https://bionlp.nlm.nih.gov/tac2017adversereactions/

Task 1: Extract AdverseReactions and related mentions (Severity, Factor, DrugClass, Negation, Animal). This is similar to many NLP Named Entity Recognition (NER) evaluations.

In [86]:
import untangle
import glob
import pandas as pd
from collections import Counter
import re
import string
import csv

In [3]:
path = '/Users/jzhu/git/nlp_adversedrug/data/train_xml/'

In [4]:
def parse_xml(filename):
    """
    @input a filename string
    @return:
    1. For training data: both a dictionary (key is the section) for X (text strings) and a list of dictionary for y 
        (keys: id (not for task 1), section, type, start, len)
    2. For test data: only a list of X
    """
    X = {}
    Y = []
    
    obj = untangle.parse(filename)
    for text in obj.Label.Text.Section:
        X[text['id']] = text.cdata
        
    if obj.Label.Mentions.Mention:
        for mention in obj.Label.Mentions.Mention:
            entity = {}
            entity['id'] = mention['id']
            entity['section'] = mention['section']
            entity['type'] = mention['type']
            entity['start'] = mention['start']
            entity['len'] = mention['len']
            entity['text'] = mention['str']
            Y.append(entity)
            
    return X, Y

# test

In [70]:
filename = path + 'ADCETRIS.xml'
X, Y = parse_xml(filename)

In [71]:
X.keys()

[u'S3', u'S2', u'S1']

In [72]:
X['S1']



In [73]:
X['S1'][236:(236+11)]

u'Anaphylaxis'

In [41]:
Y[:2]

[{'id': u'M1',
  'len': u'21',
  'section': u'S1',
  'start': u'156',
  'text': u'Peripheral Neuropathy',
  'type': u'AdverseReaction'},
 {'id': u'M2',
  'len': u'11',
  'section': u'S1',
  'start': u'236',
  'text': u'Anaphylaxis',
  'type': u'AdverseReaction'}]

### For NER_DL: extract only the entities and tags in training folder (then use the pre-trained word2vec from spacy)

In [21]:
f = path + 'ADCETRIS.xml'
X, Y = parse_xml(f)
X.keys()

[u'S3', u'S2', u'S1']

In [112]:
train = []

for f in glob.glob(path+'*.xml'):
    X, Y = parse_xml(f)

    for section in X.keys():
        doc = X[section]

        # split the words in a doc
        word_ind = [[m.group(0), m.start(), m.end(), 'O'] for m in re.finditer(r'\w+', doc)
            if m.group(0) ]
        words = [w[0] for w in word_ind]
        starts = [s[1] for s in word_ind]
        ends = [e[2] for e in word_ind]
        types = [e[3] for e in word_ind]
        start_dict = dict(zip(starts, range(len(starts))))
        end_dict = dict(zip(ends, range(len(starts))))

        # parse the names in the same doc
        e_text = []
        e_type = []
        e_start = []
        e_end = []
        for e in Y:
            if e['section'] == section:
                starts = e['start'].split(',')
                lens = e['len'].split(',')
                for i in range(len(starts)):
                    e_text.append(e['text'])
                    e_type.append(e['type'])
                    e_start.append(int(starts[i]))
                    e_end.append(int(starts[i]) + int(lens[i]))

        # label the names in the list of total words
        for i in range(len(e_start)):
            if (e_start[i] in start_dict) and (e_end[i] in end_dict):
                ind_start = start_dict[e_start[i]]
                ind_end = end_dict[e_end[i]]

                words[ind_start] = ' '.join(words[ind_start:(ind_end + 1)])
                types[ind_start] = e_type[i]
                for j in range(ind_start, ind_end):
                    types[j+1] = 'rm' # label to remove later

            else:
                print e_text[i], e_start[i], e_end[i]

        # remove those words tagged with 'rm' (i.e. combine names with more than one word)
        train += [[words[i], types[i]] for i in range(len(words)) if types[i] != 'rm']

        # insert a new line to seperate from the next section
        train.append(['', ''])

Grade 3/ 2185 2193
Diarrhea 3475 3483
Diarrhea 3483 3491
Tremor 4627 4633
Tremor 4633 4639
Dyspnea 4878 4885
Dyspnea 4885 4892
Rash 5154 5158
Rash 5158 5162
Kidney Failure 4886 4900
cardiac arrest 4714 4728
syncope 4744 4751
HEMORRHAGE 762 772
0.0% 1659 1663
0.0% 3816 3820
Hemorrhagic stroke 3817 3835
Bleed at critical site 9344 9366
Genital mycotic infections females 3089 3115
Urinary tract infections 3721 3745
Increased urination 4204 4223
Genital mycotic infections males 4457 4483
Volume Depletion 6976 6992
Volume depletion 6992 7008
Major 12782 12787
Minor 13070 13075
hematocrit values >55% 18092 18114
2.8% 13437 13441
3.3% 13595 13599
3-fold the upper limit of normal (ULN) 14632 14670
blood potassium decreased 6530 6555
=10 muU/mL 8600 8610
Hypoglycemia 15224 15236
0.9% 22805 22809
1.2% 22814 22818
0.3% 22861 22865
0.7% 22870 22874
0.1% 22939 22943
0.4% 22992 22996
Severe 9620 9626
transaminase >8 * ULN 6145 6147
transaminase >8 * ULN 6159 6164
transaminase >5 * ULN 6149 6151
tran

In [113]:
len(train)

213037

In [111]:
213037 - 212798 # num of sections of all xml files

239

In [115]:
with open("train.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(train)

Note: 
* Still need to deal with many outliers 

* Since the name index is based on original text, we cannot do word labeling one word by one word. 
* Thus I first wrote all words in sentences one in a line, with the index on the same line 
* then use the name index from xml to relabel those types of names