# Abbreviation Disambiguation in Medical Texts - Data Modeling

This Notebook is in continuation of the notebook- 'Step 2- Data Preprocessing' and lists down:

1. Modeling Preprocessed data using: GridSearchCV on Logistic Regression.
2. Testing the model using Test set.

## Step# 1: Loading Dataset

In [73]:
#Importing the Required Python Packages
import shutil
import string
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import ast
from sklearn import utils
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score, classification_report
import matplotlib.pyplot as plt
pd.set_option('display.max_colwidth', -1)

In [2]:
# Lets load the train dataset.
train = pd.read_csv('Train/train_final.csv')
train.head(3)

Unnamed: 0,LABEL,ABV,TOKEN
0,anticoagulation,AC,"['yearold', 'girl', 'presented', 'unilateral', 'c6', 'nerve', 'palsy', 'papilledema', 't3', 'severe', 'ha', 'magnetic', 'resonance', 'imaging', 'showed', 'thrombosis', 'superior', 'sagittal', 'sinus', 'pt', 'right', 'transverse', 'sinus', 'attributed', 'oral', 'contraceptive', 'use', 't3', 'coagulation', 'workup', 'negative', 'svr', 'anticoagulation', 'caused', 'hemorrhagic', 'papillopathy', 'eyes', 'raising', 'question', 'ac', 'discontinued', 'diagnostic', 'management', 'issues', 'cbf', 'venous', 'sinus', 'thrombosis', 'secondary', 'intracranial', 'hypertension', 'discussed']"
1,anticoagulation,AC,"['mvr', 'procedure', 'choice', 'treat', 'mitral', 'valve', 'dysfunction', 'advantages', 'mvr', 'mitral', 'vr', 'include', 'improved', 'lts', 'better', 'preservation', 'left', 'vvi', 'cf', 'greater', 'freedom', 'endocarditis', 'thromboembolism', 'anticoagulantrelated', 'hemorrhage', 'feasibility', 'durability', 'mvr', 'depend', 'etiology', 'mitral', 'valve', 'dysfunction', 'degenerative', 'icm', 'mitral', 'valve', 'diseases', 'valve', 'repair', 'possible', 'cases', 'freedom', 'reoperation', 'high', 'contrast', 'rheumatic', 'valves', 'amenable', 'repair', 'durability', 'limited', 'isolated', 'valve', 'operations', 'performed', 'minimally', 'invasive', 'fashion', 'partial', 'upper', 'sternotomy', 'patients', 'atrial', 'fibrillation', 'mvd', 'maze', 'procedure', 'pulmonary', 'vein', 'isolation', 'performed', 'eliminating', 'atrial', 'fibrillation', 'need', 'lt', 'ac']"
2,typical development,TD,"['explored', 'classification', 'accuracy', 'english', 'spanish', 'versions', 'experimental', 'semantic', 'language', 'measure', 'functional', 'monolingualbilingual', 'children', 'language', 'impairment', 'total', 'children', 'participated', 'including', 'balanced', 'bilinguals', 'language', 'impairment', 'td', 'monolingual', 'spanish', 'language', 'impairment', 'typical', 'od', 'monolingual', 'english', 'language', 'impairment', 'typical', 'od', 'children', 'years', 'cut', 'points', 'derived', 'functionally', 'monolingual', 'children', 'applied', 'bilinguals', 'assess', 'predictive', 'accuracy', 'english', 'spanish', 'semantics', 'correct', 'classification', 'english', 'monolinguals', 'spanish', 'monolinguals', 'discriminant', 'analysis', 'yielded', 'correct', 'classification', 'balanced', 'bilingual', 'children', 'english', 'spanish', 'respectively', 'semanticsbased', 'measure', 'fair', 'good', 'classification', 'accuracy', 'functional', 'monolinguals', 'spanishenglish', 'bilingual', 'children', 'language', 'tested']"


In [3]:
# Lets load validation and test datasets as well
valid = pd.read_csv('Validation/valid_final.csv')
test = pd.read_csv('Test/test_final.csv')

In [4]:
valid.head(3)

Unnamed: 0,LABEL,ABV,TOKEN
0,benzophenone,BP,"['bp', 'fema', 'cas', 'po', 'diet', 'rats', 'target', 'dls', 'mgkg', 'body', 'weightday', 'days', 'mgkgday', 'days', 'body', 'weights', 'food', 'consumption', 'measured', 'weekly', 'haematology', 'clinical', 'chemistry', 'urinalysis', 'values', 'obtained', 'wk', 'end', 'study', 'gross', 'microscopic', 'pathological', 'examinations', 'conducted', 'organ', 'weights', 'recorded', 'treatmentrelated', 'changes', 'occurred', 'ea', 'count', 'haemoglobin', 'haematocrit', 'bilirubin', 'total', 'protein', 'albumin', 'mid', 'highdose', 'c2', 'changes', 'occur', 'groups', 'sexes', 'indications', 'increased', 'absolute', 'relative', 'cl', 'kidney', 'weights', 'mid', 'highdose', 'groups', 'statistically', 'consistent', 'absolute', 'kidney', 'weights', 'histopathology', 'cl', 'mid', 'highdose', 'cg', 'showed', 'hepatocellular', 'enlargement', 'associated', 'clumping', 'cytoplasmic', 'basophilic', 'material', 'central', 'vein', 'nel', 'demonstrated', 'mgkgday', 'days', 'administration', 'equivalent', 'ni', 'mgday', 'kg', 'human', 'basis', 'calculated', 'possible', 'average', 'daily', 'intake', 'mgday', 'safety', 'factor', 'greater', 'demonstrated', 'safety', 'factor', 'based', 'realistic', 'capita', 'consumption', 'microgramday', 'approximately', 'million']"
1,adult,AD,"['investigated', 'expression', 'xist', 'gene', 'mouse', 'female', 'ad', 'kidney', 'embryos', 'embryonic', 'stem', 'es', 'cells', 'undergoing', 'vitro', 'differentiation', 'ebs', 'quantitative', 'rtpcr', 'single', 'nucleotide', 'primer', 'extension', 'snupe', 'ca', 'found', 'xist', 'rna', 'ad', 'kidney', 'mouse', 'strains', 'approximately', 'transcripts', 'cell', 'modest', 'differences', 'strains', 'carrying', 'different', 'xce', 'alleles', 'female', 'embryos', 'days', 'pc', 'number', 'xist', 'transcripts', 'cell', 'isogenic', 'ad', 'tissue', 'quantitative', 'oligonucleotide', 'hybridization', 'assays', 't3', 'rtpcr', 'investigated', 'xist', 'expression', 'es', 'lines', 'heterozygous', 'pgk', 'xist', 'loci', 'found', 'xx', 'es', 'lines', 'xist', 'rna', 'levels', 'increased', 'embryoid', 'body', 'formation', 'levels', 'seen', 'found', 'adult', 'female', 'kidney', 'addition', 'found', 'allelic', 'ratio', 'xist', 'transcripts', 'reciprocal', 'xx', 'es', 'cell', 'lines', 'differentiating', 'vitro', 'mz', 'isogenic', 'day', 'female', 'embryos', 'results', 'suggest', 'dp', 'preferential', 'paternal', 'imprinting', 'days', 'vitro', 'differentiation', 'es', 'cells', 'influence', 'xce', 'locus', 'randomness', 'xinactivation', 'embryos', 'operate', 'es', 'cell', 'lines', 'overall', 'conclusion', 'low', 'c2', 'xist', 'rna', 'female', 'kidney', 'embryos', 'differentiating', 'xx', 'es', 'cells', 'compatible', 'models', 'require', 'xist', 'rna', 'cover', 'entire', 'inactive', 'x', 'chromosome']"
2,primary visual cortex,V1,"['recently', 'reported', 'paradoxical', 'facilitatory', 'effect', 'hz', 'rtms', 'rtms', 'v1', 'migraine', 'possibly', 'failure', 'inhibitory', 'circuits', 'unable', 'upregulated', 'vlf', 'rtms', 'investigate', 'inhibitory', 'circuit', 'dysfunction', 'extends', 'striate', 'sc', 'ma', 'studied', 'effects', 'hz', 'rtms', 'ra', 'extrastriate', 'cortex', 'perception', 'illusory', 'contours', 'patients', 'lowfrequency', 'rtms', 'enhanced', 'activity', 'extrastriate', 'cortex', 'migraineurs', 'speeding', 'reaction', 'times', 'illusory', 'contour', 'perception', 'finding', 'supports', 'view', 'failure', 'inhibitory', 'circuits', 'involving', 'extrastriate', 'sc', 'ma']"


In [5]:
test.head(3)

Unnamed: 0,LABEL,ABV,TOKEN
0,micronucleus,MN,"['work', 'tested', 'hypothesis', 'content', 'spontaneous', 'micronuclei', 'lymphocytes', 'apparently', 'healthy', 'normal', 'human', 'subject', 'exhibited', 'unusually', 'high', 'mn', 'frequency', 'nonrandom', 'dna', 'probes', 'fluorescent', 'insitu', 'hybridization', 'fish', 'beginning', 'probe', 'generated', 'subjects', 'micronuclei', 'micronuclei', 'obtained', 'peripheral', 'bl', 'microdissection', 'subjected', 'rapd', 'rapdpcr', 'unique', 'pcr', 'product', 'isolate', 'cosmid', 'clone', 'human', 'genomic', 'library', 'clone', 'hybridized', 'chromosome', 'subsequently', 'commercial', 'probes', 'included', 'fish', 'analyses', 'micronuclei', 'subject', 'age', 'sexmatched', 'controls', 'significant', 'differences', 'found', 'subject', 'controls', 'percentages', 'micronuclei', 'hybridizing', 'centromere', 'probe', 'x', 'chromosome', 'painting', 'probe', 'chromosome', 'subject', 'highly', 'significant', 'increase', 'p', 'chromosome', 'micronuclei', 'level', 'expected', 'present', 'chance', 'characterization', 'micronuclei', 'promising', 'tool', 'studies', 'mechanisms', 'inherited', 'induced', 'cin', 'strength', 'strategy', 'employed', 't0', 'characterizing', 'chromosomes', 'present', 'micronuclei', 'work', 't3', 'observation', 'chromosomal', 'instability', 'foundation', 't0', 'mechanism', 'underlying', 'observation']"
1,bipolar,BP,"['onethird', 'lithiumtreated', 'bp', 'patients', 'excellent', 'lithium', 'nr', 'lithium', 'monotherapy', 'totally', 'prevents', 'episodes', 'bp', 'disorder', 'years', 'patients', 'clinically', 'characterized', 'episodic', 'clinical', 'course', 'cr', 'remission', 'bipolar', 'family', 'history', 'low', 'psychiatric', 'comorbidity', 'maniadepression', 'episode', 'sequences', 'moderate', 'number', 'episodes', 'low', 'number', 'hospitalizations', 'prelithium', 'period', 'recently', 'found', 'temperamental', 'features', 'hypomania', 'hyperthymic', 'temperament', 'lack', 'cognitive', 'disorganization', 'predict', 'best', 'results', 'lithium', 'prophylaxis', 'lithium', 'exerts', 'neuroprotective', 'effect', 'increased', 'expression', 'brainderived', 'neurotrophic', 'factor', 'bdnf', 'inhibition', 'glycogen', 'synthase', 'kinase', 'gsk', 'play', 'important', 'role', 'response', 'lithium', 'connected', 'tt', 'bdnf', 'gene', 'ss', 'bdnf', 'levels', 'better', 'response', 'lithium', 'connected', 'met', 'allele', 'bdnf', 'valmet', 'polymorphism', 'hyperthymic', 'temperament', 'excellent', 'lithium', 'nr', 'normal', 'cognitive', 'functions', 'serum', 'bdnf', 'levels', 't3', 'lt', 'duration', 'illness', 'preservation', 'cognitive', 'functions', 'longterm', 'lithiumtreated', 'patients', 'connected', 'stimulation', 'bdnf', 'system', 'resulting', 'prevention', 'affective', 'episodes', 'exerting', 'deleterious', 'cognitive', 'effects', 'possibly', 'lithiums', 'antiviral', 'effects', 'number', 'candidate', 'genes', 'related', 'neurotransmitters', 'intracellular', 'signaling', 'neuroprotection', 'circadian', 'rhythms', 'pathogenic', 'mechanisms', 'bipolar', 'disorder', 'found', 'associated', 'lithium', 'prophylactic', 'response', 'consortium', 'lithium', 'genetics', 'conligen', 'recently', 'performed', 'genomewide', 'association', 't0', 'lithium', 'response', 'bp', 'disorder']"
2,genetic programming,GP,"['performance', 'energy', 'conversion', 'system', 'depends', 'exergy', 'analysis', 'entropy', 'generation', 'minimisation', 'new', 'simple', 'fourparameter', 'equation', 'presented', 'paper', 'predict', 'standard', 'state', 'absolute', 'entropy', 'real', 'gases', 'sstd', 'mm', 'development', 'validation', 'accomplished', 'linear', 'gp', 'lgp', 'method', 'comprehensive', 'dataset', 'widely', 'materials', 'proposed', 'model', 'compared', 'results', 'obtained', 'threelayer', 'feed', 'forward', 'neural', 'network', 'mm', 'ffnn', 'mm', 'rootmeansquare', 'error', 'rmse', 'coefficient', 'determination', 'r', 'data', 'obtained', 'lgp', 'mm', 'jmol', 'k', 'respectively', 'statistical', 'assessments', 'evaluate', 'predictive', 'power', 'mm', 'addition', 't0', 'provides', 'appropriate', 'understanding', 'important', 'molecular', 'variables', 'exergy', 'analysis', 'compared', 'lgp', 'based', 'mm', 'application', 'ffnn', 'improved', 'r', 'developed', 'mm', 'useful', 'design', 'materials', 'achieve', 'desired', 'entropy', 'value']"


### Lets keep only relevant records in Valid and test set and total records in each shall only be 4k

In [6]:
    abbrev = list(train['ABV'].unique())
    valid = valid[valid['ABV'].isin(abbrev)]
    test = test[test['ABV'].isin(abbrev)]
    labels = list(train['LABEL'].unique())
    valid = valid[valid['LABEL'].isin(labels)]
    test = test[test['LABEL'].isin(labels)]


## Lets drop 'ABV' column

In [7]:
train.drop(columns='ABV', inplace = True)
valid.drop(columns='ABV', inplace = True)
test.drop(columns='ABV', inplace = True)

In [8]:
# Convert TOKEN column from string to list
train['TOKEN'] = train['TOKEN'].apply(lambda x: ast.literal_eval(x))
valid['TOKEN'] = valid['TOKEN'].apply(lambda x: ast.literal_eval(x))
test['TOKEN'] = test['TOKEN'].apply(lambda x: ast.literal_eval(x))

### Lets tag every Token List with its Label

In [9]:
train_tagged = train.apply(lambda x: TaggedDocument(words = x['TOKEN'], tags = [x['LABEL']]), axis=1)
valid_tagged = valid.apply(lambda x: TaggedDocument(words = x['TOKEN'], tags = [x['LABEL']]), axis=1)
test_tagged = test.apply(lambda x: TaggedDocument(words = x['TOKEN'], tags = [x['LABEL']]), axis=1)

In [10]:
train_tagged.values[:5]

array([TaggedDocument(words=['yearold', 'girl', 'presented', 'unilateral', 'c6', 'nerve', 'palsy', 'papilledema', 't3', 'severe', 'ha', 'magnetic', 'resonance', 'imaging', 'showed', 'thrombosis', 'superior', 'sagittal', 'sinus', 'pt', 'right', 'transverse', 'sinus', 'attributed', 'oral', 'contraceptive', 'use', 't3', 'coagulation', 'workup', 'negative', 'svr', 'anticoagulation', 'caused', 'hemorrhagic', 'papillopathy', 'eyes', 'raising', 'question', 'ac', 'discontinued', 'diagnostic', 'management', 'issues', 'cbf', 'venous', 'sinus', 'thrombosis', 'secondary', 'intracranial', 'hypertension', 'discussed'], tags=['anticoagulation']),
       TaggedDocument(words=['mvr', 'procedure', 'choice', 'treat', 'mitral', 'valve', 'dysfunction', 'advantages', 'mvr', 'mitral', 'vr', 'include', 'improved', 'lts', 'better', 'preservation', 'left', 'vvi', 'cf', 'greater', 'freedom', 'endocarditis', 'thromboembolism', 'anticoagulantrelated', 'hemorrhage', 'feasibility', 'durability', 'mvr', 'depend', 'et

## Step# 2: Apply Doc2vec vectorizer on the Dataset

In [33]:
vectorize = Doc2Vec(dm=0, vector_size=100, min_count=2, window = 2)
vectorize.build_vocab(train_tagged.values)

In [34]:
vectorize.train(train_tagged.values, total_examples=len(train_tagged.values), epochs=30)

### Building the Final Vector Feature Classifier

In [35]:
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=30)) for doc in sents])
    return targets, regressors

In [36]:
y_train, X_train = vec_for_learning(vectorize, train_tagged)

### Lets perform a Grid Search to get the best possible combination of Hyperparameters for Logistic Regression Model

In [41]:
param_grid = {'C':[0.001, 0.01, 0.1, 1, 10, 100]}
grid_model = GridSearchCV(LogisticRegression(n_jobs=-1), param_grid)

In [42]:
grid_model.fit(X_train, y_train)

GridSearchCV(estimator=LogisticRegression(n_jobs=-1),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100]})

In [43]:
### Best parameters for the Grid Search
grid_model.best_params_

{'C': 1}

In [44]:
### Accuracy Score
grid_model.best_score_

0.9805815760310533

### Apply the best parameters to Logistic Regression and train the model.

In [45]:
logreg = LogisticRegression(n_jobs=-1, C=1)
logreg.fit(X_train, y_train)

LogisticRegression(C=1, n_jobs=-1)

In [46]:
### Apply the above Model on Validation Set
y_valid, X_valid = vec_for_learning(vectorize, valid_tagged)
y_pred_valid = logreg.predict(X_valid)

In [47]:
print('Validation Accuracy:', accuracy_score(y_valid, y_pred_valid))
print('Validation F1-Score:', f1_score(y_valid, y_pred_valid, average='weighted'))

Validation Accuracy: 0.6913
Validation F1-Score: 0.6848102160464058


## As per the above analysis of validation set, it can be seen that the Logistic Classification model gives
1. F1- Score of: 0.68
2. Accuracy of: 69%

Hence, lets apply this model to our Test set and check its performance metrics.

In [48]:
### Apply the above Model on Test Set
y_test, X_test = vec_for_learning(vectorize, test_tagged)
y_pred_test = logreg.predict(X_test)

### Lets calculate some Performance Metrics on the Test predictions.

In [68]:
accuracy = accuracy_score(y_test, y_pred_test)
f1_scr = f1_score(y_test, y_pred_test, average='weighted')
print('Test Accuracy:', accuracy)
print('Test F1-Score:', f1_scr)

Test Accuracy: 0.6962
Test F1-Score: 0.6906125218543699


In [74]:
y_unique = list(set(y_test))
cr = classification_report(y_test, y_pred_test, target_names=y_unique)
print(cr)

                                                   precision    recall  f1-score   support

                                habitual abortion       0.83      1.00      0.91        15
                                 hemolytic anemia       0.79      0.91      0.85        45
                                   thermodilution       0.76      0.60      0.67        48
                               activated charcoal       0.84      0.88      0.86        48
                                  edmonstonzagreb       0.42      0.30      0.35        54
                       metabolism of benzoapyrene       0.95      1.00      0.98        21
                                  buffer capacity       0.79      0.93      0.85        41
                                       decay rate       1.00      0.78      0.88         9
                          hypersensitive reaction       0.96      0.86      0.91        29
                      live attenuated coldadapted       0.87      0.85      0.86        5

### Thus, from the above Report it can be seen:
1. Average Precision: 0.69
2. Average Recall: 0.70

# Next Steps

## We can see that a basic Logistic Classification implementation gives 70% Accurate results hence, for next steps we can:
1. Try tuning the Doc2Vec vectorizer's Hyperparameters.
2. Try some other Classification Algorithms like SVN, Random Forrest and compare results.
3. Present model has been trained to disambiguate 20 'Medical Abbreviations' but this same model can be generalized to be used in other fields as well. Some including Scientific Researches and Internet Slags.