# Abbreviation Disambiguation in Medical Texts - Data Modeling

This Notebook is in continuation of the notebook- 'Step 2- Data Preprocessing' and lists down:

1. Modeling Preprocessed data using: GridSearchCV on Logistic Regression, SVM and XG Boost.
2. Testing the models using Test set.
3. Comparing the models and identifying the Next Steps

## Step# 1: Loading Dataset

In [None]:
#Importing the Required Python Packages
import shutil
import string
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import ast
from sklearn import utils
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import f1_score, accuracy_score, classification_report
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
pd.set_option('display.max_colwidth', None)

In [None]:
# Lets load the train dataset.
train = pd.read_csv('Train/train_final.csv')
train.head(3)

In [None]:
# Lets load validation and test datasets as well
valid = pd.read_csv('Validation/valid_final.csv')
test = pd.read_csv('Test/test_final.csv')

In [None]:
valid.head(3)

In [None]:
test.head(3)

### Lets keep only relevant records in Valid and test set.

In [None]:
    abbrev = list(train['ABV'].unique())
    valid = valid[valid['ABV'].isin(abbrev)]
    test = test[test['ABV'].isin(abbrev)]
    labels = list(train['LABEL'].unique())
    valid = valid[valid['LABEL'].isin(labels)]
    test = test[test['LABEL'].isin(labels)]


### Lets drop 'ABV' column

In [None]:
train.drop(columns='ABV', inplace = True)
valid.drop(columns='ABV', inplace = True)
test.drop(columns='ABV', inplace = True)

In [None]:
# Convert TOKEN column from string to list
train['TOKEN'] = train['TOKEN'].apply(lambda x: ast.literal_eval(x))
valid['TOKEN'] = valid['TOKEN'].apply(lambda x: ast.literal_eval(x))
test['TOKEN'] = test['TOKEN'].apply(lambda x: ast.literal_eval(x))

### Lets tag every Token List with its Label

In [None]:
train_tagged = train.apply(lambda x: TaggedDocument(words = x['TOKEN'], tags = [x['LABEL']]), axis=1)
valid_tagged = valid.apply(lambda x: TaggedDocument(words = x['TOKEN'], tags = [x['LABEL']]), axis=1)
test_tagged = test.apply(lambda x: TaggedDocument(words = x['TOKEN'], tags = [x['LABEL']]), axis=1)

In [None]:
train_tagged.values[:5]

## Step# 2: Apply Doc2vec vectorizer on the Dataset

In [None]:
vectorize = Doc2Vec(dm=0, vector_size=100, min_count=2, window = 2)
vectorize.build_vocab(train_tagged.values)

In [None]:
vectorize.train(train_tagged.values, total_examples=len(train_tagged.values), epochs=30)

### Building the Final Vector Feature Classifier

In [None]:
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=30)) for doc in sents])
    return targets, regressors

In [None]:
y_train, X_train = vec_for_learning(vectorize, train_tagged)

## Model# 1: Logistic Classifier

### Lets perform a Grid Search to get the best possible combination of Hyperparameters for Logistic Regression Model

In [None]:
param_grid = {'C':[0.001, 0.01, 0.1, 1, 10, 100]}
grid_model = GridSearchCV(LogisticRegression(n_jobs=-1), param_grid)

In [None]:
grid_model.fit(X_train, y_train)

In [None]:
### Best parameters for the Grid Search
grid_model.best_params_

In [None]:
### Accuracy Score
grid_model.best_score_

### Apply the best parameters to Logistic Regression and train the model.

In [None]:
logreg = LogisticRegression(n_jobs=-1, C=1)
logreg.fit(X_train, y_train)

In [None]:
### Apply the above Model on Validation Set
y_valid, X_valid = vec_for_learning(vectorize, valid_tagged)
y_pred_valid = logreg.predict(X_valid)

In [None]:
print('Validation Accuracy:', accuracy_score(y_valid, y_pred_valid))
print('Validation F1-Score:', f1_score(y_valid, y_pred_valid, average='weighted'))

### As per the above analysis of validation set, it can be seen that the Logistic Classification model gives
1. F1- Score of: 0.68
2. Accuracy of: 69%

Hence, lets apply this model to our Test set and check its performance metrics.

In [None]:
### Apply the above Model on Test Set
y_test, X_test = vec_for_learning(vectorize, test_tagged)
y_pred_test = logreg.predict(X_test)

### Lets calculate some Performance Metrics on the Test predictions.

In [None]:
accuracy = accuracy_score(y_test, y_pred_test)
f1_scr = f1_score(y_test, y_pred_test, average='weighted')
print('Test Accuracy:', accuracy)
print('Test F1-Score:', f1_scr)

In [None]:
y_unique = list(set(y_test))
cr = classification_report(y_test, y_pred_test, target_names=y_unique)
print(cr)

### Thus, from the above Report it can be seen:
1. Average Precision: 0.69
2. Average Recall: 0.70

## Model# 2: SVM

### Lets perform a Grid Search to get the best possible combination of Hyperparameters for SVM's

In [None]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']}
grid_svm = GridSearchCV(SVC(), param_grid)

In [None]:
grid_svm.fit(X_train, y_train)

In [None]:
### Best parameters for the Grid Search
grid_svm.best_params_

In [None]:
### Accuracy Score
grid_svm.best_score_

### Apply the best parameters to SVC and train the model

In [None]:
svcModel = SVC(C=10, gamma=0.01, kernel='rbf')
svcModel.fit(X_train, y_train)

In [None]:
### Apply the above Model on Validation Set
y_valid, X_valid = vec_for_learning(vectorize, valid_tagged)
y_pred_valid = svcModel.predict(X_valid)

In [None]:
print('SVM Validation Accuracy:', accuracy_score(y_valid, y_pred_valid))
print('SVM Validation F1-Score:', f1_score(y_valid, y_pred_valid, average='weighted'))

### As per the above analysis of validation set, it can be seen that the SVC Classification model gives
1. F1- Score of: 0.71
2. Accuracy of: 70%

Hence, lets apply this model to our Test set and check its performance metrics.

In [None]:
### Apply the above Model on Test Set
y_test, X_test = vec_for_learning(vectorize, test_tagged)
y_pred_test = svcModel.predict(X_test)

### Lets calculate some Performance Metrics on the Test predictions.

In [None]:
accuracy = accuracy_score(y_test, y_pred_test)
f1_scr = f1_score(y_test, y_pred_test, average='weighted')
print('SVM Test Accuracy:', accuracy)
print('SVM Test F1-Score:', f1_scr)

In [None]:
y_unique = list(set(y_test))
cr = classification_report(y_test, y_pred_test, target_names=y_unique)
print(cr)

### Thus, from the above Report it can be seen:
1. Average Precision: 0.71
2. Average Recall: 0.71

## Model# 3: XGBoost

### Lets create a parameter grid for XGBoost Model

In [None]:
param_grid = {'n_estimators':[100, 500, 1000], 'max_depth':[5, 6, 7], 'min_child_weight': [3, 5, 8]}

In [None]:
unique = list(set(y_train))
X_train = pd.DataFrame(X_train)
y_train = np.asarray(y_train)

In [None]:
XGBgrid = GridSearchCV(XGBClassifier(learning_rate= 0.1, gamma= 0, objective= 'multi:softmax', num_classes= len(unique), seed= 27), param_grid)

In [None]:
XGBgrid.fit(X_train, y_train)

In [None]:
### Best parameters for the Grid Search
XGBgrid.best_params_

In [None]:
### Accuracy Score
grid_svm.best_score_

### Train a XGBoost Classifier

In [None]:
XGBModel = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=7,
 min_child_weight=4,
 gamma=0,
 objective= 'multi:softmax',
 seed=27)   
XGBModel.fit(X_train, y_train)

In [None]:
### Apply the above Model on Validation Set
y_valid, X_valid = vec_for_learning(vectorize, valid_tagged)
X_valid = pd.DataFrame(X_valid)
y_valid = np.asarray(y_valid)
y_pred_valid = XGBModel.predict(X_valid)

In [None]:
print('XGBoost Validation Accuracy:', accuracy_score(y_valid, y_pred_valid))
print('XGBoost Validation F1-Score:', f1_score(y_valid, y_pred_valid, average='weighted'))

### As per the above analysis of validation set, it can be seen that the XGBoost Classification model gives
1. F1- Score of: 0.58
2. Accuracy of: 59.1%

Hence, lets apply this model to our Test set and check its performance metrics.

In [None]:
### Apply the above Model on Test Set
y_test, X_test = vec_for_learning(vectorize, test_tagged)
X_test = pd.DataFrame(X_test)
y_test = np.asarray(y_test)
y_pred_test = XGBModel.predict(X_test)

### Lets calculate some Performance Metrics on the Test predictions.

In [None]:
accuracy = accuracy_score(y_test, y_pred_test)
f1_scr = f1_score(y_test, y_pred_test, average='weighted')
print('XGBoost Test Accuracy:', accuracy)
print('XGBoost Test F1-Score:', f1_scr)

In [None]:
y_unique = list(set(y_test))
cr = classification_report(y_test, y_pred_test, target_names=y_unique)
print(cr)

### Thus, from the above Report it can be seen:
1. Average Precision: 0.59
2. Average Recall: 0.59

# Next Steps

## We can see that a basic Logistic Classification implementation gives 70% Accurate results hence, for next steps we can:
1. Try tuning the Doc2Vec vectorizer's Hyperparameters.
2. Try some other Classification Algorithms like SVN, Random Forrest and compare results.
3. Present model has been trained to disambiguate 20 'Medical Abbreviations' but this same model can be generalized to be used in other fields as well. Some including Scientific Researches and Internet Slags.