### Download the data

In [1]:
!wget https://raw.githubusercontent.com/suvigyajain0101/CaseStudies/main/AdverseEventClassification/Data/AE_Data.csv

--2022-08-19 20:23:38--  https://raw.githubusercontent.com/suvigyajain0101/CaseStudies/main/AdverseEventClassification/Data/AE_Data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5998096 (5.7M) [text/plain]
Saving to: ‘AE_Data.csv’


2022-08-19 20:23:39 (157 MB/s) - ‘AE_Data.csv’ saved [5998096/5998096]



### Import Libraries

In [2]:
import pandas as pd
import numpy as np
import re

In [3]:
WORDS_TO_REMOVE = ['##padding##', 'ti-', 'ti -']

In [4]:
df = pd.read_csv('/content/AE_Data.csv')
df.head()

Unnamed: 0,title,abstract,label
0,antimicrobial impacts of essential oils on foo...,the antimicrobial activity of twelve essential...,0
1,purification and characterization of a cystein...,antimicrobial peptide (amp) crustin is a type ...,0
2,telavancin activity tested against gram-positi...,objectives: to reassess the activity of telava...,0
3,the in vitro antimicrobial activity of cymbopo...,background: it is well known that cymbopogon (...,0
4,screening currency notes for microbial pathoge...,fomites are a well-known source of microbial i...,0


In [5]:
df['label'].value_counts()

0    3851
1     294
Name: label, dtype: int64

### Data Cleaning

1. Combine Title and Abstract
2. Lower case entire corpus
2. Remove newline and tabs from the dataset
3. Remove brackets, #, colons, 'TI" (title identifier), '##PADDING##'
5. Lemmatize and remove stopwords
2. Remove records with less than 10 words

In [6]:
df['text'] = df['title'] + ' ' + df['abstract']
df.head()

Unnamed: 0,title,abstract,label,text
0,antimicrobial impacts of essential oils on foo...,the antimicrobial activity of twelve essential...,0,antimicrobial impacts of essential oils on foo...
1,purification and characterization of a cystein...,antimicrobial peptide (amp) crustin is a type ...,0,purification and characterization of a cystein...
2,telavancin activity tested against gram-positi...,objectives: to reassess the activity of telava...,0,telavancin activity tested against gram-positi...
3,the in vitro antimicrobial activity of cymbopo...,background: it is well known that cymbopogon (...,0,the in vitro antimicrobial activity of cymbopo...
4,screening currency notes for microbial pathoge...,fomites are a well-known source of microbial i...,0,screening currency notes for microbial pathoge...


In [7]:
df.replace(r'\n','', regex=True).iloc[4140, :]['text']

'TI  - [AUTO-INFECTION (INTESTINAL) IN RADIATION SICKNESS AND ITS PREVENTION IN WISTAR WHITE RATS]. ##PADDING##'

In [8]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [9]:
eng_stopwords = stopwords.words('english')
stemmer = WordNetLemmatizer()

joined_words_to_remove = '|'.join(WORDS_TO_REMOVE)


def clean_text(x):
  # Lower case the text
  lower_x = x.lower()

  # Remove line breaks and tabs
  no_break_x = re.sub("\n|\r|\t", " ", lower_x)

  # Remove specific words
  no_waste_words_x = re.sub(joined_words_to_remove, " ", no_break_x)

  # Remove all non alphabet, numeral and space characters
  alpha_x = re.sub('[^0-9a-zA-Z ]+', ' ', no_waste_words_x)

  # Remove stopwords and lemmatize the word. Join at the end will also remove multi-spaces
  lemma_x = ' '.join([stemmer.lemmatize(word) for word in alpha_x.split() if word not in eng_stopwords])

  return lemma_x

Let's test the function on few examples

In [10]:
for sample_text in df.sample(5)['text'].values:
  print('ORIGINAL TEXT : ', sample_text)
  print('-'*100)
  print('CLEANED TEXT : ', clean_text(sample_text))
  print('\n')
  print('*'*100)

ORIGINAL TEXT :  [comparative studies on activities of antimicrobial agents against causative organisms isolated from patients with urinary tract infections (1994). ii. background of patients].
 clinical background was investigated on 628 bacterial strains isolated from patients with urinary tract infections (utis) in 10 hospitals during period from  june, 1994 to may, 1995. 1. distributions of sex, age and urinary tract infections among over sixties males, the majority was taken by complicated urinary tract infections. among females, the uncomplicated urinary tract infections was most frequent without a relation of age. as for over 40 females, the increase of complicated uti was admitted. 2. distribution of sex, age and pathogens isolated from utis in uncomplicated utis,other_species was most frequently isolated without a relation of age, and nextother_species and cns. in complicated utis without indwelling catheter,other_species,other_species andother_species were isolated the most f

In [11]:
# Apply cleaning function to the text field
df['clean_text'] = df['text'].apply(lambda x : clean_text(x))

# Get the length and drop records less than 10 words
df['text_len'] = df['clean_text'].str.split().apply(len)

cleaned_df = df[df['text_len'] > 10][['clean_text', 'label']]

In [12]:
cleaned_df.head()

Unnamed: 0,clean_text,label
0,antimicrobial impact essential oil food borne ...,0
1,purification characterization cysteine rich 14...,0
2,telavancin activity tested gram positive clini...,0
3,vitro antimicrobial activity cymbopogon essent...,0
4,screening currency note microbial pathogen ant...,0


In [13]:
print('Total records retained after data cleaning : ', cleaned_df.shape[0])
cleaned_df['label'].value_counts()

Total records retained after data cleaning :  4013


0    3719
1     294
Name: label, dtype: int64

### TF-IDF Vectorizer

Convert text to features. We'll use TF-IDF score to give the score to the word in the corpus. 

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(cleaned_df['clean_text']).toarray()
y = cleaned_df['label'].values

Since the data is unbalanced, we need to split the data into train-test in such a way that those represent the actual data. That's where stratified sampling comes in

In [15]:
from sklearn.model_selection import train_test_split

test_split = 0.2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_split, stratify=y)

In [16]:
print('Label Distribution in the training data')
print(np.unique(y_train, return_counts=True))
print('*'*50)
print('Label Distribution in the testing data')
print(np.unique(y_test, return_counts=True))

Label Distribution in the training data
(array([0, 1]), array([2975,  235]))
**************************************************
Label Distribution in the testing data
(array([0, 1]), array([744,  59]))


Now that the text has been converted into features, we can model the data 

## Machine Learning Models

The dataset at hand is so imbalanced that accuracy on the predictions is not a good metric to judge a model. We'll use Classification Report, and more importantly F1 Score for model comparison. Also to note, we want to reduce False negatives as much as we can, we don't want to classify a doc non-adverse event if it's in fact a adverse-event related

In [17]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

### Experiment 1 - Random Forest

In [18]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)

# Fit the model to the data
classifier.fit(X_train, y_train)

# Generate predictions
y_pred = classifier.predict(X_test)

# Prepare Classification Report
print(confusion_matrix(y_test,y_pred))
print('\n')
print(classification_report(y_test,y_pred))

[[743   1]
 [ 31  28]]


              precision    recall  f1-score   support

           0       0.96      1.00      0.98       744
           1       0.97      0.47      0.64        59

    accuracy                           0.96       803
   macro avg       0.96      0.74      0.81       803
weighted avg       0.96      0.96      0.95       803



Performance on 0s is satisfactory, but 1s are pretty terrible!

### Experiment 2 - Multinomial NB

In [19]:
from sklearn.naive_bayes import MultinomialNB

# Initialize the model
classifier = MultinomialNB()

# Fit the model to the data
classifier.fit(X_train, y_train)

# Generate predictions
y_pred = classifier.predict(X_test)

# Prepare Classification Report
print(confusion_matrix(y_test,y_pred))
print('\n')
print(classification_report(y_test,y_pred))

[[738   6]
 [ 41  18]]


              precision    recall  f1-score   support

           0       0.95      0.99      0.97       744
           1       0.75      0.31      0.43        59

    accuracy                           0.94       803
   macro avg       0.85      0.65      0.70       803
weighted avg       0.93      0.94      0.93       803



Even worse!

### Experiment 3 - SVM : SGD Classifier

In [21]:
from sklearn.linear_model import SGDClassifier

# Initialize the model
classifier = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=5, random_state=42)

# Fit the model to the data
classifier.fit(X_train, y_train)

# Generate predictions
y_pred = classifier.predict(X_test)

# Prepare Classification Report
print(confusion_matrix(y_test,y_pred))
print('\n')
print(classification_report(y_test,y_pred))

[[741   3]
 [ 38  21]]


              precision    recall  f1-score   support

           0       0.95      1.00      0.97       744
           1       0.88      0.36      0.51        59

    accuracy                           0.95       803
   macro avg       0.91      0.68      0.74       803
weighted avg       0.95      0.95      0.94       803





Better performance than MNB, but worse than Random Forest

### Experiment 4 - K-Nearest Neighbors

In [23]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the model
classifier = KNeighborsClassifier(n_neighbors=2)

# Fit the model to the data
classifier.fit(X_train, y_train)

# Generate predictions
y_pred = classifier.predict(X_test)

# Prepare Classification Report
print(confusion_matrix(y_test,y_pred))
print('\n')
print(classification_report(y_test,y_pred))

[[737   7]
 [ 43  16]]


              precision    recall  f1-score   support

           0       0.94      0.99      0.97       744
           1       0.70      0.27      0.39        59

    accuracy                           0.94       803
   macro avg       0.82      0.63      0.68       803
weighted avg       0.93      0.94      0.92       803



Not a good idea, TBH!

### Experiment 5 - Decision Trees

In [25]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the model
classifier = DecisionTreeClassifier(max_depth=5)

# Fit the model to the data
classifier.fit(X_train, y_train)

# Generate predictions
y_pred = classifier.predict(X_test)

# Prepare Classification Report
print(confusion_matrix(y_test,y_pred))
print('\n')
print(classification_report(y_test,y_pred))

[[737   7]
 [ 44  15]]


              precision    recall  f1-score   support

           0       0.94      0.99      0.97       744
           1       0.68      0.25      0.37        59

    accuracy                           0.94       803
   macro avg       0.81      0.62      0.67       803
weighted avg       0.92      0.94      0.92       803



A single tree is not working better than a Random Forest. Proves some theories 😀

### Experiment 6 - AdaBoost (discrete SAMME)

In [26]:
from sklearn.ensemble import AdaBoostClassifier

base_model = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1)

# Initialize the model
classifier = AdaBoostClassifier(
    base_estimator=base_model,
    learning_rate=1.0,
    n_estimators=400,
    algorithm="SAMME",
)

# Fit the model to the data
classifier.fit(X_train, y_train)

# Generate predictions
y_pred = classifier.predict(X_test)

# Prepare Classification Report
print(confusion_matrix(y_test,y_pred))
print('\n')
print(classification_report(y_test,y_pred))

[[741   3]
 [ 26  33]]


              precision    recall  f1-score   support

           0       0.97      1.00      0.98       744
           1       0.92      0.56      0.69        59

    accuracy                           0.96       803
   macro avg       0.94      0.78      0.84       803
weighted avg       0.96      0.96      0.96       803



That's a considerable improvement!

### Experiment 7 - AdaBoost (real SAMME.R)

In [27]:
from sklearn.ensemble import AdaBoostClassifier

base_model = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1)

# Initialize the model
classifier = AdaBoostClassifier(
    base_estimator=base_model,
    learning_rate=1.0,
    n_estimators=400,
    algorithm="SAMME.R",
)

# Fit the model to the data
classifier.fit(X_train, y_train)

# Generate predictions
y_pred = classifier.predict(X_test)

# Prepare Classification Report
print(confusion_matrix(y_test,y_pred))
print('\n')
print(classification_report(y_test,y_pred))

[[743   1]
 [ 33  26]]


              precision    recall  f1-score   support

           0       0.96      1.00      0.98       744
           1       0.96      0.44      0.60        59

    accuracy                           0.96       803
   macro avg       0.96      0.72      0.79       803
weighted avg       0.96      0.96      0.95       803



Worse than discrete SAMME algorithm, but still better than other models