#### Antibiotic Resistant Bacteria Multiclass-Classification and Drug Discovery
#### Corey J Sinnott
# Resistance Mechanism Multiclass-Classification

## Executive Summary

This report was commissioned to determine a robust, fast, and reproducible means of searching for, and developing, new antibiotics, in an effort to combat antibiotic resistance. After in-depth analysis, conclusions and recommendations will be presented.
   
Data was obtained from the following source:
- Comprehensive Antibiotic Resistance Database via CARD CLI interface: 
 - https://card.mcmaster.ca
- ChEMBL via Python client library: 
 - https://www.ebi.ac.uk/chembl/ 

**Full Executive Summary, Conclusion, Recommendations, Data Dictionary and Sources can be found in README.**

## Contents:
- [Data Import & Cleaning](#Data-Import-&-Cleaning)

#### Importing Libraries

In [15]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
import warnings
warnings.filterwarnings("ignore")
from sklearn import set_config
from sklearn.compose import make_column_transformer, make_column_selector
#from sklearn.preprocessing import OneHotEncoder
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.feature_selection import RFE
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, balanced_accuracy_score,\
                            precision_score, recall_score, roc_auc_score,\
                            plot_confusion_matrix, classification_report, plot_roc_curve
from sklearn.preprocessing import LabelBinarizer
from lazypredict.Supervised import Classification, Regression

# Data Import & Cleaning

In [3]:
df = pd.read_csv('./data/trimmed_df_for_classification_v1.csv')

In [4]:
df.sample(5)

Unnamed: 0,filename,rgi_main.Drug Class,rgi_main.Resistance Mechanism,rgi_main.CARD_Protein_Sequence,rgi_main.Percentage Length of Reference Sequence,lmat.score,lmat.count,mlst.alleles_1,mlst.alleles_2,mlst.alleles_3,mlst.alleles_4,mlst.alleles_5,mlst.alleles_6,mlst.alleles_7,mlst.alleles_8,mlst.alleles_9,mlst.alleles_10
1959,8a3f84c136a2a0ddd1faabc440a7cf57,fluoroquinolone antibiotic; acridine dye; tric...,antibiotic efflux,MSNVTSFRSELKQLFHLMLPILITQFAQAGFGLIDTIMAGHLSAAD...,100.0,3944.97,5618.0,Pas_cpn60(2),Pas_fusA(2),Pas_gltA(2),Pas_pyrG(2),Pas_recA(2),Pas_rplB(2),Pas_rpoB(2),,,
1921,872aa31717b84c6bd5a7aa9befc14bcf,fluoroquinolone antibiotic; acridine dye; tric...,antibiotic efflux,MSNVTSFRSELKQLFHLMLPILITQFAQAGFGLIDTIMAGHLSAAD...,100.0,4412.82,6012.0,Pas_cpn60(2),Pas_fusA(2),Pas_gltA(2),Pas_pyrG(2),Pas_recA(2),Pas_rplB(2),Pas_rpoB(2),,,
2340,a2e1d7732cf6f6f2ecb249ed8f15ef4b,,,,,17679.1,10162.0,,,,,,,,,,
851,3cb91ddcb4572492892a0a26e7d6dd11,fluoroquinolone antibiotic; tetracycline antib...,antibiotic efflux,MFKKIFPLALVSSLRFLGLFIVLPVISLYADSFHSSSPLLVGLAVG...,100.0,13.596,7.0,atpA(~2764),efp(82),mutY(~2663),ppa(~2658),trpC(~2204),ureI(~1586),yphC(~2041),,,
1537,6ed1a7efd720f0ec184b17a420c94419,fluoroquinolone antibiotic,antibiotic efflux,MDFEKDVIRTVTFKLIPALVILYLVAYIDRAAVGFAHLHMGADVGI...,44.24,5646.62,7823.0,Pas_cpn60(~3),Pas_fusA(2),Pas_gltA(2),Pas_pyrG(~80),Pas_recA(~195),Pas_rplB(~2),Pas_rpoB(~5),,,


In [5]:
df['rgi_main.Resistance Mechanism'].value_counts()

antibiotic efflux                     1826
antibiotic inactivation                636
antibiotic target alteration           287
antibiotic target replacement           91
antibiotic target protection            83
reduced permeability to antibiotic      30
Name: rgi_main.Resistance Mechanism, dtype: int64

Assigning a numeric value to each mechanism of resistance.

In [6]:
df['rgi_main.Resistance Mechanism'].replace(to_replace = 'antibiotic efflux',
                                    value = 0, inplace=True)

df['rgi_main.Resistance Mechanism'].replace(to_replace = 'antibiotic inactivation',
                                    value = 1, inplace=True)

df['rgi_main.Resistance Mechanism'].replace(to_replace = 'antibiotic target alteration',
                                    value = 2, inplace=True)

df['rgi_main.Resistance Mechanism'].replace(to_replace = 'antibiotic target replacement',
                                    value = 3, inplace=True)

df['rgi_main.Resistance Mechanism'].replace(to_replace = 'antibiotic target protection',
                                    value = 4, inplace=True)

df['rgi_main.Resistance Mechanism'].replace(to_replace = 'reduced permeability to antibiotic',
                                    value = 5, inplace=True)

In [7]:
df['rgi_main.Resistance Mechanism'].value_counts()

0.0    1826
1.0     636
2.0     287
3.0      91
4.0      83
5.0      30
Name: rgi_main.Resistance Mechanism, dtype: int64

In [8]:
df.dropna(subset = ['rgi_main.CARD_Protein_Sequence', 'rgi_main.Resistance Mechanism'], inplace=True)

In [9]:
df['rgi_main.Resistance Mechanism'] = df['rgi_main.Resistance Mechanism'].apply(np.int64)

Defining variables & encoding

In [21]:
#just protein test
X = df[['rgi_main.CARD_Protein_Sequence']]
y = df['rgi_main.Resistance Mechanism']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y)

In [12]:
ohe = OneHotEncoder()

In [13]:
X_train = ohe.fit_transform(X_train)
X_test = ohe.transform(X_test)

# Model Testing

Lazypredict model search

In [16]:
clf = Classification()

In [17]:
models, predictions = clf.fit(X_train, X_test, y_train, y_test)

100%|██████████| 29/29 [00:46<00:00,  1.59s/it]


In [18]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SGDClassifier,0.96,0.94,,0.96,0.06
LabelPropagation,0.96,0.93,,0.95,0.15
RidgeClassifier,0.96,0.93,,0.95,0.04
RandomForestClassifier,0.96,0.93,,0.95,0.58
Perceptron,0.96,0.93,,0.95,0.07
PassiveAggressiveClassifier,0.96,0.93,,0.95,0.09
NearestCentroid,0.96,0.93,,0.95,0.04
LogisticRegression,0.96,0.93,,0.95,0.09
LinearSVC,0.96,0.93,,0.95,7.48
LabelSpreading,0.96,0.93,,0.95,0.2


Random Forest

In [122]:
from sklearn.metrics import classification_report

In [125]:
RF = RandomForestClassifier(n_estimators = 500)
RF.fit(X_train, y_train)
y_pred = RF.predict(X_test)
print(f'accuracy  = {np.round(accuracy_score(y_test, y_pred), 3)}')
print(f'metrics: {classification_report(y_test, y_pred)}')

accuracy  = 0.97
metrics:               precision    recall  f1-score   support

         0.0       0.95      1.00      0.98       457
         1.0       1.00      0.92      0.96       159
         2.0       1.00      0.92      0.96        72
         3.0       1.00      0.87      0.93        23
         4.0       1.00      0.95      0.98        21
         5.0       1.00      1.00      1.00         7

    accuracy                           0.97       739
   macro avg       0.99      0.94      0.97       739
weighted avg       0.97      0.97      0.97       739



In [126]:
confusion_matrix(y_test, y_pred)
#obv biased due to the imbalanced classes

array([[457,   0,   0,   0,   0,   0],
       [ 12, 147,   0,   0,   0,   0],
       [  6,   0,  66,   0,   0,   0],
       [  3,   0,   0,  20,   0,   0],
       [  1,   0,   0,   0,  20,   0],
       [  0,   0,   0,   0,   0,   7]])

SVC

In [135]:
svc = SVC()

In [136]:
svc.fit(X_train, y_train)

SVC()

In [144]:
svc.score(X_test, y_test)

0.9702300405953992

In [137]:
y_pred = svc.predict(X_test)
print(f'accuracy  = {np.round(accuracy_score(y_test, y_pred), 3)}')
print(f'metrics: {classification_report(y_test, y_pred)}')

accuracy  = 0.97
metrics:               precision    recall  f1-score   support

         0.0       0.95      1.00      0.98       457
         1.0       1.00      0.92      0.96       159
         2.0       1.00      0.92      0.96        72
         3.0       1.00      0.87      0.93        23
         4.0       1.00      0.95      0.98        21
         5.0       1.00      1.00      1.00         7

    accuracy                           0.97       739
   macro avg       0.99      0.94      0.97       739
weighted avg       0.97      0.97      0.97       739



In [139]:
confusion_matrix(y_test, y_pred)

array([[457,   0,   0,   0,   0,   0],
       [ 12, 147,   0,   0,   0,   0],
       [  6,   0,  66,   0,   0,   0],
       [  3,   0,   0,  20,   0,   0],
       [  1,   0,   0,   0,  20,   0],
       [  0,   0,   0,   0,   0,   7]])

# Tuning Model

In [128]:
params = {
    'n_estimators' : [1000, 2500],
    'ccp_alpha' : [0.0, 0.01, 0.001],
    'min_samples_leaf' : [0, 1, 3],
    'criterion' : ['gini', 'entropy'],
    'oob_score' : [True, False]
}

In [132]:
grid = GridSearchCV(RF, params)

In [133]:
grid.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(n_estimators=500),
             param_grid={'ccp_alpha': [0.0, 0.01, 0.001],
                         'criterion': ['gini', 'entropy'],
                         'min_samples_leaf': [0, 1, 3],
                         'n_estimators': [1000, 2500],
                         'oob_score': [True, False]})

In [138]:
grid.best_params_

{'ccp_alpha': 0.0,
 'criterion': 'gini',
 'min_samples_leaf': 1,
 'n_estimators': 1000,
 'oob_score': True}

In [134]:
y_pred = grid.predict(X_test)
print(f'accuracy  = {np.round(accuracy_score(y_test, y_pred), 3)}')
print(f'metrics: {classification_report(y_test, y_pred)}')

accuracy  = 0.97
metrics:               precision    recall  f1-score   support

         0.0       0.95      1.00      0.98       457
         1.0       1.00      0.92      0.96       159
         2.0       1.00      0.92      0.96        72
         3.0       1.00      0.87      0.93        23
         4.0       1.00      0.95      0.98        21
         5.0       1.00      1.00      1.00         7

    accuracy                           0.97       739
   macro avg       0.99      0.94      0.97       739
weighted avg       0.97      0.97      0.97       739



In [143]:
grid.score(X_test, y_test)

0.9702300405953992

# Final Model & Evaluation

In [20]:
def multiclass_classification_and_eval(X, y, model):
    """
    """
    ohe = OneHotEncoder()
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    X_train = ohe.fit_transform(X_train)
    X_test = ohe.transform(X_test)
    model = model
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print(f'null accuracy = {round(max(y_test.mean(), 1 - y_test.mean()), 3)}')
    print(f'accuracy  = {np.round(accuracy_score(y_test, y_pred), 3)}')
    print(f'metrics: {classification_report(y_test, y_pred)}')
    
    return model, X_test, y_test, y_pred

In [24]:
mcc_model, X_test, y_test, y_pred = multiclass_classification_and_eval(X, y, 
                                             RandomForestClassifier(n_jobs = -1,
                                                            criterion = 'gini',
                                                            min_samples_leaf = 1,
                                                            n_estimators = 1000,
                                                            #verbose = 1,
                                                            oob_score = True
                                                            ))

null accuracy = 0.701
accuracy  = 0.957
metrics:               precision    recall  f1-score   support

           0       0.93      1.00      0.97       453
           1       1.00      0.87      0.93       159
           2       1.00      0.92      0.96        66
           3       1.00      0.89      0.94        27
           4       1.00      0.88      0.93        24
           5       1.00      1.00      1.00        10

    accuracy                           0.96       739
   macro avg       0.99      0.93      0.96       739
weighted avg       0.96      0.96      0.96       739



In [25]:
confusion_matrix(y_test, y_pred)

array([[453,   0,   0,   0,   0,   0],
       [ 21, 138,   0,   0,   0,   0],
       [  5,   0,  61,   0,   0,   0],
       [  3,   0,   0,  24,   0,   0],
       [  3,   0,   0,   0,  21,   0],
       [  0,   0,   0,   0,   0,  10]])

In [28]:
feature_import_df = pd.DataFrame(mcc_model.feature_importances_, 
                                   index =X_test.columns,  
                                   columns=['importance']).sort_values('importance', 
                                                                       ascending=False)

In [30]:
feature_import_df[:8]

Unnamed: 0,importance
rgi_main.CARD_Protein_Sequence_22,0.06
rgi_main.CARD_Protein_Sequence_42,0.03
rgi_main.CARD_Protein_Sequence_80,0.03
rgi_main.CARD_Protein_Sequence_2,0.03
rgi_main.CARD_Protein_Sequence_3,0.03
rgi_main.CARD_Protein_Sequence_4,0.03
rgi_main.CARD_Protein_Sequence_67,0.03
rgi_main.CARD_Protein_Sequence_9,0.02


### Countvectorizing for better feature extraction