# Predicting Drug/Paitent fit with Naive Bayes.

In [None]:
import pandas as pd
from sklearn.naive_bayes import ComplementNB, MultinomialNB, GaussianNB, CategoricalNB
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.model_selection import cross_validate

In [None]:
# import data from ETL pipeline...
x_train = pd.read_csv('../input/drug-classification-etl/Fact_imb_train_features.csv', index_col = 0, squeeze = True)
y_train = pd.read_csv('../input/drug-classification-etl/Fact_imb_train_labels.csv', index_col = 0, squeeze = True)

x_test = pd.read_csv('../input/drug-classification-etl/Fact_imb_test_features.csv', index_col = 0, squeeze = True)
y_test = pd.read_csv('../input/drug-classification-etl/Fact_imb_test_labels.csv', index_col = 0, squeeze = True)

In [None]:
def train_baseline_model(train, labels, model):
    mdl = model()
    mdl.fit(train, labels)
    
    return mdl

# Get a baseline for modeling.

# ComplementNB
cmp_bl_mdl = train_baseline_model(x_train, y_train, ComplementNB)
print(f'ComplementNB = {cmp_bl_mdl.score(x_test,y_test)}') # 68%

# MultinomialNB
mul_bl_md2 = train_baseline_model(x_train, y_train, MultinomialNB)
print(f'MultinomialNB = {mul_bl_md2.score(x_test,y_test)}') # 65%



## Resampling

*** This work was moved to the ETL ***

The classes are very imbalanced. This will lead to overfitting on the majority class if we tried to model this data as is. To combat this, we employ a technique to resample the classes. We will try NearMiss in order to avoid the consiquences of significant data loss.

*** Balanced Facts are found in the ETL output. ***

In [None]:
# read in the balanced Facts.

x_train_resampled = pd.read_csv('../input/drug-classification-etl/Fact_resampled_train_features.csv', index_col = 0, squeeze = True)
y_train_resampled = pd.read_csv('../input/drug-classification-etl/Fact_resampled_train_labels.csv', index_col = 0, squeeze = True)

In [None]:
re_comp_md = train_baseline_model(x_train_resampled, y_train_resampled, ComplementNB)
print(f'ComplementNB = {re_comp_md.score(x_test,y_test)}')

re_mul_md = train_baseline_model(x_train_resampled, y_train_resampled, MultinomialNB)
print(f'MultinomialNB = {re_mul_md.score(x_test,y_test)}')

Gau_md = train_baseline_model(x_train_resampled, y_train_resampled, GaussianNB)
print(f'GaussianNB = {Gau_md.score(x_test,y_test)}')

The accuracy of the Complement NB model dropped slightly, while the Multinomial model rose.GaussianNB is highest at 89%. Still the accuracy is not great for any of the Naive Bayes models. Lets run K-Fold crossvalidation to ensure the model is not overfitting.

In [None]:
def cross_val(model, x_train, y_train, folds=10):
    scoring = {'acc': 'accuracy',
           'prec_micro': 'precision_micro',
           'rec_micro': 'recall_micro'}
    scores = cross_validate(model, x_train, y_train, scoring=scoring,
                         cv=folds, return_train_score=True)
    return scores

In [None]:
gas_cv_res = cross_val(Gau_md, x_test, y_test, 4)

display(gas_cv_res['test_acc'].mean())
display(gas_cv_res['test_prec_micro'].mean())
display(gas_cv_res['test_rec_micro'].mean())

In [None]:
gau_preds = Gau_md.predict(x_test)

display(f1_score(y_test, gau_preds, average='micro'))

display(precision_score(y_test, gau_preds, average='micro'))

display(recall_score(y_test, gau_preds, average='micro'))

## Conclusions:

Naive Bayes is not the best model for this data. Out of all of the Naieve Bayes models, Gaussian Naive Bayes performs the best with a F1 of 89%. It is important to remember that the imbalance in the data set must be accounted for... I believe that this work will serve as a decent baseline for the rest of the modeling. 