## BernoulliNB

We are using the BernoulliNB algorithm form scikit-learn package (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)in this notebook. We are solving the classification problem which we predict wether a PA form will be approved base on information provided on the PA form. Our data features are 'correct_diagnosis', 'tried_and_failed', 'contraindication', 'drug'(drug type), 'bin'(payer id),'reject_code', which are all categorical. Our label will be 'pa_approved'. 

In [1]:
#import pacakges
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [2]:
#read data
cmm_pa_clf_read = pd.read_csv("../Data/cmm_pa_clf.csv",index_col = 0)
cmm_pa_clf_data = cmm_pa_clf_read.drop(columns = 'pa_approved').copy()
cmm_pa_clf_target = cmm_pa_clf_read['pa_approved'].copy()
X_train,X_test,Y_train,Y_test= train_test_split(cmm_pa_clf_data, cmm_pa_clf_target, test_size = 0.2, 
                                             random_state = 10475, shuffle = True,
                                            stratify = cmm_pa_clf_target)

## Baseline:
We predoct that all PA form will be approved. In this case the true positive rate = false positive rate = 1, the ROC-AUC score of our baseline model is 0.5. The error of this predictor is 100-73.445 = 26.555.

## BernoulliNB  

The Bernoulli NB algorithm does not have much parameters, we will tune alpha (additive smoothing parameter). We will also compare the preformance of the algorithm on wether we learn the class prior or no.

In [3]:
tuned_parameters = {'alpha': [0.1*i for i in range(1,21)],'fit_prior': [True,False]}
scores = ['accuracy','roc_auc']
bnb_clf = BernoulliNB()
skf = StratifiedKFold(n_splits=6,random_state=10475, shuffle=True)
for scr in scores:
    print("# Tuning hyper-parameters for %s" % scr)
    print()
    clf_tun = GridSearchCV(estimator = bnb_clf, param_grid = tuned_parameters, scoring="%s" % scr,cv = skf)
    clf_tun.fit(X_train, Y_train)
    print("Best parameters set found based on the parameter set:")
    print()
    print(clf_tun.best_params_)
    print("Grid scores on parameter set:")
    print()
    means = clf_tun.cv_results_["mean_test_score"]
    stds = clf_tun.cv_results_["std_test_score"]
    for mean, std, params in zip(means, stds, clf_tun.cv_results_["params"]):
        print("%0.3f (+/-%0.03f) for %r \n" % (mean, std * 2, params))
    print()

# Tuning hyper-parameters for accuracy

Best parameters set found based on the parameter set:

{'alpha': 0.1, 'fit_prior': True}
Grid scores on parameter set:

0.803 (+/-0.001) for {'alpha': 0.1, 'fit_prior': True} 

0.731 (+/-0.003) for {'alpha': 0.1, 'fit_prior': False} 

0.803 (+/-0.001) for {'alpha': 0.2, 'fit_prior': True} 

0.731 (+/-0.003) for {'alpha': 0.2, 'fit_prior': False} 

0.803 (+/-0.001) for {'alpha': 0.30000000000000004, 'fit_prior': True} 

0.731 (+/-0.003) for {'alpha': 0.30000000000000004, 'fit_prior': False} 

0.803 (+/-0.001) for {'alpha': 0.4, 'fit_prior': True} 

0.731 (+/-0.003) for {'alpha': 0.4, 'fit_prior': False} 

0.803 (+/-0.001) for {'alpha': 0.5, 'fit_prior': True} 

0.731 (+/-0.003) for {'alpha': 0.5, 'fit_prior': False} 

0.803 (+/-0.001) for {'alpha': 0.6000000000000001, 'fit_prior': True} 

0.731 (+/-0.003) for {'alpha': 0.6000000000000001, 'fit_prior': False} 

0.803 (+/-0.001) for {'alpha': 0.7000000000000001, 'fit_prior': True} 

0.731 (+/-0.003)

In [4]:
def column(matrix, i):
    return [row[i] for row in matrix]

In [5]:
bnb_tuned = BernoulliNB(alpha = 0.1, fit_prior = True)
bnb_tuned.fit(X_train,Y_train)
Y_pred = bnb_tuned.predict(X_train)
print(classification_report(Y_train, Y_pred))
print('Accuacy score of this set of parameter is: ', accuracy_score(Y_train, Y_pred),'\n')
Y_pred_proba = bnb_tuned.predict_proba(X_train)
Y_pred_proba = column(Y_pred_proba,1)
print('ROC-AUC score of this set of parameter is: ', roc_auc_score(Y_train, Y_pred_proba),'\n')

              precision    recall  f1-score   support

         0.0       0.76      0.38      0.50    118105
         1.0       0.81      0.96      0.88    326655

    accuracy                           0.80    444760
   macro avg       0.79      0.67      0.69    444760
weighted avg       0.80      0.80      0.78    444760

Accuacy score of this set of parameter is:  0.8031095422250203 

ROC-AUC score of this set of parameter is:  0.8697661832192508 



In [6]:
bnb_tuned_r = BernoulliNB(alpha = 0.1, fit_prior = True)
bnb_tuned_r.fit(X_train,Y_train)
Y_pred = bnb_tuned_r.predict(X_train)
print(classification_report(Y_train, Y_pred))
print('Accuacy score of this set of parameter is: ', accuracy_score(Y_train, Y_pred),'\n')
Y_pred_proba_r = bnb_tuned_r.predict_proba(X_train)
Y_pred_proba_r = column(Y_pred_proba_r,1)
print('ROC-AUC score of this set of parameter is: ', roc_auc_score(Y_train, Y_pred_proba_r),'\n')

              precision    recall  f1-score   support

         0.0       0.76      0.38      0.50    118105
         1.0       0.81      0.96      0.88    326655

    accuracy                           0.80    444760
   macro avg       0.79      0.67      0.69    444760
weighted avg       0.80      0.80      0.78    444760

Accuacy score of this set of parameter is:  0.8031095422250203 

ROC-AUC score of this set of parameter is:  0.8697661832192508 

