In this notebook, we will try to use LazyPredict (https://lazypredict.readthedocs.io/en/latest/) to quickly fit out data to many models, and try to get some intuition on which model that we would like to properly fit our data to. We will not use the output from LazyPredict as the final result in any sense, the main purpose is to figure out which model are sutible for out data set, and (maybe) set up a base line on how well our model can perform.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split



Load data, and perform a train-test split.

Notice that bin(payer ID), drug, and reject_code are categorical data, we would like to add dummy variables for them, then remove the original columns.

Let us first try to fit the models not using any date information. 

In [2]:
cmm = pd.read_csv("Data/CMM.csv")
cmm_pa = cmm[cmm['dim_pa_id'].notna()]
cmm_pa = pd.get_dummies(data = cmm_pa, columns = ['drug','bin','reject_code']).copy()
cmm_pa = cmm_pa.drop(columns=['date_val',
 'calendar_year',
 'calendar_month',
 'calendar_day',
 'day_of_week',
 'is_weekday',
 'is_workday',
 'is_holiday','dim_date_id','dim_claim_id','pharmacy_claim_approved','dim_pa_id'])
cmm_pa_target = cmm_pa.pa_approved.copy()
cmm_pa_data = cmm_pa.drop(columns = ['pa_approved'])
cmm_pa_data_1,cmm_pa_data_2,cmm_pa_target_1,cmm_pa_target_2 = train_test_split(cmm_pa_data,cmm_pa_target, test_size = 0.9, 
                                             random_state = 10475, shuffle = True,
                                            stratify = cmm_pa.pa_approved)

Since Lazypredict will train around 30 models and we have around 560000 data, this program may not be able to finish running on our computers. We ramdomly selected 10% of the data to be our new dataset, and perform train-test split, and training using this small subset of the data.

In [3]:
len(cmm_pa_data_1)

55595

In [4]:
len(cmm_pa_target_1)

55595

In [5]:
cmm_pa_data_1_train,cmm_pa_data_1_test,cmm_pa_target_1_train,cmm_pa_target_1_test = train_test_split(cmm_pa_data_1,cmm_pa_target_1, test_size = 0.5, 
                                             random_state = 10475, shuffle = True,
                                            stratify = cmm_pa_target_1)

In [6]:
cmm_pa_data_1_train.head()

Unnamed: 0,correct_diagnosis,tried_and_failed,contraindication,drug_A,drug_B,drug_C,bin_417380,bin_417614,bin_417740,bin_999001,reject_code_70.0,reject_code_75.0,reject_code_76.0
1200832,1.0,0.0,0.0,1,0,0,0,0,0,1,0,0,1
425910,1.0,1.0,0.0,0,1,0,0,0,1,0,1,0,0
678075,1.0,0.0,0.0,0,1,0,0,1,0,0,0,1,0
195071,1.0,0.0,0.0,0,1,0,0,1,0,0,0,1,0
488947,1.0,0.0,0.0,1,0,0,0,1,0,0,1,0,0


In [7]:
cmm_pa_target_1_train.head()

1200832   1.00
425910    0.00
678075    1.00
195071    1.00
488947    1.00
Name: pa_approved, dtype: float64

In [8]:
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(cmm_pa_data_1_train,cmm_pa_data_1_test,cmm_pa_target_1_train,cmm_pa_target_1_test)
print(models)

100%|███████████████████████████████████████████| 29/29 [02:13<00:00,  4.62s/it]

                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
NearestCentroid                    0.74               0.78     0.78      0.75   
BernoulliNB                        0.78               0.78     0.78      0.79   
PassiveAggressiveClassifier        0.74               0.78     0.78      0.75   
QuadraticDiscriminantAnalysis      0.74               0.78     0.78      0.75   
GaussianNB                         0.74               0.78     0.78      0.75   
SGDClassifier                      0.81               0.74     0.74      0.81   
LinearDiscriminantAnalysis         0.81               0.74     0.74      0.81   
NuSVC                              0.81               0.73     0.73      0.80   
LinearSVC                          0.81               0.73     0.73      0.81   
CalibratedClassifierCV             0.81               0.73     0.73      0.81   
KNeighborsClassifier        




In [33]:
cmm = pd.read_csv("Data/CMM.csv")
cmm_pa_date_inclu = cmm[cmm['dim_pa_id'].notna()]
cmm_pa_date_inclu = pd.get_dummies(data = cmm_pa_date_inclu, columns = ['drug','bin','reject_code',
                                                                        'calendar_year'
                                                  ,'calendar_month','calendar_day','day_of_week']).copy()
cmm_pa_date_inclu = cmm_pa_date_inclu.drop(columns=['dim_date_id','dim_claim_id','pharmacy_claim_approved','dim_pa_id','date_val'])
cmm_pa__date_inclu_target = cmm_pa_date_inclu.pa_approved.copy()
cmm_pa_date_inclu_data = cmm_pa_date_inclu.drop(columns = ['pa_approved'])
cmm_pa_date_inclu_data_1,cmm_pa_date_inclu_data_2,cmm_pa_date_inclu_target_1,cmm_pa_date_inclu_target_2 = train_test_split(
    cmm_pa_date_inclu_data,cmm_pa__date_inclu_target, test_size = 0.9, random_state = 10475, shuffle = True,
                                            stratify = cmm_pa.pa_approved)

In [34]:
len(cmm_pa_date_inclu_data_1)

55595

In [35]:
len(cmm_pa_date_inclu_target_1)

55595

In [36]:
cmm_pa_date_inclu_data_1.head()

Unnamed: 0,is_weekday,is_workday,is_holiday,correct_diagnosis,tried_and_failed,contraindication,drug_A,drug_B,drug_C,bin_417380,...,calendar_day_29,calendar_day_30,calendar_day_31,day_of_week_1,day_of_week_2,day_of_week_3,day_of_week_4,day_of_week_5,day_of_week_6,day_of_week_7
450103,1,1,0,1.0,1.0,0.0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
551092,1,1,0,0.0,1.0,0.0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
973673,1,1,0,1.0,0.0,0.0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
414837,1,1,0,0.0,0.0,0.0,1,0,0,1,...,0,0,0,0,0,0,0,1,0,0
237989,1,1,0,1.0,1.0,1.0,1,0,0,1,...,0,0,0,0,1,0,0,0,0,0


In [37]:
cmm_pa_date_inclu_data_1_train,cmm_pa_date_inclu_data_1_test,cmm_pa_date_inclu_target_1_train,cmm_pa_date_inclu_target_1_test = train_test_split(cmm_pa_date_inclu_data_1,
                                                   cmm_pa_date_inclu_target_1, 
                                                   test_size = 0.5, 
                                                   random_state = 10475, shuffle = True,
                                                   stratify = cmm_pa_target_1)

In [38]:
clf_date_inclu = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf_date_inclu.fit(cmm_pa_date_inclu_data_1_train,cmm_pa_date_inclu_data_1_test,cmm_pa_date_inclu_target_1_train,cmm_pa_date_inclu_target_1_test)
print(models)

100%|███████████████████████████████████████████| 29/29 [05:39<00:00, 11.69s/it]

                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
NearestCentroid                    0.74               0.78     0.78      0.75   
BernoulliNB                        0.77               0.78     0.78      0.78   
GaussianNB                         0.75               0.76     0.76      0.76   
LinearDiscriminantAnalysis         0.81               0.73     0.73      0.81   
CalibratedClassifierCV             0.81               0.72     0.72      0.80   
SGDClassifier                      0.80               0.72     0.72      0.80   
SVC                                0.81               0.72     0.72      0.80   
LinearSVC                          0.81               0.72     0.72      0.80   
AdaBoostClassifier                 0.81               0.72     0.72      0.80   
RidgeClassifier                    0.81               0.72     0.72      0.80   
RidgeClassifierCV           




In [15]:
cmm_pa_date_inclu_data_1.head()

Unnamed: 0,correct_diagnosis,tried_and_failed,contraindication,drug_A,drug_B,drug_C,bin_417380,bin_417614,bin_417740,bin_999001,reject_code_70.0,reject_code_75.0,reject_code_76.0
450103,1.0,1.0,0.0,1,0,0,0,1,0,0,1,0,0
551092,0.0,1.0,0.0,1,0,0,0,1,0,0,1,0,0
973673,1.0,0.0,0.0,1,0,0,1,0,0,0,0,1,0
414837,0.0,0.0,0.0,1,0,0,1,0,0,0,0,1,0
237989,1.0,1.0,1.0,1,0,0,1,0,0,0,0,1,0
