***Problem Statement***: One of the challenge for all Pharmaceutical companies is to understand the persistency of drug as per the physician prescription. To solve this problem ABC pharma company approached an analytics company to automate this process of identification.

***ML Problem***: With an objective to gather insights on the factors that are impacting the persistency, build a classification for the given dataset.

***Target Variable***: `persistency_flag`  

***Task***:

- Problem understanding   
- Data Understanding  
- Data Cleaning and Feature engineering  
- Model Development  
- Model Selection  
- Model Evaluation  
- Report the accuracy, precision and recall of both the class of target variable  
- Report ROC-AUC as well  
- Deploy the model  
- Explain the challenges and model selection  

## Feature Description

| Bucket                   | Variable                            | Variable Description                                                                                                                                                                                                                                                                         |
|--------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Unique Row Id            | Patient ID                          | Unique ID of each patient                                                                                                                                                                                                                                                                    |
| Target Variable          | Persistency_Flag                    | Flag indicating if a patient was persistent or not                                                                                                                                                                                                                                           |
| Demographics             | Age                                 | Age of the patient during their therapy                                                                                                                                                                                                                                                      |
|                          | Race                                | Race of the patient from the patient table                                                                                                                                                                                                                                                   |
|                          | Region                              | Region of the patient from the patient table                                                                                                                                                                                                                                                 |
|                          | Ethnicity                           | Ethnicity of the patient from the patient table                                                                                                                                                                                                                                              |
|                          | Gender                              | Gender of the patient from the patient table                                                                                                                                                                                                                                                 |
|                          | IDN Indicator                       | Flag indicating patients mapped to IDN                                                                                                                                                                                                                                                       |
| Provider Attributes      | NTM - Physician Specialty           | Specialty of the HCP that prescribed the NTM Rx                                                                                                                                                                                                                                              |
| Clinical Factors         | NTM - T-Score                       | T Score of the patient at the time of the NTM Rx (within 2 years prior from rxdate)                                                                                                                                                                                                          |
|                          | Change in T Score                   | Change in Tscore before starting with any therapy and after receiving therapy  (Worsened, Remained Same, Improved, Unknown)                                                                                                                                                                  |
|                          | NTM - Risk Segment                  | Risk Segment of the patient at the time of the NTM Rx (within 2 years days prior from rxdate)                                                                                                                                                                                                |
|                          | Change in Risk Segment              | Change in Risk Segment before starting with any therapy and after receiving therapy (Worsened, Remained Same, Improved, Unknown)                                                                                                                                                             |
|                          | NTM - Multiple Risk Factors         | Flag indicating if  patient falls under multiple risk category (having more than 1 risk) at the time of the NTM Rx (within 365 days prior from rxdate)                                                                                                                                       |
|                          | NTM - Dexa Scan Frequency           | Number of DEXA scans taken prior to the first NTM Rx date (within 365 days prior from rxdate)                                                                                                                                                                                                |
|                          | NTM - Dexa Scan Recency             | Flag indicating the presence of Dexa Scan before the NTM Rx (within 2 years prior from rxdate or between their first Rx and Switched Rx; whichever is smaller and applicable)                                                                                                                |
|                          | Dexa During Therapy                 | Flag indicating if the patient had a Dexa Scan during their first continuous therapy                                                                                                                                                                                                         |
|                          | NTM - Fragility Fracture Recency    | Flag indicating if the patient had a recent fragility fracture (within 365 days prior from rxdate)                                                                                                                                                                                           |
|                          | Fragility Fracture During Therapy   | Flag indicating if the patient had fragility fracture  during their first continuous therapy                                                                                                                                                                                                 |
|                          | NTM - Glucocorticoid Recency        | Flag indicating usage of Glucocorticoids (>=7.5mg strength) in the one year look-back from the first NTM Rx                                                                                                                                                                                  |
|                          | Glucocorticoid Usage During Therapy | Flag indicating if the patient had a Glucocorticoid usage during the first continuous therapy                                                                                                                                                                                                |
| Disease/Treatment Factor | NTM - Injectable Experience         | Flag indicating any injectable drug usage in the recent 12 months before the NTM OP Rx                                                                                                                                                                                                       |
|                          | NTM - Risk Factors                  | Risk Factors that the patient is falling into. For chronic Risk Factors complete lookback to be applied and for non-chronic Risk Factors, one year lookback from the date of first OP Rx                                                                                                     |
|                          | NTM - Comorbidity                   | Comorbidities are divided into two main categories - Acute and chronic, based on the ICD codes. For chronic disease we are taking complete look back from the first Rx date of NTM therapy and for acute diseases, time period  before the NTM OP Rx with one year lookback has been applied |
|                          | NTM - Concomitancy                  | Concomitant drugs recorded prior to starting with a therapy(within 365 days prior from first rxdate)                                                                                                                                                                                         |
|                          | Adherence                           | Adherence for the therapies                                                                                                                                                                                                                                                                  |

## EDA

In [61]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy
import sklearn

from sklearn.cluster import KMeans

# Import PCA
from sklearn.decomposition import PCA
from scipy.stats import pearsonr

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

from scipy import stats
from sklearn import metrics
from sklearn.metrics import f1_score,confusion_matrix,classification_report
from sklearn.feature_selection import f_classif
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score


from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
import warnings
warnings.filterwarnings('ignore')

rand_state = 25

In [62]:
# bring in cleaned dataset
df = pd.read_csv('data/data.csv', index_col='ptid')

In [63]:
# fix column names for XGBoost
df.rename(columns={'age_bucket_<55':'age_bucket_under_55', 'age_bucket_>75':'age_bucket_over_75'}, inplace=True)

In [64]:
df.head()

Unnamed: 0_level_0,persistency_flag,gender_male,gluco_record_prior_ntm,gluco_record_during_rx,dexa_freq_during_rx,dexa_during_rx,frag_frac_prior_ntm,frag_frac_during_rx,idn_indicator,injectable_experience_during_rx,...,risk_segment_prior_ntm_vlr_lr,tscore_bucket_prior_ntm_>-2.5,risk_segment_during_rx_vlr_lr,risk_segment_during_rx_no_t_risk_during_rx,tscore_bucket_during_rx_>-2.5,tscore_bucket_during_rx_no_t_risk_during_rx,change_t_score_no change,change_t_score_worsened,change_t_score_no_t_risk_during_rx,adherent_flag_non-adherent
ptid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
P1,1,1,0,0,0,0,0,0,0,1,...,1,1,1,0,0,0,1,0,0,0
P2,0,1,0,0,0,0,0,0,0,1,...,1,1,0,1,0,1,0,0,1,0
P4,0,0,0,1,0,0,0,0,0,1,...,0,1,0,0,0,0,1,0,0,0
P5,0,0,1,1,0,0,0,0,0,1,...,0,0,0,1,0,1,0,0,1,0
P6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0


Precision vs False Positive rate is used for ROC/AUC when there is a class imbalance

In [65]:
df.dtypes

persistency_flag                               int64
gender_male                                    int64
gluco_record_prior_ntm                         int64
gluco_record_during_rx                         int64
dexa_freq_during_rx                            int64
                                               ...  
tscore_bucket_during_rx_no_t_risk_during_rx    int64
change_t_score_no change                       int64
change_t_score_worsened                        int64
change_t_score_no_t_risk_during_rx             int64
adherent_flag_non-adherent                     int64
Length: 107, dtype: object

### Rebalance Dataset

In [66]:
df.persistency_flag.value_counts(normalize=True)

0    0.622736
1    0.377264
Name: persistency_flag, dtype: float64

Since there is an extreme class imbalance, we should assess the impact resampling to balance has on model performance.

In [67]:
from sklearn.utils import resample

# Slice the persistent and non-persistent groups
non_persistency = df[df["persistency_flag"] == 0]
persistency     = df[df["persistency_flag"] == 1]

# select the appropriate number of non-persistent subjects
select_non_persistent = resample(non_persistency,
                           replace=False,
                           n_samples=len(persistency),
                           random_state=rand_state)

balanced_df = pd.concat([persistency, select_non_persistent])

In [68]:
balanced_df.persistency_flag.value_counts(normalize=True)

1    0.5
0    0.5
Name: persistency_flag, dtype: float64

### Prepare the Training Data

In [69]:
def prep_data(data):
    X = data.drop("persistency_flag", axis=1)
    y = data["persistency_flag"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=rand_state)

    return (X, y, X_train, X_test, y_train, y_test)

### Create the lists for training

In [70]:
# datasets
datasets = { "Balanced": balanced_df, "Unbalanced": df}

## Models

In [73]:
Logit = LogisticRegression(solver='liblinear',random_state = rand_state)
RndFOR = RandomForestClassifier(random_state = rand_state)
KNN = KNeighborsClassifier()
Xgb = XGBClassifier(random_state = rand_state, eval_metric='mlogloss')

models = [Logit, RndFOR, KNN, Xgb, Sgdc, Dtree, Mlp]

In [74]:
l = []
for_iteration=0
for data in datasets:
    
    X, y, X_train, X_test, y_train, y_test = prep_data(datasets[data])

    for model in models:
        res = model.fit(X_train,y_train)

        y_pred = res.predict(X_test)

        modelName = str(model).split("(")[0]
        dataName = data
        accuracy = metrics.accuracy_score(y_test, y_pred)
        precision = metrics.precision_score(y_test, y_pred)
        sensitivity = metrics.recall_score(y_test, y_pred)
        specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
        f1_score = metrics.f1_score(y_test, y_pred)

        l.append([modelName, dataName, accuracy, precision, sensitivity, specificity, f1_score])
    
results_raw = pd.DataFrame(l,columns = ["Model","Dataset", "Accuracy", "Precision", "Sensitivity", "Specificity", "F1_score"]).fillna(0)
results_raw = results_raw.sort_values("Accuracy",ascending = False)
results_raw = results_raw.round(decimals = 2)
results_raw

Unnamed: 0,Model,Dataset,Accuracy,precision,sensitivity,specificity,f1_score
10,XGBClassifier,Unbalanced,0.82,0.82,0.7,0.9,0.75
7,LogisticRegression,Unbalanced,0.81,0.82,0.65,0.91,0.72
1,RandomForestClassifier,Balanced,0.8,0.82,0.77,0.83,0.8
8,RandomForestClassifier,Unbalanced,0.8,0.81,0.64,0.9,0.72
13,MLPClassifier,Unbalanced,0.8,0.79,0.66,0.88,0.72
11,SGDClassifier,Unbalanced,0.79,0.84,0.59,0.93,0.69
0,LogisticRegression,Balanced,0.79,0.81,0.76,0.83,0.79
3,XGBClassifier,Balanced,0.79,0.81,0.76,0.82,0.79
9,KNeighborsClassifier,Unbalanced,0.79,0.83,0.59,0.92,0.69
2,KNeighborsClassifier,Balanced,0.76,0.85,0.64,0.88,0.73


Random Forest on the Balanced dataset appears to be the best performing