Problem Statement:

One of the challenge for all Pharmaceutical companies is to understand the persistency of drug as per the physician prescription. To solve this problem ABC pharma company approached an analytics company to automate this process of identification.

ML Problem:

With an objective to gather insights on the factors that are impacting the persistency, build a classification for the given dataset.

Target Variable: Persistency_Flag

Task:

Problem understanding
Data Understanding
Data Cleaning and Feature engineering
Model Development
Model Selection
Model Evaluation
Report the accuracy, precision and recall of both the class of target variable
Report ROC-AUC as well
Deploy the model
Explain the challenges and model selection
Feature Description:

Bucket	Variable	Variable Description
Unique Row Id	Patient ID	Unique ID of each patient
Target Variable	Persistency_Flag	Flag indicating if a patient was persistent or not
Demographics	Age	Age of the patient during their therapy
Race	Race of the patient from the patient table
Region	Region of the patient from the patient table
Ethnicity	Ethnicity of the patient from the patient table
Gender	Gender of the patient from the patient table
IDN Indicator	Flag indicating patients mapped to IDN
Provider Attributes	NTM - Physician Specialty	Specialty of the HCP that prescribed the NTM Rx
Clinical Factors	NTM - T-Score 	T Score of the patient at the time of the NTM Rx (within 2 years prior from rxdate)
Change in T Score 	Change in Tscore before starting with any therapy and after receiving therapy  (Worsened, Remained Same, Improved, Unknown)
NTM - Risk Segment	Risk Segment of the patient at the time of the NTM Rx (within 2 years days prior from rxdate)
Change in Risk Segment	Change in Risk Segment before starting with any therapy and after receiving therapy (Worsened, Remained Same, Improved, Unknown)
NTM - Multiple Risk Factors	Flag indicating if  patient falls under multiple risk category (having more than 1 risk) at the time of the NTM Rx (within 365 days prior from rxdate)
NTM - Dexa Scan Frequency	Number of DEXA scans taken prior to the first NTM Rx date (within 365 days prior from rxdate)
NTM - Dexa Scan Recency	Flag indicating the presence of Dexa Scan before the NTM Rx (within 2 years prior from rxdate or between their first Rx and Switched Rx; whichever is smaller and applicable)
Dexa During Therapy	Flag indicating if the patient had a Dexa Scan during their first continuous therapy
NTM - Fragility Fracture Recency	Flag indicating if the patient had a recent fragility fracture (within 365 days prior from rxdate)
Fragility Fracture During Therapy	Flag indicating if the patient had fragility fracture  during their first continuous therapy
NTM - Glucocorticoid Recency	Flag indicating usage of Glucocorticoids (>=7.5mg strength) in the one year look-back from the first NTM Rx
Glucocorticoid Usage During Therapy	Flag indicating if the patient had a Glucocorticoid usage during the first continuous therapy
Disease/Treatment Factor	NTM - Injectable Experience	Flag indicating any injectable drug usage in the recent 12 months before the NTM OP Rx
NTM - Risk Factors	Risk Factors that the patient is falling into. For chronic Risk Factors complete lookback to be applied and for non-chronic Risk Factors, one year lookback from the date of first OP Rx 
NTM - Comorbidity 	Comorbidities are divided into two main categories - Acute and chronic, based on the ICD codes. For chronic disease we are taking complete look back from the first Rx date of NTM therapy and for acute diseases, time period  before the NTM OP Rx with one year lookback has been applied
NTM - Concomitancy	Concomitant drugs recorded prior to starting with a therapy(within 365 days prior from first rxdate)
Adherence	Adherence for the therapies

In [31]:
import pandas as pd
import numpy as np

In [32]:
df = pd.read_excel('Healthcare_dataset.xlsx', sheet_name='Dataset')
df.columns = df.columns.str.lower()

In [33]:
df.head()

Unnamed: 0,ptid,persistency_flag,gender,race,ethnicity,region,age_bucket,ntm_speciality,ntm_specialist_flag,ntm_speciality_bucket,...,risk_family_history_of_osteoporosis,risk_low_calcium_intake,risk_vitamin_d_insufficiency,risk_poor_health_frailty,risk_excessive_thinness,risk_hysterectomy_oophorectomy,risk_estrogen_deficiency,risk_immobilization,risk_recurring_falls,count_of_risks
0,P1,Persistent,Male,Caucasian,Not Hispanic,West,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0
1,P2,Non-Persistent,Male,Asian,Not Hispanic,West,55-65,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0
2,P3,Non-Persistent,Female,Other/Unknown,Hispanic,Midwest,65-75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,Y,N,N,N,N,N,N,N,2
3,P4,Non-Persistent,Female,Caucasian,Not Hispanic,Midwest,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,1
4,P5,Non-Persistent,Female,Caucasian,Not Hispanic,Midwest,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,1


There are 69 fields, far too many to break out here, but they are binned in the following categories:
- Unique Row Id  
- Target Variable
- Demographics  
- Provider Attributes  
- Clinical Factors  
- Disease/Treatment Factor

In [34]:
df.isna().sum().any() # false

False

In [35]:
# list of variables to hot encode (convert from categorical to binary)
categorical = ['persistency_flag', 
                'gender', 
                'gluco_record_prior_ntm', 
                'gluco_record_during_rx',
                'dexa_during_rx',
                'frag_frac_prior_ntm',	
                'frag_frac_during_rx',
                'risk_segment_prior_ntm',	
                'tscore_bucket_prior_ntm',
                'adherent_flag'	,
                'idn_indicator'	,
                'injectable_experience_during_rx',
                'comorb_encounter_for_screening_for_malignant_neoplasms',
                'comorb_encounter_for_immunization',
                #'comorb_encntr_for_general_exam_w_o_complaint_susp_or_reprtd_dx',
                'comorb_vitamin_d_deficiency',
                'comorb_other_joint_disorder_not_elsewhere_classified',
                'comorb_encntr_for_oth_sp_exam_w_o_complaint_suspected_or_reprtd_dx',
                'comorb_long_term_current_drug_therapy',
                'comorb_dorsalgia',
                'comorb_personal_history_of_other_diseases_and_conditions',
                'comorb_other_disorders_of_bone_density_and_structure',
                'comorb_disorders_of_lipoprotein_metabolism_and_other_lipidemias',
                'comorb_osteoporosis_without_current_pathological_fracture',
                'comorb_personal_history_of_malignant_neoplasm',
                'comorb_gastro_esophageal_reflux_disease',
                'concom_cholesterol_and_triglyceride_regulating_preparations',
                'concom_narcotics',
                'concom_systemic_corticosteroids_plain',
                'concom_anti_depressants_and_mood_stabilisers',
                'concom_fluoroquinolones',	
                'concom_cephalosporins',
                'concom_macrolides_and_similar_types',
                'concom_broad_spectrum_penicillins',
                'concom_anaesthetics_general',
                'concom_viral_vaccines',
                'risk_type_1_insulin_dependent_diabetes',
                'risk_osteogenesis_imperfecta',
                'risk_rheumatoid_arthritis'	,
                'risk_untreated_chronic_hyperthyroidism',
                'risk_untreated_chronic_hypogonadism',
                'risk_untreated_early_menopause',
                'risk_patient_parent_fractured_their_hip',
                'risk_smoking_tobacco',
                'risk_chronic_malnutrition_or_malabsorption',
                'risk_chronic_liver_disease',
                'risk_family_history_of_osteoporosis',
                'risk_low_calcium_intake',
                'risk_vitamin_d_insufficiency',
                'risk_poor_health_frailty', 
                'risk_excessive_thinness',  
                'risk_hysterectomy_oophorectomy',
                'risk_estrogen_deficiency', 
                'risk_immobilization', 
                'risk_recurring_falls']


In [36]:
df = pd.get_dummies(df[categorical], drop_first= True,  )

# Now clean up column names
df.columns = df.columns.str.lower()
df.columns = df.columns.str.rstrip('_y')
df=df.rename(columns = {'persistency_flag_persistent':'persistency_flag'})

In [37]:
df

Unnamed: 0,persistency_flag,gender_male,gluco_record_prior_ntm,gluco_record_during_rx,dexa_during_rx,frag_frac_prior_ntm,frag_frac_during_rx,risk_segment_prior_ntm_vlr_lr,tscore_bucket_prior_ntm_>-2.5,adherent_flag_non-adherent,...,risk_chronic_liver_disease,risk_family_history_of_osteoporosis,risk_low_calcium_intake,risk_vitamin_d_insufficienc,risk_poor_health_frailt,risk_excessive_thinness,risk_hysterectomy_oophorectom,risk_estrogen_deficienc,risk_immobilization,risk_recurring_falls
0,1,1,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3419,1,0,0,0,0,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0
3420,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3421,1,0,0,0,1,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0
3422,0,0,0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
