Structured Data Assignement
Problem Statement
The dataset in question contains a comprehensive collection of electronic health records belonging to patients who have been diagnosed with a specific disease. These health records comprise a detailed log of every aspect of the patients' medical history, including all diagnoses, symptoms, prescribed drug treatments, and medical tests that they have undergone. Each row represents a healthcare record/medical event for a patient and it includes a timestamp for each entry/event, thereby allowing for a chronological view of the patient's medical history.

The Data has mainly three columns

Patient-Uid - Unique Alphanumeric Identifier for a patient
Date - Date when patient encountered the event.
Incident - This columns describes which event occurred on the day.
Problem
The development of drugs is critical in providing therapeutic options for patients suffering from chronic and terminal illnesses. “Target Drug”, in particular, is designed to enhance the patient's health and well-being without causing dependence on other medications that could potentially lead to severe and life-threatening side effects. These drugs are specifically tailored to treat a particular disease or condition, offering a more focused and effective approach to treatment, while minimising the risk of harmful reactions.

Objective
To develop a predictive model which will predict whether a patient will be eligible*** for “Target Drug” or not in next 30 days. Knowing if the patient is eligible or not will help physician treating the patient make informed decision on the which treatments to give.

A patient is considered eligible for a particular drug when they have taken their first prescription for that drug.

Importing Libraries

In [1]:
import pandas as pd
from datetime import datetime,timedelta
import numpy as np

In [2]:
df = pd.read_parquet('train.parquet')

In [3]:
df

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,2015-05-16,PRIMARY_DIAGNOSIS
3,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,2018-01-30,SYMPTOM_TYPE_0
4,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,2015-04-22,DRUG_TYPE_0
8,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2016-06-18,DRUG_TYPE_1
...,...,...,...
29080886,a0ee9f75-1c7c-11ec-94c7-16262ee38c7f,2018-07-06,DRUG_TYPE_6
29080897,a0ee1284-1c7c-11ec-a3d5-16262ee38c7f,2017-12-29,DRUG_TYPE_6
29080900,a0ee9b26-1c7c-11ec-8a40-16262ee38c7f,2018-10-18,DRUG_TYPE_10
29080903,a0ee1a92-1c7c-11ec-8341-16262ee38c7f,2015-09-18,DRUG_TYPE_6


In [4]:
df.duplicated().sum()

35571

In [5]:
df.drop_duplicates(inplace=True)

In [6]:
df['Incident'].value_counts()

DRUG_TYPE_6          549616
DRUG_TYPE_1          484565
PRIMARY_DIAGNOSIS    424879
DRUG_TYPE_0          298881
DRUG_TYPE_2          256203
DRUG_TYPE_7          251239
DRUG_TYPE_8          158706
DRUG_TYPE_3          126615
TEST_TYPE_1           96810
TARGET DRUG           67218
DRUG_TYPE_9           66894
DRUG_TYPE_5           55940
SYMPTOM_TYPE_0        46078
DRUG_TYPE_11          45419
SYMPTOM_TYPE_6        32066
TEST_TYPE_0           27570
SYMPTOM_TYPE_7        22019
DRUG_TYPE_10          20911
DRUG_TYPE_14          17306
DRUG_TYPE_13          12321
DRUG_TYPE_12           9540
SYMPTOM_TYPE_14        8927
SYMPTOM_TYPE_1         8608
SYMPTOM_TYPE_2         8168
TEST_TYPE_3            8115
SYMPTOM_TYPE_5         7583
SYMPTOM_TYPE_8         7430
TEST_TYPE_2            7021
SYMPTOM_TYPE_15        6295
SYMPTOM_TYPE_10        6005
SYMPTOM_TYPE_29        5910
SYMPTOM_TYPE_16        4940
DRUG_TYPE_15           4906
SYMPTOM_TYPE_9         4885
DRUG_TYPE_4            4566
SYMPTOM_TYPE_4      

In [7]:
df['Incident'].nunique()

57

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3185297 entries, 0 to 29080911
Data columns (total 3 columns):
 #   Column       Dtype         
---  ------       -----         
 0   Patient-Uid  object        
 1   Date         datetime64[ns]
 2   Incident     object        
dtypes: datetime64[ns](1), object(2)
memory usage: 97.2+ MB


In [9]:
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').dt.date


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3185297 entries, 0 to 29080911
Data columns (total 3 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   Patient-Uid  object
 1   Date         object
 2   Incident     object
dtypes: object(3)
memory usage: 97.2+ MB


In [11]:
# grouping the data based on petirnt id
patient_data = df.groupby('Patient-Uid').apply(lambda x: x[['Date', 'Incident']].values.tolist())

In [12]:
patient_data

Patient-Uid
a0db1e73-1c7c-11ec-ae39-16262ee38c7f    [[2019-03-09, PRIMARY_DIAGNOSIS], [2020-08-04,...
a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f    [[2015-05-16, PRIMARY_DIAGNOSIS], [2016-03-23,...
a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f    [[2018-01-30, SYMPTOM_TYPE_0], [2018-03-23, DR...
a0dc950b-1c7c-11ec-b6ec-16262ee38c7f    [[2015-04-22, DRUG_TYPE_0], [2016-06-15, DRUG_...
a0dc9543-1c7c-11ec-bb63-16262ee38c7f    [[2016-06-18, DRUG_TYPE_1], [2016-05-24, DRUG_...
                                                              ...                        
a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f    [[2020-05-20, PRIMARY_DIAGNOSIS], [2020-03-09,...
a0f0d4f4-1c7c-11ec-b144-16262ee38c7f    [[2020-07-18, PRIMARY_DIAGNOSIS], [2020-07-18,...
a0f0d523-1c7c-11ec-89d2-16262ee38c7f    [[2020-05-21, PRIMARY_DIAGNOSIS], [2020-06-03,...
a0f0d553-1c7c-11ec-a70a-16262ee38c7f    [[2015-05-22, SYMPTOM_TYPE_7], [2020-07-21, PR...
a0f0d582-1c7c-11ec-a6c1-16262ee38c7f    [[2020-06-05, PRIMARY_DIAGNOSIS], [2016-09-28,..

In [13]:
patient_data_df = patient_data.reset_index().rename(columns={0: 'PatientDetails'})

In [14]:
patient_data_df

Unnamed: 0,Patient-Uid,PatientDetails
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,"[[2019-03-09, PRIMARY_DIAGNOSIS], [2020-08-04,..."
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,"[[2015-05-16, PRIMARY_DIAGNOSIS], [2016-03-23,..."
2,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,"[[2018-01-30, SYMPTOM_TYPE_0], [2018-03-23, DR..."
3,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,"[[2015-04-22, DRUG_TYPE_0], [2016-06-15, DRUG_..."
4,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,"[[2016-06-18, DRUG_TYPE_1], [2016-05-24, DRUG_..."
...,...,...
27028,a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f,"[[2020-05-20, PRIMARY_DIAGNOSIS], [2020-03-09,..."
27029,a0f0d4f4-1c7c-11ec-b144-16262ee38c7f,"[[2020-07-18, PRIMARY_DIAGNOSIS], [2020-07-18,..."
27030,a0f0d523-1c7c-11ec-89d2-16262ee38c7f,"[[2020-05-21, PRIMARY_DIAGNOSIS], [2020-06-03,..."
27031,a0f0d553-1c7c-11ec-a70a-16262ee38c7f,"[[2015-05-22, SYMPTOM_TYPE_7], [2020-07-21, PR..."


In [15]:
def get_date(item):
    return item[0]
def sort_date(incidents):
    sorted_date = sorted(incidents,key=lambda x:get_date(x))
    return sorted_date
patient_data_df['PatientDetails'] = patient_data_df['PatientDetails'].apply(sort_date)

In [16]:
patient_data_df

Unnamed: 0,Patient-Uid,PatientDetails
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,"[[2015-09-22, DRUG_TYPE_7], [2018-04-13, SYMPT..."
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,"[[2015-04-10, DRUG_TYPE_0], [2015-04-12, DRUG_..."
2,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,"[[2015-04-08, DRUG_TYPE_0], [2015-04-08, PRIMA..."
3,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,"[[2015-04-22, DRUG_TYPE_0], [2015-04-22, DRUG_..."
4,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,"[[2015-04-14, DRUG_TYPE_1], [2015-04-24, TEST_..."
...,...,...
27028,a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f,"[[2015-04-18, DRUG_TYPE_6], [2015-05-07, DRUG_..."
27029,a0f0d4f4-1c7c-11ec-b144-16262ee38c7f,"[[2015-07-01, DRUG_TYPE_6], [2015-07-01, DRUG_..."
27030,a0f0d523-1c7c-11ec-89d2-16262ee38c7f,"[[2015-04-07, DRUG_TYPE_6], [2015-04-07, DRUG_..."
27031,a0f0d553-1c7c-11ec-a70a-16262ee38c7f,"[[2015-05-17, DRUG_TYPE_9], [2015-05-22, SYMPT..."


In [17]:
# Function to create a new column Target for showing whether the patient is eligible for target drug or not
def check_target_drug(incidents):
    for incident in incidents:
        if incident[1] == 'TARGET DRUG':
            return 1
    return 0

# Add 'target' column
patient_data_df['Target'] = patient_data_df['PatientDetails'].apply(check_target_drug)


In [18]:
patient_data_df

Unnamed: 0,Patient-Uid,PatientDetails,Target
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,"[[2015-09-22, DRUG_TYPE_7], [2018-04-13, SYMPT...",0
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,"[[2015-04-10, DRUG_TYPE_0], [2015-04-12, DRUG_...",0
2,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,"[[2015-04-08, DRUG_TYPE_0], [2015-04-08, PRIMA...",0
3,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,"[[2015-04-22, DRUG_TYPE_0], [2015-04-22, DRUG_...",0
4,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,"[[2015-04-14, DRUG_TYPE_1], [2015-04-24, TEST_...",0
...,...,...,...
27028,a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f,"[[2015-04-18, DRUG_TYPE_6], [2015-05-07, DRUG_...",1
27029,a0f0d4f4-1c7c-11ec-b144-16262ee38c7f,"[[2015-07-01, DRUG_TYPE_6], [2015-07-01, DRUG_...",1
27030,a0f0d523-1c7c-11ec-89d2-16262ee38c7f,"[[2015-04-07, DRUG_TYPE_6], [2015-04-07, DRUG_...",1
27031,a0f0d553-1c7c-11ec-a70a-16262ee38c7f,"[[2015-05-17, DRUG_TYPE_9], [2015-05-22, SYMPT...",1


In [19]:
patient_data_df

Unnamed: 0,Patient-Uid,PatientDetails,Target
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,"[[2015-09-22, DRUG_TYPE_7], [2018-04-13, SYMPT...",0
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,"[[2015-04-10, DRUG_TYPE_0], [2015-04-12, DRUG_...",0
2,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,"[[2015-04-08, DRUG_TYPE_0], [2015-04-08, PRIMA...",0
3,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,"[[2015-04-22, DRUG_TYPE_0], [2015-04-22, DRUG_...",0
4,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,"[[2015-04-14, DRUG_TYPE_1], [2015-04-24, TEST_...",0
...,...,...,...
27028,a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f,"[[2015-04-18, DRUG_TYPE_6], [2015-05-07, DRUG_...",1
27029,a0f0d4f4-1c7c-11ec-b144-16262ee38c7f,"[[2015-07-01, DRUG_TYPE_6], [2015-07-01, DRUG_...",1
27030,a0f0d523-1c7c-11ec-89d2-16262ee38c7f,"[[2015-04-07, DRUG_TYPE_6], [2015-04-07, DRUG_...",1
27031,a0f0d553-1c7c-11ec-a70a-16262ee38c7f,"[[2015-05-17, DRUG_TYPE_9], [2015-05-22, SYMPT...",1


In [20]:
type(patient_data_df['PatientDetails'][0][0][0])

datetime.date

In [21]:
prediction_date = (pd.to_datetime('today') + pd.DateOffset(days=30)).date()
def last_incident_0(incidents):
    diff = (prediction_date - incidents[-1][0])
    return diff
def last_incident_1(incidents):
    index=0
    for i in range(len(incidents)):
        if incidents[i][1]=='TARGET DRUG':
            index = i-1
    diff = (prediction_date - incidents[index][0])
    return diff
        
        
    

In [22]:
patient_data_df['Days'] = patient_data_df.apply(lambda row: last_incident_0(row['PatientDetails']) if row['Target'] == 0 else last_incident_1(row['PatientDetails']), axis=1)


In [23]:
'''
For patients eligible for target drug only the data 30 days before the day the target drug is required for building model
since we have to predict 30 days prior to giving target drug. Function target_1 does this
Function target_0 gets all the incidents happened for patients not eligible for target drug
'''

def target_0(incidents):
    incidents = [i[1] for i in incidents]
    return incidents
def target_1(incidents):
    for incident in incidents:
        target_date = datetime.today().date()
        if incident[1]=='TARGET DRUG':
            target_date = incident[0]
            break
    incidents = [incident[1] for incident in incidents if incident[0]<=target_date]
    return incidents

In [24]:
patient_data_df['PatientDetails'] = patient_data_df.apply(lambda row: target_0(row['PatientDetails']) if row['Target'] == 0 else target_1(row['PatientDetails']), axis=1)


In [25]:
patient_data_df['Days'] = patient_data_df['Days'].dt.days

In [26]:
patient_data_df

Unnamed: 0,Patient-Uid,PatientDetails,Target,Days
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,"[DRUG_TYPE_7, SYMPTOM_TYPE_2, DRUG_TYPE_7, SYM...",0,1202
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,"[DRUG_TYPE_0, DRUG_TYPE_2, DRUG_TYPE_0, PRIMAR...",0,1335
2,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,"[DRUG_TYPE_0, PRIMARY_DIAGNOSIS, DRUG_TYPE_7, ...",0,1694
3,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,"[DRUG_TYPE_0, DRUG_TYPE_7, DRUG_TYPE_2, PRIMAR...",0,1232
4,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,"[DRUG_TYPE_1, TEST_TYPE_1, SYMPTOM_TYPE_8, DRU...",0,1199
...,...,...,...,...
27028,a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f,"[DRUG_TYPE_6, DRUG_TYPE_0, DRUG_TYPE_6, DRUG_T...",1,1213
27029,a0f0d4f4-1c7c-11ec-b144-16262ee38c7f,"[DRUG_TYPE_6, DRUG_TYPE_8, DRUG_TYPE_1, DRUG_T...",1,1224
27030,a0f0d523-1c7c-11ec-89d2-16262ee38c7f,"[DRUG_TYPE_6, DRUG_TYPE_1, DRUG_TYPE_9, DRUG_T...",1,1244
27031,a0f0d553-1c7c-11ec-a70a-16262ee38c7f,"[DRUG_TYPE_9, SYMPTOM_TYPE_7, DRUG_TYPE_2, DRU...",1,1221


In [27]:
new_df = patient_data_df.explode('PatientDetails')

In [28]:
new_df

Unnamed: 0,Patient-Uid,PatientDetails,Target,Days
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,DRUG_TYPE_7,0,1202
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,SYMPTOM_TYPE_2,0,1202
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,DRUG_TYPE_7,0,1202
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,SYMPTOM_TYPE_0,0,1202
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,DRUG_TYPE_9,0,1202
...,...,...,...,...
27032,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,DRUG_TYPE_1,1,1232
27032,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,PRIMARY_DIAGNOSIS,1,1232
27032,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,TEST_TYPE_1,1,1232
27032,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,TEST_TYPE_2,1,1232


In [29]:
inc_count = new_df.groupby(['Patient-Uid','PatientDetails']).size().reset_index(name='count')

In [30]:
inc_count

Unnamed: 0,Patient-Uid,PatientDetails,count
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,DRUG_TYPE_0,29
1,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,DRUG_TYPE_11,1
2,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,DRUG_TYPE_2,11
3,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,DRUG_TYPE_6,10
4,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,DRUG_TYPE_7,6
...,...,...,...
309593,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,DRUG_TYPE_7,1
309594,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,PRIMARY_DIAGNOSIS,1
309595,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,TARGET DRUG,1
309596,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,TEST_TYPE_1,4


In [31]:
patient_df = inc_count.pivot(index = 'Patient-Uid',columns = 'PatientDetails',values='count').fillna(0)

In [32]:
patient_df = patient_df.reset_index()

In [33]:
patient_df.columns.name=None

In [34]:
patient_df.shape

(27033, 58)

In [35]:
patient_df['Target'] = patient_data_df['Target']

In [36]:
patient_df.drop(['TARGET DRUG'],axis=1,inplace=True)

In [37]:
patient_df.drop('Patient-Uid',axis=1,inplace=True)

In [38]:
patient_df

Unnamed: 0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5,Target
0,29.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,10.0,2.0,0.0,0.0,0.0,0.0,0
1,8.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,4.0,0.0,0.0,0.0,0.0,0
2,6.0,7.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,2.0,0.0,0.0,0.0,0.0,0
3,15.0,42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,2.0,45.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,6.0,0.0,9.0,27.0,1.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27028,41.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
27029,16.0,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
27030,7.0,48.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1
27031,7.0,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [39]:
patient_df

Unnamed: 0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5,Target
0,29.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,10.0,2.0,0.0,0.0,0.0,0.0,0
1,8.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,4.0,0.0,0.0,0.0,0.0,0
2,6.0,7.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,2.0,0.0,0.0,0.0,0.0,0
3,15.0,42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,2.0,45.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,6.0,0.0,9.0,27.0,1.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27028,41.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
27029,16.0,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
27030,7.0,48.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1
27031,7.0,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [40]:
patient_df['Target'].value_counts(normalize=True)

0    0.653239
1    0.346761
Name: Target, dtype: float64

In [41]:
X = patient_df.drop('Target',axis=1)
y = patient_df['Target']

In [44]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X = ss.fit_transform(X)

In [45]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=4)

In [46]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
lr.score(X_test,y_test)

0.7595561035758323

In [47]:
from sklearn.metrics import f1_score
y_pred = lr.predict(X_test)
f1_score(y_test,y_pred)

0.5896464646464646

In [48]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
li = [5,6,7,8,9,10,12,13,15,20]
for depth in li:
    dt=DecisionTreeClassifier(max_depth=depth)
    dt.fit(X_train,y_train)
    train_score = f1_score(dt.predict(X_train),y_train)
    cross_val = np.mean(cross_val_score(dt,X_train,y_train,cv=10,scoring='f1'))
    print("Depth :",depth,"Train score :",train_score,"Cross val score :",cross_val)
    

Depth : 5 Train score : 0.6596820279873983 Cross val score : 0.6385386178061114
Depth : 6 Train score : 0.6624747709982922 Cross val score : 0.6273114312695374
Depth : 7 Train score : 0.678599466164233 Cross val score : 0.6380854907593096
Depth : 8 Train score : 0.6991053209857165 Cross val score : 0.6398102055834097
Depth : 9 Train score : 0.7155729208197114 Cross val score : 0.6431134825990099
Depth : 10 Train score : 0.747786923435316 Cross val score : 0.6377284625312523
Depth : 12 Train score : 0.8094437652811736 Cross val score : 0.6352394988655881
Depth : 13 Train score : 0.832095326524296 Cross val score : 0.6344969837647042
Depth : 15 Train score : 0.8855044453034403 Cross val score : 0.6297332080796242
Depth : 20 Train score : 0.9671869052150743 Cross val score : 0.6148768937082402


In [49]:
dt = DecisionTreeClassifier(max_depth = 15)
dt.fit(X_train,y_train)
f1_score(dt.predict(X_test),y_test)

0.6273235878000362

In [50]:
from sklearn.ensemble import RandomForestClassifier
li = [1,2,3,4,5,6,7,8,9,10,13,14,15]
for depth in li:
    rf = RandomForestClassifier(max_depth = depth,n_estimators=100,max_features = 'sqrt')
    rf.fit(X_train,y_train)
    train_score = f1_score(dt.predict(X_train),y_train)
    cross_val = np.mean(cross_val_score(rf,X_train,y_train,cv=10,scoring='f1'))
    print("Depth :",depth,"Train score :",train_score,"Cross val score :",cross_val)

Depth : 1 Train score : 0.885418277408381 Cross val score : 0.0
Depth : 2 Train score : 0.885418277408381 Cross val score : 0.04051329387114792
Depth : 3 Train score : 0.885418277408381 Cross val score : 0.22137262754967027
Depth : 4 Train score : 0.885418277408381 Cross val score : 0.3414000793980592
Depth : 5 Train score : 0.885418277408381 Cross val score : 0.43022703952844366
Depth : 6 Train score : 0.885418277408381 Cross val score : 0.50303664739185
Depth : 7 Train score : 0.885418277408381 Cross val score : 0.5517397139016953
Depth : 8 Train score : 0.885418277408381 Cross val score : 0.5896384324327342
Depth : 9 Train score : 0.885418277408381 Cross val score : 0.621750757686703
Depth : 10 Train score : 0.885418277408381 Cross val score : 0.638322118587833
Depth : 13 Train score : 0.885418277408381 Cross val score : 0.668790228096408
Depth : 14 Train score : 0.885418277408381 Cross val score : 0.6761016023958073
Depth : 15 Train score : 0.885418277408381 Cross val score : 0.681

In [51]:
rf = RandomForestClassifier(max_depth = 15,n_estimators=100,max_features = 'sqrt')
rf.fit(X_train,y_train)
f1_score(rf.predict(X_test),y_test)

0.686637761135199

In [52]:
pip install xgboost




In [53]:
import xgboost as xgb
for lr in [0.01,0.02,0.03,0.04,0.05,0.1,0.11,0.12,0.13,0.14,0.15,0.2,0.5,0.7,1]:
  model = xgb.XGBClassifier(learning_rate = lr, n_estimators=100, verbosity = 0)
  model.fit(X_train, y_train)
  print("Learning rate : ", lr," Train score : ", f1_score(model.predict(X_train),y_train)," Cross-Val score : ", np.mean(cross_val_score(model, X_train, y_train, cv=10)))

Learning rate :  0.01  Train score :  0.6313768513439386  Cross-Val score :  0.7673200698244004
Learning rate :  0.02  Train score :  0.7035033702255138  Cross-Val score :  0.7865556478804183
Learning rate :  0.03  Train score :  0.7279071654629554  Cross-Val score :  0.7944295161097579
Learning rate :  0.04  Train score :  0.7442607160057794  Cross-Val score :  0.7980759479957873
Learning rate :  0.05  Train score :  0.7556405963485611  Cross-Val score :  0.8014579417437562
Learning rate :  0.1  Train score :  0.7870815281606933  Cross-Val score :  0.804523145805901
Learning rate :  0.11  Train score :  0.789432300676207  Cross-Val score :  0.8064780224014367
Learning rate :  0.12  Train score :  0.794252329069951  Cross-Val score :  0.8057913655405639
Learning rate :  0.13  Train score :  0.7967530932303569  Cross-Val score :  0.8071647909456114
Learning rate :  0.14  Train score :  0.7980148101465259  Cross-Val score :  0.8058438846132798
Learning rate :  0.15  Train score :  0.8041

In [54]:
model = xgb.XGBClassifier(learning_rate = 0.5,n_estimators=100,verbosity=0)
model.fit(X_train,y_train)
f1_score(model.predict(X_test),y_test)

0.7007218212104386

In [55]:
test = pd.read_parquet('test.parquet')

In [56]:
test

Unnamed: 0,Patient-Uid,Date,Incident
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2016-12-08,SYMPTOM_TYPE_0
1,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-10-17,DRUG_TYPE_0
2,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-12-01,DRUG_TYPE_2
3,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-12-05,DRUG_TYPE_1
4,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-11-04,SYMPTOM_TYPE_0
...,...,...,...
1372854,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-05-11,DRUG_TYPE_13
1372856,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2018-08-22,DRUG_TYPE_2
1372857,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-02-04,DRUG_TYPE_2
1372858,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-09-25,DRUG_TYPE_8


In [57]:
test['Date'] = pd.to_datetime(test['Date'], format='%Y-%m-%d').dt.date


In [58]:
test_data = test.groupby('Patient-Uid').apply(lambda x: x[['Date', 'Incident']].values.tolist())

In [59]:
test_data = test_data.reset_index().rename(columns={0: 'PatientDetails'})

In [60]:
def get_date(item):
    return item[0]
def sort_date(incidents):
    sorted_date = sorted(incidents,key=lambda x:get_date(x))
    return sorted_date
test_data['PatientDetails'] = test_data['PatientDetails'].apply(sort_date)

In [61]:
test_data

Unnamed: 0,Patient-Uid,PatientDetails
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,"[[2016-06-23, DRUG_TYPE_7], [2016-12-08, SYMPT..."
1,a0f9e9f9-1c7c-11ec-b565-16262ee38c7f,"[[2015-04-17, PRIMARY_DIAGNOSIS], [2015-04-18,..."
2,a0f9ea43-1c7c-11ec-aa10-16262ee38c7f,"[[2015-04-09, DRUG_TYPE_1], [2015-04-28, DRUG_..."
3,a0f9ea7c-1c7c-11ec-af15-16262ee38c7f,"[[2015-04-14, DRUG_TYPE_6], [2015-05-14, PRIMA..."
4,a0f9eab1-1c7c-11ec-a732-16262ee38c7f,"[[2015-10-21, DRUG_TYPE_2], [2016-03-18, DRUG_..."
...,...,...
11477,a102720c-1c7c-11ec-bd9a-16262ee38c7f,"[[2015-04-10, DRUG_TYPE_7], [2015-05-07, DRUG_..."
11478,a102723c-1c7c-11ec-9f80-16262ee38c7f,"[[2015-05-27, DRUG_TYPE_1], [2015-06-30, DRUG_..."
11479,a102726b-1c7c-11ec-bfbf-16262ee38c7f,"[[2015-05-25, DRUG_TYPE_6], [2015-07-02, DRUG_..."
11480,a102729b-1c7c-11ec-86ba-16262ee38c7f,"[[2017-11-13, DRUG_TYPE_2], [2018-04-20, DRUG_..."


In [62]:
prediction_date = (pd.to_datetime('today') + pd.DateOffset(days=30)).date()
def last_incident_0(incidents):
    diff = (prediction_date - incidents[-1][0])
    return diff
test_data['Days'] = test_data['PatientDetails'].apply(last_incident_0)

In [63]:
test_data['Days'] = test_data['Days'].dt.days

In [64]:
test_data

Unnamed: 0,Patient-Uid,PatientDetails,Days
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,"[[2016-06-23, DRUG_TYPE_7], [2016-12-08, SYMPT...",1648
1,a0f9e9f9-1c7c-11ec-b565-16262ee38c7f,"[[2015-04-17, PRIMARY_DIAGNOSIS], [2015-04-18,...",1493
2,a0f9ea43-1c7c-11ec-aa10-16262ee38c7f,"[[2015-04-09, DRUG_TYPE_1], [2015-04-28, DRUG_...",1495
3,a0f9ea7c-1c7c-11ec-af15-16262ee38c7f,"[[2015-04-14, DRUG_TYPE_6], [2015-05-14, PRIMA...",1343
4,a0f9eab1-1c7c-11ec-a732-16262ee38c7f,"[[2015-10-21, DRUG_TYPE_2], [2016-03-18, DRUG_...",1476
...,...,...,...
11477,a102720c-1c7c-11ec-bd9a-16262ee38c7f,"[[2015-04-10, DRUG_TYPE_7], [2015-05-07, DRUG_...",1340
11478,a102723c-1c7c-11ec-9f80-16262ee38c7f,"[[2015-05-27, DRUG_TYPE_1], [2015-06-30, DRUG_...",1602
11479,a102726b-1c7c-11ec-bfbf-16262ee38c7f,"[[2015-05-25, DRUG_TYPE_6], [2015-07-02, DRUG_...",1422
11480,a102729b-1c7c-11ec-86ba-16262ee38c7f,"[[2017-11-13, DRUG_TYPE_2], [2018-04-20, DRUG_...",1678


In [65]:
def target_0(incidents):
    incidents = [i[1] for i in incidents]
    return incidents
test_data['PatientDetails'] = test_data['PatientDetails'].apply(target_0)

In [66]:
new_df = test_data.explode('PatientDetails')

In [67]:
inc_count = new_df.groupby(['Patient-Uid','PatientDetails']).size().reset_index(name='count')

In [68]:
inc_count

Unnamed: 0,Patient-Uid,PatientDetails,count
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,DRUG_TYPE_0,8
1,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,DRUG_TYPE_1,3
2,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,DRUG_TYPE_11,1
3,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,DRUG_TYPE_2,2
4,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,DRUG_TYPE_5,1
...,...,...,...
123853,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,DRUG_TYPE_5,8
123854,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,DRUG_TYPE_7,44
123855,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,DRUG_TYPE_8,59
123856,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,DRUG_TYPE_9,10


In [81]:
test_df = inc_count.pivot(index = 'Patient-Uid',columns = 'PatientDetails',values='count').fillna(0)

In [82]:
test_df = patient_df.reset_index()

In [101]:
test_df['Days'] = test_data['Days']

In [84]:
test_df.columns.name=None

In [89]:
test_df.drop('index',axis=1,inplace=True)

In [100]:
test_df

Unnamed: 0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
0,8.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
1,2.0,30.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,4.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
3,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11477,33.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11478,4.0,6.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11479,14.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11480,5.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
ss = StandardScaler()
X = ss.fit_transform(test_df)

In [103]:
X

array([[ 0.0063907 , -0.746983  , -0.15937173, ..., -0.06826023,
        -0.05636554,  0.30363476],
       [-0.54601571,  0.97245173, -0.15937173, ..., -0.06826023,
        -0.05636554, -0.37005471],
       [-0.36188024,  1.16350003, -0.15937173, ..., -0.06826023,
        -0.05636554, -0.36136194],
       ...,
       [ 0.5587971 , -0.61961747,  0.27170158, ..., -0.06826023,
        -0.05636554, -0.67864795],
       [-0.26981251, -0.42856916, -0.15937173, ..., -0.06826023,
        -0.05636554,  0.43402627],
       [-0.63808344, -0.93803131, -0.15937173, ..., -0.06826023,
        -0.05636554,  0.63395992]])

In [104]:
target = model.predict(X)

In [105]:
target

array([1, 0, 0, ..., 0, 1, 0])

In [117]:
test_data['Target'] = target

In [118]:
final = test_data[['Patient-Uid','Target']]

In [120]:
final.to_csv('Final_submission.csv', index = False)