## Problem Statement
Insurance Plus++, a premium payer, wants to use predictive modeling on healthcare data to predict the occurrence of future events among their covered patients. They want to use existing data about their patients’ previous medical events to predict future events in their patient journey. Events are recorded in the standardized ICD-9 format (details here). In this challenge, the goal is to predict the next 10 events in 2014 for each patient in order of occurrence.

The challenge was organized by zs on Hackerearth and it was open for US citizens.
I scored 5th rank on leaderboard

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
train=pd.read_csv("../input/train.csv")
test=pd.read_csv("../input/test.csv")

## Data Preparation

In [3]:
train.head()

Unnamed: 0,UID,Age,Gender,Date,Event_Code
0,Id_e45bbc48,14,F,201205,8707
1,Id_e45a8472,52,F,201305,7261
2,Id_e45b20d6,12,F,201212,1967
3,Id_e45aabad,22,F,201211,7172
4,Id_e45c5780,73,F,201312,8100


In [4]:
train=train.drop(['Age','Gender'],axis=1)

In [5]:
# Getting Year from Date
def get_year(x):
    date = str(x)
    year = date[:4]
    return year

In [6]:
# Adding the Year column to Train Data
train['Year'] = train.apply(lambda x : get_year(x['Date']), axis=1)

In [7]:
# Getting Month from Date
def get_month(x):
    date = str(x)
    month = date[4:]
    return month

In [8]:
# Adding the Month column to Train Data
train['Month'] = train.apply(lambda x : get_month(x['Date']), axis=1)

In [9]:
train.head()

Unnamed: 0,UID,Date,Event_Code,Year,Month
0,Id_e45bbc48,201205,8707,2012,5
1,Id_e45a8472,201305,7261,2013,5
2,Id_e45b20d6,201212,1967,2012,12
3,Id_e45aabad,201211,7172,2012,11
4,Id_e45c5780,201312,8100,2013,12


### Extracting 10 recent details of patients  

In [10]:
pid_list = set(train['UID'])
test_df = pd.DataFrame()
for item in pid_list:
    df = train[train['Year']=='2013']
    df1 = df[df['UID']==item]
    if df1.shape[0]<10:
        df1_month = set(df1['Month'])
        df2 = train[train['Year']=='2012']
        df3 = df2[df2['UID']==item]
        df3_month = set(df3['Month'])
        difference_month = df3_month.difference(df1_month)
        for m in difference_month:
            df4 = df3[df3['Month']==m]
            df1 = df1.append(df4)
    if df1.shape[0]<10:
        df1_month = set(df1['Month'])
        df2 = train[train['Year']=='2011']
        df3 = df2[df2['UID']==item]
        df3_month = set(df3['Month'])
        difference_month = df3_month.difference(df1_month)
        for m in difference_month:
            df4 = df3[df3['Month']==m]
            df1 = df1.append(df4)
        
    test_df = test_df.append(df1)  
    
train = test_df

In [11]:
train.head(5)

Unnamed: 0,UID,Date,Event_Code,Year,Month
7465,Id_e45ad297,201307,9637,2013,7
11934,Id_e45ad297,201304,E878,2013,4
15813,Id_e45ad297,201312,1955,2013,12
18711,Id_e45ad297,201307,8472,2013,7
24987,Id_e45ad297,201304,8102,2013,4


## Computing the Probability of Occurrence
### #Formula used : P(E/M) = [ P(M/E)*P(E) ] / P(M)

In [12]:
# Probability for Event P(E)

new_train_df = train.groupby(["UID","Event_Code"]).size().reset_index(name="count_of_event_for_patient")
trial_train = train.merge(new_train_df, on = ['UID','Event_Code'])

new_train_df1 = train.groupby(["UID"]).size().reset_index(name="total_events_for_patient") 
trial_train = trial_train.merge(new_train_df1, on = ['UID'])

trial_train['prob_of_event'] = trial_train['count_of_event_for_patient'] / trial_train['total_events_for_patient']

train = trial_train

In [13]:
train.head()

Unnamed: 0,UID,Date,Event_Code,Year,Month,count_of_event_for_patient,total_events_for_patient,prob_of_event
0,Id_e45ad297,201307,9637,2013,7,3,113,0.026549
1,Id_e45ad297,201307,9637,2013,7,3,113,0.026549
2,Id_e45ad297,201307,9637,2013,7,3,113,0.026549
3,Id_e45ad297,201304,E878,2013,4,1,113,0.00885
4,Id_e45ad297,201312,1955,2013,12,1,113,0.00885


In [14]:
# Probability for Month P(M)

new_train_df = train.groupby(["UID","Month"]).size().reset_index(name="count_of_month_for_patient")
trial_train = train.merge(new_train_df, on = ['UID','Month'])

new_train_df1 = train.groupby(["UID"]).size().reset_index(name="total_months_for_patient") 
trial_train = trial_train.merge(new_train_df1, on = ['UID'])

trial_train['prob_of_month'] = trial_train['count_of_month_for_patient'] / trial_train['total_months_for_patient']
train = trial_train

In [15]:
# Probability of Month Given the Event P(M/E)

new_train_df = train.groupby(["UID","Event_Code","Month"]).size().reset_index(name="count_of_month_and_event_for_patient")
trial_train = train.merge(new_train_df, on = ['UID','Event_Code','Month'])

new_train_df1 = train.groupby(["UID"]).size().reset_index(name="total_event_for_patient_when_month") 
trial_train = trial_train.merge(new_train_df1, on = ['UID'])

trial_train['prob_of_month_when_event'] = trial_train['count_of_month_and_event_for_patient'] / trial_train['total_event_for_patient_when_month']
train = trial_train

In [16]:
# Computing the Probability
train['prob_of_occurrence'] = (train['prob_of_month_when_event']*train['prob_of_event']) / train['prob_of_month']
new_sort = train

In [17]:
# Extracting the Top 10 probability Events
freq_events1 = pd.crosstab(index=[new_sort['UID']],columns=new_sort['Event_Code'], values = new_sort['prob_of_occurrence'], aggfunc=np.mean)
freq_events1.fillna(0)
freq_events1.reset_index(drop=False, inplace=True)

submit = freq_events1.loc[:,freq_events1.columns != 'UID'].apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:10].index, index=['Event'+str(x) for x in range(1,11)]),axis=1).reset_index()
submit.drop('index',inplace=True, axis=1)

submit['UID'] = freq_events1['UID']

cols = submit.columns.tolist()
cols = cols[-1:] + cols[:-1]
submit = submit[cols]

In [18]:
submit

Unnamed: 0,UID,Event1,Event2,Event3,Event4,Event5,Event6,Event7,Event8,Event9,Event10
0,Id_e45a3682,2214,3258,7087,8561,8501,3641,3180,2015,2761,9231
1,Id_e45a3683,2773,3489,8100,8708,2189,7807,7685,9921,8502,5815
2,Id_e45a3684,3273,2735,2334,3417,3614,4273,V586,3641,V588,8561
3,Id_e45a3685,2706,1975,3614,3131,2533,2657,7194,7159,2500,9921
4,Id_e45a3686,2687,3194,2674,7109,5990,2541,2710,3641,8502,8004
5,Id_e45a3687,9714,9701,3556,7151,7241,3082,7705,V761,G020,2788
6,Id_e45a3688,2656,3579,2766,311,3372,4847,2550,3715,2942,2335
7,Id_e45a3689,2645,3749,2836,2189,3466,3475,1975,2780,2270,9920
8,Id_e45a368a,3591,3510,3486,3615,3213,3454,3320,3673,2024,3509
9,Id_e45a368b,2775,9923,9921,532,9961,3721,2693,3731,3432,5964


In [19]:
# final submission
submit.to_csv("submission.csv", index=False)