# Survival Analysis : Implementation

In [None]:
!pip install lifelines

In [None]:
!pip install ppscore

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from lifelines.plotting import plot_lifetimes      # Lifeline package for the Survival Analysis
%pylab inline
figsize(12,6)

### Example with a fictitious data

 in case of user 4 and user 5, we don’t know at what time the event will occur, but still we are using that data to estimate the probability of survival. If we choose not to include the censored data, then it is highly likely that our estimates would be highly biased and under-estimated. The inclusion of censored data to calculate the estimates, makes the Survival Analysis very powerful

ni is deﬁned as the population at risk at time just prior to time ti; and di is defined as number of events occurred at time ti.

In [None]:
from lifelines import KaplanMeierFitter

## Example Data 
durations = [5,6,6,2.5,4,4]
event_observed = [1, 0, 0, 1, 1, 1]

## create an kmf object
kmf = KaplanMeierFitter() 


## Fit the data into the model
kmf.fit(durations, event_observed,label='Kaplan Meier Estimate')

## Create an estimate
kmf.plot(ci_show=False) ## ci_show is meant for Confidence interval, since our data set is too tiny, thus i am not showing it.
print(kmf)

<b>Right censoring</b> – a data point is above a certain value but it is unknown by how much. ... The observed value is the minimum of the censoring and failure times; subjects whose <b>failure time is greater than their censoring time</b> are right-censored.

## Real World Example 

### We will be using Telco Customer Churn data from Kaggle
https://www.kaggle.com/blastchar/telco-customer-churn/

In [None]:
##  create a dataframe
df = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv") 

## Explanation of Dataset

<b>customerID:</b> Customer ID
<br>
<b>gender:</b> Whether the customer is a male or a female
<br>
<b>SeniorCitizen:</b> Whether the customer is a senior citizen or not (1, 0)
<br>
<b>Partner:</b> Whether the customer has a partner or not (Yes, No)
<br>
<b>Dependents:</b> Whether the customer has dependents or not (Yes, No)
<br>
<b>tenure:</b> Number of months the customer has stayed with the company
<br>
<b>PhoneService:</b> Whether the customer has a phone service or not (Yes, No)
<br>
<b>MultipleLines:</b> Whether the customer has multiple lines or not (Yes, No, No phone service)
<br>
<b>InternetService:</b> Customer’s internet service provider (DSL, Fiber optic, No)
<br>
<b>OnlineSecurity:</b> Whether the customer has online security or not (Yes, No, No internet service)
<br>
<b>OnlineBackup:</b> Whether the customer has online backup or not (Yes, No, No internet service)
<br>
<b>DeviceProtection:</b> Whether the customer has device protection or not (Yes, No, No internet service)
<br>
<b>TechSupport:</b> Whether the customer has tech support or not (Yes, No, No internet service)
<br>
<b>StreamingTV:</b> Whether the customer has streaming TV or not (Yes, No, No internet service)
<br>
<b>StreamingMovies:</b> Whether the customer has streaming movies or not (Yes, No, No internet service)
<br>
<b>Contract:</b> The contract term of the customer (Month-to-month, One year, Two year)
<br>
<b>PaperlessBilling:</b> Whether the customer has paperless billing or not (Yes, No)
<br>
<b>PaymentMethod:</b> The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
<br>
<b>MonthlyCharges:</b> The amount charged to the customer monthly
<br>
<b>TotalCharges:</b> The total amount charged to the customer
<br>
<b>Churn:</b> Whether the customer churned or not (Yes or No)

In [None]:
## Have a first look at the data
df.head() 

In [None]:
## Data Types and Missing Values in Columns
df.info()  

In [None]:
## Convert TotalCharges to numeric
df['TotalCharges']=pd.to_numeric(df['TotalCharges'],errors='coerce')

## Replace yes and No in the Churn column to 1 and 0. 1 for the event and 0 for the censured data.
df['Churn']=df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0 )

In [None]:
## after converting the column TotalCharges to numeric
df.info()  ## Column TotalCharges is having missing values

In [None]:
## Impute the null value with the median value

df.TotalCharges.fillna(value=df['TotalCharges'].median(),inplace=True)

ways of filling missing values - https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

In [None]:
## Create a list of Categorical Columns
cat_cols= [i  for i in df.columns if df[i].dtype==object]
cat_cols.remove('customerID')  ## customerID has been removed because it is unique for all the rows.

In [None]:
## lets have a look at the categories and their distribution in all the categorical columns.

for i in cat_cols:
    print('Column Name: ',i)
    print(df[i].value_counts())
    print('-----------------------------')

# Predictive Power Score

In [None]:
import ppscore as pps
plt.figure(figsize=(16,12))
sns.heatmap(pps.matrix(df),annot=True,fmt=".2f")

# Corelation Matrix

In [None]:
plt.figure(figsize=(16,12))
sns.heatmap(df.corr(),annot=True,fmt=".2f")

More details about the Kaplan-Meier graphs given below- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3059453/

In [None]:
## Lets create an overall KaplanMeier curve, without breaking it into groups of covariates.

## Import the library
from lifelines import KaplanMeierFitter


durations = df['tenure'] ## Time to event data of censored and event data
event_observed = df['Churn']  ## It has the churned (1) and censored is (0)

## create a kmf object as km
km = KaplanMeierFitter() ## instantiate the class to create an object

## Fit the data into the model
km.fit(durations, event_observed,label='Kaplan Meier Estimate')

## Create an estimate
km.plot()

## Lets create Kaplan Meier Curves for Cohorts

Lets create three cohorts of customers based on whether a customer has subscribed for Streaming TV or not. We want to know that which cohort has the better customer retention.

In [None]:
kmf = KaplanMeierFitter() 


T = df['tenure']     ## time to event
E = df['Churn']      ## event occurred or censored


groups = df['Contract']             ## Create the cohorts from the 'Contract' column
ix1 = (groups == 'Month-to-month')   ## Cohort 1
ix2 = (groups == 'Two year')         ## Cohort 2
ix3 = (groups == 'One year')         ## Cohort 3


kmf.fit(T[ix1], E[ix1], label='Month-to-month')    ## fit the cohort 1 data
ax = kmf.plot()


kmf.fit(T[ix2], E[ix2], label='Two year')         ## fit the cohort 2 data
ax1 = kmf.plot(ax=ax)


kmf.fit(T[ix3], E[ix3], label='One year')        ## fit the cohort 3 data
kmf.plot(ax=ax1)                                 ## Plot the KM curve for three cohort on same x and y axis

We see that month-to-month subscribers has highest probability to churn 

In [None]:
kmf1 = KaplanMeierFitter() ## instantiate the class to create an object

## Two Cohorts are compared. 1. Streaming TV Not Subsribed by Users, 2. Streaming TV subscribed by the users.
groups = df['StreamingTV']   
i1 = (groups == 'No')      ## group i1 , having the pandas series for the 1st cohort
i2 = (groups == 'Yes')     ## group i2 , having the pandas series for the 2nd cohort


## fit the model for 1st cohort
kmf1.fit(T[i1], E[i1], label='Not Subscribed StreamingTV')
a1 = kmf1.plot()

## fit the model for 2nd cohort
kmf1.fit(T[i2], E[i2], label='Subscribed StreamingTV')
kmf1.plot(ax=a1)

From the curves, it is evident that the customers, who have subscribed for the Streaming TV, have better customer retention as compared to the customers, who have not subscribed for the Streaming TV. 

 we can see that the survival probability of the cohort in blue is less than the cohort in red. For the cohort in blue, the survival probability is decreasing with high rate in first 10 months and it gets relatively better after that; however, for the red cohort, the rate of decrease in survival rate is fairly constant. Therefore, for the cohort , which has not subscribed for the Streaming TV, efforts should be made to retain the customers in first 10 volatile months.

In [None]:
kmf2 = KaplanMeierFitter() ## instantiate the class to create an object


groups = df['gender']   
j1 = (groups == 'Male')      ## group i1 , having the pandas series for the 1st cohort
j2 = (groups == 'Female')     ## group i2 , having the pandas series for the 2nd cohort


## fit the model for 1st cohort
kmf2.fit(T[j1], E[j1], label='Male')
a1 = kmf2.plot()

## fit the model for 2nd cohort
kmf2.fit(T[j2], E[j2], label='Female')
kmf2.plot(ax=a1)

In [None]:
kmf3 = KaplanMeierFitter() ## instantiate the class to create an object


groups = df['Partner']   
k1 = (groups == 'No')      ## group i1 , having the pandas series for the 1st cohort
k2 = (groups == 'Yes')     ## group i2 , having the pandas series for the 2nd cohort


## fit the model for 1st cohort
kmf3.fit(T[k1], E[k1], label='Do not have a partner')
a1 = kmf3.plot()

## fit the model for 2nd cohort
kmf3.fit(T[k2], E[k2], label='Have a partner')
kmf3.plot(ax=a1)

Additionally, Kaplan-Meier curves  are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age.

An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.

## Cox Proportional Hazard Model (Survival Regression)

In [None]:
from lifelines import CoxPHFitter     

In [None]:
## My objective here is to introduce you to the implementation of the model.Thus taking subset of the columns to train the model.
## Only using the subset of the columns present in the original data
df_r= df.loc[:,['tenure','Churn','gender','Partner','Dependents','PhoneService','MonthlyCharges','SeniorCitizen','StreamingTV']]
df_r.head() ## have a look at the data 

In [None]:
## Create dummy variables by using one-hot encoding
df_dummy = pd.get_dummies(df_r, drop_first=True)
df_dummy.head()

In this seminal paper, Cox (1972) presented the proportional hazards model, which specifies that the conditional hazard function of failure time given a set of covariates is the product of an unknown baseline hazard function and an exponential regression function of covariates

Description of the above model -https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#cox-s-proportional-hazard-model

In [None]:
# Using Cox Proportional Hazards model
cph = CoxPHFitter()   ## Instantiate the class to create a cph object
cph.fit(df_dummy, 'tenure', event_col='Churn')   ## Fit the data to train the model
cph.print_summary()    ## HAve a look at the significance of the features

In [None]:
cph.plot() #With a fitted model, an alternative way to view the coefficients and their ranges is to use the plot method.

This plot is an another way to show the coefficient for example- PhoneService_Yes(having a phone service)-has a coefficient of about 0.69 Thus, a one unit increase in PhoneService_Yes means the the baseline hazard will increase by a factor of exp(0.69)= 2.00, about a 20% increase in the Cox proportional hazard model, a higher hazard means more at risk of the event occurring. The value exp(0.69) is called the hazard ratio

Interesting point to note here is that , the β (coef ) values in case of covariates MonthlyCharges and gender_Male is approximately zero (~-0.01), but still the MonthlyCharges plays a significant role in predicting churn , while the latter is insignificant. The reason is that the MonthlyCharges is continuous value and it can vary from the order of tens, hundreds to thousands, when multiplied by the small coef (β=-0.01), it becomes significant. On the other hand, the covariate gender can only take the value 0 or 1, and in both the cases [exp(-0.01 * 0), exp(-0.01*1)] it will be insignificant.

In [None]:
## We want to see the Survival curve at the customer level. Therefore, we have selected 6 customers (rows 5 till 9).

tr_rows = df_dummy.iloc[1:5, 2:]
tr_rows

In [None]:
## Lets predict the survival curve for the selected customers. 
## Customers can be identified with the help of the number mentioned against each curve.
cph.predict_survival_function(tr_rows).plot()

So from the above graph from the given graph we can see that customer 2 has the highest probability to churn.

Creating the survival curves at each customer level helps us in proactively creating a tailor made strategy for high-valued customers for different survival risk segments along the timeline.

# Additional Resources

Lifelines Python Doumentation-https://lifelines.readthedocs.io/en/latest/Quickstart.html

SciPy 2015 lecture by Allen Downey- https://www.youtube.com/watch?v=XHYFNraQEEo

Princeton University Lectures notes-https://data.princeton.edu/wws509/notes/c7.pdf

Thanks for checking out the analysis<br>
-Akshat Anand