# Agenda

1. Why use survival analysis?
2. Prepare data.
3. Create survival models.
4. Evaluate.
5. Improve.
6. Test.


In [None]:
import pandas as pd
url = 'https://datasciencemeetup.s3.amazonaws.com/data/WA_Fn-UseC_-Telco-Customer-Churn+2.csv'
df = pd.read_csv(url)
df

In [None]:
pip install lifelines

In [None]:
import matplotlib.pyplot as plt  # visualization
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
import lifelines

# Why survival analysis > classification
Gene Signature Improves Prediction of Multi-Drug Resistant Ovarian Cancer Survival:
![sa](https://home.ccr.cancer.gov/inthejournals/dev/images/GeneSignatureMDROvarianCancer.jpg)

Reliability engineering:
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/Bathtub_curve.svg/1280px-Bathtub_curve.svg.png" alt="Drawing" style="width: 400px;"/>

In [None]:
df_churn = df

In [None]:
df_churn = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df_churn

In [None]:
df_churn['customerID'].duplicated().any()

In [None]:
df_churn.set_index('customerID', inplace=True)
df_churn

# Split your data to avoid information leakage

In [None]:
df_churn.info()

In [None]:
def clean_data(df):
    df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
    df['TotalCharges'] = df['TotalCharges'].fillna(0)
    X = pd.get_dummies(df.drop(['Churn', 'tenure'], axis=1), drop_first=True)
    y = df.loc[:, ['tenure', 'Churn']]
    y.columns = ['time', 'status']
    y['status'] = y.loc[:, 'status'] == 'Yes'
    df_clean = pd.concat([y, X], axis=1)
    return df_clean
df_churn_clean = clean_data(df_churn)
df_churn_clean

In [None]:
num_customers = df_churn_clean.shape[0]
num_train = int(num_customers * 0.7)
train_customers = np.random.choice(df_churn_clean.index, size=num_train, replace=False)
df_train = df_churn_clean.loc[train_customers]

num_dev = int((num_customers - num_train) * 0.5)
dev_customers = np.random.choice(df_churn_clean.drop(train_customers).index, size=num_dev, replace=False)
df_dev = df_churn_clean.loc[dev_customers]

test_customers = df_churn.drop(train_customers).drop(dev_customers).index
df_test = df_churn_clean.loc[test_customers]

In [None]:
print('Train', df_train.shape)
print('Dev', df_dev.shape)
print('Test', df_test.shape)
print('Total:', df_dev.shape[0] + df_test.shape[0] + df_train.shape[0])
assert df_dev.shape[0] + df_test.shape[0] + df_train.shape[0] == df_churn.shape[0]

These are all our features. We will pay special attention to `tenure` and `Churn` because they are our _response_ variables

In [None]:
df_train.info()

Let's see our lifetimes

In [None]:
from lifelines.plotting import plot_lifetimes
time = df_train['time'].sample(25, replace=False)
status = df_train['status'].sample(25, replace=False)
plt.figure(figsize=(16, 6));
plot_lifetimes(time, status)
plt.xlabel('Days subscribed');
plt.ylabel('Customer ID');
plt.title('Customer subscription lifelines');

# Survival analysis
## The survival function
The survival function answers the question: "what is the probability that an event will happen at some time $t$?"
$$S(t) = Pr(t < T)$$
### Kaplan-Meier
The __Kaplan-Meier Estimator__ estimates the survival function using __time__ and __status__. It is the simplest survival model that tells you the probability of the event occurring at time $t$. The shaded region around the line is the _confidence interval_, which tells you how _certain_ you can be that the _true_ probability is within the shaded region.

> A 95% CI is the interval that you are 95% certain contains the true population value as it might be estimated from a much larger study.

In [None]:
T = df_train['time']
E = df_train['status']
kmf = lifelines.KaplanMeierFitter().fit(T, E, alpha=0.01)
kmf.plot_survival_function()
plt.ylim(0, 1);

### What is Kaplain-Meier good for?; Customer segmentation.
You don't need much data to create it, but the catch is it can't be used to predict what an _individual_ will do, only what your _population_ may do. You can improve on the Kaplan-Meier using __customer segmentation__.

In [None]:
df_churn.head()

In [None]:
ax = plt.subplot(111)
group = 'OnlineSecurity'
for name, df in df_churn.groupby(group):
    idx = df_churn[group] == name
    kmf = lifelines.KaplanMeierFitter().fit(T[idx], E[idx], alpha=0.05, label=name)
    kmf.plot_survival_function(ax=ax)
plt.title(group);

In [None]:
ax = plt.subplot(111)
group = 'InternetService'
for name, df in df_churn.groupby(group):
    idx = df_churn[group] == name
    kmf = lifelines.KaplanMeierFitter().fit(T[idx], E[idx], alpha=0.05, label=name)
    kmf.plot_survival_function(ax=ax)
plt.title(group);

# Estimate Hazard
The hazard function is one of the most difficult aspects of survival analysis to explain. Think of hazard as a _risk_ of the event occurring at any given time. Here is the definition:
> The hazard function is defined as the event rate at time $t$ conditional on survival until time $t$.

The bottom line is you always want your hazard to be low. If it goes up, that's a red flag that something has changed between you and your customer. This is an opportunity to learn what's changed and if you can intervene.

## The Nelson-Aalen estimator

### Cumulative hazard
Similar to the Kaplan-Meier estimator, the Nelson-Aalen estimator is also non-parametric. It estimates the __cumulative hazard rate__ which estimates the cumulative number of expected events at time $t$.

In [None]:
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter(alpha=0.05, nelson_aalen_smoothing = False).fit(T, E)
naf.cumulative_hazard_.head()

In [None]:
?NelsonAalenFitter

In [None]:
naf.plot()

In [None]:
ax = plt.subplot(111)
group = 'InternetService'
for name, df in df_churn.groupby(group):
    idx = df_churn[group] == name
    naf = NelsonAalenFitter().fit(T[idx], E[idx], label=name)
    naf.plot(ax=ax)
plt.title(group);

### The hazard rate
Visualizing the hazard function in `Lifelines` requires setting a `bandwidth` parameter to control how we estimate. The lower the bandwidth, the less smooth the hazard function will be. Try a few different values.

In [None]:
ax = plt.subplot(111)
group = 'InternetService'
for name, df in df_churn.groupby(group):
    idx = df_churn[group] == name
    naf = NelsonAalenFitter().fit(T[idx], E[idx], label=name)
    naf.plot_hazard(ax=ax, bandwidth=5)
plt.title(group);

# How can you predict what individual customers will do?
The non-parametric models: Kaplan-Meier and Nelson-Aalen are great for analyzing groups of people. Buut to analyze what any individual person will do, you need more information. You need a parametric model.



In [None]:
df_train.info()

## Cox Proportional Hazards Regression
This is one of the most popular models used in research papers, clinical trials, and survival analysis.

$$h(t, X) = h_0(t)exp(\beta X)$$

It estimates the survival function using features in your data.

In [None]:
from lifelines import CoxPHFitter
?CoxPHFitter.fit

In [None]:
cph = CoxPHFitter()
cph.fit(df_train, duration_col='time', event_col='status')
cph.print_summary()

In [None]:
# detecting multicollinearity: https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
# VIF starts at 1 and has no upper limit
# VIF = 1, no correlation between the independent variable and the other variables
# VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

feature_vif = calc_vif(df_train.drop(['status', 'time'], axis=1))
feature_vif.sort_values('VIF')

In [None]:
features_keep = feature_vif.loc[feature_vif['VIF'] < 10, 'variables'].tolist()
features_keep.extend(['time', 'status'])
features_keep

In [None]:
cph = CoxPHFitter()
cph.fit(df_train.loc[:, features_keep], duration_col='time', event_col='status')
cph.print_summary()

__Key outputs:__
- P values tell you how significant the feature is.
- exp(coef) tell you how much an increase in your feature will contribute to the hazard.
$$exp(-coef) = \frac{hazard\ of\ group 1(t)}{hazard\ of\ group 2(t)}$$
- Concordance.

# Evaluating your model

## Plotting coefficiencts

In [None]:
cph.plot();

## Predicting on the development set

In [None]:
surv_pred_train = cph.predict_survival_function(df_train)
surv_median_pred_train = cph.predict_median(df_train)

In [None]:
surv_pred_train

In [None]:
kmf.survival_function_

In [None]:
import plotly.graph_objs as go

fig = go.Figure()

buttons = []

fig.add_trace(
    go.Scatter(
        x=kmf.survival_function_.index, 
        y=kmf.survival_function_['KM_estimate'],
        marker_color="black",
        name='KM estimate'
    )
)
buttons.append(dict(method='restyle', label='KM Estimate', 
                    args=[{'y': [kmf.survival_function_['KM_estimate'].values]}]))

for customerID in surv_pred_train.columns[0:50]:
    color = 'red' if df_train_clean.loc[customerID, 'status'] else 'blue'
    customer_probabilities = surv_pred_train[customerID]
    fig.add_trace(
        go.Scatter(
            x=surv_pred_train.index, 
            y=customer_probabilities,
            marker_color=color,
            name=customerID
        )
    )
    buttons.append(dict(method='restyle',
                        label=customerID,
                        args=[{'y': [customer_probabilities.values]}])
                  )
updatemenu = [{}]
updatemenu[0]['buttons'] = buttons
updatemenu[0]['direction'] = 'down'
updatemenu[0]['showactive'] = True

# update layout and show figure
fig.update_layout(updatemenus=updatemenu)



#fig.show()

In [None]:
surv_median_pred_train

In [None]:
surv_pred_train.columns[100]

In [None]:
patient_name = surv_pred_train.columns[90]
pred_median = surv_median_pred_train
ax = surv_pred_train[patient_name].plot()
kmf.plot(ax=ax, label='KM Estimator')
plt.vlines(df_train.loc[patient_name, 'time'], ymin=0, ymax=1)
plt.vlines(surv_median_pred_train[patient_name], ymin=0, ymax=1, color='red')

In [None]:
# predict on dev
surv_pred_dev = cph.predict_survival_function(df_dev)
surv_median_pred_dev = cph.predict_median(df_dev)

In [None]:
patient_name = surv_pred_dev.columns[80]
pred_median = surv_median_pred_dev
ax = surv_pred_dev[patient_name].plot()
kmf.plot(ax=ax, label='KM Estimator')
plt.vlines(df_dev.loc[patient_name, 'time'], ymin=0, ymax=1)
plt.vlines(surv_median_pred_dev[patient_name], ymin=0, ymax=1, color='red')

In [None]:
ax = surv_pred_dev.iloc[:, 0:6].plot();
kmf.plot(ax=ax, label='KM Estimator')

## Concordance
https://stats.stackexchange.com/a/478305/11867

In [None]:
from lifelines.utils import concordance_index
concordance_index(df_dev['time'], -cph.predict_partial_hazard(df_dev), df_dev['status'])

## Checking your assumptions
When models don't perform well, it can be because the data does not satisfy their assumptions. Cox PH assumes that the hazards for any two individuals have the same shape, so that if you divide one by the other, the hazard ratio is constant.

In [None]:
cph.check_assumptions(df_train.loc[:, features_keep], p_value_threshold=0.05, show_plots=True)

In [None]:
assumptions_results = lifelines.statistics.proportional_hazard_test(
    cph, df_train.loc[:, features_keep], time_transform='rank'
)
assumptions_results.print_summary()

In [None]:
assumptions_results.p_value

In [None]:
features_selected = list(np.array(assumptions_results.name)[0.05 > assumptions_results.p_value])
features_selected.extend(['time', 'status'])

In [None]:
features_selected

In [None]:
cph2 = CoxPHFitter().fit(df_train.loc[:, features_selected],
                         duration_col='time',
                         event_col='status')

In [None]:
concordance_index(df_dev['time'], -cph2.predict_partial_hazard(df_dev), df_dev['status'])

In [None]:
cph2.print_summary()

In [None]:
import umap

In [None]:
?umap.UMAP

In [None]:
mapper = umap.UMAP(metric='hamming', n_neighbors=50)
mapper.fit(df_train[features_selected])

In [None]:
embedding = mapper.transform(df_train[features_selected])

In [None]:
df_embedding = pd.DataFrame(embedding, index=df_train.index)
df_embedding['status'] = df_train['status']

In [None]:
df_embedding

In [None]:
df_embedding.plot(kind='scatter', x=0, y=1)

In [None]:
import seaborn as sns

In [None]:
?sns.scatterplot

In [None]:
plt.figure(figsize=(16, 6))
sns.scatterplot(data=df_embedding, x=0, y=1, hue='status');
plt.show()

In [None]:
?pd.DataFrame.plot