# Aim
Ilustrate Survival Analysis Analysis.
Using relevant cases with examples, in this case dataset for South Corea cases of Covid19.

# Conclusions

1. Sample is very small.
2. Median time for decease is 4 days.
3. There are not significant differences in terms of gender for deceases in terms of the time. Males and Females appears to be affected at the same path. Proportions of females with the disease appears to be smallers. Sample size very small.
4. The same for people of different ages. However the number of people recorded is strongly affected by age. Potential Selection Bias. Sample size very small.
4. People with previous diseases/conditions die sooner.


# Literature
## Key ideas
* **Survival Analysis**: Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival? From https://en.wikipedia.org/wiki/Survival_analysis
* **Logrank test**: is a hypothesis test to compare the survival distributions of two samples. From  https://en.wikipedia.org/wiki/Logrank_test 
## Papers
1. Wu C, Chen X, Cai Y, et al. Risk Factors Associated With Acute Respiratory Distress Syndrome and Death in Patients With Coronavirus Disease 2019 Pneumonia in Wuhan, China. JAMA Intern Med. Published online March 13, 2020. doi:10.1001/jamainternmed.2020.0994
    * Particular interesting because of the case
    
    <img src="../papers/Wu_et_al_2020.png" width="350" height="450" />
## Libraries
# R
# Python
* [Lifelines](https://lifelines.readthedocs.io/en/latest/Quickstart.html)

# Blogs

1. Overall view in Python:
    * https://towardsdatascience.com/survival-analysis-intuition-implementation-in-python-504fde4fcf8e
2. Examples in R and Python: https://plot.ly/python/v3/ipython-notebooks/survival-analysis-r-vs-python/

# Data
From [Kagle](https://www.kaggle.com/kimjihoo/coronavirusdataset)

In [None]:
import sys
import os
import pandas as pd
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
from datetime import timedelta
import pandas as pd
import numpy as np
try: 
    from lifelines.statistics import logrank_test
except:
    !conda install -c conda-forge lifelines=0.24.2 -y
from lifelines.statistics import logrank_test
from lifelines import KaplanMeierFitter
%matplotlib inline
import matplotlib.pyplot as plt
from lifelines.utils import median_survival_times
from lifelines import WeibullFitter
from lifelines import NelsonAalenFitter
from lifelines import CoxPHFitter



In [None]:
# where to save things
OUTPUT = 'kaggle/working/survival_analysis'
# Data from SK, https://www.kaggle.com/kimjihoo/coronavirusdataset#PatientInfo.csv
DATA_LOCATTION = '/kaggle/input/coronavirusdataset/PatientInfo.csv'
os.makedirs(OUTPUT,exist_ok=True)

In [None]:
for dirname, _, filenames in os.walk("/kaggle/input"):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv(DATA_LOCATTION)
data = data[[i for i in data.columns if not i.__contains__('Unnamed')]]
data.head().T

In [None]:
# This cell looks like taka ages in kaggle
data["sex"].value_counts().plot.bar()
plt.gcf()
plt.title('Sex')

In [None]:
"""Check numbers death, recovered and sick"""
print('Size sample: {}'.format(data.shape[0]))
print('Number casualties {}'.format((data['state'] == 'deceased').sum()))
print('Number recovered {}'.format((data['state'] == 'released').sum()))

# Features

In [None]:
data['confirmed_date'] = pd.to_datetime(data['confirmed_date'])
data['released_date'] = pd.to_datetime(data['released_date'])
data['symptom_onset_date'] = pd.to_datetime(data['symptom_onset_date'])
data['deceased_date'] = pd.to_datetime(data['deceased_date'])
#data['duraction_confirmed_death'] = data['released_date']-data['confirmed_date']
data['days_death_confirmed'] = data['deceased_date']-data['confirmed_date']
data['days_death_symptons'] = data['deceased_date']-data['symptom_onset_date']

Considering people sick
Maximum data on the dataset, to calculated durations.

In [None]:
max_date_on_dataset = max([data['released_date'].max(), data['deceased_date'].max()])
max_date_on_dataset

In [None]:
def create_starting_date(patient_data):
    """
    Create starting date checking first sysmptoms but taking
    confirmed_date if not availabe
    """
    if pd.isnull(patient_data['symptom_onset_date']):
        return patient_data['confirmed_date']
    else:
        return patient_data['symptom_onset_date']
    
data['start_date'] = data.apply(lambda x: create_starting_date(x), axis = 1)
data['start_date'].describe()

In [None]:
def create_end_date(patient_data,max_date_on_dataset = '2020-03-20 00:00:00'):
    """
    Create ending date checking first sysmptoms but taking
    confirmed_date if not availabe
    """
    if patient_data['state'] == 'isolated':
        return max_date_on_dataset
    elif patient_data['state'] == 'deceased':
        return patient_data['deceased_date'] 
    elif patient_data['state'] == 'released':
        return patient_data['released_date']
    else:
        return None

data['end_date'] = data.apply(lambda x: create_end_date(x), axis = 1)
data['end_date'].describe()

Let have a look a deceased cases

In [None]:
deceseade_report = data[data['state'] == 'deceased'][['state','days_death_confirmed',
        'days_death_symptons','confirmed_date','released_date',
        'symptom_onset_date','deceased_date']]
deceseade_report.head()

In [None]:
def disease_to_int(x):
    try:
        return int(x)
    except:
        return 0
    
data['disease_encoded']= data['disease'].apply(disease_to_int)
#data['disease_encoded'].hist()
data['disease_encoded'].value_counts().plot.bar()
plt.gcf()
plt.title('Histogram Causalties')

Everything makes sense. There are two cases we do not have any information, and there are people who died first and later they were comfirmed.

In [None]:
data[['state','days_death_confirmed',
        'days_death_symptons','confirmed_date','released_date',
        'symptom_onset_date','deceased_date']].head()

In [None]:
#
data_fileted = data[(data['state']=='deceased')  |  (data['state']=='isolated')].copy()
#data_fileted = data[(data['state']=='deceased')].copy()
data_fileted['duration']= data_fileted['end_date'] - data_fileted['start_date']
data_fileted['observed']= (data['state']=='deceased')
data_fileted['observed'] = data_fileted['observed'].astype(int)
data_fileted.dropna(subset = ['observed','duration'],inplace = True)

In [None]:
data_fileted.head()

In [None]:
data_fileted[['duration','observed']].describe()

In [None]:
data_fileted['observed'].hist()

In [None]:
kmf = KaplanMeierFitter()
# Remove people with -1 timedelta
data_fileted = data_fileted[data_fileted["duration"]>= np.timedelta64(0,'D')]
# from time format to int
data_fileted["duration"] = data_fileted["duration"].dt.days

T = data_fileted["duration"]
E = data_fileted["observed"]

kmf.fit(T, event_observed=E)


Distribution of the deaths

In [None]:
T.hist()
plt.gcf()
plt.title('Distribution of Death by day')

In [None]:
kmf.survival_function_.plot()
plt.title('Survival ');

In [None]:
kmf.plot()
plt.gcf()
plt.title('Survival Function')

In [None]:
kmf.median_survival_time_

In [None]:
median_ci = median_survival_times(kmf.confidence_interval_)
median_ci

In [None]:
# Creating auxiliary function to add small numbers
def add_small_number(x):
    if x==0:
        return x+0.01
    else:
        return x
data_fileted["duration"].apply(add_small_number).head()

In [None]:
from lifelines import KaplanMeierFitter
from lifelines import (WeibullFitter, ExponentialFitter,
LogNormalFitter, LogLogisticFitter, NelsonAalenFitter,
PiecewiseExponentialFitter, GeneralizedGammaFitter, SplineFitter)

fig, axes = plt.subplots(3, 3, figsize=(10, 7.5))

T = data_fileted["duration"].apply(add_small_number)
E = data_fileted["observed"]


kmf = KaplanMeierFitter().fit(T, E, label='KaplanMeierFitter')
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentalFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
gg = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')
spf = SplineFitter([6, 20, 40, 75]).fit(T, E, label='SplineFitter')

wbf.plot_survival_function(ax=axes[0][0])
exf.plot_survival_function(ax=axes[0][1])
lnf.plot_survival_function(ax=axes[0][2])
kmf.plot_survival_function(ax=axes[1][0])
llf.plot_survival_function(ax=axes[1][1])
pwf.plot_survival_function(ax=axes[1][2])
gg.plot_survival_function(ax=axes[2][0])
spf.plot_survival_function(ax=axes[2][1])
fig.suptitle('Comparison Parametrics with Non-Parametrics Curvers', fontsize=16)
#plt.gcf()
#plt.title('Comparison Parametrics with Non-Parametrics Curvers')


Mean of for and median sugguest 4 is the poing when most people decease.
# Are there significant gender differences?
Looks like yes by using KM curve and log_rank test. Later Cox Regression will point out this as well controlling by age.

In [None]:
ax = plt.subplot(111)
dem = (data_fileted["sex"] == "female")
kmf.fit(T[dem], event_observed=E[dem], label="Females")
kmf.plot(ax=ax)

kmf.fit(T[~dem], event_observed=E[~dem], label="Males")
kmf.plot(ax=ax)

plt.ylim(0, 1);
plt.title("Lifespans of different global regimes");

In [None]:
data_fileted["sex"].value_counts().plot.bar()
plt.gcf()
plt.title('Gender Distribution')

The graph suggest we cannot really distinguish by gender.
Let's check using a test (logrank_test). We do not found that the curves are diffirent statistically. See below p value.

In [None]:
results = logrank_test(T[dem], T[~dem], E[dem], E[~dem], alpha=.99)
results.print_summary()

# People with previous diseases

In [None]:
data_fileted['disease_encoded'].value_counts().plot.bar()
plt.gcf()
plt.title('People with previous diseases (1), in deaths cases')

In [None]:

regime_types = [0,1]
labels =['No previous diseases', 'Previous disease'] 

for i in regime_types:
    ax = plt.subplot(2, 2, i + 1)

    ix = data_fileted['disease_encoded'] == i
    kmf.fit(T[ix], E[ix], label=labels[i])
    kmf.plot(ax=ax, legend=False)
    #kmf.xlabel('timeline in days')

    plt.title(labels[i])
    plt.xlim(0, 8)

    if i==0:
        plt.ylabel('Frac. life after $n$ days')
plt.gcf()
plt.tight_layout()
plt.title('People with previous diseases cases')

In [None]:
dem = (data_fileted["disease_encoded"] == 1)
results = logrank_test(T[dem], T[~dem], E[dem], E[~dem], alpha=.99)
results.print_summary()

Significat differencest at a 95%.

# Age differences

In [None]:
#regime_types = data_fileted['age'].unique()
regime_types = ['50s', '60s', '70s', '80s']

for i, regime_type in enumerate(regime_types):
    ax = plt.subplot(2, 2, i + 1)

    ix = data_fileted['age'] == regime_type
    kmf.fit(T[ix], E[ix], label=regime_type)
    kmf.plot(ax=ax, legend=False)

    plt.title(regime_type)
    plt.xlim(0, 8)

    if i==0:
        plt.ylabel('Frac. life after $n$ days')

plt.tight_layout()

In [None]:
for i in regime_types:
    print('comparing age group {} vs all others'.format(i))
    dem = (data_fileted["age"] == i)
    results = logrank_test(T[dem], T[~dem], E[dem], E[~dem], alpha=.99)
    results.print_summary()

Groups are not significatively different. Probably because of the size of the sample.

In [None]:
dem_50s = (data_fileted["age"] == '50s')
dem_80s = (data_fileted["age"] == '80s')
results = logrank_test(T[dem_50s], T[dem_80s], E[dem_50s], E[dem_80s], alpha=.99)
results.print_summary()

#[Estimating hazard rates using Nelson-Aalen](https://lifelines.readthedocs.io/en/latest/Survival%20analysis%20with%20lifelines.html#estimating-hazard-rates-using-nelson-aalen)

In [None]:
naf = NelsonAalenFitter()
naf.fit(T,event_observed=E)

In [None]:
print(naf.cumulative_hazard_.head())
naf.plot()
plt.gcf()
plt.title('Cumulative Hazard')

In [None]:
for i, regime_type in enumerate(regime_types):
    print(i, regime_type)

In [None]:
#regime_types = data_fileted['age'].unique()
regime_types = ['50s', '60s', '70s', '80s']

for i, regime_type in enumerate(regime_types):
    ax = plt.subplot(2, 2, i + 1)

    ix = data_fileted['age'] == regime_type
    naf.fit(T[ix],event_observed=E[ix], label=regime_type)
    naf.plot(ax=ax, legend=False)

    plt.title(regime_type)
    plt.xlim(0, 8)

    if i==0:
        plt.ylabel('Cumulative hazard function')

plt.tight_layout()

# Weibul Fiting Parametric 

In [None]:
data_fileted.columns

In [None]:
data_fileted['duration']=data_fileted['duration']
data_fileted[['duration','observed']].dtypes

In [None]:
def gender_encode(x):
    if x=='female':
        return 1
    elif x=='male':
        return 0
    else:
        return x
    
data_fileted['sex_encoded']= data_fileted['sex'].apply(gender_encode)
#data['disease_encoded'].hist()
data_fileted['sex_encoded'].value_counts().plot.bar()
plt.gcf()
plt.title('Histogram Causalties, female (1) Male (0)')

In [None]:

def remove_final_s(x):
    try:
        return int(x.replace('s',''))
    except:
        return x  
    
data_fileted['age_int']=data_fileted['age'].apply(remove_final_s)

## Basics of the Cox proportional hazards model
The purpose of the model is to evaluate simultaneously the effect of several factors on survival. In other words, it allows us to examine how specified factors influence the rate of a particular event happening (e.g., infection, death) at a particular point in time.
https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html
Really good explanations here :http://www.sthda.com/english/wiki/cox-proportional-hazards-model

In [None]:
   
cph = CoxPHFitter()
data_fileted = data_fileted.dropna(subset = ['sex_encoded','age_int','duration','observed'])
cph.fit(data_fileted[['age_int','duration','observed','sex_encoded']], duration_col='duration', event_col='observed')

cph.plot_covariate_groups('age_int', [30, 40, 50, 60, 70, 80, 90], cmap='coolwarm')
plt.gcf()
plt.title('By group age')

In [None]:
cph.plot_covariate_groups('sex_encoded', [1, 0], cmap='coolwarm')
plt.gcf()
plt.title('Survival Curve by Gender. Female 1. Male 0')

Looks like for younger people development could be shorter. Potentially because elder people could have go to the docter earlier. In order to better understand that let's have a look to the details. Looks like the variable may not be representative sinve p-value is 0.19

In [None]:
cph.print_summary()  # access the results using cph.summary

In [None]:
# We can see that the sex variable is not very useful by plotting the coefficients
cph.plot();

In [None]:
data_fileted['age'].value_counts().plot.bar()

# Future work
1. Consider uncertaintey of the beggining of the situation