# Survival Analysis — Full Reproducible Notebook

This notebook reproduces the `run_analysis_full.py` script in an interactive format. It performs EDA, Kaplan–Meier estimation, Cox proportional hazards modeling, proportional-hazards testing, and saves text deliverables. Run cells sequentially. If `lifelines` is not installed, install it using `pip install lifelines`.

**Files expected:** `simulated_survival_data.csv` (placed in the same folder as this notebook).


In [None]:
# Setup: imports and file paths
import os, sys
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

DATA_PATH = Path('simulated_survival_data.csv')
OUTDIR = Path('results_notebook')
OUTDIR.mkdir(exist_ok=True)

print('Data path:', DATA_PATH.resolve())
print('Output dir:', OUTDIR.resolve())

# If lifelines isn't installed, uncomment the following line and run it:
# !pip install lifelines


In [None]:
# Load data
if not DATA_PATH.exists():
    raise FileNotFoundError(f"{DATA_PATH} not found. Put simulated_survival_data.csv in the notebook directory.")
df = pd.read_csv(DATA_PATH)
df.head()

## Exploratory Data Analysis (EDA)
Descriptive statistics, missing values, event distribution, and key summaries.

In [None]:
# EDA
desc = df.describe(include='all').T
missing = df.isnull().sum()
events = df['event'].value_counts()
event_rate = df['event'].mean()

print('Shape:', df.shape)
print('\nMissing values:\n', missing)
print('\nEvent counts:\n', events.to_string())
print('\nEvent rate: {:.2%}'.format(event_rate))

# Save EDA summary to file
eda_text = []
eda_text.append(f"Dataset shape: {df.shape[0]} rows, {df.shape[1]} columns\n")
eda_text.append('Column types:\n' + df.dtypes.to_string() + '\n\n')
eda_text.append('Missing values:\n' + missing.to_string() + '\n\n')
eda_text.append('Descriptive statistics (numeric):\n' + df.describe().to_string() + '\n\n')
eda_text.append('Event distribution:\n' + events.to_string() + '\n\n')
eda_text.append(f'Event rate: {event_rate:.4f}\n')
with open(OUTDIR / 'eda_summary.txt', 'w') as f:
    f.write('\n'.join(eda_text))
print('\nEDA summary saved to', OUTDIR / 'eda_summary.txt')

## Kaplan–Meier Estimation
Fit a Kaplan–Meier curve for overall survival and report survival probabilities.

In [None]:
# Kaplan-Meier
try:
    from lifelines import KaplanMeierFitter
except Exception as e:
    raise ImportError('lifelines not installed. Run `pip install lifelines` and re-run this cell.') from e

kmf = KaplanMeierFitter()
kmf.fit(df['time'], event_observed=df['event'], label='Overall')
ax = kmf.plot_survival_function()
ax.set_title('Kaplan-Meier Survival Curve')
ax.set_xlabel('Time')
ax.set_ylabel('Survival Probability')
plt.tight_layout()
plt.savefig(OUTDIR / 'km_plot.png', dpi=150)
plt.show()

# Save KM estimates and survival at selected times
times_to_report = [1,3,6,12,24,36]
with open(OUTDIR / 'km_estimates.txt', 'w') as f:
    f.write('Kaplan-Meier estimates (summary)\n')
    f.write('='*60 + '\n\n')
    f.write(str(kmf.event_table.head(200)) + '\n\n')
    f.write('Survival probabilities at times (units same as time column):\n')
    for t in times_to_report:
        f.write(f'  time {t}: {kmf.predict(t):.4f}\n')
print('KM estimates saved to', OUTDIR / 'km_estimates.txt')

## Cox Proportional Hazards Model
Prepare covariates, fit the Cox model, and save summary.

In [None]:
# Cox PH model
from lifelines import CoxPHFitter
df2 = df.copy()
# Ensure correct dtypes
df2['sex'] = df2['sex'].astype(int)
df2 = pd.get_dummies(df2, columns=['treatment'], drop_first=True)  # treatment_B created

covariates = ['age', 'sex', 'biomarker', 'treatment_B']
cph = CoxPHFitter()
cph.fit(df2[['time','event'] + covariates], duration_col='time', event_col='event', show_progress=True)
cph.print_summary()

# Save summary to text file
with open(OUTDIR / 'cox_summary.txt', 'w') as f:
    f.write('Cox Proportional Hazards Model Summary\n')
    f.write('='*80 + '\n\n')
    f.write(cph.summary.to_string())
print('Cox summary saved to', OUTDIR / 'cox_summary.txt')

## Proportional Hazards (PH) Test
Use lifelines' `proportional_hazard_test` to test PH assumption.

In [None]:
# PH test
from lifelines.statistics import proportional_hazard_test
results = proportional_hazard_test(cph, df2, time_transform='rank')
print(results.summary)
with open(OUTDIR / 'ph_test.txt', 'w') as f:
    f.write('Proportional Hazards Test (lifelines)\n')
    f.write('='*80 + '\n\n')
    f.write(str(results.summary))
print('PH test saved to', OUTDIR / 'ph_test.txt')

## Interpretation & Writing Deliverables
Guidance on how to fill final interpretation and example text. The following cell writes a template; after running the notebook, replace placeholders with actual values from the results.

In [None]:
# Write final interpretation template based on results
template = f"""Final interpretation (fill with values from results):

- EDA: {df.shape[0]} subjects. Event rate = {df['event'].mean():.3f}.
- Kaplan-Meier: check results in {OUTDIR / 'km_estimates.txt'} and plot {OUTDIR / 'km_plot.png'} for survival probabilities.
- Cox PH: see {OUTDIR / 'cox_summary.txt'} for hazard ratios, 95% CIs, and p-values for covariates.
- PH test: see {OUTDIR / 'ph_test.txt'}. If PH assumption violated, consider stratified Cox or time-varying covariates.
- Example conclusion: 'Treatment B was associated with HR=..., 95%CI=..., p=... (replace with actual numbers).'
"""
with open(OUTDIR / 'final_interpretation.txt', 'w') as f:
    f.write(template)
print('Final interpretation template saved to', OUTDIR / 'final_interpretation.txt')
print('\nExample of Cox summary head:') 
print(open(OUTDIR / 'cox_summary.txt').read().splitlines()[:20]) if (OUTDIR / 'cox_summary.txt').exists() else print('Run Cox cell to generate summary.')