# Survival Analysis of Full Cohort 

**This notebook analyzes median overall survival and hazard ratios in patients with advanced or metastatic urothelial cancer. The analysis compares patients receiving first-line monotherapy with monotherapy checkpoint inhibitors (pembrolizumab [KEYNOTE-361] or atezolizumab [IMvigor130]) versus those receiving chemotherapy.** 

In [1]:
import numpy as np
import pandas as pd

from flatiron_cleaner import DataProcessorUrothelial

## Patient Censoring 

In [2]:
cohort = pd.read_csv('../outputs/full_cohort.csv')

In [10]:
cohort.shape

(6461, 3)

In [3]:
cohort.sample(3)

Unnamed: 0,PatientID,LineName,StartDate
840,FF79CB6948199,Pembrolizumab,2018-12-04
3116,FF98BE15B7014,chemo,2012-10-22
5670,F28BC76726153,chemo,2019-02-04


In [5]:
processor = DataProcessorUrothelial()

# Process Enhanced_Mortality_V2.csv and use visit, telemedicine, biomarkers, oral, and progression data to determine censoring date 
mortality_df = processor.process_mortality(file_path = '../data/Enhanced_Mortality_V2.csv',
                                           index_date_df = cohort, 
                                           index_date_column = 'StartDate',
                                           visit_path = '../data/Visit.csv', 
                                           telemedicine_path = '../data/Telemedicine.csv', 
                                           biomarkers_path = '../data/Enhanced_AdvUrothelialBiomarkers.csv', 
                                           oral_path = '../data/Enhanced_AdvUrothelial_Orals.csv',
                                           progression_path = '../data/Enhanced_AdvUrothelial_Progression.csv',
                                           drop_dates = False)

2025-04-10 14:27:26,748 - INFO - Successfully read Enhanced_Mortality_V2.csv file with shape: (9040, 2) and unique PatientIDs: 9040
2025-04-10 14:27:26,764 - INFO - Successfully merged Enhanced_Mortality_V2.csv df with index_date_df resulting in shape: (6461, 3) and unique PatientIDs: 6461
2025-04-10 14:27:27,173 - INFO - The following columns ['last_visit_date', 'last_biomarker_date', 'last_oral_date', 'last_progression_date'] are used to calculate the last EHR date
2025-04-10 14:27:27,179 - INFO - Successfully processed Enhanced_Mortality_V2.csv file with final shape: (6461, 6) and unique PatientIDs: 6461. There are 0 out of 6461 patients with missing duration values


In [9]:
mortality_df.shape

(6461, 6)

In [12]:
df = pd.merge(mortality_df, cohort[['PatientID', 'LineName']], on = 'PatientID', how = 'left')

In [14]:
df.shape

(6461, 7)

## Power Calculation
**Calculated here: https://sample-size.net/sample-size-survival-analysis/**

In [16]:
# Proportion exposed
df.query('LineName != "chemo"').shape[0]/df.shape[0]

0.33539699736882833

In [20]:
# Number of deaths total
df.query('event == 1').shape[0]

4501

In [22]:
# Baseline event rate for chemotherapy group
chemo_df = df[df['LineName'] == 'chemo'] 
num_events = chemo_df['event'].sum()

# Total person-time (sum of follow-up times, regardless of censoring)
total_person_time = chemo_df['duration'].sum()

# Crude event rate per unit time
event_rate = num_events / total_person_time

print(event_rate)

0.0011452952547330413


In [23]:
# Censoring rate for entire dataset (presumed equal in both)
num_censored = (df['event'] == 0).sum()

# Total person-time (same as for event rate)
total_person_time = df['duration'].sum()

# Censoring rate per unit time
censoring_rate = num_censored / total_person_time

print(censoring_rate)

0.0005660352293395697


In [24]:
# Planned average length of follow-up in days 
average_follow_up = df['duration'].mean()

print(average_follow_up)

535.9359232316979


**Deaths and samples sizes needed by HR given the proportion in the treamtent group is 0.33, two-tailed alpha of 0.05, and beta of 0.2:** 
* **HR 0.7: 277 deaths (252 patients in checkpoint and 498 in chemotherapy with total of 750)**
* **HR 0.8: 708 deaths (624 patients in checkpoint and 1236 in chemotherapy with total of 1860)**