Data Cleaning

This notebook prepares Flatiron Health CSV files for patients with advanced urothelial cancer treated with first-line checkpoint inhibitors or chemotherapy. It builds upon the previous data_cleaning_trial_patients file, which used flatiron_cleaner package. Cleaned dataframes are merged into a single dataset, which will serve as the input for unsupervised clustering to identify clinically or biologically meaningful subgroups. Note that several additional variables (neutrophil count, lymphocyte count) have been added manually beyond the variables provided by the flatiron_cleaner package.

In [1]:
## Import packages
import numpy as np
import pandas as pd
import flatiron_cleaner

from flatiron_cleaner import DataProcessorUrothelial
from flatiron_cleaner import merge_dataframes

In [2]:
df = pd.read_csv('../outputs/full_cohort.csv')

In [3]:
df.head(5)

Unnamed: 0,PatientID,LineName,StartDate
0,F5AAF96C85477,Pembrolizumab,2021-07-08
1,F788831A66E9A,Pembrolizumab,2023-02-22
2,F75847DF35E43,Atezolizumab,2019-04-25
3,F6E944C1709E6,Pembrolizumab,2020-08-12
4,F75087BE5F959,Pembrolizumab,2020-09-09


In [4]:
df.shape

(6461, 3)

In [5]:
ids = df.PatientID.to_list()

## Clean CSV files 

In [6]:
# Initialize class 
processor = DataProcessorUrothelial()

### Process Enhanced_AdvUrothelial.csv

In [7]:
enhanced_df = processor.process_enhanced(file_path = '../data/Enhanced_AdvUrothelial.csv',
                                         patient_ids = ids)

2025-05-03 14:11:01,556 - INFO - Successfully read Enhanced_AdvUrothelial.csv file with shape: (13129, 13) and unique PatientIDs: 13129
2025-05-03 14:11:01,556 - INFO - Filtering for 6461 specific PatientIDs
2025-05-03 14:11:01,564 - INFO - Successfully filtered Enhanced_AdvUrothelial.csv file with shape: (6461, 13) and unique PatientIDs: 6461
2025-05-03 14:11:01,596 - INFO - Successfully processed Enhanced_AdvUrothelial.csv file with final shape: (6461, 13) and unique PatientIDs: 6461


In [8]:
enhanced_df['SmokingStatus'] = enhanced_df['SmokingStatus'].map({
    'History of smoking': 1,
    'No history of smoking': 0,
    'Unknown/not documented': 0
})

In [9]:
enhanced_df['GroupStage_mod'] = enhanced_df["GroupStage_mod"].map({
    '0': '0-II',
    'I': '0-II',
    'II': '0-II',
    'III': 'III',
    'IV': 'IV',
    'unknown': 'unknown'
})

enhanced_df['GroupStage_mod'] = enhanced_df['GroupStage_mod'].astype('category')

In [10]:
enhanced_df['PrimarySite_lower'] = enhanced_df['PrimarySite'].isin(['Bladder', 'Urethra']).astype('int64')
enhanced_df = enhanced_df.drop(columns = ['PrimarySite'])

In [11]:
enhanced_df['days_diagnosis_to_adv'] = enhanced_df['days_diagnosis_to_adv'].fillna(0)

### Process Demographics.csv 

In [12]:
demographics_df = processor.process_demographics(file_path = '../data/Demographics.csv',
                                                 index_date_df = df,
                                                 index_date_column = 'StartDate')

2025-05-03 14:11:10,794 - INFO - Successfully read Demographics.csv file with shape: (13129, 6) and unique PatientIDs: 13129
2025-05-03 14:11:10,835 - INFO - Successfully processed Demographics.csv file with final shape: (6461, 6) and unique PatientIDs: 6461


### Process Enhanced_AdvUrothelialBiomarkers.csv

In [13]:
biomarkers_df = processor.process_biomarkers(file_path = '../data/Enhanced_AdvUrothelialBiomarkers.csv',
                                             index_date_df = df, 
                                             index_date_column = 'StartDate',
                                             days_before = None, 
                                             days_after = 14)

2025-05-03 14:11:13,478 - INFO - Successfully read Enhanced_AdvUrothelialBiomarkers.csv file with shape: (9924, 19) and unique PatientIDs: 4251
2025-05-03 14:11:13,493 - INFO - Successfully merged Enhanced_AdvUrothelialBiomarkers.csv df with index_date_df resulting in shape: (6326, 20) and unique PatientIDs: 2623
2025-05-03 14:11:13,550 - INFO - Successfully processed Enhanced_AdvUrothelialBiomarkers.csv file with final shape: (6461, 4) and unique PatientIDs: 6461


In [14]:
biomarkers_df.PDL1_percent_staining.value_counts(dropna = False)

PDL1_percent_staining
NaN          6372
5% - 9%        25
10% - 19%      16
20% - 29%      14
30% - 39%       9
90% - 99%       6
1%              5
50% - 59%       3
70% - 79%       3
80% - 89%       3
40% - 49%       2
2% - 4%         1
60% - 69%       1
100%            1
0%              0
< 1%            0
Name: count, dtype: int64

In [15]:
def map_pdl1(value):
    if pd.isna(value):  # leave missing as is
        return value
    elif value in ['0%', '< 1%']:
        return '0%'
    else:
        return '>=1%'

biomarkers_df['PDL1_binary'] = biomarkers_df['PDL1_percent_staining'].apply(map_pdl1)

In [16]:
biomarkers_df.PDL1_binary.value_counts(dropna = False)

PDL1_binary
NaN     6372
>=1%      89
Name: count, dtype: int64

In [17]:
biomarkers_df = biomarkers_df.drop(columns = ['PDL1_percent_staining'])

In [18]:
biomarkers_df['FGFR_status'] = biomarkers_df['FGFR_status'].fillna('unknown')
biomarkers_df['PDL1_status'] = biomarkers_df['PDL1_status'].fillna('unknown')

### Process ECOG.csv

In [19]:
ecog_df = processor.process_ecog(file_path = '../data/ECOG.csv', 
                                 index_date_df = df,
                                 index_date_column = 'StartDate',
                                 days_before = 90,
                                 days_after = 0,
                                 days_before_further = 180)

2025-05-03 14:11:26,289 - INFO - Successfully read ECOG.csv file with shape: (184794, 4) and unique PatientIDs: 9933
2025-05-03 14:11:26,333 - INFO - Successfully merged ECOG.csv df with index_date_df resulting in shape: (118838, 5) and unique PatientIDs: 5453
2025-05-03 14:11:26,430 - INFO - Successfully processed ECOG.csv file with final shape: (6461, 3) and unique PatientIDs: 6461


In [20]:
ecog_df['ecog_index'] = ecog_df['ecog_index'].cat.add_categories('unknown').fillna('unknown')

ecog_df['ecog_index'] = ecog_df["ecog_index"].map({
    0: '0-1',
    1: '0-1',
    2: '2',
    3: '3-4',
    4: '3-4',
    'unknown': 'unknown'
})

ecog_df['ecog_index'] = ecog_df['ecog_index'].astype('category')

In [21]:
ecog_df['ecog_newly_gte2'] = ecog_df['ecog_newly_gte2'].fillna(0)

### Process Vitals.csv

In [22]:
vitals_df = processor.process_vitals(file_path = '../data/Vitals.csv',
                                     index_date_df = df,
                                     index_date_column = 'StartDate',
                                     weight_days_before = 90,
                                     days_after = 0,
                                     vital_summary_lookback = 180, 
                                     abnormal_reading_threshold = 1)

2025-05-03 14:11:38,394 - INFO - Successfully read Vitals.csv file with shape: (3604484, 16) and unique PatientIDs: 13109
2025-05-03 14:11:40,061 - INFO - Successfully merged Vitals.csv df with index_date_df resulting in shape: (2038026, 17) and unique PatientIDs: 6461
2025-05-03 14:11:40,993 - INFO - Successfully processed Vitals.csv file with final shape: (6461, 8) and unique PatientIDs: 6461


### Process Lab.csv

In [23]:
labs_df = processor.process_labs(file_path = '../data/Lab.csv',
                                 index_date_df = df,
                                 index_date_column = 'StartDate',
                                 days_before = 90,
                                 days_after = 0,
                                 summary_lookback = 180)

2025-05-03 14:11:58,322 - INFO - Successfully read Lab.csv file with shape: (9373598, 17) and unique PatientIDs: 12700
2025-05-03 14:12:02,796 - INFO - Successfully merged Lab.csv df with index_date_df resulting in shape: (5615579, 18) and unique PatientIDs: 6408
2025-05-03 14:12:14,842 - INFO - Successfully processed Lab.csv file with final shape: (6461, 76) and unique PatientIDs: 6461


### Process MedicationAdministration.csv

In [24]:
medications_df = processor.process_medications(file_path = '../data/MedicationAdministration.csv',
                                               index_date_df = df,
                                               index_date_column = 'StartDate',
                                               days_before = 90,
                                               days_after = 0)

2025-05-03 14:12:20,137 - INFO - Successfully read MedicationAdministration.csv file with shape: (997836, 11) and unique PatientIDs: 10983
2025-05-03 14:12:20,492 - INFO - Successfully merged MedicationAdministration.csv df with index_date_df resulting in shape: (565555, 12) and unique PatientIDs: 6341
2025-05-03 14:12:20,542 - INFO - Successfully processed MedicationAdministration.csv file with final shape: (6461, 9) and unique PatientIDs: 6461


### Process Diagnosis.csv

In [25]:
diagnosis_df = processor.process_diagnosis(file_path = '../data/Diagnosis.csv',
                                           index_date_df = df,
                                           index_date_column = 'StartDate',
                                           days_before = None,
                                           days_after = 0)

2025-05-03 14:12:24,338 - INFO - Successfully read Diagnosis.csv file with shape: (625348, 6) and unique PatientIDs: 13129
2025-05-03 14:12:24,459 - INFO - Successfully merged Diagnosis.csv df with index_date_df resulting in shape: (309101, 7) and unique PatientIDs: 6461
2025-05-03 14:12:25,387 - INFO - Successfully processed Diagnosis.csv file with final shape: (6461, 40) and unique PatientIDs: 6461


In [26]:
diagnosis_df['other_gi_met'] = (
    diagnosis_df['adrenal_met'] | diagnosis_df['peritoneum_met'] | diagnosis_df['gi_met']
)

diagnosis_df['other_combined_met'] = (
    diagnosis_df['brain_met'] | diagnosis_df['other_met']
)

diagnosis_df = diagnosis_df.drop(columns = ['adrenal_met', 'peritoneum_met', 'gi_met', 'brain_met', 'other_met'])

### Pulling in Neutrophil and Lymphocyte values

In [27]:
lab = pd.read_csv('../data/Lab.csv')
lab = lab[lab['PatientID'].isin(ids)]

In [28]:
lab.shape

(5615579, 17)

In [29]:
# Function that returns number of rows and count of unique PatientIDs for a dataframe. 
def row_ID(dataframe):
    row = dataframe.shape[0]
    ID = dataframe['PatientID'].nunique()
    return row, ID

In [30]:
row_ID(lab)

(5615579, 6408)

In [31]:
lab = pd.merge(lab, df[['PatientID', 'StartDate']], on = 'PatientID', how = 'left')

In [32]:
lab['ResultDate'] = pd.to_datetime(lab['ResultDate'], format="%Y-%m-%d")

In [33]:
# Select rows with neutrophil count, lymphocyte count
# Neutrophil count -- (LOINC: 26499-4, 751-8, 30451-9, and 753-4)
# Lymphocyte count -- (LOINC: 26474-7, 30364-4, 731-0, 732-8)

lab_core = (
    lab[
    (lab['LOINC'] == "26499-4") |
    (lab['LOINC'] == "751-8") |
    (lab['LOINC'] == "30451-9") |
    (lab['LOINC'] == "753-4") | 
    (lab['LOINC'] == "26474-7")| 
    (lab['LOINC'] == "30364-4")| 
    (lab['LOINC'] == "731-0")  | 
    (lab['LOINC'] == "732-8")]
    .filter(items = ['PatientID', 
                     'ResultDate', 
                     'LOINC', 
                     'LabComponent', 
                     'TestUnits', 
                     'TestUnitsCleaned', 
                     'TestResult', 
                     'TestResultCleaned', 
                     'StartDate'])
)

In [34]:
lab_core['LOINC'] = lab_core['LOINC'].astype(str)
conditions = [
    ((lab_core['LOINC'] == '26499-4') | (lab_core['LOINC'] == '751-8') | (lab_core['LOINC'] == '30451-9') | (lab_core['LOINC'] == '753-4')),
    (lab_core['LOINC'] == '26474-7') | (lab_core['LOINC'] == '30364-4') | (lab_core['LOINC'] == '731-0') | (lab_core['LOINC'] == '732-8')]

choices = ['neutrophil_count',  
           'lymphocyte_count']

lab_core.loc[:, 'lab_name'] = np.select(conditions, choices, default="Unknown").astype(str)

In [35]:
lab_core['lab_name'].value_counts()

lab_name
lymphocyte_count    200662
neutrophil_count    179419
Name: count, dtype: int64

In [36]:
# Remove missing lab values. 
lab_core = lab_core.dropna(subset = ['TestResultCleaned'])

In [37]:
lab_core['TestUnits'].value_counts()

TestUnits
10*3/uL     332694
10*3/mm3      4973
10*9/L        4345
cell/uL       2993
/mm3           204
/uL            195
10*3/L          50
10*3/mL          7
Name: count, dtype: int64

In [38]:
pd.crosstab(lab_core['lab_name'], lab_core['TestUnits'], margins = True, margins_name = "Total")

TestUnits,/mm3,/uL,10*3/L,10*3/mL,10*3/mm3,10*3/uL,10*9/L,cell/uL,Total
lab_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
lymphocyte_count,101,88,25,7,2703,174916,2845,1475,182160
neutrophil_count,103,107,25,0,2270,157778,1500,1518,163301
Total,204,195,50,7,4973,332694,4345,2993,345461


In [39]:
#For cell counts, need conversions for /mm^3, /uL, 10^3/mL, cell/uL; also need for 10^3/L
conditions = [
    ((lab_core['lab_name'] == 'neutrophil_count') | (lab_core['lab_name'] == 'lymphocyte_count')) & 
    (lab_core['TestUnits'] == '10*3/L'),
    ((lab_core['lab_name'] == 'neutrophil_count') | (lab_core['lab_name'] == 'lymphocyte_count')) & 
    ((lab_core['TestUnits'] == '/mm3') | (lab_core['TestUnits'] == '/uL') | (lab_core['TestUnits'] == '10*3/mL'))]

choices = [lab_core['TestResultCleaned'] * 1000000,
           lab_core['TestResultCleaned'] * 1000]

lab_core.loc[:, 'test_result_cleaned'] = np.select(conditions, choices, default = lab_core['TestResultCleaned'])

In [40]:
# Elgibliity window is -90 and 0 from start date. 
lab_core['StartDate'] = pd.to_datetime(lab_core['StartDate'], format="%Y-%m-%d")
lab_core_win = (
    lab_core
    .assign(lab_date_diff = (lab_core['ResultDate'] - lab_core['StartDate']).dt.days)
    .query('lab_date_diff >= -90 and lab_date_diff <= 0')
    .filter(items = ['PatientID', 'ResultDate', 'TestResultCleaned', 'lab_name', 'StartDate', 'test_result_cleaned', 'lab_date_diff'])
)

In [41]:
row_ID(lab_core_win)

(27395, 5682)

In [42]:
lab_core_win.loc[:, 'lab_date_diff'] = lab_core_win['lab_date_diff'].abs()

In [43]:
# Select lab closest to date of advanced diagnosis and pivot to a wide table. 
lab_diag_wide = (
    lab_core_win
    .loc[lab_core_win.groupby(['PatientID', 'lab_name'])['lab_date_diff'].idxmin()]
    .pivot(index = 'PatientID', columns = 'lab_name', values = 'test_result_cleaned')
    .reset_index()
    .rename(columns = {
        'neutrophil_count': 'neutrophil',
        'lymphocyte_count': 'lymphocyte'})
)

lab_diag_wide.columns.name = None

In [44]:
row_ID(lab_diag_wide)

(5682, 5682)

In [45]:
lab_diag_wide.sample(3)

Unnamed: 0,PatientID,lymphocyte,neutrophil
1455,F412959B03189,1.66,9.32
5456,FF5300B7888CA,1.5,6.5404
1066,F2EBA47DD533B,1.4193,6.4


In [46]:
lab_diag_wide = (
    pd.concat(
        [lab_diag_wide, 
        pd.Series(ids)[~pd.Series(ids).isin(lab_diag_wide['PatientID'])].to_frame(name = 'PatientID')],
        sort = False)
)

In [47]:
row_ID(lab_diag_wide)

(6461, 6461)

Max, min

In [50]:
# Elgibility window is negative 180 to 0 from start date. 
lab_core_win_summ = (
    lab_core
    .assign(lab_date_diff = (lab_core['ResultDate'] - lab_core['StartDate']).dt.days)
    .query('lab_date_diff >= -180 and lab_date_diff <= 0')
    .filter(items = ['PatientID', 'ResultDate', 'TestResultCleaned', 'lab_name', 'StartDate', 'test_result_cleaned', 'lab_date_diff'])
)

In [51]:
# Pivot table of maximum values for core labs during elgibility period of -180 to 0 days from start date. 
lab_max_wide = (
    lab_core_win_summ
    .groupby(['PatientID', 'lab_name'])['test_result_cleaned'].max()
    .to_frame()
    .reset_index()
    .pivot(index = 'PatientID', columns = 'lab_name', values = 'test_result_cleaned')
    .reset_index()
    .rename(columns = {
        'neutrophil_count': 'neutrophil_max',
        'lymphocyte_count': 'lymphocyte_max'})
)

lab_max_wide.columns.name = None

In [52]:
row_ID(lab_max_wide)

(5723, 5723)

In [58]:
# Pivot table of minimum values for core labs during elgibility period of -180 to 0 days from start date. 
lab_min_wide = (
    lab_core_win_summ
    .groupby(['PatientID', 'lab_name'])['test_result_cleaned'].min()
    .to_frame()
    .reset_index()
    .pivot(index = 'PatientID', columns = 'lab_name', values = 'test_result_cleaned')
    .reset_index()
    .rename(columns = {
        'neutrophil_count': 'neutrophil_min',
        'lymphocyte_count': 'lymphocyte_min'})
)

lab_min_wide.columns.name = None

In [59]:
row_ID(lab_min_wide)

(5723, 5723)

In [60]:
neutrophil_lymphocyte_df = pd.merge(lab_diag_wide, lab_max_wide, on = 'PatientID', how = 'outer')

In [61]:
neutrophil_lymphocyte_df = pd.merge(neutrophil_lymphocyte_df, lab_min_wide, on = 'PatientID', how = 'outer')

In [62]:
neutrophil_lymphocyte_df.sample(3)

Unnamed: 0,PatientID,lymphocyte,neutrophil,lymphocyte_max,neutrophil_max,lymphocyte_min,neutrophil_min
4573,FB52FD54E6250,1.0,7.4,1.5,7.7,0.8,2.9
4353,FAD53C0FC1CAF,,,,,,
5668,FE0B1B2A2DA7B,1.7,,1.7,,1.3,


## Merge dataframes

In [96]:
final_df = merge_dataframes(enhanced_df,
                            demographics_df,
                            biomarkers_df,
                            ecog_df,
                            vitals_df,
                            labs_df,
                            medications_df,
                            diagnosis_df,
                            neutrophil_lymphocyte_df)

2025-05-03 14:50:08,225 - INFO - Anticipated number of merges: 8
2025-05-03 14:50:08,228 - INFO - Anticipated number of columns in final dataframe presuming all columns are unique except for PatientID: 155
2025-05-03 14:50:08,238 - INFO - Dataset 1 shape: (6461, 13), unique PatientIDs: 6461
2025-05-03 14:50:08,245 - INFO - Dataset 2 shape: (6461, 6), unique PatientIDs: 6461
2025-05-03 14:50:08,246 - INFO - Dataset 3 shape: (6461, 4), unique PatientIDs: 6461
2025-05-03 14:50:08,247 - INFO - Dataset 4 shape: (6461, 3), unique PatientIDs: 6461
2025-05-03 14:50:08,249 - INFO - Dataset 5 shape: (6461, 8), unique PatientIDs: 6461
2025-05-03 14:50:08,250 - INFO - Dataset 6 shape: (6461, 76), unique PatientIDs: 6461
2025-05-03 14:50:08,251 - INFO - Dataset 7 shape: (6461, 9), unique PatientIDs: 6461
2025-05-03 14:50:08,253 - INFO - Dataset 8 shape: (6461, 37), unique PatientIDs: 6461
2025-05-03 14:50:08,257 - INFO - Dataset 9 shape: (6461, 7), unique PatientIDs: 6461
2025-05-03 14:50:08,286 - 

In [97]:
final_df.shape

(6461, 155)

In [98]:
final_df.head(2)

Unnamed: 0,PatientID,DiseaseGrade,SmokingStatus,Surgery,GroupStage_mod,TStage_mod,NStage_mod,MStage_mod,SurgeryType_mod,days_diagnosis_to_adv,...,liver_met,bone_met,other_gi_met,other_combined_met,lymphocyte,neutrophil,lymphocyte_max,neutrophil_max,lymphocyte_min,neutrophil_min
0,F0016E985D839,High grade (G2/G3/G4),1,1,IV,T3,N1,M0,upper,0.0,...,0,0,0,0,1.6,3.3,2.4,5.4,1.6,3.3
1,F001E5D4C6FA0,Low grade (G1),1,1,unknown,T1,unknown,unknown,bladder,274.0,...,0,0,0,0,2.4,,2.9,,2.4,


In [99]:
list(final_df.columns)

['PatientID',
 'DiseaseGrade',
 'SmokingStatus',
 'Surgery',
 'GroupStage_mod',
 'TStage_mod',
 'NStage_mod',
 'MStage_mod',
 'SurgeryType_mod',
 'days_diagnosis_to_adv',
 'adv_diagnosis_year',
 'days_diagnosis_to_surgery',
 'PrimarySite_lower',
 'Gender',
 'age',
 'Ethnicity_mod',
 'Race_mod',
 'region',
 'PDL1_status',
 'FGFR_status',
 'PDL1_binary',
 'ecog_index',
 'ecog_newly_gte2',
 'weight_index',
 'bmi_index',
 'percent_change_weight',
 'hypotension',
 'tachycardia',
 'fevers',
 'hypoxemia',
 'albumin',
 'alp',
 'alt',
 'ast',
 'bicarbonate',
 'bun',
 'calcium',
 'chloride',
 'creatinine',
 'hemoglobin',
 'platelet',
 'potassium',
 'sodium',
 'total_bilirubin',
 'wbc',
 'albumin_max',
 'alp_max',
 'alt_max',
 'ast_max',
 'bicarbonate_max',
 'bun_max',
 'calcium_max',
 'chloride_max',
 'creatinine_max',
 'hemoglobin_max',
 'platelet_max',
 'potassium_max',
 'sodium_max',
 'total_bilirubin_max',
 'wbc_max',
 'albumin_min',
 'alp_min',
 'alt_min',
 'ast_min',
 'bicarbonate_min',
 '

## Variable selection for final data frame

After discussion about clinical relevance and appropriateness for this analysis, we will plan to proceed with the following, ~61 variables:
    'PrimarySite',
    'SmokingStatus',
    'Surgery',
    'GroupStage_mod',
    'days_diagnosis_to_adv',
    'age',
    'PDL1_binary',
    'FGFR_status',
    'ecog_index',
    'ecog_newly_gte2',
    'weight_index',
    'bmi_index',
    'percent_change_weight',
    'hypotension',
    'tachycardia',
    'fevers',
    'hypoxemia',
    'platelet',
    'hemoglobin',
    'wbc',
    'creatinine',
    'potassium',
    'calcium',
    'sodium',
    'bicarbonate',
    'total_bilirubin',
    'alp',
    'ast',
    'albumin',
    'wbc_max',
    'creatinine_max',
    'sodium_max',
    'total_bilirubin_max',
    'alp_max',
    'ast_max',
    'platelet_min',
    'hemoglobin_min',
    'wbc_min',
    'albumin_min',
    'sodium_min',
    'bicarbonate_min',
    'potassium_min',
    'anticoagulant',
    'opioid',
    'antibiotic',
    'diabetic_med',
    'bone_therapy_agent',
    'immunosuppressant',
    'van_walraven_score',
    'lymph_met',
    'thoracic_met',
    'liver_met',
    'bone_met',
    'other_gi_met',
    'other_combined_met'
    'neutrophil',
    'neutrophil_max',
    'neutrophil_min',
    'lymphocyte',
    'lymphocyte_max',
    'lymphocyte_min'

In [100]:
final_df['ecog_newly_gte2'].value_counts()

ecog_newly_gte2
0    6130
1     331
Name: count, dtype: Int64

Filter down to the 60 variables of interest:

In [101]:
final_df = final_df.filter(items=[
    'PatientID',
    'PrimarySite_lower',
    'SmokingStatus',
    'Surgery',
    'GroupStage_mod',
    'days_diagnosis_to_adv',
    'age',
    'PDL1_binary',
    'FGFR_status',
    'ecog_index',
    'ecog_newly_gte2',
    'bmi_index',
    'percent_change_weight',
    'hypotension',
    'tachycardia',
    'fevers',
    'hypoxemia',
    'platelet',
    'hemoglobin',
    'wbc',
    'creatinine',
    'potassium',
    'calcium',
    'sodium',
    'bicarbonate',
    'total_bilirubin',
    'alp',
    'ast',
    'albumin',
    'wbc_max',
    'creatinine_max',
    'sodium_max',
    'total_bilirubin_max',
    'alp_max',
    'ast_max',
    'platelet_min',
    'hemoglobin_min',
    'wbc_min',
    'albumin_min',
    'sodium_min',
    'bicarbonate_min',
    'potassium_min',
    'anticoagulant',
    'opioid',
    'antibiotic',
    'diabetic_med',
    'bone_therapy_agent',
    'immunosuppressant',
    'van_walraven_score',
    'lymph_met',
    'thoracic_met',
    'liver_met',
    'bone_met',
    'other_gi_met',
    'other_combined_met',
    'neutrophil',
    'neutrophil_max',
    'neutrophil_min',
    'lymphocyte',
    'lymphocyte_max',
    'lymphocyte_min'])

In [102]:
final_df.shape

(6461, 61)

In [103]:
list(final_df.columns)

['PatientID',
 'PrimarySite_lower',
 'SmokingStatus',
 'Surgery',
 'GroupStage_mod',
 'days_diagnosis_to_adv',
 'age',
 'PDL1_binary',
 'FGFR_status',
 'ecog_index',
 'ecog_newly_gte2',
 'bmi_index',
 'percent_change_weight',
 'hypotension',
 'tachycardia',
 'fevers',
 'hypoxemia',
 'platelet',
 'hemoglobin',
 'wbc',
 'creatinine',
 'potassium',
 'calcium',
 'sodium',
 'bicarbonate',
 'total_bilirubin',
 'alp',
 'ast',
 'albumin',
 'wbc_max',
 'creatinine_max',
 'sodium_max',
 'total_bilirubin_max',
 'alp_max',
 'ast_max',
 'platelet_min',
 'hemoglobin_min',
 'wbc_min',
 'albumin_min',
 'sodium_min',
 'bicarbonate_min',
 'potassium_min',
 'anticoagulant',
 'opioid',
 'antibiotic',
 'diabetic_med',
 'bone_therapy_agent',
 'immunosuppressant',
 'van_walraven_score',
 'lymph_met',
 'thoracic_met',
 'liver_met',
 'bone_met',
 'other_gi_met',
 'other_combined_met',
 'neutrophil',
 'neutrophil_max',
 'neutrophil_min',
 'lymphocyte',
 'lymphocyte_max',
 'lymphocyte_min']

## Preparing data frame for clustering (UMAP --> HDBSCAN): binarizing variables and imputing missing values

Primary Site

In [104]:
final_df['PrimarySite_lower'].value_counts()

PrimarySite_lower
1    4965
0    1496
Name: count, dtype: int64

Smoking

In [95]:
final_df['SmokingStatus'].value_counts()

SmokingStatus
1    4742
0    1719
Name: count, dtype: int64

Surgery

In [105]:
final_df['Surgery'].value_counts()

Surgery
0    3345
1    3116
Name: count, dtype: Int64

GroupStage_mod

In [106]:
final_df['GroupStage_mod'].value_counts()

GroupStage_mod
unknown    2978
IV         2346
0-II        606
III         531
Name: count, dtype: int64

Will convert this to a variable of stage_0_III, and a stage_IV that are each binarized, to try to capture information about whether a patient had local disease or metastatic disease at the time of diagnosis

In [107]:
final_df['stage_0_III'] = np.where(final_df['GroupStage_mod'].isin(['0-II', 'III']), 1, 0)

In [108]:
final_df['stage_0_III'].value_counts()

stage_0_III
0    5324
1    1137
Name: count, dtype: int64

In [109]:
final_df['stage_IV'] = np.where(final_df['GroupStage_mod'].isin(['IV']), 1, 0)

In [111]:
final_df['stage_IV'].value_counts()

stage_IV
0    4115
1    2346
Name: count, dtype: int64

In [112]:
del final_df['GroupStage_mod']

Days from diagnosis

In [113]:
final_df['days_diagnosis_to_adv'].isna().sum()

np.int64(0)

In [114]:
final_df['days_diagnosis_to_adv'].median()

np.float64(101.0)

Age

In [115]:
final_df['age'].isna().sum()

np.int64(0)

In [116]:
final_df['age'].median()

np.float64(73.0)

PDL1 Status

PDL1 Status

In [118]:
final_df['PDL1_binary'].value_counts()

PDL1_binary
>=1%    89
Name: count, dtype: int64

In [119]:
final_df['pdl1'] = np.where(final_df['PDL1_binary'].isin(['>=1%']), 1, 0)

In [120]:
final_df['pdl1'].value_counts()

pdl1
0    6372
1      89
Name: count, dtype: int64

In [121]:
del final_df['PDL1_binary']

FGFR

In [122]:
final_df['FGFR_status'].value_counts()

FGFR_status
unknown     5563
negative     632
positive     266
Name: count, dtype: int64

In [123]:
final_df['fgfr'] = np.where(final_df['FGFR_status'].isin(['positive']), 1, 0)

In [124]:
final_df['fgfr'].value_counts()

fgfr
0    6195
1     266
Name: count, dtype: int64

In [125]:
del final_df['FGFR_status']

ECOG Index

In [296]:
final_df['ecog_index'].value_counts()

ecog_index
0-1        3547
unknown    1987
2           728
3-4         199
Name: count, dtype: int64

In [297]:
final_df['ecog_2+'] = np.where(final_df['ecog_index'].isin(['2', '3-4']), 1, 0)

In [298]:
final_df['ecog_2+'].value_counts()

ecog_2+
0    5534
1     927
Name: count, dtype: int64

In [299]:
del final_df['ecog_index']

ECOG Newly gte2

In [130]:
final_df['ecog_newly_gte2'].value_counts()

ecog_newly_gte2
0    6130
1     331
Name: count, dtype: Int64

BMI

In [132]:
final_df['bmi_index'].isna().sum()

np.int64(94)

For missing BMI values, impute using median value

In [133]:
final_df['bmi_index'].median()

np.float64(26.417649130497168)

In [134]:
final_df['bmi_index'].fillna(final_df['bmi_index'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['bmi_index'].fillna(final_df['bmi_index'].median(), inplace=True)


In [135]:
final_df['bmi_index'].isna().sum()

np.int64(0)

Percent Change in weight

In [136]:
final_df['percent_change_weight'].isna().sum()

np.int64(524)

In [137]:
final_df['percent_change_weight'].median()

np.float64(-0.9803410038610013)

In [138]:
final_df['percent_change_weight'].fillna(final_df['percent_change_weight'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['percent_change_weight'].fillna(final_df['percent_change_weight'].median(), inplace=True)


In [139]:
final_df['percent_change_weight'].isna().sum()

np.int64(0)

Hypotension

In [140]:
final_df['hypotension'].value_counts()

hypotension
0    6185
1     276
Name: count, dtype: Int64

Tachycardia

In [141]:
final_df['tachycardia'].value_counts()

tachycardia
0    4706
1    1755
Name: count, dtype: Int64

Fevers

In [142]:
final_df['fevers'].value_counts()

fevers
0    6324
1     137
Name: count, dtype: Int64

Hypoxia

In [143]:
final_df['hypoxemia'].value_counts()

hypoxemia
0    6279
1     182
Name: count, dtype: Int64

In [144]:
final_df['platelet'].isna().sum()

np.int64(370)

In [145]:
final_df['platelet'].median()

np.float64(269.0)

In [146]:
final_df['platelet'].fillna(final_df['platelet'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['platelet'].fillna(final_df['platelet'].median(), inplace=True)


In [147]:
final_df['platelet'].isna().sum()

np.int64(0)

Hemoglobin

In [148]:
final_df['hemoglobin'].isna().sum()

np.int64(372)

In [149]:
final_df['hemoglobin'].median()

np.float64(11.6)

In [150]:
final_df['hemoglobin'].fillna(final_df['hemoglobin'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['hemoglobin'].fillna(final_df['hemoglobin'].median(), inplace=True)


In [151]:
final_df['hemoglobin'].isna().sum()

np.int64(0)

WBC

In [152]:
final_df['wbc'].isna().sum()

np.int64(376)

In [153]:
final_df['wbc'].median()

np.float64(8.0)

In [154]:
final_df['wbc'].fillna(final_df['wbc'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['wbc'].fillna(final_df['wbc'].median(), inplace=True)


In [156]:
final_df['wbc'].isna().sum()

np.int64(0)

Creatinine

In [157]:
final_df['creatinine'].isna().sum()

np.int64(753)

In [159]:
final_df['creatinine'].median()

np.float64(1.2)

In [158]:
final_df['creatinine'].fillna(final_df['creatinine'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['creatinine'].fillna(final_df['creatinine'].median(), inplace=True)


In [160]:
final_df['creatinine'].isna().sum()

np.int64(0)

Potassium

In [161]:
final_df['potassium'].isna().sum()

np.int64(807)

In [162]:
final_df['potassium'].median()

np.float64(4.4)

In [163]:
final_df['potassium'].fillna(final_df['potassium'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['potassium'].fillna(final_df['potassium'].median(), inplace=True)


In [164]:
final_df['potassium'].isna().sum()

np.int64(0)

Calcium

In [165]:
final_df['calcium'].isna().sum()

np.int64(853)

In [166]:
final_df['calcium'].median()

np.float64(9.3)

In [167]:
final_df['calcium'].fillna(final_df['calcium'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['calcium'].fillna(final_df['calcium'].median(), inplace=True)


In [168]:
final_df['calcium'].isna().sum()

np.int64(0)

Sodium

In [169]:
final_df['sodium'].isna().sum()

np.int64(806)

In [170]:
final_df['sodium'].median()

np.float64(138.0)

In [171]:
final_df['sodium'].fillna(final_df['sodium'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['sodium'].fillna(final_df['sodium'].median(), inplace=True)


In [172]:
final_df['sodium'].isna().sum()

np.int64(0)

Bicarbonate

In [173]:
final_df['bicarbonate'].isna().sum()

np.int64(846)

In [174]:
final_df['bicarbonate'].median()

np.float64(25.0)

In [175]:
final_df['bicarbonate'].fillna(final_df['sodium'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['bicarbonate'].fillna(final_df['sodium'].median(), inplace=True)


In [176]:
final_df['bicarbonate'].isna().sum()

np.int64(0)

Bilirubin

In [177]:
final_df['total_bilirubin'].isna().sum()

np.int64(899)

In [178]:
final_df['total_bilirubin'].median()

np.float64(0.4)

In [179]:
final_df['total_bilirubin'].fillna(final_df['total_bilirubin'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['total_bilirubin'].fillna(final_df['total_bilirubin'].median(), inplace=True)


In [180]:
final_df['total_bilirubin'].isna().sum()

np.int64(0)

Alk Phos

In [181]:
final_df['alp'].isna().sum()

np.int64(869)

In [182]:
final_df['alp'].median()

np.float64(89.0)

In [183]:
final_df['alp'].fillna(final_df['alp'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['alp'].fillna(final_df['alp'].median(), inplace=True)


In [184]:
final_df['alp'].isna().sum()

np.int64(0)

AST

In [185]:
final_df['ast'].isna().sum()

np.int64(871)

In [186]:
final_df['ast'].median()

np.float64(18.0)

In [187]:
final_df['ast'].fillna(final_df['ast'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['ast'].fillna(final_df['ast'].median(), inplace=True)


In [188]:
final_df['ast'].isna().sum()

np.int64(0)

Albumin

In [189]:
final_df['albumin'].isna().sum()

np.int64(970)

In [190]:
final_df['albumin'].median()

np.float64(38.0)

Note the units here are g/L as opposed to g/dL

In [191]:
final_df['albumin'].fillna(final_df['albumin'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['albumin'].fillna(final_df['albumin'].median(), inplace=True)


In [192]:
final_df['albumin'].isna().sum()

np.int64(0)

WBC Max

In [193]:
final_df['wbc_max'].isna().sum()

np.int64(351)

In [194]:
final_df['wbc_max'].median()

np.float64(9.1)

In [195]:
final_df['wbc_max'].fillna(final_df['wbc_max'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['wbc_max'].fillna(final_df['wbc_max'].median(), inplace=True)


In [196]:
final_df['wbc_max'].isna().sum()

np.int64(0)

Cr Max

In [197]:
final_df['creatinine_max'].isna().sum()

np.int64(691)

In [198]:
final_df['creatinine_max'].median()

np.float64(1.29)

In [199]:
final_df['creatinine_max'].fillna(final_df['creatinine_max'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['creatinine_max'].fillna(final_df['creatinine_max'].median(), inplace=True)


In [200]:
final_df['creatinine_max'].isna().sum()

np.int64(0)

Sodium Max

In [201]:
final_df['sodium_max'].isna().sum()

np.int64(743)

In [202]:
final_df['sodium_max'].median()

np.float64(140.0)

In [203]:
final_df['sodium_max'].fillna(final_df['sodium_max'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['sodium_max'].fillna(final_df['sodium_max'].median(), inplace=True)


In [204]:
final_df['sodium_max'].isna().sum()

np.int64(0)

Bilirubin max

In [205]:
final_df['total_bilirubin_max'].isna().sum()

np.int64(829)

In [206]:
final_df['total_bilirubin_max'].median()

np.float64(0.5)

In [207]:
final_df['total_bilirubin_max'].fillna(final_df['total_bilirubin_max'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['total_bilirubin_max'].fillna(final_df['total_bilirubin_max'].median(), inplace=True)


In [208]:
final_df['total_bilirubin_max'].isna().sum()

np.int64(0)

Alk Phos Max

In [209]:
final_df['alp_max'].isna().sum()

np.int64(801)

In [210]:
final_df['alp_max'].median()

np.float64(94.0)

In [211]:
final_df['alp_max'].fillna(final_df['alp_max'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['alp_max'].fillna(final_df['alp_max'].median(), inplace=True)


In [212]:
final_df['alp_max'].isna().sum()

np.int64(0)

AST Max

In [213]:
final_df['ast_max'].isna().sum()

np.int64(803)

In [214]:
final_df['ast_max'].median()

np.float64(21.0)

In [215]:
final_df['ast_max'].fillna(final_df['ast_max'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['ast_max'].fillna(final_df['ast_max'].median(), inplace=True)


In [216]:
final_df['ast_max'].isna().sum()

np.int64(0)

Platelet Min

In [217]:
final_df['platelet_min'].isna().sum()

np.int64(346)

In [218]:
final_df['platelet_min'].median()

np.float64(239.0)

In [219]:
final_df['platelet_min'].fillna(final_df['platelet_min'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['platelet_min'].fillna(final_df['platelet_min'].median(), inplace=True)


In [220]:
final_df['platelet_min'].isna().sum()

np.int64(0)

Hemoglobin Min

In [221]:
final_df['hemoglobin_min'].isna().sum()

np.int64(347)

In [222]:
final_df['hemoglobin_min'].median()

np.float64(11.2)

In [223]:
final_df['hemoglobin_min'].fillna(final_df['hemoglobin_min'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['hemoglobin_min'].fillna(final_df['hemoglobin_min'].median(), inplace=True)


In [224]:
final_df['hemoglobin_min'].isna().sum()

np.int64(0)

WBC Min

In [225]:
final_df['wbc_min'].isna().sum()

np.int64(351)

In [226]:
final_df['wbc_min'].median()

np.float64(7.2)

In [227]:
final_df['wbc_min'].fillna(final_df['wbc_min'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['wbc_min'].fillna(final_df['wbc_min'].median(), inplace=True)


In [228]:
final_df['wbc_min'].isna().sum()

np.int64(0)

Albumin Min

In [229]:
final_df['albumin_min'].isna().sum()

np.int64(902)

In [230]:
final_df['albumin_min'].median()

np.float64(38.0)

In [231]:
final_df['albumin_min'].fillna(final_df['albumin_min'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['albumin_min'].fillna(final_df['albumin_min'].median(), inplace=True)


In [232]:
final_df['albumin_min'].isna().sum()

np.int64(0)

Sodium Min

In [233]:
final_df['sodium_min'].isna().sum()

np.int64(743)

In [234]:
final_df['sodium_min'].median()

np.float64(137.95)

In [235]:
final_df['sodium_min'].fillna(final_df['sodium_min'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['sodium_min'].fillna(final_df['sodium_min'].median(), inplace=True)


In [236]:
final_df['sodium_min'].isna().sum()

np.int64(0)

Bicarbonate Min

In [237]:
final_df['bicarbonate_min'].isna().sum()

np.int64(780)

In [238]:
final_df['bicarbonate_min'].median()

np.float64(24.0)

In [239]:
final_df['bicarbonate_min'].fillna(final_df['bicarbonate_min'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['bicarbonate_min'].fillna(final_df['bicarbonate_min'].median(), inplace=True)


In [240]:
final_df['bicarbonate_min'].isna().sum()

np.int64(0)

Potassium Min

In [242]:
final_df['potassium_min'].isna().sum()

np.int64(744)

In [243]:
final_df['potassium_min'].median()

np.float64(4.2)

In [244]:
final_df['potassium_min'].fillna(final_df['potassium_min'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['potassium_min'].fillna(final_df['potassium_min'].median(), inplace=True)


In [245]:
final_df['potassium_min'].isna().sum()

np.int64(0)

Anticoagulant Treatment

In [246]:
final_df['anticoagulant'].value_counts()

anticoagulant
0    6391
1      70
Name: count, dtype: Int64

Opioid Treatment

In [247]:
final_df['opioid'].value_counts()

opioid
0    6152
1     309
Name: count, dtype: Int64

Antibiotic

In [248]:
final_df['antibiotic'].value_counts()

antibiotic
0    5987
1     474
Name: count, dtype: Int64

Diabetic medication

In [249]:
final_df['diabetic_med'].value_counts()

diabetic_med
0    6377
1      84
Name: count, dtype: Int64

Bone Therapy Agent

Bone therapy agent

In [251]:
final_df['bone_therapy_agent'].value_counts()

bone_therapy_agent
0    6118
1     343
Name: count, dtype: Int64

Immunosuppressant

In [252]:
final_df['immunosuppressant'].value_counts()

immunosuppressant
0    6455
1       6
Name: count, dtype: Int64

Van Walraven Score

In [254]:
final_df['van_walraven_score'].isna().sum()

np.int64(2755)

Note significant number of patients without Van Walraven scores

In [256]:
final_df['van_walraven_score'].median()

np.float64(3.0)

In [257]:
final_df['van_walraven_score'].fillna(final_df['van_walraven_score'].median(), inplace=True)

In [258]:
final_df['van_walraven_score'].isna().sum()

np.int64(0)

Lymph Note Metastases

In [259]:
final_df['lymph_met'].value_counts()

lymph_met
0    5676
1     785
Name: count, dtype: Int64

Thoracic Metastases:

In [260]:
final_df['thoracic_met'].value_counts()

thoracic_met
0    5926
1     535
Name: count, dtype: Int64

Liver Metastases

In [261]:
final_df['liver_met'].value_counts()

liver_met
0    6095
1     366
Name: count, dtype: Int64

Bone Metastases

In [262]:
final_df['bone_met'].value_counts()

bone_met
0    5666
1     795
Name: count, dtype: Int64

Other GI Metastases

In [263]:
final_df['other_gi_met'].value_counts()

other_gi_met
0    6274
1     187
Name: count, dtype: Int64

Other Combined Metastases

In [264]:
final_df['other_combined_met'].value_counts()

other_combined_met
0    5900
1     561
Name: count, dtype: Int64

Neutrophil count

In [266]:
final_df['neutrophil'].isna().sum()

np.int64(1518)

In [267]:
final_df['neutrophil'].median()

np.float64(5.6)

In [268]:
final_df['neutrophil'].fillna(final_df['neutrophil'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['neutrophil'].fillna(final_df['neutrophil'].median(), inplace=True)


In [269]:
final_df['neutrophil'].isna().sum()

np.int64(0)

Neutrophil Max

In [270]:
final_df['neutrophil_max'].isna().sum()

np.int64(1464)

In [271]:
final_df['neutrophil_max'].median()

np.float64(6.3945)

In [272]:
final_df['neutrophil_max'].fillna(final_df['neutrophil_max'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['neutrophil_max'].fillna(final_df['neutrophil_max'].median(), inplace=True)


In [273]:
final_df['neutrophil_max'].isna().sum()

np.int64(0)

Neutrophil Min

In [274]:
final_df['neutrophil_min'].isna().sum()

np.int64(1464)

In [275]:
final_df['neutrophil_min'].median()

np.float64(4.8)

In [276]:
final_df['neutrophil_min'].fillna(final_df['neutrophil_min'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['neutrophil_min'].fillna(final_df['neutrophil_min'].median(), inplace=True)


In [277]:
final_df['neutrophil_min'].isna().sum()

np.int64(0)

Lymphocyte count:

In [278]:
final_df['lymphocyte'].isna().sum()

np.int64(853)

In [279]:
final_df['lymphocyte'].median()

np.float64(1.38)

In [280]:
final_df['lymphocyte'].fillna(final_df['lymphocyte'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['lymphocyte'].fillna(final_df['lymphocyte'].median(), inplace=True)


In [281]:
final_df['lymphocyte'].isna().sum()

np.int64(0)

Lymphocyte Max:

In [282]:
final_df['lymphocyte_max'].isna().sum()

np.int64(812)

In [283]:
final_df['lymphocyte_max'].median()

np.float64(1.6)

In [284]:
final_df['lymphocyte_max'].fillna(final_df['lymphocyte_max'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['lymphocyte_max'].fillna(final_df['lymphocyte_max'].median(), inplace=True)


In [285]:
final_df['lymphocyte_max'].isna().sum()

np.int64(0)

Lymphocyte Min

In [286]:
final_df['lymphocyte_min'].isna().sum()

np.int64(812)

In [287]:
final_df['lymphocyte_min'].median()

np.float64(1.2)

In [288]:
final_df['lymphocyte_min'].fillna(final_df['lymphocyte_min'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df['lymphocyte_min'].fillna(final_df['lymphocyte_min'].median(), inplace=True)


In [289]:
final_df['lymphocyte_min'].isna().sum()

np.int64(0)

In [290]:
final_df.isna().sum()

PatientID                0
PrimarySite_lower        0
SmokingStatus            0
Surgery                  0
days_diagnosis_to_adv    0
                        ..
lymphocyte_min           0
stage_0_III              0
stage_IV                 0
pdl1                     0
fgfr                     0
Length: 62, dtype: int64

In [292]:
final_df.isna().sum()[final_df.isna().sum() > 0]

Series([], dtype: int64)

In [293]:
final_df.head(2)

Unnamed: 0,PatientID,PrimarySite_lower,SmokingStatus,Surgery,days_diagnosis_to_adv,age,ecog_index,ecog_newly_gte2,bmi_index,percent_change_weight,...,neutrophil,neutrophil_max,neutrophil_min,lymphocyte,lymphocyte_max,lymphocyte_min,stage_0_III,stage_IV,pdl1,fgfr
0,F0016E985D839,0,1,1,0.0,80,2,0,20.557003,0.433996,...,3.3,5.4,3.3,1.6,2.4,1.6,0,1,0,0
1,F001E5D4C6FA0,1,1,1,274.0,77,unknown,0,26.080805,-0.980341,...,5.6,6.3945,4.8,2.4,2.9,2.4,0,0,0,0


## Export dataframe

In [300]:
final_df.to_csv('../outputs/final_df_for_analysis.csv', index = False)

In [301]:
# Save dtypes
final_df.dtypes.apply(lambda x: x.name).to_csv('../outputs/final_df_dtypes.csv')