# Tutorial of DataProcessorUrothelial

The DataProcessorUrothelial package streamlines the processing of Flatiron Health's advanced urothelial cancer datasets. It provides specialized functions to clean and standardize CSV files containing clinical data (eg., demographics, vitals, labs, medications, ICD codes). Each processing function handles format-specific requirements and common data quality issues, outputting standardized dataframes that can be merged into a single, comprehensive dataset ready for analysis.

In [1]:
# Development setup only 
# These lines are only needed when running this notebook from the example folder in the repository
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

## Setup 
To begin using the package, import the required modules and initialize the processor.

In [2]:
from flatiron_cleaner import DataProcessorUrothelial
from flatiron_cleaner import merge_dataframes

import pandas as pd

In [3]:
# Initialize class 
processor = DataProcessorUrothelial()

In [4]:
# Import dataframe with index date of interest for PatientIDs
df = pd.read_csv('../data_uro/LineOfTherapy.csv')

In [5]:
# For our example we'll select patients receiving first-line pembrolizumab or carpoplatin and gemcitabine. 
df = (
    df
    .query('LineNumber == 1')
    .query('LineName == "Pembrolizumab" or LineName == "Carboplatin,Gemcitabine" or LineName == "Cisplatin,Gemcitabine"')
    [['PatientID', 'StartDate']]
)
ids = df.PatientID.to_list()

In [6]:
df.shape

(5780, 2)

## Cleaning CSV Files 
Process individual data files using specialized functions. Each function handles data cleaning and standardization specific to the CSV file. 

In [7]:
# Process Enhanced_AdvUrothelial.csv
enhanced_df = processor.process_enhanced(file_path = '../data_uro/Enhanced_AdvUrothelial.csv', 
                                         patient_ids = ids)

2025-03-04 10:55:36,631 - INFO - Successfully read Enhanced_AdvUrothelial.csv file with shape: (13129, 13) and unique PatientIDs: 13129
2025-03-04 10:55:36,631 - INFO - Filtering for 5780 specific PatientIDs
2025-03-04 10:55:36,635 - INFO - Successfully filtered Enhanced_AdvUrothelial.csv file with shape: (5780, 13) and unique PatientIDs: 5780
2025-03-04 10:55:36,650 - INFO - Successfully processed Enhanced_AdvUrothelial.csv file with final shape: (5780, 13) and unique PatientIDs: 5780


In [8]:
# Process Demographics.csv 
demographics_df = processor.process_demographics(file_path = '../data_uro/Demographics.csv',
                                                 index_date_df = df,
                                                 index_date_column = 'StartDate')

2025-03-04 10:55:36,662 - INFO - Successfully read Demographics.csv file with shape: (13129, 6) and unique PatientIDs: 13129
2025-03-04 10:55:36,674 - INFO - Successfully processed Demographics.csv file with final shape: (5780, 6) and unique PatientIDs: 5780


In [9]:
# Process Practice.csv
practice_df = processor.process_practice(file_path = '../data_uro/Practice.csv',
                                         patient_ids = ids)

2025-03-04 10:55:36,688 - INFO - Successfully read Practice.csv file with shape: (14181, 4) and unique PatientIDs: 13129
2025-03-04 10:55:36,689 - INFO - Filtering for 5780 specific PatientIDs
2025-03-04 10:55:36,692 - INFO - Successfully filtered Practice.csv file with shape: (6299, 4) and unique PatientIDs: 5780
2025-03-04 10:55:36,760 - INFO - Successfully processed Practice.csv file with final shape: (5780, 2) and unique PatientIDs: 5780


In [10]:
# Process Enhanced_AdvUrothelialBiomarkers.csv
biomarkers_df = processor.process_biomarkers(file_path = '../data_uro/Enhanced_AdvUrothelialBiomarkers.csv',
                                             index_date_df = df, 
                                             index_date_column = 'StartDate',
                                             days_before = None, 
                                             days_after = 14)

2025-03-04 10:55:36,780 - INFO - Successfully read Enhanced_AdvUrothelialBiomarkers.csv file with shape: (9924, 19) and unique PatientIDs: 4251
2025-03-04 10:55:36,791 - INFO - Successfully merged Enhanced_AdvUrothelialBiomarkers.csv df with index_date_df resulting in shape: (5753, 20) and unique PatientIDs: 2370
2025-03-04 10:55:36,832 - INFO - Successfully processed Enhanced_AdvUrothelialBiomarkers.csv file with final shape: (5780, 4) and unique PatientIDs: 5780


In [11]:
# Process ECOG.csv
ecog_df = processor.process_ecog(file_path = '../data_uro/ECOG.csv', 
                                 index_date_df = df,
                                 index_date_column = 'StartDate',
                                 days_before = 90,
                                 days_after = 0,
                                 days_before_further = 180)

2025-03-04 10:55:36,904 - INFO - Successfully read ECOG.csv file with shape: (184794, 4) and unique PatientIDs: 9933
2025-03-04 10:55:36,945 - INFO - Successfully merged ECOG.csv df with index_date_df resulting in shape: (106365, 5) and unique PatientIDs: 4853
2025-03-04 10:55:37,025 - INFO - Successfully processed ECOG.csv file with final shape: (5780, 3) and unique PatientIDs: 5780


In [12]:
# Process Vitals.csv
vitals_df = processor.process_vitals(file_path = '../data_uro/Vitals.csv',
                                     index_date_df = df,
                                     index_date_column = 'StartDate',
                                     weight_days_before = 90,
                                     days_after = 0,
                                     vital_summary_lookback = 180, 
                                     abnormal_reading_threshold = 1)

2025-03-04 10:55:40,708 - INFO - Successfully read Vitals.csv file with shape: (3604484, 16) and unique PatientIDs: 13109
2025-03-04 10:55:42,465 - INFO - Successfully merged Vitals.csv df with index_date_df resulting in shape: (1863569, 17) and unique PatientIDs: 5780
2025-03-04 10:55:43,413 - INFO - Successfully processed Vitals.csv file with final shape: (5780, 8) and unique PatientIDs: 5780


In [13]:
# Process Lab.csv
labs_df = processor.process_labs(file_path = '../data_uro/Lab.csv',
                                 index_date_df = df,
                                 index_date_column = 'StartDate',
                                 additional_loinc_mappings = {'crp': ['1988-5']},
                                 days_before = 90,
                                 days_after = 0,
                                 summary_lookback = 180)

2025-03-04 10:55:55,749 - INFO - Successfully read Lab.csv file with shape: (9373598, 17) and unique PatientIDs: 12700
2025-03-04 10:55:58,971 - INFO - Successfully merged Lab.csv df with index_date_df resulting in shape: (5101910, 18) and unique PatientIDs: 5735
2025-03-04 10:56:09,880 - INFO - Successfully processed Lab.csv file with final shape: (5780, 81) and unique PatientIDs: 5780


In [14]:
# Process MedicationAdministration.csv
medications_df = processor.process_medications(file_path = '../data_uro/MedicationAdministration.csv',
                                               index_date_df = df,
                                               index_date_column = 'StartDate',
                                               days_before = 90,
                                               days_after = 0)

2025-03-04 10:56:11,168 - INFO - Successfully read MedicationAdministration.csv file with shape: (997836, 11) and unique PatientIDs: 10983
2025-03-04 10:56:11,505 - INFO - Successfully merged MedicationAdministration.csv df with index_date_df resulting in shape: (527007, 12) and unique PatientIDs: 5679
2025-03-04 10:56:11,546 - INFO - Successfully processed MedicationAdministration.csv file with final shape: (5780, 9) and unique PatientIDs: 5780


In [15]:
# Process Diagnsois.csv 
diagnosis_df = processor.process_diagnosis(file_path = '../data_uro/Diagnosis.csv',
                                           index_date_df = df,
                                           index_date_column = 'StartDate',
                                           days_before = None,
                                           days_after = 0)

2025-03-04 10:56:11,962 - INFO - Successfully read Diagnosis.csv file with shape: (625348, 6) and unique PatientIDs: 13129
2025-03-04 10:56:12,082 - INFO - Successfully merged Diagnosis.csv df with index_date_df resulting in shape: (286648, 7) and unique PatientIDs: 5780
2025-03-04 10:56:12,949 - INFO - Successfully processed Diagnosis.csv file with final shape: (5780, 40) and unique PatientIDs: 5780


In [16]:
# Process Insurance.csv 
insurance_df = processor.process_insurance(file_path = '../data_uro/Insurance.csv',
                                           index_date_df = df,
                                           index_date_column = 'StartDate',
                                           days_before = None,
                                           days_after = 0,
                                           missing_date_strategy = 'liberal')

2025-03-04 10:56:13,023 - INFO - Successfully read Insurance.csv file with shape: (53709, 14) and unique PatientIDs: 12391
2025-03-04 10:56:13,067 - INFO - Successfully merged Insurance.csv df with index_date_df resulting in shape: (25081, 15) and unique PatientIDs: 5478
2025-03-04 10:56:13,134 - INFO - Successfully processed Insurance.csv file with final shape: (5780, 5) and unique PatientIDs: 5780


In [17]:
# Process Enhanced_Mortality_V2.csv and use visit, telemedicine, biomarkers, oral, and progression data to determine censoring date 
mortality_df = processor.process_mortality(file_path = '../data_uro/Enhanced_Mortality_V2.csv',
                                           index_date_df = df, 
                                           index_date_column = 'StartDate',
                                           visit_path = '../data_uro/Visit.csv', 
                                           telemedicine_path = '../data_uro/Telemedicine.csv', 
                                           biomarkers_path = '../data_uro/Enhanced_AdvUrothelialBiomarkers.csv', 
                                           oral_path = '../data_uro/Enhanced_AdvUrothelial_Orals.csv',
                                           progression_path = '../data_uro/Enhanced_AdvUrothelial_Progression.csv',
                                           drop_dates = True)

2025-03-04 10:56:13,150 - INFO - Successfully read Enhanced_Mortality_V2.csv file with shape: (9040, 2) and unique PatientIDs: 9040
2025-03-04 10:56:13,162 - INFO - Successfully merged Enhanced_Mortality_V2.csv df with index_date_df resulting in shape: (5780, 3) and unique PatientIDs: 5780
2025-03-04 10:56:13,546 - INFO - The following columns ['last_visit_date', 'last_biomarker_date', 'last_oral_date', 'last_progression_date'] are used to calculate the last EHR date
2025-03-04 10:56:13,551 - INFO - Successfully processed Enhanced_Mortality_V2.csv file with final shape: (5780, 3) and unique PatientIDs: 5780. There are 0 out of 5780 patients with missing duration values


## Merge Processed Dataframes
Merge the processed dataframes into a single analysis-ready dataset

In [18]:
merged_data = merge_dataframes(enhanced_df, 
                               demographics_df, 
                               practice_df, 
                               biomarkers_df, 
                               ecog_df, 
                               vitals_df,
                               labs_df,
                               medications_df, 
                               diagnosis_df, 
                               insurance_df,
                               mortality_df)

2025-03-04 10:56:13,566 - INFO - Anticipated number of merges: 10
2025-03-04 10:56:13,566 - INFO - Anticipated number of columns in final dataframe presuming all columns are unique except for PatientID: 164
2025-03-04 10:56:13,569 - INFO - Dataset 1 shape: (5780, 13), unique PatientIDs: 5780
2025-03-04 10:56:13,570 - INFO - Dataset 2 shape: (5780, 6), unique PatientIDs: 5780
2025-03-04 10:56:13,571 - INFO - Dataset 3 shape: (5780, 2), unique PatientIDs: 5780
2025-03-04 10:56:13,572 - INFO - Dataset 4 shape: (5780, 4), unique PatientIDs: 5780
2025-03-04 10:56:13,572 - INFO - Dataset 5 shape: (5780, 3), unique PatientIDs: 5780
2025-03-04 10:56:13,573 - INFO - Dataset 6 shape: (5780, 8), unique PatientIDs: 5780
2025-03-04 10:56:13,575 - INFO - Dataset 7 shape: (5780, 81), unique PatientIDs: 5780
2025-03-04 10:56:13,575 - INFO - Dataset 8 shape: (5780, 9), unique PatientIDs: 5780
2025-03-04 10:56:13,576 - INFO - Dataset 9 shape: (5780, 40), unique PatientIDs: 5780
2025-03-04 10:56:13,577 -

In [19]:
merged_data.columns.to_list()

['PatientID',
 'PrimarySite',
 'DiseaseGrade',
 'SmokingStatus',
 'Surgery',
 'GroupStage_mod',
 'TStage_mod',
 'NStage_mod',
 'MStage_mod',
 'SurgeryType_mod',
 'days_diagnosis_to_adv',
 'adv_diagnosis_year',
 'days_diagnosis_to_surgery',
 'Gender',
 'age',
 'Ethnicity_mod',
 'Race_mod',
 'region',
 'PracticeType_mod',
 'PDL1_status',
 'PDL1_percent_staining',
 'FGFR_status',
 'ecog_index',
 'ecog_newly_gte2',
 'weight_index',
 'bmi_index',
 'percent_change_weight',
 'hypotension',
 'tachycardia',
 'fevers',
 'hypoxemia',
 'albumin',
 'alp',
 'alt',
 'ast',
 'bicarbonate',
 'bun',
 'calcium',
 'chloride',
 'creatinine',
 'crp',
 'hemoglobin',
 'platelet',
 'potassium',
 'sodium',
 'total_bilirubin',
 'wbc',
 'albumin_max',
 'alp_max',
 'alt_max',
 'ast_max',
 'bicarbonate_max',
 'bun_max',
 'calcium_max',
 'chloride_max',
 'creatinine_max',
 'crp_max',
 'hemoglobin_max',
 'platelet_max',
 'potassium_max',
 'sodium_max',
 'total_bilirubin_max',
 'wbc_max',
 'albumin_min',
 'alp_min',
 

In [20]:
for col, dtype in merged_data.dtypes.items():
    print(f"{col}: {dtype}")

PatientID: object
PrimarySite: category
DiseaseGrade: category
SmokingStatus: category
Surgery: Int64
GroupStage_mod: category
TStage_mod: category
NStage_mod: category
MStage_mod: category
SurgeryType_mod: category
days_diagnosis_to_adv: float64
adv_diagnosis_year: category
days_diagnosis_to_surgery: float64
Gender: category
age: Int64
Ethnicity_mod: category
Race_mod: category
region: category
PracticeType_mod: category
PDL1_status: category
PDL1_percent_staining: category
FGFR_status: category
ecog_index: category
ecog_newly_gte2: Int64
weight_index: float64
bmi_index: float64
percent_change_weight: float64
hypotension: Int64
tachycardia: Int64
fevers: Int64
hypoxemia: Int64
albumin: float64
alp: float64
alt: float64
ast: float64
bicarbonate: float64
bun: float64
calcium: float64
chloride: float64
creatinine: float64
crp: float64
hemoglobin: float64
platelet: float64
potassium: float64
sodium: float64
total_bilirubin: float64
wbc: float64
albumin_max: float64
alp_max: float64
alt_ma