# Tutorial of DataProcessorUrothelial 

The DataProcessorUrothelial package streamlines the processing of Flatiron Health's advanced urothelial cancer datasets. It provides specialized functions to clean and standardize CSV files containing clinical data (eg., demographics, vitals, labs, medications, ICD codes). Each processing function handles format-specific requirements and common data quality issues, outputting standardized dataframes that can be merged into a single, comprehensive dataset ready for analysis.

In [1]:
# Development setup only 
# These lines are only needed when running this notebook from the repository
import sys
from pathlib import Path
src_path = str(Path('../src').resolve())
sys.path.append(src_path)

## Setup 
To begin using the package, import the required modules and initialize the processor.

In [2]:
from urothelial_processor import DataProcessorUrothelial
from merge_utils import merge_dataframes

import pandas as pd

In [3]:
# Initialize class 
processor = DataProcessorUrothelial()

In [4]:
# Import dataframe with index date of interest for PatientIDs
df = pd.read_csv('../data/Enhanced_AdvUrothelial.csv')

In [5]:
# For our example we'll select 1000 random patients to clean
df = df.sample(1000)
ids = df.PatientID.to_list()

## Cleaning CSV Files 
Process individual data files using specialized functions. Each function handles data cleaning and standardization specific to the CSV file. 

In [6]:
# Process Enhanced_AdvUrothelial.csv
enhanced_df = processor.process_enhanced_adv(file_path = '../data/Enhanced_AdvUrothelial.csv', 
                                             patient_ids = ids)

2025-02-20 14:23:05,946 - INFO - Successfully read Enhanced_AdvUrothelial.csv file with shape: (13129, 13) and unique PatientIDs: 13129
2025-02-20 14:23:05,946 - INFO - Filtering for 1000 specific PatientIDs
2025-02-20 14:23:05,948 - INFO - Successfully filtered Enhanced_AdvUrothelial.csv file with shape: (1000, 13) and unique PatientIDs: 1000
2025-02-20 14:23:05,957 - INFO - Successfully processed Enhanced_AdvUrothelial.csv file with final shape: (1000, 13) and unique PatientIDs: 1000


In [7]:
# Process Demographics.csv 
demographics_df = processor.process_demographics(file_path = '../data/Demographics.csv',
                                                 index_date_df = df,
                                                 index_date_column = 'AdvancedDiagnosisDate')

2025-02-20 14:23:05,967 - INFO - Successfully read Demographics.csv file with shape: (13129, 6) and unique PatientIDs: 13129
2025-02-20 14:23:05,975 - INFO - Successfully processed Demographics.csv file with final shape: (1000, 6) and unique PatientIDs: 1000


In [8]:
# Process Practice.csv
practice_df = processor.process_practice(file_path = '../data/Practice.csv',
                                         patient_ids = ids)

2025-02-20 14:23:05,986 - INFO - Successfully read Practice.csv file with shape: (14181, 4) and unique PatientIDs: 13129
2025-02-20 14:23:05,986 - INFO - Filtering for 1000 specific PatientIDs
2025-02-20 14:23:05,988 - INFO - Successfully filtered Practice.csv file with shape: (1072, 4) and unique PatientIDs: 1000
2025-02-20 14:23:06,003 - INFO - Successfully processed Practice.csv file with final shape: (1000, 2) and unique PatientIDs: 1000


In [9]:
# Process Enhanced_Mortality_V2.csv and use visit, telemedicine, biomarkers, oral, and progression data to determine censoring date 
mortality_df = processor.process_mortality(file_path = '../data/Enhanced_Mortality_V2.csv',
                                           index_date_df = df, 
                                           index_date_column = 'AdvancedDiagnosisDate',
                                           visit_path = '../data/Visit.csv', 
                                           telemedicine_path = '../data/Telemedicine.csv', 
                                           biomarker_path = '../data/Enhanced_AdvUrothelialBiomarkers.csv', 
                                           oral_path = '../data/Enhanced_AdvUrothelial_Orals.csv',
                                           progression_path = '../data/Enhanced_AdvUrothelial_Progression.csv')

2025-02-20 14:23:06,009 - INFO - Successfully read Enhanced_Mortality_V2.csv file with shape: (9040, 2) and unique PatientIDs: 9040
2025-02-20 14:23:06,017 - INFO - Successfully merged Enhanced_Mortality_V2.csv df with index_date_df resulting in shape: (1000, 3) and unique PatientIDs: 1000
2025-02-20 14:23:06,368 - INFO - The follwing columns ['last_visit_date', 'last_oral_date', 'last_biomarker_date', 'last_progression_date'] are used to calculate the last EHR date
2025-02-20 14:23:06,371 - INFO - Successfully processed Enhanced_Mortality_V2.csv file with final shape: (1000, 3) and unique PatientIDs: 1000. There are 0 out of 1000 patients with missing duration values


In [10]:
# Process Enhanced_AdvUrothelialBiomarkers.csv
biomarkers_df = processor.process_biomarkers(file_path = '../data/Enhanced_AdvUrothelialBiomarkers.csv',
                                             index_date_df = df, 
                                             index_date_column = 'AdvancedDiagnosisDate',
                                             days_before = None, 
                                             days_after = 14)

# Empty values are replaced with "unknown"
biomarkers_df['pdl1_status'] = biomarkers_df['pdl1_status'].fillna('unknown')

biomarkers_df['fgfr_status'] = biomarkers_df['fgfr_status'].cat.add_categories('unknown')
biomarkers_df['fgfr_status'] = biomarkers_df['fgfr_status'].fillna('unknown')

2025-02-20 14:23:06,400 - INFO - Successfully read Enhanced_AdvUrothelialBiomarkers.csv file with shape: (9924, 19) and unique PatientIDs: 4251
2025-02-20 14:23:06,406 - INFO - Successfully merged Enhanced_AdvUrothelialBiomarkers.csv df with index_date_df resulting in shape: (738, 20) and unique PatientIDs: 326
2025-02-20 14:23:06,417 - INFO - Successfully processed Enhanced_AdvUrothelialBiomarkers.csv file with final shape: (1000, 4) and unique PatientIDs: 1000


ValueError: new categories must not include old categories: {'unknown'}

In [None]:
# Process ECOG.csv
ecog_df = processor.process_ecog(file_path = '../data/ECOG.csv', 
                                 index_date_df = df,
                                 index_date_column = 'AdvancedDiagnosisDate',
                                 days_before = 90,
                                 days_after = 0,
                                 days_before_further = 180)

In [None]:
# Process Vitals.csv
vitals_df = processor.process_vitals(file_path = '../data/Vitals.csv',
                                     index_date_df = df,
                                     index_date_column = 'AdvancedDiagnosisDate',
                                     weight_days_before = 90,
                                     weight_days_after = 0,
                                     vital_summary_lookback = 180)

In [None]:
# Process Lab.csv
labs_df = processor.process_labs(file_path = '../data/Lab.csv',
                                 index_date_df = df,
                                 index_date_column = 'AdvancedDiagnosisDate',
                                 days_before = 90,
                                 days_after = 0,
                                 summary_lookback = 180)

In [None]:
# Process MedicationAdministration.csv
medications_df = processor.process_medications(file_path = '../data/MedicationAdministration.csv',
                                               index_date_df = df,
                                               index_date_column = 'AdvancedDiagnosisDate',
                                               days_before = 90,
                                               days_after = 0)

In [None]:
# Process Diagnsois.csv 
diagnosis_df = processor.process_diagnosis(file_path = '../data/Diagnosis.csv',
                                           index_date_df = df,
                                           index_date_column = 'AdvancedDiagnosisDate',
                                           days_before = None,
                                           days_after = 0)

In [None]:
# Process Insurance.csv 
insurance_df = processor.process_insurance(file_path = '../data/Insurance.csv',
                                           index_date_df = df,
                                           index_date_column = 'AdvancedDiagnosisDate',
                                           days_before = None,
                                           days_after = 0)

## Merge Processed Dataframes
Merge the processed dataframes into a single analysis-ready dataset

In [None]:
merged_data = merge_dataframes(enhanced_df, 
                               demographics_df, 
                               practice_df, 
                               mortality_df, 
                               biomarkers_df, 
                               ecog_df, 
                               vitals_df,
                               labs_df,
                               medications_df, 
                               diagnosis_df, 
                               insurance_df)

In [None]:
merged_data.columns.to_list()

In [None]:
for col, dtype in merged_data.dtypes.items():
    print(f"{col}: {dtype}")