# Tutorial of DataProcessorUrothelial

The DataProcessorUrothelial package streamlines the processing of Flatiron Health's advanced urothelial cancer datasets. It provides specialized functions to clean and standardize CSV files containing clinical data (eg., demographics, vitals, labs, medications, ICD codes). Each processing function handles format-specific requirements and common data quality issues, outputting standardized dataframes that can be merged into a single, comprehensive dataset ready for analysis.

In [1]:
# Development setup only 
# These lines are only needed when running this notebook from the example folder in the repository
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

## Setup 
To begin using the package, import the required modules and initialize the processor.

In [2]:
from flatiron_cleaner.urothelial import DataProcessorUrothelial
from flatiron_cleaner.merge_utils import merge_dataframes

import pandas as pd

In [3]:
# Initialize class 
processor = DataProcessorUrothelial()

In [4]:
# Import dataframe with index date of interest for PatientIDs
df = pd.read_csv('../data_uro/Enhanced_AdvUrothelial.csv')

In [5]:
# For our example we'll select 1000 random patients to clean
df = df.sample(n = 1000, random_state = 42)
ids = df.PatientID.to_list()

## Cleaning CSV Files 
Process individual data files using specialized functions. Each function handles data cleaning and standardization specific to the CSV file. 

In [6]:
# Process Enhanced_AdvUrothelial.csv
enhanced_df = processor.process_enhanced(file_path = '../data_uro/Enhanced_AdvUrothelial.csv', 
                                         patient_ids = ids)

2025-03-01 15:21:40,449 - INFO - Successfully read Enhanced_AdvUrothelial.csv file with shape: (13129, 13) and unique PatientIDs: 13129
2025-03-01 15:21:40,450 - INFO - Filtering for 1000 specific PatientIDs
2025-03-01 15:21:40,452 - INFO - Successfully filtered Enhanced_AdvUrothelial.csv file with shape: (1000, 13) and unique PatientIDs: 1000
2025-03-01 15:21:40,461 - INFO - Successfully processed Enhanced_AdvUrothelial.csv file with final shape: (1000, 13) and unique PatientIDs: 1000


In [7]:
# Process Demographics.csv 
demographics_df = processor.process_demographics(file_path = '../data_uro/Demographics.csv',
                                                 index_date_df = df,
                                                 index_date_column = 'AdvancedDiagnosisDate')

2025-03-01 15:21:40,474 - INFO - Successfully read Demographics.csv file with shape: (13129, 6) and unique PatientIDs: 13129
2025-03-01 15:21:40,484 - INFO - Successfully processed Demographics.csv file with final shape: (1000, 6) and unique PatientIDs: 1000


In [8]:
# Process Practice.csv
practice_df = processor.process_practice(file_path = '../data_uro/Practice.csv',
                                         patient_ids = ids)

2025-03-01 15:21:40,495 - INFO - Successfully read Practice.csv file with shape: (14181, 4) and unique PatientIDs: 13129
2025-03-01 15:21:40,495 - INFO - Filtering for 1000 specific PatientIDs
2025-03-01 15:21:40,497 - INFO - Successfully filtered Practice.csv file with shape: (1082, 4) and unique PatientIDs: 1000
2025-03-01 15:21:40,513 - INFO - Successfully processed Practice.csv file with final shape: (1000, 2) and unique PatientIDs: 1000


In [9]:
# Process Enhanced_AdvUrothelialBiomarkers.csv
biomarkers_df = processor.process_biomarkers(file_path = '../data_uro/Enhanced_AdvUrothelialBiomarkers.csv',
                                             index_date_df = df, 
                                             index_date_column = 'AdvancedDiagnosisDate',
                                             days_before = None, 
                                             days_after = 14)

# Empty values are replaced with "unknown"
biomarkers_df['PDL1_status'] = biomarkers_df['PDL1_status'].fillna('unknown')

biomarkers_df['FGFR_status'] = biomarkers_df['FGFR_status'].cat.add_categories('unknown')
biomarkers_df['FGFR_status'] = biomarkers_df['FGFR_status'].fillna('unknown')

2025-03-01 15:21:40,533 - INFO - Successfully read Enhanced_AdvUrothelialBiomarkers.csv file with shape: (9924, 19) and unique PatientIDs: 4251
2025-03-01 15:21:40,539 - INFO - Successfully merged Enhanced_AdvUrothelialBiomarkers.csv df with index_date_df resulting in shape: (768, 20) and unique PatientIDs: 329
2025-03-01 15:21:40,550 - INFO - Successfully processed Enhanced_AdvUrothelialBiomarkers.csv file with final shape: (1000, 4) and unique PatientIDs: 1000


In [10]:
# Process ECOG.csv
ecog_df = processor.process_ecog(file_path = '../data_uro/ECOG.csv', 
                                 index_date_df = df,
                                 index_date_column = 'AdvancedDiagnosisDate',
                                 days_before = 90,
                                 days_after = 0,
                                 days_before_further = 180)

2025-03-01 15:21:40,616 - INFO - Successfully read ECOG.csv file with shape: (184794, 4) and unique PatientIDs: 9933
2025-03-01 15:21:40,642 - INFO - Successfully merged ECOG.csv df with index_date_df resulting in shape: (14342, 5) and unique PatientIDs: 768
2025-03-01 15:21:40,653 - INFO - Successfully processed ECOG.csv file with final shape: (1000, 3) and unique PatientIDs: 1000


In [11]:
# Process Vitals.csv
vitals_df = processor.process_vitals(file_path = '../data_uro/Vitals.csv',
                                     index_date_df = df,
                                     index_date_column = 'AdvancedDiagnosisDate',
                                     weight_days_before = 90,
                                     days_after = 0,
                                     vital_summary_lookback = 180, 
                                     abnormal_reading_threshold = 1)

2025-03-01 15:21:44,330 - INFO - Successfully read Vitals.csv file with shape: (3604484, 16) and unique PatientIDs: 13109
2025-03-01 15:21:45,496 - INFO - Successfully merged Vitals.csv df with index_date_df resulting in shape: (264027, 17) and unique PatientIDs: 997
2025-03-01 15:21:45,600 - INFO - Successfully processed Vitals.csv file with final shape: (1000, 8) and unique PatientIDs: 1000


In [None]:
# Process Lab.csv
labs_df = processor.process_labs(file_path = '../data_uro/Lab.csv',
                                 index_date_df = df,
                                 index_date_column = 'AdvancedDiagnosisDate',
                                 additional_loinc_mappings = {'crp': ['1988-5']},
                                 days_before = 90,
                                 days_after = 0,
                                 summary_lookback = 180)

In [None]:
# Process MedicationAdministration.csv
medications_df = processor.process_medications(file_path = '../data_uro/MedicationAdministration.csv',
                                               index_date_df = df,
                                               index_date_column = 'AdvancedDiagnosisDate',
                                               days_before = 90,
                                               days_after = 0)

In [None]:
# Process Diagnsois.csv 
diagnosis_df = processor.process_diagnosis(file_path = '../data_uro/Diagnosis.csv',
                                           index_date_df = df,
                                           index_date_column = 'AdvancedDiagnosisDate',
                                           days_before = None,
                                           days_after = 0)

In [None]:
# Process Insurance.csv 
insurance_df = processor.process_insurance(file_path = '../data_uro/Insurance.csv',
                                           index_date_df = df,
                                           index_date_column = 'AdvancedDiagnosisDate',
                                           days_before = None,
                                           days_after = 0,
                                           missing_date_strategy = 'liberal')

In [None]:
# Process Enhanced_Mortality_V2.csv and use visit, telemedicine, biomarkers, oral, and progression data to determine censoring date 
mortality_df = processor.process_mortality(file_path = '../data_uro/Enhanced_Mortality_V2.csv',
                                           index_date_df = df, 
                                           index_date_column = 'AdvancedDiagnosisDate',
                                           visit_path = '../data_uro/Visit.csv', 
                                           telemedicine_path = '../data_uro/Telemedicine.csv', 
                                           biomarkers_path = '../data_uro/Enhanced_AdvUrothelialBiomarkers.csv', 
                                           oral_path = '../data_uro/Enhanced_AdvUrothelial_Orals.csv',
                                           progression_path = '../data_uro/Enhanced_AdvUrothelial_Progression.csv',
                                           drop_dates = True)

## Merge Processed Dataframes
Merge the processed dataframes into a single analysis-ready dataset

In [None]:
merged_data = merge_dataframes(enhanced_df, 
                               demographics_df, 
                               practice_df, 
                               biomarkers_df, 
                               ecog_df, 
                               vitals_df,
                               labs_df,
                               medications_df, 
                               diagnosis_df, 
                               insurance_df,
                               mortality_df, )

In [None]:
merged_data.columns.to_list()

In [None]:
for col, dtype in merged_data.dtypes.items():
    print(f"{col}: {dtype}")