OBJECTIVE: Create a dataframe of relevant variables using a cohort of patients who would have been eligible for the KEYNOTE-361 and IMvigor130 clinical trials (i.e. patients who received first-line treatment with either atezolizumab, pembrolizumab, or platinum based chemotherapy)

In [5]:
import numpy as np
import pandas as pd

In [6]:
# Function that returns number of rows and count of unique PatientIDs for a dataframe. 
def row_ID(dataframe):
    row = dataframe.shape[0]
    ID = dataframe['PatientID'].nunique()
    return row, ID

1. Load the full cohort previously defined

In [10]:
cohort = pd.read_csv('../checkpoint_trial/full_cohort.csv')

In [11]:
cohort.sample(3)

Unnamed: 0,PatientID,LineName,StartDate
6021,F7AC49EC24CEC,chemo,2017-06-01
3980,F2EB76CA2B96E,chemo,2018-02-20
304,F98178DDF95CD,Pembrolizumab,2021-03-25


In [12]:
cohort_IDs = cohort['PatientID'].to_numpy()

In [13]:
len(cohort_IDs)

6461

2. Load the demographics file and clean it

In [14]:
demographics = pd.read_csv('../data/Demographics.csv')

In [15]:
demographics.sample(3)

Unnamed: 0,PatientID,BirthYear,Gender,Race,Ethnicity,State
8619,FC73D6C2DC102,1949,M,White,Not Hispanic or Latino,MI
5088,FD70EF51E180B,1942,M,White,Not Hispanic or Latino,CT
3801,F1240D494CFDD,1953,M,White,,AR


In [16]:
demographics = demographics[demographics['PatientID'].isin(cohort_IDs)]

In [17]:
row_ID(demographics)

(6461, 6461)

In [18]:
demographics.sample(3)

Unnamed: 0,PatientID,BirthYear,Gender,Race,Ethnicity,State
9250,F4CB4A4CFB600,1952,M,White,,NE
1802,FA301DB837F97,1958,F,Other Race,Hispanic or Latino,
11005,FB719B1B2F8DD,1961,M,White,Not Hispanic or Latino,PA


Race:
The recommendation from Flatiron is to do the following:
This approach specifically addresses the nuance of “Hispanic or Latino” appearing as both a Race and Ethnicity value in Flatiron data, as detailed in the Race and Ethnicity Overview. In order to align with OMB Standards, Flatiron recommends treating “Hispanic or Latino” as an ethnicity, using the following logic:

-Identify patients with a Race value of “Hispanic or Latino”
-For these patients, recode Race to NULL and Ethnicity to “Hispanic or Latino”

The resulting dataset will remove all instances of “Hispanic or Latino” as a Race, leaving “White,” “Black or African American,” “Asian,” “Other Race,” and NULL as possible Race values. 

In [19]:
# If race value is 'Hispanic or Latino', code as unknown, otherwise value unchanged.
demographics['race'] = (
    np.where(demographics['Race'] == 'Hispanic or Latino', 'NULL', demographics['Race'])
)

In [20]:
# Missing race value will be recoded as Unknown
demographics['race'] = demographics['race'].fillna('NULL')

In [21]:
demographics['race'].value_counts().sum()

np.int64(6461)

In [22]:
race_counts = demographics['Race'].value_counts()
print(race_counts)

Race
White                        4561
Other Race                    822
Black or African American     306
Asian                          86
Hispanic or Latino              8
Name: count, dtype: int64


In [23]:
race_counts = demographics['race'].value_counts()
print(race_counts)

race
White                        4561
Other Race                    822
NULL                          686
Black or African American     306
Asian                          86
Name: count, dtype: int64


Ethnicity:

In [24]:
# If race value is equal to 'Hispanic or Latino', code ethnicity as 'Hispanic or Latino', otherwise unchanged. 
demographics['ethnicity'] = (
    np.where(demographics['Race'] == 'Hispanic or Latino', 'hispanic_latino', demographics['Ethnicity'])
)

In [25]:
demographics['ethnicity'] = demographics['ethnicity'].fillna('NULL')

In [26]:
demographics['ethnicity'] = demographics['ethnicity'].replace({'Hispanic or Latino': 'hispanic_latino'})

In [27]:
demographics['ethnicity'] = demographics['ethnicity'].replace({'Not Hispanic or Latino': 'not_hispanic_latino'})

In [28]:
ethnicity_counts = demographics['ethnicity'].value_counts()
print(ethnicity_counts)

ethnicity
not_hispanic_latino    4896
NULL                   1318
hispanic_latino         247
Name: count, dtype: int64


In [29]:
demographics = demographics.drop(columns = ['Race', 'Ethnicity'])

In [30]:
demographics.sample(3)

Unnamed: 0,PatientID,BirthYear,Gender,State,race,ethnicity
12676,FA783F4C6C50A,1939,M,VA,Other Race,not_hispanic_latino
12157,F0016E985D839,1943,F,TN,White,not_hispanic_latino
5318,FBA9C1C7DA1A2,1936,M,FL,White,


Per Flatiron, it is recommended that race and ethnicity are combined into a single variable as follows: 
-Hispanic or Latino 
-Not Hispanic or Latino, White 
-Not Hispanic or Latino, Black or African American 
-Not Hispanic or Latino, Asian 
-Not Hispanic or Latino, Other Race 
-Not Hispanic or Latino, Unknown Race 
-Unknown  
But will defer creating this column, given that there is some complexity in deciding how to handle cases where either race or ethnicity is unknown, will defer for now, given that race is unlikely to be central to the question at hand.

Birthyear, convert into Age; use the date of first line start to calculate age

In [31]:
enhanced_adv = pd.read_csv('../data/Enhanced_AdvUrothelial.csv')

In [32]:
demographics = pd.merge(demographics, cohort[['PatientID', 'StartDate']], on = 'PatientID')

In [33]:
demographics.sample(3)

Unnamed: 0,PatientID,BirthYear,Gender,State,race,ethnicity,StartDate
3273,F9E1592A24FC4,1934,M,FL,,,2019-06-10
6249,F8E4A4819F60F,1950,F,VA,Other Race,,2013-09-10
3223,F4936BB6C02D7,1943,M,FL,White,not_hispanic_latino,2016-09-15


In [34]:
print(demographics['StartDate'].dtype)

object


In [35]:
print(demographics['StartDate'].isnull().sum())

0


In [36]:
print(demographics['StartDate'].apply(type).unique())

[<class 'str'>]


In [37]:
demographics['StartDate'] = pd.to_datetime(demographics['StartDate'], format="%Y-%m-%d")

In [38]:
print(demographics['StartDate'].dtype)

datetime64[ns]


In [39]:
demographics.loc[:, 'age'] = demographics['StartDate'].dt.year - demographics['BirthYear']

In [40]:
demographics.sample(3)

Unnamed: 0,PatientID,BirthYear,Gender,State,race,ethnicity,StartDate,age
4892,FFC5358AE3BFF,1943,F,NY,White,not_hispanic_latino,2017-07-17,74
2305,FA4479A3F5C9F,1957,M,CA,White,not_hispanic_latino,2018-05-24,61
5625,F4FD64657B54C,1935,M,TN,White,not_hispanic_latino,2011-07-15,76


In [41]:
demographics = demographics.drop(columns = ['BirthYear', 'StartDate'])

In [42]:
demographics.sample(3)

Unnamed: 0,PatientID,Gender,State,race,ethnicity,age
1428,FF937FD974FED,M,AL,White,,67
1596,F3D61FAC3AD20,M,AR,White,not_hispanic_latino,66
2050,F527A66986491,F,CA,White,not_hispanic_latino,84


Practice type

In [43]:
practice = pd.read_csv('../data/Practice.csv')

In [44]:
practice = practice[practice['PatientID'].isin(cohort_IDs)]

In [45]:
row_ID(practice)

(7036, 6461)

In [46]:
practice_counts = practice['PracticeType'].value_counts()
print(practice_counts)

PracticeType
COMMUNITY    5751
ACADEMIC     1285
Name: count, dtype: int64


In cases where patients have multiple responses for PracticeType, need to address this by labeling as "BOTH"

In [47]:
#First determine how many practice types are present for each patient
practice_unique_count = (
    practice.groupby('PatientID')['PracticeType'].agg('nunique')
    .to_frame()
    .reset_index()
    .rename(columns = {'PracticeType': 'n_type'})
)

In [48]:
practice_n = pd.merge(practice, practice_unique_count, on = 'PatientID')

In [49]:
#Label patients with more than practice type with "BOTH"
practice_n['p_type'] = (
    np.where(practice_n['n_type'] == 1, practice_n['PracticeType'], 'BOTH')
)

In [50]:
practice_n = (
    practice_n.drop_duplicates(subset = ['PatientID'], keep = 'first')
    .filter(items = ['PatientID', 'p_type'])
)

In [51]:
row_ID(practice_n)

(6461, 6461)

In [52]:
practice_n.sample(3)

Unnamed: 0,PatientID,p_type
4266,F06C6B6F957C7,COMMUNITY
3962,F3C68AE1738AC,COMMUNITY
155,F568EC5EE78C9,COMMUNITY


In [53]:
practice_counts = practice_n['p_type'].value_counts()
print(practice_counts)

p_type
COMMUNITY    5179
ACADEMIC      816
BOTH          466
Name: count, dtype: int64


In [54]:
demographics = pd.merge(demographics, practice_n, on = 'PatientID')

In [55]:
demographics.sample(3)

Unnamed: 0,PatientID,Gender,State,race,ethnicity,age,p_type
1516,FA8756DA68AF4,M,AL,White,not_hispanic_latino,81,COMMUNITY
3358,F92E398E3085E,M,FL,White,hispanic_latino,66,COMMUNITY
4513,F44AA39257FEF,F,NJ,White,not_hispanic_latino,76,COMMUNITY


Gender:

In [56]:
gender_counts = demographics['Gender'].value_counts()
print(gender_counts)

Gender
M    4726
F    1733
Name: count, dtype: int64


Missing values noted, but will avoid imputation for now, with plans to address missingness in ultimate step

In [57]:
#Convert column name to snake
demographics = demographics.rename(columns = {'Gender': 'gender'})

State:

In [58]:
# Group states into Census-Bureau regions  
state_dict = { 
    'ME': 'northeast', 
    'NH': 'northeast',
    'VT': 'northeast', 
    'MA': 'northeast',
    'CT': 'northeast',
    'RI': 'northeast',  
    'NY': 'northeast', 
    'NJ': 'northeast', 
    'PA': 'northeast', 
    'IL': 'midwest', 
    'IN': 'midwest', 
    'MI': 'midwest', 
    'OH': 'midwest', 
    'WI': 'midwest',
    'IA': 'midwest',
    'KS': 'midwest',
    'MN': 'midwest',
    'MO': 'midwest', 
    'NE': 'midwest',
    'ND': 'midwest',
    'SD': 'midwest',
    'DE': 'south',
    'FL': 'south',
    'GA': 'south',
    'MD': 'south',
    'NC': 'south', 
    'SC': 'south',
    'VA': 'south',
    'DC': 'south',
    'WV': 'south',
    'AL': 'south',
    'KY': 'south',
    'MS': 'south',
    'TN': 'south',
    'AR': 'south',
    'LA': 'south',
    'OK': 'south',
    'TX': 'south',
    'AZ': 'west',
    'CO': 'west',
    'ID': 'west',
    'MT': 'west',
    'NV': 'west',
    'NM': 'west',
    'UT': 'west',
    'WY': 'west',
    'AK': 'west',
    'CA': 'west',
    'HI': 'west',
    'OR': 'west',
    'WA': 'west',
    'PR': 'unknown'
}

demographics['region'] = demographics['State'].map(state_dict)

In [59]:
demographics['region'] = demographics['region'].fillna('unknown')

In [60]:
region_counts = demographics['region'].value_counts()
print(region_counts)

region
south        2578
unknown      1523
west          809
northeast     799
midwest       752
Name: count, dtype: int64


In [61]:
demographics = demographics.drop(columns = ['State'])

In [62]:
demographics.sample(3)

Unnamed: 0,PatientID,gender,race,ethnicity,age,p_type,region
3690,F54CA14F1A83E,F,White,not_hispanic_latino,56,COMMUNITY,south
2116,FC1CA9BA613EE,M,White,not_hispanic_latino,76,COMMUNITY,west
1095,FE0521B794539,M,White,not_hispanic_latino,84,ACADEMIC,unknown


In [63]:
%whos DataFrame

Variable                Type         Data/Info
----------------------------------------------
cohort                  DataFrame              PatientID      <...>\n[6461 rows x 3 columns]
demographics            DataFrame              PatientID gende<...>\n[6461 rows x 7 columns]
enhanced_adv            DataFrame               PatientID Diag<...>[13129 rows x 13 columns]
practice                DataFrame               PatientID     <...>\n[7036 rows x 4 columns]
practice_n              DataFrame              PatientID     p<...>\n[6461 rows x 2 columns]
practice_unique_count   DataFrame              PatientID  n_ty<...>\n[6461 rows x 2 columns]


In [64]:
#Keep cohort, demographics, enhanced_adv
del practice
del practice_n
del practice_unique_count

3. Clean enhanced_adv dataset

In [65]:
row_ID(enhanced_adv)

(13129, 13129)

In [66]:
#filter for patients in the cohort
enhanced_adv = enhanced_adv[enhanced_adv['PatientID'].isin(cohort_IDs)]

In [67]:
row_ID(enhanced_adv)

(6461, 6461)

In [68]:
enhanced_adv.sample(3)

Unnamed: 0,PatientID,DiagnosisDate,AdvancedDiagnosisDate,PrimarySite,DiseaseGrade,GroupStage,TStage,NStage,MStage,SmokingStatus,Surgery,SurgeryDate,SurgeryType
8982,F1EE95DFEF1CB,2017-03-24,2019-05-07,Bladder,High grade (G2/G3/G4),Stage II,T3,N0,M0,No history of smoking,True,2017-04-06,Cystoprostatectomy
11223,FAC796573054B,2017-08-23,2017-08-23,Urethra,High grade (G2/G3/G4),Stage IV,Unknown/not documented,Unknown/not documented,Unknown/not documented,No history of smoking,True,2018-02-12,"Cystectomy, NOS"
10042,F9E6B487BF1BD,2018-01-01,2022-08-19,Bladder,High grade (G2/G3/G4),Unknown/not documented,T2,N0,M0,No history of smoking,True,2019-05-01,Complete (radical) cystectomy


GroupStage

In [69]:
stage_counts = enhanced_adv['GroupStage'].value_counts()
print(stage_counts)

GroupStage
Unknown/not documented    2978
Stage IV                  2055
Stage II                   459
Stage III                  263
Stage IVB                  172
Stage IIIA                 164
Stage IVA                  119
Stage I                    113
Stage IIIB                 104
Stage 0is                   20
Stage 0a                    14
Name: count, dtype: int64


In [70]:
# Dictionary for regrouping stages
stage_dict = { 
    'Stage 0': '0',
    'Stage 0is': '0',
    'Stage 0a': '0',
    'Stage I': 'I',
    'Stage II': 'II',
    'Stage III': 'III',
    'Stage IIIA': 'III',
    'Stage IIIB': 'III',
    'Stage IV': 'IV',
    'Stage IVA': 'IV',
    'Stage IVB': 'IV',
    'Unknown/not documented': 'unknown'
}

In [71]:
enhanced_adv['stage'] = enhanced_adv['GroupStage'].map(stage_dict)

In [72]:
stage_counts = enhanced_adv['stage'].value_counts()
print(stage_counts)

stage
unknown    2978
IV         2346
III         531
II          459
I           113
0            34
Name: count, dtype: int64


In [73]:
enhanced_adv = enhanced_adv.drop(columns = ['GroupStage'])

AdvancedDiagnosisDate

In [74]:
enhanced_adv = enhanced_adv.rename(columns = {'AdvancedDiagnosisDate': 'adv_diagnosis_date'})

In [75]:
enhanced_adv['adv_diagnosis_date'] = pd.to_datetime(enhanced_adv['adv_diagnosis_date'], format="%Y-%m-%d")

In [76]:
#confirm datetime conversion successful
print(enhanced_adv['adv_diagnosis_date'].dtype)

datetime64[ns]


In [77]:
enhanced_adv.loc[:, 'adv_diagnosis_year'] = enhanced_adv['adv_diagnosis_date'].dt.year

In [78]:
enhanced_adv.sample(3)

Unnamed: 0,PatientID,DiagnosisDate,adv_diagnosis_date,PrimarySite,DiseaseGrade,TStage,NStage,MStage,SmokingStatus,Surgery,SurgeryDate,SurgeryType,stage,adv_diagnosis_year
5495,F748865D83873,2013-01-01,2015-03-04,Bladder,High grade (G2/G3/G4),Unknown/not documented,NX,M0,History of smoking,True,2015-05-15,Cystoprostatectomy,unknown,2015
5021,F798B5EC4E3AB,2022-10-26,2022-10-26,Bladder,High grade (G2/G3/G4),T2,N3,M0,No history of smoking,False,,,III,2022
4273,FC0AA4FC866F8,2019-12-03,2021-02-12,Bladder,High grade (G2/G3/G4),T2,Unknown/not documented,M0,No history of smoking,False,,,unknown,2021


DiagnosisDate

In [79]:
enhanced_adv = enhanced_adv.rename(columns = {'DiagnosisDate': 'diagnosis_date'})

In [80]:
enhanced_adv['diagnosis_date'] = pd.to_datetime(enhanced_adv['diagnosis_date'], format="%Y-%m-%d")

In [81]:
# Missing diagnosis_date will be replaced with adv_date; other dates will be left untouched. 
enhanced_adv['diagnosis_date'] = (
    np.where(enhanced_adv['diagnosis_date'].isna(), enhanced_adv['adv_diagnosis_date'], enhanced_adv['diagnosis_date'])
)

In [82]:
#confirm datetime conversion successful
print(enhanced_adv['adv_diagnosis_date'].dtype)

datetime64[ns]


Time from diagnosis date to metastatic date

In [83]:
enhanced_adv.loc[:, 'delta_adv_diagnosis'] = (enhanced_adv['adv_diagnosis_date'] - enhanced_adv['diagnosis_date']).dt.days

In [84]:
enhanced_adv.sample(3)

Unnamed: 0,PatientID,diagnosis_date,adv_diagnosis_date,PrimarySite,DiseaseGrade,TStage,NStage,MStage,SmokingStatus,Surgery,SurgeryDate,SurgeryType,stage,adv_diagnosis_year,delta_adv_diagnosis
5347,FB6DBAEFD47BA,2020-01-01,2022-03-31,Bladder,Unknown/not documented,Unknown/not documented,Unknown/not documented,M0,History of smoking,False,,,unknown,2022,820
11607,F96F96D93E5A3,2012-05-18,2012-05-18,Renal Pelvis,High grade (G2/G3/G4),TX,N2,M1,History of smoking,False,,,IV,2012,0
9342,FFC19DE00759D,2020-09-24,2020-09-24,Bladder,High grade (G2/G3/G4),T3b,N2,M0,History of smoking,True,2020-10-30,Partial cystectomy,III,2020,0


PrimarySite

In [85]:
site_counts = enhanced_adv['PrimarySite'].value_counts()
print(site_counts)

PrimarySite
Bladder         4919
Renal Pelvis     897
Ureter           599
Urethra           46
Name: count, dtype: int64


In [86]:
enhanced_adv = enhanced_adv.rename(columns = {'PrimarySite': 'primary_site'})

In [87]:
enhanced_adv['primary_site'] = enhanced_adv['primary_site'].replace({'Bladder': 'lower_tract'})
enhanced_adv['primary_site'] = enhanced_adv['primary_site'].replace({'Urethra': 'lower_tract'})
enhanced_adv['primary_site'] = enhanced_adv['primary_site'].replace({'Renal Pelvis': 'upper_tract'})
enhanced_adv['primary_site'] = enhanced_adv['primary_site'].replace({'Ureter': 'upper_tract'})

In [88]:
site_counts = enhanced_adv['primary_site'].value_counts()
print(site_counts)

primary_site
lower_tract    4965
upper_tract    1496
Name: count, dtype: int64


DiseaseGrade

In [89]:
enhanced_adv['DiseaseGrade'].value_counts()

DiseaseGrade
High grade (G2/G3/G4)     5441
Unknown/not documented     711
Low grade (G1)             309
Name: count, dtype: int64

In [90]:
enhanced_adv = enhanced_adv.rename(columns = {'DiseaseGrade': 'disease_grade'})

In [91]:
enhanced_adv['disease_grade'] = enhanced_adv['disease_grade'].replace({'High grade (G2/G3/G4)': 'high_grade'})
enhanced_adv['disease_grade'] = enhanced_adv['disease_grade'].replace({'Unknown/not documented': 'unknown'})
enhanced_adv['disease_grade'] = enhanced_adv['disease_grade'].replace({'Low grade (G1)': 'low_grade'})

In [92]:
enhanced_adv['disease_grade'].value_counts()

disease_grade
high_grade    5441
unknown        711
low_grade      309
Name: count, dtype: int64

TStage

In [93]:
enhanced_adv['TStage'].value_counts()

TStage
Unknown/not documented    2226
T2                        1140
T3                         738
T1                         720
T4                         295
T4a                        232
T3a                        200
T3b                        196
T2b                        192
T2a                        158
TX                         136
Ta                         126
T4b                         54
Tis                         42
T0                           6
Name: count, dtype: int64

In [94]:
enhanced_adv = enhanced_adv.rename(columns = {'TStage': 't_stage'})

In [95]:
# Dictionary for regrouping t stages
t_stage_dict = { 
    'Unknown/not documented': 'unknown',
    'T2': 'T2',
    'T3': 'T3',
    'T1': 'T1',
    'T4': 'T4',
    'T4a': 'T4',
    'T3a': 'T3',
    'T3b': 'T3',
    'T2b': 'T2',
    'T2a': 'T2',
    'TX': 'unknown',
    'Ta': 'Ta',
    'T4b': 'T4',
    'Tis': 'Tis',
    'T0': 'T0'
    
}

enhanced_adv['t_stage'] = enhanced_adv['t_stage'].map(t_stage_dict)

In [96]:
enhanced_adv['t_stage'].value_counts()

t_stage
unknown    2362
T2         1490
T3         1134
T1          720
T4          581
Ta          126
Tis          42
T0            6
Name: count, dtype: int64

NStage

In [97]:
enhanced_adv['NStage'].value_counts()

NStage
Unknown/not documented    3107
N0                        1315
NX                         743
N2                         612
N1                         497
N3                         187
Name: count, dtype: int64

In [98]:
enhanced_adv = enhanced_adv.rename(columns = {'NStage': 'n_stage'})

In [99]:
# Dictionary for regrouping n stages
n_stage_dict = { 
    'Unknown/not documented': 'unknown',
    'N0': 'N0',
    'NX': 'unknown',
    'N2': 'N2',
    'N1': 'N1',
    'N3': 'N3'
}

enhanced_adv['n_stage'] = enhanced_adv['n_stage'].map(n_stage_dict)

In [100]:
enhanced_adv['n_stage'].value_counts()

n_stage
unknown    3850
N0         1315
N2          612
N1          497
N3          187
Name: count, dtype: int64

MStage

In [101]:
enhanced_adv['MStage'].value_counts()

MStage
M0                        2630
Unknown/not documented    1878
M1                        1330
MX                         393
M1b                        139
M1a                         91
Name: count, dtype: int64

In [102]:
enhanced_adv = enhanced_adv.rename(columns = {'MStage': 'm_stage'})

In [103]:
# Dictionary for regrouping m stages
m_stage_dict = { 
    'M0': 'M0',
    'Unknown/not documented': 'unknown',
    'M1': 'M1',
    'MX': 'unknown',
    'M1b': 'M1',
    'M1a': 'M1'
}

enhanced_adv['m_stage'] = enhanced_adv['m_stage'].map(m_stage_dict)

In [104]:
enhanced_adv['m_stage'].value_counts()

m_stage
M0         2630
unknown    2271
M1         1560
Name: count, dtype: int64

SmokingStatus

In [105]:
enhanced_adv['SmokingStatus'].value_counts()

SmokingStatus
History of smoking        4742
No history of smoking     1677
Unknown/not documented      42
Name: count, dtype: int64

In [106]:
enhanced_adv = enhanced_adv.rename(columns = {'SmokingStatus': 'smoking_status'})

In [107]:
enhanced_adv['smoking_status'] = enhanced_adv['smoking_status'].replace({'History of smoking': 'smoker'})
enhanced_adv['smoking_status'] = enhanced_adv['smoking_status'].replace({'No history of smoking': 'never_smoker'})
enhanced_adv['smoking_status'] = enhanced_adv['smoking_status'].replace({'Unknown/not documented': 'unknown'})

In [108]:
enhanced_adv['smoking_status'].value_counts()

smoking_status
smoker          4742
never_smoker    1677
unknown           42
Name: count, dtype: int64

Surgery

In [109]:
enhanced_adv['Surgery'].value_counts()

Surgery
False    3345
True     3116
Name: count, dtype: int64

In [110]:
enhanced_adv = enhanced_adv.rename(columns = {'Surgery': 'surgery_status'})

In [111]:
enhanced_adv['surgery_status'].value_counts()

surgery_status
False    3345
True     3116
Name: count, dtype: int64

In [112]:
print(enhanced_adv['surgery_status'].dtype)

bool


SurgeryDate

In [113]:
print(enhanced_adv['SurgeryDate'].dtype)

object


In [114]:
enhanced_adv = enhanced_adv.rename(columns = {'SurgeryDate': 'surgery_date'})

In [115]:
enhanced_adv['surgery_date'] = pd.to_datetime(enhanced_adv['surgery_date'], format="%Y-%m-%d")

In [116]:
print(enhanced_adv['surgery_date'].dtype)

datetime64[ns]


Leaving cases where there is no surgery date, empty for now, preserving date formatting

SurgeryType

In [117]:
enhanced_adv['SurgeryType'].value_counts()

SurgeryType
Cystoprostatectomy               1210
Nephroureterectomy                882
Complete (radical) cystectomy     539
Partial cystectomy                126
Nephrectomy                       124
Ureterectomy                       98
Cystectomy, NOS                    73
Other                              58
Unknown/not documented              4
Urethrectomy                        2
Name: count, dtype: int64

In [118]:
enhanced_adv = enhanced_adv.rename(columns = {'SurgeryType': 'surgery_type'})

In [119]:
# Dictionary for regrouping surgery type
surgery_type_dict = { 
    'Cystoprostatectomy': 'cystoprostatectomy',
    'Nephroureterectomy': 'nephroureterectomy',
    'Complete (radical) cystectomy': 'radical_cystectomy',
    'Partial cystectomy': 'partial_cystectomy',
    'Nephrectomy': 'nephrectomy',
    'Ureterectomy': 'ureterectomy',
    'Cystectomy, NOS': 'cystectomy_nos',
    'Other': 'other_surgery',
    'Unknown/not documented': 'unknown_surgery',
    'Urethrectomy': 'urethrectomy'
}

enhanced_adv['surgery_type'] = enhanced_adv['surgery_type'].map(surgery_type_dict)

In [120]:
enhanced_adv['surgery_type'] = enhanced_adv['surgery_type'].fillna('no_surgery')

In [121]:
enhanced_adv['surgery_type'].value_counts()

surgery_type
no_surgery            3345
cystoprostatectomy    1210
nephroureterectomy     882
radical_cystectomy     539
partial_cystectomy     126
nephrectomy            124
ureterectomy            98
cystectomy_nos          73
other_surgery           58
unknown_surgery          4
urethrectomy             2
Name: count, dtype: int64

In [122]:
#Final enhanced_adv dataframe
enhanced_adv.sample(3)

Unnamed: 0,PatientID,diagnosis_date,adv_diagnosis_date,primary_site,disease_grade,t_stage,n_stage,m_stage,smoking_status,surgery_status,surgery_date,surgery_type,stage,adv_diagnosis_year,delta_adv_diagnosis
7011,FED83F499A14E,2016-09-02,2016-09-02,lower_tract,high_grade,T2,N3,unknown,never_smoker,False,NaT,no_surgery,IV,2016,0
11134,FB96EF87D47D8,2020-03-26,2020-03-26,lower_tract,high_grade,T3,N1,M0,smoker,True,2020-03-27,radical_cystectomy,III,2020,0
11020,F82B8AA223624,2015-04-16,2017-02-21,lower_tract,high_grade,T2,N0,M0,smoker,True,2015-08-11,cystoprostatectomy,II,2017,677
