# Flatiron Health mBC: Data Wrangling Test Set

**OBJECTIVE: Create a dataframe of relevant variables using test cohort patients which will be used to validate machine learning survival models.**

**BACKGROUND: The 13 CSV Flatiron files will be cleaned in the exact same fashion for the test set patients as for the training set patients. For more information on the cleaning process refer to Notebook: Data Wrangling Training Set.**

**OUTLINE:**
1. **File cleaning for patients in training set**
2. **Merge files to create master test dataframe** 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

In [2]:
# Function that returns number of rows and count of unique PatientIDs for a dataframe. 
def row_ID(dataframe):
    row = dataframe.shape[0]
    ID = dataframe['PatientID'].nunique()
    return row, ID

In [3]:
#Import test IDs saved from Data Wrangling Training Set file. 
test_IDs = pd.read_csv('test_IDs.csv')

In [4]:
# Array of PatientIDs in training set.
test_IDs = test_IDs['PatientID'].to_numpy()

In [5]:
len(test_IDs)

6336

## Part 1: Data wrangling

**Relevant CSV files will be imported and processed. A file is considered processed when each row corresponds to a unique patient from the training set and each column is a relevant variable for mortality prognositication. The eligibility window for collecting variables is typically defined as -90 days and +30 days from index date. The index date is time of metastatic diagnosis. Plus 30 was selected as the upper bound of the eligibility window given that median time to start of first line treatment is about 30 days from metastatic diagnosis.** 

**The following CSV files from Flatiron will be cleaned:**
1. **Demographics and Practice**
2. **Enhanced_MetastaticBreast**
3. **Enhanced_Mortality_V2**
4. **MedicationAdministration**
5. **Enhanced_MetBreastBiomarkers**
6. **Insurance**
7. **ECOG**
8. **Vitals**
9. **Labs**
10. **Diagnosis and Enhanced_MetBreastSitesOfMet**
11. **SocialDeeterminantsOfHealth**

### 1. Demographics

In [6]:
demographics = pd.read_csv('Demographics.csv')

In [7]:
demographics = demographics[demographics['PatientID'].isin(test_IDs)]

In [8]:
row_ID(demographics)

(6336, 6336)

#### Race and Ethnicity

In [9]:
# If race value is 'Hispanic or Latino', code as unknown, otherwise value unchanged.
demographics['race'] = (
    np.where(demographics['Race'] == 'Hispanic or Latino', 'unknown', demographics['Race'])
)

In [10]:
# Missing race value will be recoded as Unknown
demographics['race'] = demographics['race'].fillna('unknown')

In [11]:
demographics['race'].value_counts().sum()

6336

In [12]:
# If race value is equal to 'Hispanic or Latino', code ethnicity as 'Hispanic or Latino', otherwise unchanged. 
demographics['ethnicity'] = (
    np.where(demographics['Race'] == 'Hispanic or Latino', 'hispanic_latino', demographics['Ethnicity'])
)

In [13]:
demographics['ethnicity'] = demographics['ethnicity'].fillna('unknown')

In [14]:
demographics['ethnicity'] = demographics['ethnicity'].replace({'Hispanic or Latino': 'hispanic_latino'})

In [15]:
demographics = demographics.drop(columns = ['Race', 'Ethnicity'])

#### BirthYear

In [16]:
enhanced_met = pd.read_csv('Enhanced_MetastaticBreast.csv')

In [17]:
demographics = pd.merge(demographics, enhanced_met[['PatientID', 'MetDiagnosisDate']], on = 'PatientID')

In [18]:
demographics.loc[:, 'MetDiagnosisDate'] = pd.to_datetime(demographics['MetDiagnosisDate'])

In [19]:
demographics.loc[:, 'age'] = demographics['MetDiagnosisDate'].dt.year - demographics['BirthYear']

In [20]:
demographics = demographics.drop(columns = ['BirthYear', 'MetDiagnosisDate'])

#### PracticeType

In [21]:
practice = pd.read_csv('Practice.csv')

In [22]:
practice = practice[practice['PatientID'].isin(test_IDs)]

In [23]:
row_ID(practice)

(6487, 6336)

In [24]:
practice_unique_count = (
    practice.groupby('PatientID')['PracticeType'].agg('nunique')
    .to_frame()
    .reset_index()
    .rename(columns = {'PracticeType': 'n_type'})
)

In [25]:
practice_n = pd.merge(practice, practice_unique_count, on = 'PatientID')

In [26]:
practice_n['p_type'] = (
    np.where(practice_n['n_type'] == 1, practice_n['PracticeType'], 'BOTH')
)

In [27]:
practice_n = (
    practice_n.drop_duplicates(subset = ['PatientID'], keep = 'first')
    .filter(items = ['PatientID', 'p_type'])
)

In [28]:
demographics = pd.merge(demographics, practice_n, on = 'PatientID')

#### Gender

In [29]:
# Impute F as unknown given most common gender. 
demographics['Gender'] = demographics['Gender'].fillna('F')

In [30]:
demographics = demographics.rename(columns = {'Gender': 'gender'})

#### State

In [31]:
# Group states into Census-Bureau regions  
state_dict = { 
    'ME': 'northeast', 
    'NH': 'northeast',
    'VT': 'northeast', 
    'MA': 'northeast',
    'CT': 'northeast',
    'RI': 'northeast',  
    'NY': 'northeast', 
    'NJ': 'northeast', 
    'PA': 'northeast', 
    'IL': 'midwest', 
    'IN': 'midwest', 
    'MI': 'midwest', 
    'OH': 'midwest', 
    'WI': 'midwest',
    'IA': 'midwest',
    'KS': 'midwest',
    'MN': 'midwest',
    'MO': 'midwest', 
    'NE': 'midwest',
    'ND': 'midwest',
    'SD': 'midwest',
    'DE': 'south',
    'FL': 'south',
    'GA': 'south',
    'MD': 'south',
    'NC': 'south', 
    'SC': 'south',
    'VA': 'south',
    'DC': 'south',
    'WV': 'south',
    'AL': 'south',
    'KY': 'south',
    'MS': 'south',
    'TN': 'south',
    'AR': 'south',
    'LA': 'south',
    'OK': 'south',
    'TX': 'south',
    'AZ': 'west',
    'CO': 'west',
    'ID': 'west',
    'MT': 'west',
    'NV': 'west',
    'NM': 'west',
    'UT': 'west',
    'WY': 'west',
    'AK': 'west',
    'CA': 'west',
    'HI': 'west',
    'OR': 'west',
    'WA': 'west',
    'PR': 'unknown'
}

demographics['region'] = demographics['State'].map(state_dict)

In [32]:
demographics['region'] = demographics['region'].fillna('unknown')

In [33]:
demographics['region'].value_counts(dropna = False).sum()

6336

In [34]:
demographics = demographics.drop(columns = ['State'])

In [35]:
# Final training demographics table.
demographics.sample(5)

Unnamed: 0,PatientID,gender,race,ethnicity,age,p_type,region
481,F82D20F623B8F,F,White,Not Hispanic or Latino,66,COMMUNITY,west
4529,F204E1A80F6CE,F,White,Not Hispanic or Latino,53,COMMUNITY,midwest
3686,FAC4DDEF44814,F,White,Not Hispanic or Latino,61,COMMUNITY,midwest
6281,F69D56D6CAB49,F,Black or African American,Not Hispanic or Latino,79,COMMUNITY,south
97,FBB6B3C2E5055,F,White,Not Hispanic or Latino,52,ACADEMIC,unknown


In [36]:
%whos DataFrame

Variable                Type         Data/Info
----------------------------------------------
demographics            DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
enhanced_met            DataFrame               PatientID Diag<...>n[31677 rows x 4 columns]
practice                DataFrame               PatientID     <...>\n[6487 rows x 4 columns]
practice_n              DataFrame              PatientID     p<...>\n[6336 rows x 2 columns]
practice_unique_count   DataFrame              PatientID  n_ty<...>\n[6336 rows x 2 columns]


In [37]:
# Keep demographics and enhanced_met
del practice
del practice_n
del practice_unique_count

### 2. Enhanced_MetastaticBreast

In [38]:
enhanced_met = enhanced_met[enhanced_met['PatientID'].isin(test_IDs)]

In [39]:
row_ID(enhanced_met)

(6336, 6336)

#### GroupStage 

In [40]:
# Dictionary for regrouping stages
stage_dict = { 
    '0': '0',
    'I': 'I',
    'II': 'II',
    'III': 'III',
    'IV': 'IV',
    'Not documented': 'unknown'
}

enhanced_met['stage'] = enhanced_met['GroupStage'].map(stage_dict)

In [41]:
enhanced_met = enhanced_met.drop(columns = ['GroupStage'])

#### MetDiagnosisDate

In [42]:
enhanced_met = enhanced_met.rename(columns = {'MetDiagnosisDate': 'met_date'})

In [43]:
enhanced_met.loc[:, 'met_date'] = pd.to_datetime(enhanced_met['met_date'])

In [44]:
enhanced_met.loc[:, 'met_year'] = enhanced_met['met_date'].dt.year

#### DiagnosisDate

In [45]:
enhanced_met = enhanced_met.rename(columns = {'DiagnosisDate': 'diagnosis_date'})

In [46]:
# Missing diagnosis_date will be replaced with met_date; other dates will be left untouched. 
enhanced_met['diagnosis_date'] = (
    np.where(enhanced_met['diagnosis_date'].isna(), enhanced_met['met_date'], enhanced_met['diagnosis_date'])
)

In [47]:
enhanced_met['diagnosis_date'] = pd.to_datetime(enhanced_met['diagnosis_date'])

#### Time from diagnosis date to metastatic date

In [48]:
enhanced_met.loc[:, 'delta_met_diagnosis'] = (enhanced_met['met_date'] - enhanced_met['diagnosis_date']).dt.days

In [49]:
# Final enhanced_met dataframe
enhanced_met.sample(5)

Unnamed: 0,PatientID,diagnosis_date,met_date,stage,met_year,delta_met_diagnosis
18624,F3C4D37A4301C,2015-02-13,2018-11-21,III,2018,1377
15661,F48BA80BBE987,2013-01-02,2016-12-29,II,2016,1457
13557,F447BAC53A2D9,2013-05-09,2015-08-06,III,2015,819
7292,F03D9E19426F1,2002-05-07,2013-08-26,III,2013,4129
14200,FD62FCEE15DD4,2011-09-15,2014-09-29,III,2014,1110


In [50]:
%whos DataFrame

Variable       Type         Data/Info
-------------------------------------
demographics   DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
enhanced_met   DataFrame               PatientID diag<...>\n[6336 rows x 6 columns]


### 3. Enhanced_Mortality_V2

In [51]:
mortality = pd.read_csv('Enhanced_Mortality_V2.csv')

In [52]:
mortality = mortality[mortality['PatientID'].isin(test_IDs)]

In [53]:
row_ID(mortality)

(3693, 3693)

In [54]:
mortality = mortality.rename(columns = {'DateOfDeath': 'death_date'})

In [55]:
# For patients with year granularity, impute middle of the year (ie., July 1)
mortality['death_date'] = (
    np.where(mortality['death_date'].str.len() == 4, mortality['death_date'] + '-07-01', mortality['death_date'])
)

In [56]:
# For patients with month granularity, impute 15th of the month.
mortality['death_date'] = (
    np.where(mortality['death_date'].str.len() == 7, mortality['death_date'] + '-15', mortality['death_date'])
)

In [57]:
mortality['death_date'] = pd.to_datetime(mortality['death_date'])

#### Censoring

**For patients for whom a date of death is not known, the censor date can be defined either as the data cutoff date or as the last confirmed activity date. The last confirmed activity date is broadly defined as the last date at which there is evidence in the EHR that a patient is alive. Evidence of a record in at least one of the items listed below qualifies as patient-level confirmed activity:**
* Visit: VisitDate
* Telemedicine: VisitDate
* Enhanced_MetBreast_Orals: StartDate or EndDate
* Enhanced_MetBreastBiomarkers: SpecimenCollectedDate
* Enhanced_MetBreastProgression: LastClinicNoteDate or ProgressionDate
* Enhanced_MetBreastSitesOfMet: DateOfMetastasis

In [58]:
visit = pd.read_csv('Visit.csv')
telemedicine = pd.read_csv('Telemedicine.csv')
orals = pd.read_csv('Enhanced_MetBreast_Orals.csv')
biomarkers = pd.read_csv('Enhanced_MetBreastBiomarkers.csv')
progression = pd.read_csv('Enhanced_MetBreastProgression.csv')
mets = pd.read_csv('Enhanced_MetBreastSitesOfMet.csv')

##### Visit and Telemedicine

In [59]:
visit.shape

(1936611, 7)

In [60]:
telemedicine.shape

(23412, 3)

In [61]:
visit_tele = (
    visit[['PatientID', 'VisitDate']]
    .append(telemedicine[['PatientID', 'VisitDate']])
) 

In [62]:
visit_tele.shape

(1960023, 2)

In [63]:
visit_tele.loc[:,'VisitDate'] = pd.to_datetime(visit_tele['VisitDate'])

In [64]:
# Select max VisitDate from combined Visit and Telemedicine table.
visit_tele_max = (
    visit_tele
    [visit_tele['PatientID'].isin(test_IDs)]
    .groupby('PatientID')['VisitDate'].max()
    .to_frame(name = 'visit_max')
    .reset_index()
)

In [65]:
row_ID(visit_tele_max)

(6336, 6336)

##### Orals

In [66]:
orals = orals[orals['PatientID'].isin(test_IDs)]

In [67]:
orals.loc[:, 'StartDate'] = pd.to_datetime(orals['StartDate'])

In [68]:
orals.loc[:, 'EndDate'] = pd.to_datetime(orals['EndDate'])

In [69]:
orals_max = (
    orals
    .assign(max_date = orals[['StartDate', 'EndDate']].max(axis = 1))
    .groupby('PatientID')['max_date'].max()
    .to_frame(name = 'orals_max')
    .reset_index()
)

##### Biomarkers

In [70]:
biomarkers = biomarkers[biomarkers['PatientID'].isin(test_IDs)]

In [71]:
biomarkers.loc[:, 'SpecimenCollectedDate'] = pd.to_datetime(biomarkers['SpecimenCollectedDate'])

In [72]:
biomarkers_max = (
    biomarkers
    .groupby('PatientID')['SpecimenCollectedDate'].max()
    .to_frame(name = 'biomarkers_max')
    .reset_index()
)

##### Progression

In [73]:
progression = progression[progression['PatientID'].isin(test_IDs)]

In [74]:
progression.loc[:, 'ProgressionDate'] = pd.to_datetime(progression['ProgressionDate'])

In [75]:
progression.loc[:, 'LastClinicNoteDate'] = pd.to_datetime(progression['LastClinicNoteDate'])

In [76]:
progression_max = (
    progression
    .assign(max_date = progression[['ProgressionDate', 'LastClinicNoteDate']].max(axis = 1))
    .groupby('PatientID')['max_date'].max()
    .to_frame(name = 'progression_max')
    .reset_index()
)

##### Sites of metastasis

In [77]:
mets = mets[mets['PatientID'].isin(test_IDs)]

In [78]:
mets.loc[:, 'DateOfMetastasis'] = pd.to_datetime(mets['DateOfMetastasis'])

In [79]:
mets_max = (
    mets
    .groupby('PatientID')['DateOfMetastasis'].max()
    .to_frame(name = 'mets_max')
    .reset_index()
)

##### Max date merge

In [80]:
last_activity = pd.merge(visit_tele_max, orals_max, on = 'PatientID', how = 'outer')

In [81]:
last_activity = pd.merge(last_activity, biomarkers_max, on = 'PatientID', how = 'outer')

In [82]:
last_activity = pd.merge(last_activity, progression_max, on = 'PatientID', how = 'outer')

In [83]:
last_activity = pd.merge(last_activity, mets_max, on = 'PatientID', how = 'outer')

In [84]:
row_ID(last_activity)

(6336, 6336)

In [85]:
# Find max of each row. 
last_activity = (
    last_activity
    .assign(last_activity = last_activity[['visit_max', 'orals_max', 'biomarkers_max', 'progression_max', 'mets_max']].max(axis = 1))
    .filter(items = ['PatientID', 'last_activity'])
)

In [86]:
len(last_activity) == len(test_IDs)

True

In [87]:
last_activity['last_activity'].isna().sum()

0

In [88]:
# Append missing training IDs.
mortality = (
    mortality
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(mortality['PatientID'])].to_frame(name = 'PatientID'), 
        sort = False
    )
)

In [89]:
row_ID(mortality)

(6336, 6336)

In [90]:
mortality = pd.merge(mortality, enhanced_met[['PatientID', 'met_date']], on = 'PatientID')

In [91]:
mortality = pd.merge(mortality, last_activity, on = 'PatientID')

In [92]:
row_ID(mortality)

(6336, 6336)

In [93]:
mortality.loc[:, 'death_status'] = np.where(mortality['death_date'].isna(), 0, 1)

In [94]:
# timerisk_activity is time from metastatic diagnosis to death or last activity if no death date.
mortality.loc[:, 'timerisk_activity'] = (
    np.where(mortality['death_date'].isna(),
             (mortality['last_activity'] - mortality['met_date']).dt.days,
             (mortality['death_date'] - mortality['met_date']).dt.days)
)

In [95]:
# If timerisk_activity is less than 0, set to 0 otherwise remains unchanged. 
mortality['timerisk_activity'] = np.where(mortality['timerisk_activity'] < 0, 0, mortality['timerisk_activity'])

In [96]:
mortality.sample(5)

Unnamed: 0,PatientID,death_date,met_date,last_activity,death_status,timerisk_activity
4507,F8A7B93172BAC,NaT,2018-01-16,2022-08-24,0,1681.0
2480,FAA5BDA8A68DB,2017-04-15,2017-03-16,2017-03-23,1,30.0
5807,F024ABFE6F126,NaT,2018-08-30,2022-08-31,0,1462.0
836,FC7B66DA36C10,2017-12-15,2017-01-31,2017-12-18,1,318.0
3955,FE8DC46EFBA68,NaT,2021-06-24,2021-07-12,0,18.0


In [97]:
mortality = pd.merge(mortality, enhanced_met[['PatientID', 'diagnosis_date']], on = 'PatientID', how = 'outer')

In [98]:
# timerisk_activity_first is time from first diagnosis (metastatic or not) to death or last activity if no death date.
mortality.loc[:, 'timerisk_activity_first'] = (
    np.where(mortality['death_date'].isna(),
             (mortality['last_activity'] - mortality['diagnosis_date']).dt.days,
             (mortality['death_date'] - mortality['diagnosis_date']).dt.days)
)

In [99]:
# If timerisk_activity is less than 0, set to 0 otherwise remains unchanged. 
mortality['timerisk_activity_first'] = np.where(
    mortality['timerisk_activity_first'] < 0, 0, mortality['timerisk_activity_first'])

In [100]:
mortality.to_csv('mortality_cleaned_te.csv', index = False, header = True)

In [101]:
mortality = mortality.filter(items = ['PatientID', 'death_status', 'timerisk_activity'])

In [102]:
mortality.sample(5)

Unnamed: 0,PatientID,death_status,timerisk_activity
5110,F48E54608CF85,0,11.0
3630,FFFFA7CC85A8F,1,2294.0
6173,F7E13D76781A1,0,50.0
4991,FAC040427105B,0,5.0
5319,F9E21748E3AD1,0,696.0


In [103]:
%whos DataFrame

Variable          Type         Data/Info
----------------------------------------
biomarkers        DataFrame                PatientID Bio<...>[48165 rows x 19 columns]
biomarkers_max    DataFrame              PatientID bioma<...>\n[6275 rows x 2 columns]
demographics      DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
enhanced_met      DataFrame               PatientID diag<...>\n[6336 rows x 6 columns]
last_activity     DataFrame              PatientID last_<...>\n[6336 rows x 2 columns]
mets              DataFrame               PatientID Date<...>n[15986 rows x 3 columns]
mets_max          DataFrame              PatientID   met<...>\n[6306 rows x 2 columns]
mortality         DataFrame              PatientID  deat<...>\n[6336 rows x 3 columns]
orals             DataFrame               PatientID     <...>n[15195 rows x 5 columns]
orals_max         DataFrame              PatientID  oral<...>\n[5183 rows x 2 columns]
progression       DataFrame               Patien

In [104]:
# Keep demographics, enhanced_met, and mortality
del biomarkers
del biomarkers_max
del last_activity
del orals
del orals_max
del telemedicine
del visit
del visit_tele
del visit_tele_max

### 4. MedicationAdministration

In [105]:
med_admin = pd.read_csv('MedicationAdministration.csv')

In [106]:
med_admin = med_admin[med_admin['PatientID'].isin(test_IDs)]

In [107]:
row_ID(med_admin)

(518864, 5158)

In [108]:
med_admin.shape

(518864, 11)

**An indicator variable will be created for key medications (ie., steroids, opioids, other pain meds, antibiotics, anticoagulation, diabetic medicaitons, etc.) around time of metastatic diagnosis. The elgibility window is -90 days from metastatic diagnosis to first line of therapy or +30, whichever comes first. First line of therapy is included as an upper bound because steroids are frequently administered as part of treatment for chemotherapy induced-nausea, so steroids might inadvertently capture chemotherapy treatment if upper bound is set after first line of therapy.** 

In [109]:
line_therapy = pd.read_csv('LineOfTherapy.csv')

In [110]:
line_therapy = line_therapy[line_therapy['PatientID'].isin(med_admin['PatientID'])]

In [111]:
line_therapy_1 = (
    line_therapy 
    .query('LineNumber == 1 and IsMaintenanceTherapy == False')
)

In [112]:
# If patients have 2 first line therapies, select earliest
line_therapy_1 = line_therapy_1.drop_duplicates(subset = ['PatientID'], keep = 'first')

In [113]:
med_admin = pd.merge(med_admin, line_therapy_1[['PatientID', 'StartDate']], on = 'PatientID', how = 'left')

In [114]:
med_admin = pd.merge(med_admin, enhanced_met[['PatientID', 'met_date']], on = 'PatientID', how = 'left')

In [115]:
med_admin.loc[:, 'AdministeredDate'] = pd.to_datetime(med_admin['AdministeredDate'])

In [116]:
med_admin.loc[:, 'StartDate'] = pd.to_datetime(med_admin['StartDate'])

In [117]:
med_admin['AdministeredDate'].isna().sum()

0

In [118]:
# Median days from metastatic date to start of first line of therapy. 
(
    med_admin
    .drop_duplicates(subset = ['PatientID'], keep = 'first')
    .assign(start_met_diff = lambda x: (x.StartDate - x.met_date).dt.days)
    .start_met_diff
    .median()
)

23.0

In [119]:
# New variable upper_bound which defines upper bound
# If no StartDate (ie., no treatment received), then upper bound +30 from metastatic diagnosis 
# If StartDate is greater than 30 days from metastatic diagnosis, then upper bound +30 from metastatic diagnosis
# If StartDate is less than or equal 30 from metastatic diagnosis, then upper bound is one day before StartDate
conditions = [
    (med_admin['StartDate'].isna()) | ((med_admin['StartDate'] - med_admin['met_date']).dt.days > 30),
    ((med_admin['StartDate'] - med_admin['met_date']).dt.days <= 30)]    

choices = [30, (med_admin['StartDate'] - med_admin['met_date']).dt.days - 1]
    
med_admin.loc[:, 'upper_bound'] = np.select(conditions, choices)

In [120]:
med_admin.loc[:, 'upper_bound_date'] = (
    np.where(med_admin['upper_bound'] != 30, 
             med_admin['StartDate'] - pd.DateOffset(days = 1), 
             med_admin['met_date'] + pd.DateOffset(days = 30))
)

In [121]:
# Select window of -90 days and from metastatic diagnosis and remove clinical study drug. 
med_admin_win = (
    med_admin
    [((med_admin['AdministeredDate'] - med_admin['met_date']).dt.days >= -90) &
    (med_admin['AdministeredDate'] <= med_admin['upper_bound_date']) &
    (med_admin['CommonDrugName'] != 'Clinical study drug')]
)

In [122]:
row_ID(med_admin_win)

(15903, 912)

#### Antineoplastic 

**No indicator variable created.** 

#### Antiemetic

**No indicator variable created.** 

#### Solution-fluid

**No indicator variable created.** 

#### Steroid

In [123]:
med_admin_win.loc[:, 'steroid_diag'] = (
    np.where((med_admin_win['DrugCategory'] == 'steroid') & 
             ((med_admin_win['Route'] == 'Intravenous') | 
              (med_admin_win['Route'] == 'Oral') | 
              (med_admin_win['Route'] == 'Intrajejunal') |
              (med_admin_win['Route'] == 'Nasogastric') |
              (med_admin_win['Route'] == 'enteral')), 1, 0)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


#### Pain

##### Opioid PO

In [124]:
# List of avialable opioids in the US. 
opioid_list = [
    'buprenorphine',
    'codeine',
    'fentanyl',
    'hydrocodone',
    'hydromorphone',
    'methadone',
    'morphine',
    'oxycodone',
    'oxymorphone',
    'tapentadol',
    'tramadol'
]

In [125]:
med_admin_win.loc[:, 'opioid_PO_diag'] = (
    np.where(((med_admin_win['Route'] == 'Oral') | 
              (med_admin_win['Route'] == 'Transdermal') | 
              (med_admin_win['Route'] == 'Sublingual')) &
             (med_admin_win['CommonDrugName'].str.contains('|'.join(opioid_list))), 1, 0)
)

##### Nonopioid PO

In [126]:
med_admin_win.loc[:, 'nonopioid_PO_diag'] = (
    np.where((med_admin_win['DrugCategory'] == 'pain agent') & 
             (med_admin_win['Route'] == 'Oral') & 
             (~med_admin_win['CommonDrugName'].str.contains('|'.join(opioid_list))), 1, 0)
)

##### Pain IV

In [127]:
med_admin_win.loc[:, 'pain_IV_diag'] = (
    np.where((med_admin_win['DrugCategory'] == 'pain agent') & 
             (med_admin_win['Route'] == 'Intravenous'), 1, 0)
)

#### Hematologic agent

##### Heparin and other parenteral agents

In [128]:
med_admin_win.loc[:, 'heparin_diag'] = (
    np.where(((med_admin_win['CommonDrugName'].str.contains('heparin')) & 
              (med_admin_win['AdministeredUnits'] == 'unit/kg/hr')) | 
             (med_admin_win['CommonDrugName'].str.contains('bivalirudin')) | 
             (med_admin_win['CommonDrugName'].str.contains('argatroban')), 1, 0)
)

###### Enoxaparin and other subcutaneous agents 

In [129]:
med_admin_win.loc[:, 'enoxaparin_diag'] = (
    np.where(((med_admin_win['CommonDrugName'].str.contains('enoxaparin')) & 
              (med_admin_win['AdministeredAmount'] > 40)) | 
             ((med_admin_win['CommonDrugName'].str.contains('dalteparin')) & 
              (med_admin_win['AdministeredAmount'] > 5000)) | 
             ((med_admin_win['CommonDrugName'].str.contains('fondaparinux')) & 
              (med_admin_win['AdministeredAmount'] > 2.5)), 1, 0)
)

##### DOAC

In [130]:
med_admin_win.loc[:, 'doac_diag'] = (
    np.where((med_admin_win['CommonDrugName'].str.contains('apixaban')) | 
             (med_admin_win['CommonDrugName'].str.contains('rivaroxaban')) | 
             (med_admin_win['CommonDrugName'].str.contains('dabigatran')) | 
             (med_admin_win['CommonDrugName'].str.contains('edoxaban')), 1, 0)
)

##### Warfarin

In [131]:
med_admin_win.loc[:, 'warfarin_diag'] = np.where((med_admin_win['CommonDrugName'].str.contains('warfarin')), 1, 0)

##### Anticoagulation merge 

In [132]:
# Combine heparin, enoxparin, DOAC, and warfarin columns into a single anticoagulation indicator variable. 
med_admin_win['ac_diag'] = (
    med_admin_win['heparin_diag'] + med_admin_win['enoxaparin_diag'] + med_admin_win['doac_diag'] + med_admin_win['warfarin_diag']
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [133]:
# Drop heparin, enoxaparin, DOAC, and warfarin columns. 
med_admin_win = med_admin_win.drop(columns = ['heparin_diag', 'enoxaparin_diag', 'doac_diag', 'warfarin_diag'])

##### DAPT

**No indicator variable created.** 

##### GCSF

**No indicator variable created.** 

##### Epoetin

**No indicator variable created.** 

##### tPA

**No indicator variable created.** 

#### Anti-infective 

##### Anti-infective IV

In [134]:
med_admin_win.loc[:, 'antiinfective_IV_diag'] = (
    np.where((med_admin_win['DrugCategory'] == 'anti-infective') & 
             (med_admin_win['Route'] == 'Intravenous'), 1, 0)
)

##### Anti-infective PO

In [135]:
med_admin_win.loc[:, 'antiinfective_diag'] = (
    np.where((med_admin_win['DrugCategory'] == 'anti-infective') & 
             (med_admin_win['Route'] == 'Oral'), 1, 0)
)

#### Anesthetic

**No indicator variable created.** 

#### Cytoprotective

**No indicator variable created.** 

#### Antihyperglycemic

In [136]:
med_admin_win.loc[:, 'antihyperglycemic_diag'] = np.where(med_admin_win['DrugCategory'] == 'antihyperglycemic', 1, 0)

#### Proton pump inhibitor

In [137]:
med_admin_win.loc[:, 'ppi_diag'] = np.where(med_admin_win['DrugCategory'] == 'proton pump inhibitor', 1, 0)

#### Antidepressant

In [138]:
med_admin_win.loc[:, 'antidepressant_diag'] = np.where(med_admin_win['DrugCategory'] == 'antidepressant', 1, 0)

#### Bone therapy agent

In [139]:
med_admin_win.loc[:, 'bta_diag'] = np.where(med_admin_win['DrugCategory'] == 'bone therapy agent (bta)', 1, 0)

#### Hormone

In [140]:
med_admin_win.loc[:, 'thyroid_diag'] = np.where(med_admin_win['CommonDrugName'] == 'levothyroxine', 1, 0)

#### Gout and hyperurecemia agent 

**No indicator variable created.** 

#### 4.16 Immunosuppressive 

In [141]:
med_admin_win.loc[:, 'is_diag'] = np.where(med_admin_win['DrugCategory'] == 'immunosuppressive', 1, 0)

#### Sedative agent

**No indicator variable created.** 

#### Endocrine

**No indicator variable created.** 

#### Antidote and reversal agent

**No indicator variable created.** 

#### Hyperglycemic

**No indicator variable created.** 

#### Antithyroid agent

**No indicator variable created.** 

#### Anticholinergic

**No indicator variable created.** 

#### Calciumimetic

**No indicator variable created.** 

#### Targeted therapy

**No indicator variable created.** 

#### Condensing

In [142]:
# Select columns with indicator variables and PatientID, then collapse rows by PatientID and sum columns. 
med_admin_wide = (
    med_admin_win
    [med_admin_win.columns[med_admin_win.columns.str.contains('diag|PatientID')]]
    .groupby('PatientID').sum()
)

In [143]:
# Replace numbers greater than 1 with 1; 0 remains unchanged. 
med_admin_wide = (
    med_admin_wide.mask(med_admin_wide > 1, 1)
    .reset_index()
)

In [144]:
row_ID(med_admin_wide)

(912, 912)

In [145]:
# Append missing training IDs.
med_admin_wide = (
    med_admin_wide.append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(med_admin_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False
    )
    .fillna(0)
)

In [146]:
row_ID(med_admin_wide)

(6336, 6336)

In [147]:
med_admin_wide.shape

(6336, 14)

In [148]:
med_admin_wide.sample(5)

Unnamed: 0,PatientID,steroid_diag,opioid_PO_diag,nonopioid_PO_diag,pain_IV_diag,ac_diag,antiinfective_IV_diag,antiinfective_diag,antihyperglycemic_diag,ppi_diag,antidepressant_diag,bta_diag,thyroid_diag,is_diag
1350,F4AF7EDD8F041,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5256,F0CB683A3AE90,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
87,F1713F00140A4,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1273,F049E045CAEF5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
427,F714A93CAA55A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [149]:
# Percent of patients receiving relevant medications at time of metastatic diagnosis. 
(med_admin_wide.iloc[:, 1:].sum()/len(med_admin_wide)).sort_values(ascending = False)

bta_diag                  0.038352
steroid_diag              0.035511
pain_IV_diag              0.030461
antiinfective_IV_diag     0.025095
opioid_PO_diag            0.022885
nonopioid_PO_diag         0.022569
ppi_diag                  0.009785
antiinfective_diag        0.008207
ac_diag                   0.005997
antidepressant_diag       0.005997
antihyperglycemic_diag    0.005051
thyroid_diag              0.003472
is_diag                   0.000158
dtype: float64

In [150]:
%whos DataFrame

Variable          Type         Data/Info
----------------------------------------
demographics      DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
enhanced_met      DataFrame               PatientID diag<...>\n[6336 rows x 6 columns]
line_therapy      DataFrame               PatientID     <...>n[14044 rows x 9 columns]
line_therapy_1    DataFrame               PatientID     <...>\n[4823 rows x 9 columns]
med_admin         DataFrame                PatientID    <...>518864 rows x 15 columns]
med_admin_wide    DataFrame              PatientID  ster<...>n[6336 rows x 14 columns]
med_admin_win     DataFrame                PatientID    <...>[15903 rows x 28 columns]
mets              DataFrame               PatientID Date<...>n[15986 rows x 3 columns]
mets_max          DataFrame              PatientID   met<...>\n[6306 rows x 2 columns]
mortality         DataFrame              PatientID  deat<...>\n[6336 rows x 3 columns]
progression       DataFrame               Patien

In [151]:
# Keep demographics, enhanced_met, med_admin_wide, and mortality
del line_therapy
del line_therapy_1
del med_admin
del med_admin_win

### 5. Enhanced_MetBreastBiomarkers

In [152]:
biomarkers = pd.read_csv('Enhanced_MetBreastBiomarkers.csv')

In [153]:
biomarkers = biomarkers[biomarkers['PatientID'].isin(test_IDs)]

In [154]:
biomarkers.shape

(48165, 19)

**The Biomarkers dataframe is in a long format. The goal is to build a single-row-per-patient dataframe with columns reflecting a patient's biomarker status within a predefined elgibility window. For this project, the elgibility window is defined as negative infinity to +30 days from time of diagnosis of metastatic disease (ie., index date).** 

**Regarding biomarker date information, result date is the date the biomarker result was first reported, and so represents the date on which the clinician would be expected to have information about the patient’s biomarker status to inform the course of treatment. Flatiron recommends using result date as the relevant biomarker test date and using specimen received date as the proxy when result date is not available. The gaps between collected date and either received or result date are substantially more variable.**

**We'll begin by imputing specimen received date when result date is missing. Then, we'll select all biomarkers that fall within the elbility window.**

In [155]:
biomarkers.loc[:, 'ResultDate'] = pd.to_datetime(biomarkers['ResultDate'])

In [156]:
biomarkers.loc[:, 'SpecimenReceivedDate'] = pd.to_datetime(biomarkers['SpecimenReceivedDate'])

In [157]:
# Replace missing result date with specimen received date. 
biomarkers.loc[:, 'result_date'] = (
    np.where(biomarkers['ResultDate'].isna(), biomarkers['SpecimenReceivedDate'], biomarkers['ResultDate'])
)

In [158]:
biomarkers = pd.merge(biomarkers, enhanced_met[['PatientID', 'met_date']], on = 'PatientID', how = 'left')

In [159]:
# Create new variable that captures difference in days between result date and metastatic diagnosis. 
biomarkers.loc[:, 'bio_date_diff'] = (biomarkers['result_date'] - biomarkers['met_date']).dt.days

In [160]:
# Select all patients with biomarkers < +30 from metastatic diagnosis. 
biomarker_win = biomarkers[biomarkers['bio_date_diff'] <= 30]

**The next step is defining positive and negative staus for each biomarker. For ER, PR, and HER2 the biomarker result closest to metastatic diagnosis will be selected. Positive will be selected over negative if both are on the same date.** 

**For BRCA and PIK3CA, status will be labeled as positive if ever positive and negative if always negative. Similarly for PDL1, the highest percent staining value is selected rather than value closest to metastatic diagnosis.**

#### 5.1 Assigning patient-level biomarker status for ER, PR, and HER2

In [161]:
# Identify positive and negative cases
biomarker_name = [
    'ER',
    'PR',
    'HER2']

pos_neg = [
    'Positive',
    'IHC positive (3+)',
    'FISH positive/amplified',
    'Positive NOS',
    'NGS positive (ERBB2 amplified)',
    'Negative', 
    'IHC negative (0-1+)',
    'FISH negative/not amplified',
    'Negative NOS',
    'NGS negative (ERBB2 not amplified)']

In [162]:
biomarker_hr_her2 = (
    biomarker_win
    .query('BiomarkerName == @biomarker_name')
    .query('BiomarkerStatus == @pos_neg')
)

In [163]:
row_ID(biomarker_hr_her2)

(26364, 5538)

In [164]:
# Create indicator variable where where 2 if positive, 1 if negative, and 0 if unknown or missing. 
conditions = [
    (biomarker_hr_her2['BiomarkerStatus'] == 'Positive') | 
    (biomarker_hr_her2['BiomarkerStatus'] == 'IHC positive (3+)') | 
    (biomarker_hr_her2['BiomarkerStatus'] == 'FISH positive/amplified') |
    (biomarker_hr_her2['BiomarkerStatus'] == 'Positive NOS') |
    (biomarker_hr_her2['BiomarkerStatus'] == 'NGS positive (ERBB2 amplified)'), 
    (biomarker_hr_her2['BiomarkerStatus'] == 'Negative') |
    (biomarker_hr_her2['BiomarkerStatus'] == 'IHC negative (0-1+)') |
    (biomarker_hr_her2['BiomarkerStatus'] == 'FISH negative/not amplified') |
    (biomarker_hr_her2['BiomarkerStatus'] == 'Negative NOS') |
    (biomarker_hr_her2['BiomarkerStatus'] == 'NGS negative (ERBB2 not amplified)')]

choices = [2,1]
biomarker_hr_her2.loc[:, 'bio_status'] = np.select(conditions, choices)

In [165]:
biomarker_hr_her2_wide = (
    biomarker_hr_her2
    .sort_values(by = ['PatientID', 'BiomarkerName', 'result_date', 'bio_status'], ascending = [True, True, False, False])
    .drop_duplicates(subset = ['PatientID', 'BiomarkerName'], keep = 'first')
    .pivot(index = 'PatientID', columns = 'BiomarkerName', values = 'bio_status')
    .reset_index()
)

In [166]:
row_ID(biomarker_hr_her2_wide)

(5538, 5538)

In [167]:
biomarker_hr_her2_wide = (
    biomarker_hr_her2_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(biomarker_hr_her2_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
    .fillna(0)
)

In [168]:
row_ID(biomarker_hr_her2_wide)

(6336, 6336)

#### 5.2 Assigning patient-level biomarker status for BRCA and PIK3CA

In [169]:
biomarker_brca_pik = (
    biomarker_win
    .query('BiomarkerName == "BRCA" or BiomarkerName == "PIK3CA"')
)

In [170]:
row_ID(biomarker_brca_pik)

(1820, 1307)

In [171]:
# Create indicator variable where where 2 if positive, 1 if negative, and 0 if unknown or missing. 
conditions = [
    (biomarker_brca_pik['BiomarkerStatus'] == 'BRCA1 mutation identified') |
    (biomarker_brca_pik['BiomarkerStatus'] == 'BRCA2 mutation identified') |
    (biomarker_brca_pik['BiomarkerStatus'] == 'Both BRCA1 and BRCA2 mutations identified') |
    (biomarker_brca_pik['BiomarkerStatus'] == 'BRCA mutation NOS') | 
    (biomarker_brca_pik['BiomarkerStatus'] == 'Positive'), 
    (biomarker_brca_pik['BiomarkerStatus'] == 'No BRCA mutation') |
    (biomarker_brca_pik['BiomarkerStatus'] == 'Genetic Variant of Unknown Significance (VUS)') |
    (biomarker_brca_pik['BiomarkerStatus'] == 'Genetic Variant Favor Polymorphism') |
    (biomarker_brca_pik['BiomarkerStatus'] == 'Negative')]

choices = [2,1]
biomarker_brca_pik.loc[:, 'bio_status'] = np.select(conditions, choices, default = 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


In [172]:
# Select biomarker status closest to metastatic diagnosis among duplicates, merge with nonduplciates, then pivot. 
biomarker_brca_pik_wide = (
    biomarker_brca_pik
    .sort_values(by = ['PatientID', 'BiomarkerName', 'bio_status'], ascending = False)
    .drop_duplicates(subset = ['PatientID', 'BiomarkerName'], keep = 'first')
    .pivot(index = 'PatientID', columns = 'BiomarkerName', values = 'bio_status')
    .reset_index()
)
biomarker_hr_her2_wide.columns.name = None

In [173]:
row_ID(biomarker_brca_pik_wide)

(1307, 1307)

In [174]:
biomarker_brca_pik_wide = (
    biomarker_brca_pik_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(biomarker_brca_pik_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
    .fillna(0)
)

In [175]:
row_ID(biomarker_brca_pik_wide)

(6336, 6336)

In [176]:
biomarker_notpdl1_wide = pd.merge(biomarker_hr_her2_wide, biomarker_brca_pik_wide, on = 'PatientID')

In [177]:
row_ID(biomarker_notpdl1_wide)

(6336, 6336)

In [178]:
biomarker_notpdl1_wide['BRCA'] = (
    biomarker_notpdl1_wide['BRCA'].replace({
        2: 'positive',
        1: 'negative',
        0: 'unknown',
        np.nan: 'unknown'})
)

In [179]:
biomarker_notpdl1_wide['ER'] = (
    biomarker_notpdl1_wide['ER'].replace({
        2: 'positive',
        1: 'negative',
        0: 'unknown',
        np.nan: 'unknown'})
)

In [180]:
biomarker_notpdl1_wide['HER2'] = (
    biomarker_notpdl1_wide['HER2'].replace({
        2: 'positive',
        1: 'negative',
        0: 'unknown',
        np.nan: 'unknown'})
)

In [181]:
biomarker_notpdl1_wide['PIK3CA'] = (
    biomarker_notpdl1_wide['PIK3CA'].replace({
        2: 'positive',
        1: 'negative',
        0: 'unknown',
        np.nan: 'unknown'})
)

In [182]:
biomarker_notpdl1_wide['PR'] = (
    biomarker_notpdl1_wide['PR'].replace({
        2: 'positive',
        1: 'negative',
        0: 'unknown',
        np.nan: 'unknown'})
)

In [183]:
biomarker_notpdl1_wide.sample(5)

Unnamed: 0,PatientID,ER,HER2,PR,BRCA,PIK3CA
5257,FF29F7EDA6251,negative,negative,negative,unknown,unknown
1262,F3798CA509A3C,positive,positive,positive,unknown,unknown
2676,F7A80B0DF833B,positive,negative,negative,unknown,unknown
1467,F40D5A0EAC6E6,positive,negative,positive,unknown,unknown
3649,FA9BB1E31CBD0,positive,positive,negative,negative,negative


#### 5.3 Assigning patient-level PD-L1

**Flatiron recommends using PercentStaining as the primary source of truth to assess PD-L1 status over other options in the Biomarker table. For patients with multiple PDL1 testing instances, the maximum PercentStaining level will be selected and assigned to the patient. PD-L1 testing instances from earlier years, specifically before 2017, are likely to be missing PercentStaining values** 

In [184]:
biomarker_win_pdl1 = (
    biomarker_win
    .query('BiomarkerName == "PDL1"')
)

In [185]:
row_ID(biomarker_win_pdl1)

(353, 248)

In [186]:
pdl1_dict = { 
    np.nan: 0,
    '0%': 1, 
    '< 1%': 2,
    '1%': 3, 
    '2% - 4%': 4,
    '5% - 9%': 5,
    '10% - 19%': 6,  
    '20% - 29%': 7, 
    '30% - 39%': 8, 
    '40% - 49%': 9, 
    '50% - 59%': 10, 
    '60% - 69%': 11, 
    '70% - 79%': 12, 
    '80% - 89%': 13, 
    '90% - 99%': 14,
    '100%': 15
}

biomarker_win_pdl1.loc[:, 'percent_staining'] = biomarker_win_pdl1['PercentStaining'].map(pdl1_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


In [187]:
biomarker_pdl1_staining = (
    biomarker_win_pdl1
    .sort_values(by = ['PatientID', 'percent_staining'], ascending = False)
    .drop_duplicates(subset = ['PatientID'], keep = 'first')
    .pivot(index = 'PatientID', columns = 'BiomarkerName', values = 'percent_staining')
    .rename(columns = {'PDL1': 'pdl1_staining'})
    .reset_index()
)
biomarker_pdl1_staining.columns.name = None

In [188]:
row_ID(biomarker_pdl1_staining)

(248, 248)

In [189]:
biomarker_pdl1_staining.sample(5)

Unnamed: 0,PatientID,pdl1_staining
70,F426FE4D9880C,1
134,F960D8C499020,5
149,FA64AFE390A56,1
128,F8C306C4E7566,1
17,F14EC485ADD6A,1


In [190]:
pdl1_dict_rev = { 
    0: np.nan,
    1: '0%', 
    2: '0%',
    3: '>1%', 
    4: '>1%',
    5: '>1%',
    6: '>1%',  
    7: '>1%', 
    8: '>1%', 
    9: '>1%', 
    10: '>1%', 
    11: '>1%', 
    12: '>1%', 
    13: '>1%', 
    14: '>1%',
    15: '>1%'
}

biomarker_pdl1_staining.loc[:, 'pdl1_staining'] = biomarker_pdl1_staining['pdl1_staining'].map(pdl1_dict_rev)

In [191]:
biomarker_pdl1_staining_wide = (
    biomarker_pdl1_staining
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(biomarker_pdl1_staining['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
    .fillna('unknown')
)

In [192]:
biomarker_pdl1_staining_wide['pdl1_staining'] = (
    biomarker_pdl1_staining_wide['pdl1_staining'].replace({np.nan: 'unknown'}))

In [193]:
biomarker_pdl1_staining_wide.sample(5)

Unnamed: 0,PatientID,pdl1_staining
1754,F0D65F20894CB,unknown
2177,FBE6D5B8F4C59,unknown
2716,FE0B2F2BE35D9,unknown
4033,F059CEA8C0154,unknown
6089,FF211CD77DA66,unknown


In [194]:
biomarker_pdl1_staining_wide.shape

(6336, 2)

**Flatiron recommends considering using BiomarkerStatus to impute the missing patient-level PercentStaining category value. Impute missing PercentStaining values as follows:**
* **Impute a PercentStaining value of “≥1%” for patients with at least one confirmed positive PD-L1 result within the eligible window.** 
* **Impute a PercentStaining value of “0%” to patients with no confirmed positive PD-L1 results and at least one confirmed negative PD-L1 result within the eligible window.** 
* **Do not impute a PercentStaining value to patients who have no confirmed positive or negative PD-L1 results within the eligible window.**

In [195]:
biomarker_win_pdl1.BiomarkerStatus.value_counts(dropna = False)

No interpretation given in report    185
PD-L1 negative/not detected          117
PD-L1 positive                        37
Unsuccessful/indeterminate test        8
Unknown                                4
Results pending                        2
Name: BiomarkerStatus, dtype: int64

In [196]:
# Create indicator variable where where 2 if positive, 1 if negative, and 0 if unknown or missing. 
conditions = [
    (biomarker_win_pdl1['BiomarkerStatus'] == 'Rearrangement present') | 
    (biomarker_win_pdl1['BiomarkerStatus'] == 'Mutation positive') | 
    (biomarker_win_pdl1['BiomarkerStatus'] == 'PD-L1 positive'),
    (biomarker_win_pdl1['BiomarkerStatus'] == 'Rearrangement not present') | 
    (biomarker_win_pdl1['BiomarkerStatus'] == 'Mutation negative') | 
    (biomarker_win_pdl1['BiomarkerStatus'] == 'PD-L1 negative/not detected')
]

choices = [2,1]
biomarker_win_pdl1.loc[:, 'bio_status'] = np.select(conditions, choices, default = 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


In [197]:
# Among PDL1 tested patients, select highest percent staining for those with repeat testing, merge with nonduplciates, then pivot. 
biomarker_pdl1_status = (
    biomarker_win_pdl1
    .sort_values(by = ['PatientID', 'bio_status'], ascending = False)
    .drop_duplicates(subset = ['PatientID'], keep = 'first')
    .pivot(index = 'PatientID', columns = 'BiomarkerName', values = 'bio_status')
    .rename(columns = {'PDL1': 'pdl1_status'})
    .reset_index()
)
biomarker_pdl1_status.columns.name = None

In [198]:
row_ID(biomarker_pdl1_status)

(248, 248)

In [199]:
biomarker_pdl1_status.head()

Unnamed: 0,PatientID,pdl1_status
0,F024BFC72DA47,1
1,F02CE34ADA3C4,0
2,F04BBA351D29A,0
3,F06B245193651,0
4,F076A362DF540,0


In [200]:
biomarker_pdl1_status_wide = (
    biomarker_pdl1_status
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(biomarker_pdl1_status['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
    .fillna(0)
)

In [201]:
biomarker_pdl1 = pd.merge(biomarker_pdl1_staining_wide, biomarker_pdl1_status_wide, on = 'PatientID')

In [202]:
# If PDL1 staining is unknown, set to >=1% if ever positive and 0% if ever negative. 
# If PDL1 staining is known, set to >=1% if staining 1-100% and 0% if 0%.
conditions = [
    ((biomarker_pdl1['pdl1_staining'] == 'unknown') & (biomarker_pdl1['pdl1_status'] == 2)) | 
    (biomarker_pdl1['pdl1_staining'] == '>1%'),
    ((biomarker_pdl1['pdl1_staining'] == 'unknown') & (biomarker_pdl1['pdl1_status'] == 1)) | 
    (biomarker_pdl1['pdl1_staining'] == '0%'), 
    ((biomarker_pdl1['pdl1_staining'] == 'unknown') & (biomarker_pdl1['pdl1_status'] == 0)),
]

choices = ['>1%', '0%', 'unknown']

biomarker_pdl1.loc[:, 'pdl1_n'] = np.select(conditions, choices)

In [203]:
biomarker_pdl1.sample(5)

Unnamed: 0,PatientID,pdl1_staining,pdl1_status,pdl1_n
5232,FC692DFABC3B0,unknown,0.0,unknown
4625,F01271AC8E717,unknown,0.0,unknown
5841,F666C7CAAD1D4,unknown,0.0,unknown
1515,F96337CC210C4,unknown,0.0,unknown
1613,F76E990EED8EC,unknown,0.0,unknown


In [204]:
biomarker_pdl1_wide = (
    biomarker_pdl1[['PatientID', 'pdl1_n']]
)

In [205]:
row_ID(biomarker_pdl1_wide)

(6336, 6336)

In [206]:
biomarker_wide = pd.merge(biomarker_notpdl1_wide, biomarker_pdl1_wide, on = 'PatientID')

In [207]:
row_ID(biomarker_wide)

(6336, 6336)

In [208]:
biomarker_wide.sample(5)

Unnamed: 0,PatientID,ER,HER2,PR,BRCA,PIK3CA,pdl1_n
5359,FF7C6E012A84A,negative,positive,negative,unknown,unknown,unknown
6327,F6BCCD9807F6E,unknown,unknown,unknown,unknown,unknown,unknown
5548,F7BA72A0B6C15,unknown,unknown,unknown,unknown,unknown,unknown
2297,F688277DF84CC,positive,negative,positive,negative,unknown,unknown
86,F037DD082A7C8,positive,negative,negative,unknown,unknown,unknown


In [209]:
%whos DataFrame

Variable                       Type         Data/Info
-----------------------------------------------------
biomarker_brca_pik             DataFrame               PatientID Biom<...>n[1820 rows x 23 columns]
biomarker_brca_pik_wide        DataFrame              PatientID  BRCA<...>\n[6336 rows x 3 columns]
biomarker_hr_her2              DataFrame               PatientID Biom<...>[26364 rows x 23 columns]
biomarker_hr_her2_wide         DataFrame              PatientID   ER <...>\n[6336 rows x 4 columns]
biomarker_notpdl1_wide         DataFrame              PatientID      <...>\n[6336 rows x 6 columns]
biomarker_pdl1                 DataFrame              PatientID pdl1_<...>\n[6336 rows x 4 columns]
biomarker_pdl1_staining        DataFrame             PatientID pdl1_s<...>n\n[248 rows x 2 columns]
biomarker_pdl1_staining_wide   DataFrame              PatientID pdl1_<...>\n[6336 rows x 2 columns]
biomarker_pdl1_status          DataFrame             PatientID  pdl1_<...>n\n[248 rows x 2 c

In [210]:
# Keep biomarker_wide, demographics, enhanced_met, med_admin_wide, and mortality
del biomarker_brca_pik
del biomarker_brca_pik_wide
del biomarker_hr_her2
del biomarker_hr_her2_wide
del biomarker_pdl1
del biomarker_pdl1_staining 
del biomarker_pdl1_staining_wide
del biomarker_pdl1_status
del biomarker_pdl1_status_wide
del biomarker_pdl1_wide 
del biomarker_win
del biomarker_win_pdl1
del biomarkers

### 6. Insurance

In [211]:
insurance = pd.read_csv('Insurance.csv')

In [212]:
insurance = insurance[insurance['PatientID'].isin(test_IDs)]

In [213]:
row_ID(insurance)

(28722, 5991)

**The insurance table contains patient insurance/payer information. Patients may have multiple payer categories concurrently. Start date is populated roughly 90% of the time, while end date is populated about 25% of the time. This mutiple-row-per-patient table will be transformed into a single-row-per-patient table. Indicator variables for each payer category active at time of metastatic diagnosis will be made as columns. Insurance will be considered active if start date is less than 30 days from metastatic diagnosis regardless of end date.** 

In [214]:
insurance.loc[:, 'StartDate'] = pd.to_datetime(insurance['StartDate'])

In [215]:
insurance = pd.merge(insurance, enhanced_met[['PatientID', 'met_date']], on = 'PatientID', how = 'left')

In [216]:
# Remove years with start dates less than 1920 which is likely a coding error. 
insurance = insurance[(insurance['StartDate']).dt.year >= 1920]

In [217]:
insurance.loc[:, 'insurance_date_diff'] = (insurance['StartDate'] - insurance['met_date']).dt.days

In [218]:
insurance_win = insurance[insurance['insurance_date_diff'] <= 30]

In [219]:
row_ID(insurance)

(25993, 5626)

In [220]:
# Recode payer category 
conditions = [
    (insurance_win['IsMedicareAdv'] == 'Yes') | 
    (insurance_win['IsPartAOnly'] == 'Yes') | 
    (insurance_win['IsPartBOnly'] == 'Yes') |
    (insurance_win['IsPartAandPartB'] == 'Yes') |
    (insurance_win['IsPartDOnly'] == 'Yes'),
    (insurance_win['IsManagedGovtPlan'] == 'Yes'),
    (insurance_win['IsManagedMedicaid'] == 'Yes'),
    (insurance_win['IsMedicareMedicaid'] == 'Yes')]

choices = ['Medicare', 'Other Government Program', 'Medicaid', 'medicare_medicaid']

insurance_win.loc[:, 'payer_category'] = np.select(conditions, choices, insurance_win['PayerCategory'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


#### Medicare

In [221]:
insurance_win.loc[:, 'medicare'] = np.where(insurance_win['payer_category'] == 'Medicare', 1, 0)

#### Medicaid

In [222]:
insurance_win.loc[:, 'medicaid'] = np.where(insurance_win['payer_category'] == 'Medicaid', 1, 0)

#### Medicare/Medicaid 

In [223]:
insurance_win.loc[:, 'medicare_medicaid'] = np.where(insurance_win['payer_category'] == 'medicare_medicaid', 1, 0)

#### Commercial 

In [224]:
insurance_win.loc[:, 'commercial'] = np.where(insurance_win['payer_category'] == 'Commercial Health Plan', 1, 0)

#### Patient Assistance Programs 

In [225]:
insurance_win.loc[:, 'patient_assistance'] = np.where(insurance_win['payer_category'] == 'Patient Assistance Program', 1, 0)

#### Other Government Program 

In [226]:
insurance_win.loc[:, 'other_govt'] = np.where(insurance_win['payer_category'] == 'Other Government Program', 1, 0)

#### Self Pay 

In [227]:
insurance_win.loc[:, 'self_pay'] = np.where(insurance_win['payer_category'] == 'Self Pay', 1, 0)

#### Other Payer

In [228]:
insurance_win.loc[:, 'other'] = np.where(insurance_win['payer_category'] == 'Other Payer - Type Unknown', 1, 0)

#### Condense 

In [229]:
# After dropping 'insurance_date_diff', add columns by PatientID.
insurance_wide = (
    insurance_win
    .drop(columns = ['insurance_date_diff'])
    .groupby('PatientID').sum()
)

In [230]:
# Set any value greater than 1 to 1; leave 0 unchanged. 
insurance_wide = (
    insurance_wide
    .mask(insurance_wide > 1, 1)
    .reset_index()
)

In [231]:
row_ID(insurance_wide)

(4596, 4596)

In [232]:
# Append missing training IDs.
insurance_wide = (
    insurance_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(insurance_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
)

In [233]:
row_ID(insurance_wide)

(6336, 6336)

In [234]:
insurance_wide = insurance_wide.fillna(0)

In [235]:
insurance_wide.sample(5)

Unnamed: 0,PatientID,medicare,medicaid,medicare_medicaid,commercial,patient_assistance,other_govt,self_pay,other
1591,F54BBDBCDF92B,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2347,F821CB1322E4E,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4011,FF9287D0824FC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
379,FC9A3D259EE70,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
482,FA9484B421C63,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [236]:
%whos DataFrame

Variable                 Type         Data/Info
-----------------------------------------------
biomarker_notpdl1_wide   DataFrame              PatientID      <...>\n[6336 rows x 6 columns]
biomarker_wide           DataFrame              PatientID      <...>\n[6336 rows x 7 columns]
demographics             DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
enhanced_met             DataFrame               PatientID diag<...>\n[6336 rows x 6 columns]
insurance                DataFrame               PatientID     <...>[25993 rows x 16 columns]
insurance_wide           DataFrame              PatientID  medi<...>\n[6336 rows x 9 columns]
insurance_win            DataFrame               PatientID     <...>[11919 rows x 25 columns]
med_admin_wide           DataFrame              PatientID  ster<...>n[6336 rows x 14 columns]
mets                     DataFrame               PatientID Date<...>n[15986 rows x 3 columns]
mets_max                 DataFrame              PatientID 

In [237]:
# Keep biomarker_wide, demographics, enhanced_met, insurance_wide, med_admin_wide, and mortality
del insurance
del insurance_win

### 7. ECOG

In [238]:
ecog = pd.read_csv('ECOG.csv')

In [239]:
ecog = ecog[ecog['PatientID'].isin(test_IDs)]

In [240]:
row_ID(ecog)

(126570, 4807)

**The ECOG table is a longitudinal record of structured ECOG scores captured in the EHR for each patient. Many patients have multiple ECOG scores reported. A new dataframe will be built where one ECOG score will be assigned to each patient. The index date will be date of advanced diagnosis with an elgible window period of +30 days to -90 days from advanced diagnosis. The ECOG score closest to index date will be assigned to the patient. In the case of two ECOG scores on the same day or equidistant but on opposite sides of the index date, the higher ECOG score (worse performance) will be selected.** 

**BaselineECOG is a composite table that selects one ECOG score within +7 days and -30 days of a line of therapy. Patients might have two baseline ECOG values for line number 1 due to maintenance therapy. BaselineECOG will not be used for creating baseline models.** 

In [241]:
ecog = pd.merge(ecog, enhanced_met[['PatientID', 'met_date']], on = 'PatientID', how = 'left')

In [242]:
ecog.loc[:, 'EcogDate'] = pd.to_datetime(ecog['EcogDate'])      

In [243]:
ecog.loc[:, 'ecog_date_diff'] = (ecog['EcogDate'] - ecog['met_date']).dt.days

In [244]:
ecog_win = ecog[(ecog['ecog_date_diff'] >= -90) & (ecog['ecog_date_diff'] <= 30)]

In [245]:
row_ID(ecog_win)

(7761, 2861)

In [246]:
# Time from metastatic diagnosis to ECOG date will be converted to an absolute value. 
ecog_win.loc[:, 'ecog_date_diff'] = ecog_win['ecog_date_diff'].abs()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [247]:
# Sort values with ECOG nearest to time of diagnosis as top row (and largest ECOG if multiple ECOGs that day) then select top row.   ECOG date nearest to day of diagnosis as top row and largest ES
ecog_diagnosis_wide = (
    ecog_win
    .sort_values(by = ['PatientID', 'ecog_date_diff', 'EcogValue'], ascending = [True, True, False])
    .drop_duplicates(subset = ['PatientID'], keep = 'first' )
    .filter(items = ['PatientID', 'EcogValue'])
    .rename(columns = {'EcogValue': 'ecog_diagnosis'})
)

In [248]:
row_ID(ecog_diagnosis_wide)

(2861, 2861)

In [249]:
# Append missing training IDs. 
ecog_diagnosis_wide = (
    ecog_diagnosis_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(ecog_diagnosis_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
    .fillna('unknown')
)

In [250]:
row_ID(ecog_diagnosis_wide)

(6336, 6336)

In [251]:
ecog_diagnosis_wide.sample(5)

Unnamed: 0,PatientID,ecog_diagnosis
188,FD588ABA550EF,unknown
87632,FDF96238FC317,0.0
240,F4F6658CECABE,unknown
1298,F4012A3CE3A85,unknown
3794,F9763C5B0576D,unknown


In [252]:
%whos DataFrame

Variable                 Type         Data/Info
-----------------------------------------------
biomarker_notpdl1_wide   DataFrame              PatientID      <...>\n[6336 rows x 6 columns]
biomarker_wide           DataFrame              PatientID      <...>\n[6336 rows x 7 columns]
demographics             DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
ecog                     DataFrame                PatientID    <...>[126570 rows x 6 columns]
ecog_diagnosis_wide      DataFrame               PatientID ecog<...>\n[6336 rows x 2 columns]
ecog_win                 DataFrame                PatientID    <...>\n[7761 rows x 6 columns]
enhanced_met             DataFrame               PatientID diag<...>\n[6336 rows x 6 columns]
insurance_wide           DataFrame              PatientID  medi<...>\n[6336 rows x 9 columns]
med_admin_wide           DataFrame              PatientID  ster<...>n[6336 rows x 14 columns]
mets                     DataFrame               PatientID

In [253]:
# Keep biomarker_wide, demographics, ecog_diagnosis_wide, enhanced_met, insurance_wide, med_admin_wide, and mortality
del ecog
del ecog_win

### 8. Vitals

In [254]:
vitals = pd.read_csv('Vitals.csv')

In [255]:
vitals = vitals[vitals['PatientID'].isin(test_IDs)]

In [256]:
row_ID(vitals)

(2008013, 6319)

**The Vitals table is a longitudinal record of vitals captured in the EHR for each patient. A weight and BMI variable at time of advanced diagnosis will be created. The elgibility window will be -90 days to +30 days from advanced diagnosis. Average height from all visits will be used to calculate BMI. In the case of two weights on the same day or equidistant but on opposite sides of the index date, the lowest weight will be selected. Percent change in weight and weight slope 3 months within metastatic diagnosis will be calculated. Patients must have at least two weight recordings to calculate percent change in weight or weight slope.** 

#### Weight and BMI

In [257]:
# Create weight dataframe; remove weight values that are empty or equal to zero.
weight = (
    vitals
    .query('Test == "body weight"')
    .filter(items = ['PatientID', 'TestDate', 'TestResultCleaned'])
    .rename(columns = {'TestResultCleaned': 'weight'})
    .dropna(subset = ['weight'])
    .query('weight != 0')
)

In [258]:
weight.loc[:, 'TestDate'] = pd.to_datetime(weight['TestDate'])

In [259]:
weight = pd.merge(weight, enhanced_met[['PatientID', 'met_date']], on = 'PatientID', how = 'left')

In [260]:
# Weight elgibliity window is -90 and +30 from metastatic diagnosis diagnosis. 
weight_win_bmi = (
    weight
    .assign(weight_date_diff = (weight['TestDate'] - weight['met_date']).dt.days)
    .query('weight_date_diff >= -90 and weight_date_diff <= 30')
)

In [261]:
weight_win_bmi.loc[:, 'weight_date_diff'] = weight_win_bmi['weight_date_diff'].abs()

In [262]:
# Select weight closest to date of metastatic diagnosis; lowest weight selected in the event of two weights on same day or equidistant. 
weight_bmi_wide = (
    weight_win_bmi
    .sort_values(by = ['PatientID', 'weight_date_diff', 'weight'], ascending = [True, True, True])
    .drop_duplicates(subset = ['PatientID'], keep = 'first')
    .filter(items = ['PatientID', 'weight'])
    .rename(columns = {'weight': 'weight_diag'})
)

In [263]:
# Dataframe of average height for each patient. 
height_avg = (
    vitals
    .query('Test == "body height"')
    .filter(items = ['PatientID', 'TestResultCleaned'])
    .query('TestResultCleaned > 0')
    .groupby('PatientID')['TestResultCleaned'].mean()
    .to_frame()
    .reset_index()
    .rename(columns = {'TestResultCleaned': 'height_avg'})
)

In [264]:
weight_bmi_wide = pd.merge(weight_bmi_wide, height_avg, on = 'PatientID', how = 'left')

In [265]:
# Create BMI column. 
weight_bmi_wide = (
    weight_bmi_wide
    .assign(bmi_diag = lambda x: (x['weight_diag']/(x['height_avg']*x['height_avg']))*10000)
    .drop(columns = ['height_avg'])
)

In [266]:
# Append excluded IDs from training set and create a missing variable for those without BMI at diagnosis. 
weight_bmi_wide = (
    weight_bmi_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(weight_bmi_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
)

In [267]:
row_ID(weight_bmi_wide)

(6336, 6336)

In [268]:
weight_bmi_wide.loc[:, 'bmi_diag_na'] = np.where(weight_bmi_wide['bmi_diag'].isna(), 1, 0)

#### Percent change 

In [269]:
# Select elgbility window of -90 to +90 days from advanced diagnosis.
weight_win_summary = (
    weight
    .assign(weight_date_diff = (weight['TestDate'] - weight['met_date']).dt.days)
    .query('weight_date_diff >= -90 and weight_date_diff <= 90')
)

In [270]:
# Select patients with more than 1 weight recording within elgibility window.
weight_win_summary = weight_win_summary[weight_win_summary.duplicated(subset = ['PatientID'], keep = False)]

In [271]:
# Select weight from the earliest time within elgibility window. 
weight_tmin = weight_win_summary.loc[weight_win_summary.groupby('PatientID')['weight_date_diff'].idxmin()]

In [272]:
# Select weight from the latest time within elgibility window. 
weight_tmax = weight_win_summary.loc[weight_win_summary.groupby('PatientID')['weight_date_diff'].idxmax()]

In [273]:
# Combine above two dataframes and sort from earliest recorded weight to latest recorded weight for each patient. 
weight_tcomb = (
    pd.concat([weight_tmin, weight_tmax])
    .sort_values(by = ['PatientID', 'weight_date_diff'], ascending = True)
)

In [274]:
row_ID(weight_tcomb)

(9404, 4702)

In [275]:
weight_tcomb.loc[:, 'weight_pct_change'] = weight_tcomb.groupby('PatientID')['weight'].pct_change()

In [276]:
weight_tcomb.loc[:, 'diff_date_diff'] = weight_tcomb['weight_date_diff'].diff()

In [277]:
# Drop empty rows for weight_pct_change.
weight_pct_wide = (
    weight_tcomb
    .dropna(subset = ['weight_pct_change'])
    .filter(items = ['PatientID', 'weight_pct_change', 'diff_date_diff'])
)

In [278]:
row_ID(weight_pct_wide)

(4702, 4702)

In [279]:
# Append missing training IDs and create a missing variable for those without weight_pct_change. 
weight_pct_wide = (
    weight_pct_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(weight_pct_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
    .drop(columns = ['diff_date_diff'])
)

In [280]:
row_ID(weight_pct_wide)

(6336, 6336)

In [281]:
weight_pct_wide.loc[:, 'weight_pct_na'] = np.where(weight_pct_wide['weight_pct_change'].isna(), 1, 0)

#### Weight slope

In [282]:
from scipy.stats import linregress 

In [283]:
weight_win_summary.loc[:, 'date_ordinal'] = weight_win_summary['TestDate'].map(dt.datetime.toordinal)

In [284]:
# Dataframe of slope for weight recordings within window period (kg/day).
weight_slope_wide = (
    weight_win_summary
    .groupby('PatientID')
    .apply(lambda x: pd.Series(linregress(x['date_ordinal'], x['weight'])))
    .rename(columns = {0: 'weight_slope'})
    .reset_index()
    .filter(items = ['PatientID', 'weight_slope']))   

  slope = ssxym / ssxm


In [285]:
row_ID(weight_slope_wide)

(4702, 4702)

In [286]:
# Append missing training IDs. 
weight_slope_wide = (
    weight_slope_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(weight_slope_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
)

In [287]:
row_ID(weight_slope_wide)

(6336, 6336)

#### Weight merge 

In [288]:
weight_wide = pd.merge(weight_bmi_wide, weight_pct_wide, on = 'PatientID')

In [289]:
weight_wide = pd.merge(weight_wide, weight_slope_wide, on = 'PatientID')

In [290]:
row_ID(weight_wide)

(6336, 6336)

In [291]:
weight_wide.sample(5)

Unnamed: 0,PatientID,weight_diag,bmi_diag,bmi_diag_na,weight_pct_change,weight_pct_na,weight_slope
5538,FA3AB91EFA8FA,,,1,,1,
1675,F59E01635E1A4,87.996848,32.282932,0,-0.095993,0,-0.090335
5870,F6A4E120BBB59,,,1,,1,
511,F1C2DE6245BF4,69.6,24.324278,0,-0.031474,0,-0.017796
118,F068FEE2B8043,78.925008,27.251951,0,-0.017241,0,-0.024567


In [292]:
%whos DataFrame

Variable                 Type         Data/Info
-----------------------------------------------
biomarker_notpdl1_wide   DataFrame              PatientID      <...>\n[6336 rows x 6 columns]
biomarker_wide           DataFrame              PatientID      <...>\n[6336 rows x 7 columns]
demographics             DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
ecog_diagnosis_wide      DataFrame               PatientID ecog<...>\n[6336 rows x 2 columns]
enhanced_met             DataFrame               PatientID diag<...>\n[6336 rows x 6 columns]
height_avg               DataFrame              PatientID  heig<...>\n[6250 rows x 2 columns]
insurance_wide           DataFrame              PatientID  medi<...>\n[6336 rows x 9 columns]
med_admin_wide           DataFrame              PatientID  ster<...>n[6336 rows x 14 columns]
mets                     DataFrame               PatientID Date<...>n[15986 rows x 3 columns]
mets_max                 DataFrame              PatientID 

In [293]:
# Keep biomarker_wide, demographics, ecog_diagnosis_wide, enhanced_met, insurance_wide, med_admin_wide, mortality, 
# and weight_wide
del height_avg
del vitals
del weight
del weight_bmi_wide
del weight_pct_wide
del weight_slope_wide
del weight_tcomb
del weight_tmax
del weight_tmin
del weight_win_bmi
del weight_win_summary

### 9. Labs

In [294]:
lab = pd.read_csv('Lab.csv')

In [295]:
lab = lab[lab['PatientID'].isin(test_IDs)]

In [296]:
row_ID(lab)

(6110125, 6057)

**The Lab table is a longitudinal record of lab captured in the EHR with multiple-rows-per-patient. A single-patient-per-row table will be built focusing on the following NCCN recommended labs:** 
* **Creatinine -- (LOINC: 2160-0 and 38483-4)**
* **Hemoglobin -- (LOINC: 718-7 and 20509-6)**
* **White blood cell count -- (LOINC: 26464-8 and 6690-2)**
* **Neutrophil count -- (LOINC: 26499-4, 751-8, 30451-9, and 753-4)**
* **Albumin, serum -- (LOINC: 1751-7)**
* **Total bilirubin -- (LOINC: 42719-5 and 1975-2)**
* **Sodium — (LOINC: 2947-0 and 2951-2)**
* **Bicarb — (LOINC: 1963-8, 1959-6, 14627-4, 1960-4, and 2028-9)**
* **Calcium — (LOINC: 17861-6 and 49765-1)** 
* **AST — (LOINC: 1920-8)**
* **ALT — (LOINC: 1742-6, 1743-4, and 1744-2)**
* **Platelet -- (LOINC: 26515-7, 777-3, 778-1, and 49497-1)**
* **Potassium -- (LOINC: 6298-4 and 2823-3)**
* **Chloride -- (LOINC: 2075-0)**
* **BUN -- (LOINC: 3094-0)**
* **ALP -- (LOINC: 6768-6)**

**The index date will be time of advanced diagnosis with an elgibility window of -90 days to +30 days. The lab value closest to the index date will be selected for each patient. The following summary statistics, using an elgibility window of negative infinity to +30 days from advanced diagnosis, will also be created for the above variables:** 
* **Max**
* **Min**
* **Mean**
* **Standard deviation** 
* **Slope**

#### 9.1 Baseline lab values

In [297]:
lab = pd.merge(lab, enhanced_met[['PatientID', 'met_date']], on = 'PatientID', how = 'left')

In [298]:
lab.loc[:, 'ResultDate'] = pd.to_datetime(lab['ResultDate']) 

In [299]:
# Select rows with clinically relevant labs.
lab_core = (
    lab[
    (lab['LOINC'] == "2160-0") |
    (lab['LOINC'] == "38483-4") | 
    (lab['LOINC'] == "718-7") |
    (lab['LOINC'] == "20509-6") |
    (lab['LOINC'] == "26464-8") |
    (lab['LOINC'] == "6690-2") |
    (lab['LOINC'] == "26499-4") |
    (lab['LOINC'] == "751-8") |
    (lab['LOINC'] == "30451-9") |
    (lab['LOINC'] == "753-4") |
    (lab['LOINC'] == "1751-7") |
    (lab['LOINC'] == "42719-5") |
    (lab['LOINC'] == "1975-2") |
    (lab['LOINC'] == "2947-0") |
    (lab['LOINC'] == "2951-2") |
    (lab['LOINC'] == "1963-8") |
    (lab['LOINC'] == "1959-6") |
    (lab['LOINC'] == "14627-4") |
    (lab['LOINC'] == "1960-4") |
    (lab['LOINC'] == "2028-9") |
    (lab['LOINC'] == "17861-6") |
    (lab['LOINC'] == "49765-1") |
    (lab['LOINC'] == "1920-8") |
    (lab['LOINC'] == "1742-6") | 
    (lab['LOINC'] == "1743-4") |
    (lab['LOINC'] == "1744-2") |
    (lab['LOINC'] == "26515-7") | 
    (lab['LOINC'] == "777-3") |
    (lab['LOINC'] == "778-1") |
    (lab['LOINC'] == "49497-1") | 
    (lab['LOINC'] == "6298-4") |
    (lab['LOINC'] == "2823-3") |
    (lab['LOINC'] == "2075-0") | 
    (lab['LOINC'] == "3094-0") | 
    (lab['LOINC'] == "6768-6")]
    .filter(items = ['PatientID', 
                     'ResultDate', 
                     'LOINC', 
                     'LabComponent', 
                     'TestUnits', 
                     'TestUnitsCleaned', 
                     'TestResult', 
                     'TestResultCleaned', 
                     'met_date'])
)

In [300]:
conditions = [
    ((lab_core['LOINC'] == '2160-0') | (lab_core['LOINC'] == '38483-4')),
    ((lab_core['LOINC'] == '718-7') | (lab_core['LOINC'] == '20509-6')),
    ((lab_core['LOINC'] == '26464-8') | (lab_core['LOINC'] == '6690-2')), 
    ((lab_core['LOINC'] == '26499-4') | (lab_core['LOINC'] == '751-8') | (lab_core['LOINC'] == '30451-9') | (lab_core['LOINC'] == '753-4')),
    (lab_core['LOINC'] == '1751-7'),
    ((lab_core['LOINC'] == '42719-5') | (lab_core['LOINC'] == '1975-2')),
    ((lab_core['LOINC'] == '2947-0') | (lab_core['LOINC'] == '2951-2')),
    ((lab_core['LOINC'] == '1963-8') | (lab_core['LOINC'] == '1959-6') | (lab_core['LOINC'] == '14627-4') | (lab_core['LOINC'] == '1960-4') | (lab_core['LOINC'] == '2028-9')),
    ((lab_core['LOINC'] == '17861-6') | (lab_core['LOINC'] == '49765-1')),
    (lab_core['LOINC'] == '1920-8'),
    ((lab_core['LOINC'] == '1742-6') | (lab_core['LOINC'] == '1743-4') | (lab_core['LOINC'] == '1744-2')),
    ((lab_core['LOINC'] == '26515-7') | (lab_core['LOINC'] == '777-3') | (lab_core['LOINC'] == '778-1') | (lab_core['LOINC'] == '49497-1')),
    ((lab_core['LOINC'] == '6298-4') | (lab_core['LOINC'] == '2823-3')),
    (lab_core['LOINC'] == '2075-0'), 
    (lab_core['LOINC'] == '3094-0'),
    (lab_core['LOINC'] == '6768-6')]

choices = ['creatinine', 
           'hemoglobin', 
           'wbc', 
           'neutrophil_count',  
           'albumin', 
           'total_bilirubin', 
           'sodium', 
           'bicarb',
           'calcium',
           'ast', 
           'alt',
           'platelet',
           'potassium', 
           'chloride',
           'bun',
           'alp']

lab_core.loc[:, 'lab_name'] = np.select(conditions, choices)

In [301]:
# Remove missing lab values. 
lab_core = lab_core.dropna(subset = ['TestResultCleaned'])

In [302]:
conditions = [
    ((lab_core['lab_name'] == 'wbc') | (lab_core['lab_name'] == 'neutrophil_count') | (lab_core['lab_name'] == 'platelet')) & 
    (lab_core['TestUnits'] == '10*3/L'),
    (lab_core['lab_name'] == 'hemoglobin') & (lab_core['TestUnits'] == 'g/uL')]

choices = [lab_core['TestResultCleaned'] * 1000000,
           lab_core['TestResultCleaned'] / 100000]

lab_core.loc[:, 'test_result_cleaned'] = np.select(conditions, choices, default = lab_core['TestResultCleaned'])

In [303]:
# Elgibliity window is -90 and +30 from advanced diagnosis. 
lab_core_win = (
    lab_core
    .assign(lab_date_diff = (lab_core['ResultDate'] - lab_core['met_date']).dt.days)
    .query('lab_date_diff >= -90 and lab_date_diff <= 30')
    .filter(items = ['PatientID', 'ResultDate', 'TestResultCleaned', 'lab_name', 'met_date', 'test_result_cleaned', 'lab_date_diff'])
)

In [304]:
lab_core_win.loc[:, 'lab_date_diff'] = lab_core_win['lab_date_diff'].abs()

In [305]:
# Select lab closest to date of advanced diagnosis and pivot to a wide table. 
lab_diag_wide = (
    lab_core_win
    .loc[lab_core_win.groupby(['PatientID', 'lab_name'])['lab_date_diff'].idxmin()]
    .pivot(index = 'PatientID', columns = 'lab_name', values = 'test_result_cleaned')
    .reset_index()
    .rename(columns = {
        'albumin': 'albumin_diag',
        'creatinine': 'creatinine_diag',
        'hemoglobin': 'hemoglobin_diag',
        'neutrophil_count': 'neutrophil_count_diag',
        'total_bilirubin': 'total_bilirubin_diag',
        'wbc': 'wbc_diag',
        'sodium': 'sodium_diag', 
        'bicarb': 'bicarb_diag',
        'calcium': 'calcium_diag',
        'ast': 'ast_diag', 
        'alt': 'alt_diag',
        'platelet': 'platelet_diag',
        'potassium': 'potassium_diag',
        'chloride': 'chloride_diag',
        'bun': 'bun_diag',
        'alp': 'alp_diag'})
)

lab_diag_wide.columns.name = None

In [306]:
row_ID(lab_diag_wide)

(4169, 4169)

In [307]:
lab_diag_wide = (
    lab_diag_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(lab_diag_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
)

In [308]:
row_ID(lab_diag_wide)

(6336, 6336)

In [309]:
# Create missing variables for labs at time of diagnosis. 
for x in range (1, len(lab_diag_wide.columns)):
    lab_diag_wide.loc[:, lab_diag_wide.columns[x]+'_na'] = np.where(lab_diag_wide[lab_diag_wide.columns[x]].isna(), 1, 0)

In [310]:
list(lab_diag_wide.columns)

['PatientID',
 'albumin_diag',
 'alp_diag',
 'alt_diag',
 'ast_diag',
 'bicarb_diag',
 'bun_diag',
 'calcium_diag',
 'chloride_diag',
 'creatinine_diag',
 'hemoglobin_diag',
 'neutrophil_count_diag',
 'platelet_diag',
 'potassium_diag',
 'sodium_diag',
 'total_bilirubin_diag',
 'wbc_diag',
 'albumin_diag_na',
 'alp_diag_na',
 'alt_diag_na',
 'ast_diag_na',
 'bicarb_diag_na',
 'bun_diag_na',
 'calcium_diag_na',
 'chloride_diag_na',
 'creatinine_diag_na',
 'hemoglobin_diag_na',
 'neutrophil_count_diag_na',
 'platelet_diag_na',
 'potassium_diag_na',
 'sodium_diag_na',
 'total_bilirubin_diag_na',
 'wbc_diag_na']

#### Mean, max, min, and standard deviation

In [311]:
# Elgibility window is negative infinity to +30 from advanced diagnosis. 
lab_core_win_summ = (
    lab_core
    .assign(lab_date_diff = (lab_core['ResultDate'] - lab_core['met_date']).dt.days)
    .query('lab_date_diff <= 30')
    .filter(items = ['PatientID', 'ResultDate', 'TestResultCleaned', 'lab_name', 'met_date', 'test_result_cleaned', 'lab_date_diff'])
)

In [312]:
# Pivot table of average values for core labs during elgibility period of -90 to -30 days from advanced diagnosis. 
lab_avg_wide = (
    lab_core_win_summ
    .groupby(['PatientID', 'lab_name'])['test_result_cleaned'].mean()
    .to_frame()
    .reset_index()
    .pivot(index = 'PatientID', columns = 'lab_name', values = 'test_result_cleaned')
    .reset_index()
    .rename(columns = {
        'albumin': 'albumin_avg',
        'creatinine': 'creatinine_avg',
        'hemoglobin': 'hemoglobin_avg',
        'neutrophil_count': 'neutrophil_count_avg',
        'total_bilirubin': 'total_bilirubin_avg',
        'wbc': 'wbc_avg',
        'sodium': 'sodium_avg', 
        'bicarb': 'bicarb_avg',
        'calcium': 'calcium_avg',
        'ast': 'ast_avg', 
        'alt': 'alt_avg',
        'platelet': 'platelet_avg',
        'potassium': 'potassium_avg',
        'chloride': 'chloride_avg',
        'bun': 'bun_avg',
        'alp': 'alp_avg'})
)

lab_avg_wide.columns.name = None

In [313]:
row_ID(lab_avg_wide)

(4533, 4533)

In [314]:
# Pivot table of maximum values for core labs during elgibility period of -90 to -30 days from advanced diagnosis. 
lab_max_wide = (
    lab_core_win_summ
    .groupby(['PatientID', 'lab_name'])['test_result_cleaned'].max()
    .to_frame()
    .reset_index()
    .pivot(index = 'PatientID', columns = 'lab_name', values = 'test_result_cleaned')
    .reset_index()
    .rename(columns = {
        'albumin': 'albumin_max',
        'creatinine': 'creatinine_max',
        'hemoglobin': 'hemoglobin_max',
        'neutrophil_count': 'neutrophil_count_max',
        'total_bilirubin': 'total_bilirubin_max',
        'wbc': 'wbc_max', 
        'sodium': 'sodium_max', 
        'bicarb': 'bicarb_max',
        'calcium': 'calcium_max',
        'ast': 'ast_max', 
        'alt': 'alt_max',
        'platelet': 'platelet_max',
        'potassium': 'potassium_max',
        'chloride': 'chloride_max',
        'bun': 'bun_max', 
        'alp': 'alp_max'})
)

lab_max_wide.columns.name = None

In [315]:
row_ID(lab_max_wide)

(4533, 4533)

In [316]:
# Pivot table of minimum values for core labs during elgibility period of -90 to -30 days from advanced diagnosis. 
lab_min_wide = (
    lab_core_win_summ
    .groupby(['PatientID', 'lab_name'])['test_result_cleaned'].min()
    .to_frame()
    .reset_index()
    .pivot(index = 'PatientID', columns = 'lab_name', values = 'test_result_cleaned')
    .reset_index()
    .rename(columns = {
        'albumin': 'albumin_min',
        'creatinine': 'creatinine_min',
        'hemoglobin': 'hemoglobin_min',
        'neutrophil_count': 'neutrophil_count_min',
        'total_bilirubin': 'total_bilirubin_min',
        'wbc': 'wbc_min',
        'sodium': 'sodium_min', 
        'bicarb': 'bicarb_min',
        'calcium': 'calcium_min',
        'ast': 'ast_min', 
        'alt': 'alt_min',
        'platelet': 'platelet_min',
        'potassium': 'potassium_min',
        'chloride': 'chloride_min',
        'bun': 'bun_min',
        'alp': 'alp_min'})
)

lab_min_wide.columns.name = None

In [317]:
row_ID(lab_min_wide)

(4533, 4533)

In [318]:
# Pivot table of standard deviation for core labs during elgibility period of -90 to -30 days from advanced diagnosis. 
lab_std_wide = (
    lab_core_win_summ
    .groupby(['PatientID', 'lab_name'])['test_result_cleaned'].std()
    .to_frame()
    .reset_index()
    .pivot(index = 'PatientID', columns = 'lab_name', values = 'test_result_cleaned')
    .reset_index()
    .rename(columns = {
        'albumin': 'albumin_std',
        'creatinine': 'creatinine_std',
        'hemoglobin': 'hemoglobin_std',
        'neutrophil_count': 'neutrophil_count_std',
        'total_bilirubin': 'total_bilirubin_std',
        'wbc': 'wbc_std',
        'sodium': 'sodium_std', 
        'bicarb': 'bicarb_std',
        'calcium': 'calcium_std',
        'ast': 'ast_std', 
        'alt': 'alt_std',
        'platelet': 'platelet_std',
        'potassium': 'potassium_std',
        'chloride': 'chloride_std',
        'bun': 'bun_std', 
        'alp': 'alp_std'})
)

lab_std_wide.columns.name = None

In [319]:
row_ID(lab_std_wide)

(4533, 4533)

In [320]:
lab_summary_wide = pd.merge(lab_avg_wide, lab_max_wide, on = 'PatientID', how = 'outer')

In [321]:
lab_summary_wide = pd.merge(lab_summary_wide, lab_min_wide, on = 'PatientID', how = 'outer')

In [322]:
lab_summary_wide = pd.merge(lab_summary_wide, lab_std_wide, on = 'PatientID', how = 'outer')

In [323]:
row_ID(lab_summary_wide)

(4533, 4533)

In [324]:
lab_summary_wide = (
    lab_summary_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(lab_summary_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
)

In [325]:
row_ID(lab_summary_wide)

(6336, 6336)

In [326]:
lab_summary_wide.sample(5)

Unnamed: 0,PatientID,albumin_avg,alp_avg,alt_avg,ast_avg,bicarb_avg,bun_avg,calcium_avg,chloride_avg,creatinine_avg,...,calcium_std,chloride_std,creatinine_std,hemoglobin_std,neutrophil_count_std,platelet_std,potassium_std,sodium_std,total_bilirubin_std,wbc_std
3590,FCA3617D6D41F,41.0,69.0,19.0,12.25,30.0,12.5,9.4,103.25,0.735,...,0.432049,2.076054,0.120692,1.197001,0.936647,16.753109,0.338858,0.861684,0.173205,1.193797
2421,F89814180B8CD,40.362222,103.348837,11.860465,15.418605,24.169767,22.030233,9.204651,105.186047,0.964651,...,0.303121,2.129787,0.124238,0.843645,1.212769,54.343248,0.382233,2.219914,0.139529,1.644034
2286,F81C4CB85CF6F,33.666667,1103.0,16.333333,29.666667,27.666667,16.333333,9.233333,91.666667,0.8,...,0.251661,1.527525,0.1,0.52915,2.987195,64.531646,0.321455,2.516611,0.057735,2.787472
2091,F2655469D11F3,,,,,,,,,,...,,,,,,,,,,
1081,F4E2E8EA130FA,,,,,,,,,,...,,,,,,,,,,


In [327]:
list(lab_summary_wide.columns)

['PatientID',
 'albumin_avg',
 'alp_avg',
 'alt_avg',
 'ast_avg',
 'bicarb_avg',
 'bun_avg',
 'calcium_avg',
 'chloride_avg',
 'creatinine_avg',
 'hemoglobin_avg',
 'neutrophil_count_avg',
 'platelet_avg',
 'potassium_avg',
 'sodium_avg',
 'total_bilirubin_avg',
 'wbc_avg',
 'albumin_max',
 'alp_max',
 'alt_max',
 'ast_max',
 'bicarb_max',
 'bun_max',
 'calcium_max',
 'chloride_max',
 'creatinine_max',
 'hemoglobin_max',
 'neutrophil_count_max',
 'platelet_max',
 'potassium_max',
 'sodium_max',
 'total_bilirubin_max',
 'wbc_max',
 'albumin_min',
 'alp_min',
 'alt_min',
 'ast_min',
 'bicarb_min',
 'bun_min',
 'calcium_min',
 'chloride_min',
 'creatinine_min',
 'hemoglobin_min',
 'neutrophil_count_min',
 'platelet_min',
 'potassium_min',
 'sodium_min',
 'total_bilirubin_min',
 'wbc_min',
 'albumin_std',
 'alp_std',
 'alt_std',
 'ast_std',
 'bicarb_std',
 'bun_std',
 'calcium_std',
 'chloride_std',
 'creatinine_std',
 'hemoglobin_std',
 'neutrophil_count_std',
 'platelet_std',
 'potassium

#### Slope

In [328]:
lab_core_win_summ.loc[:, 'result_date_ordinal'] = lab_core_win_summ['ResultDate'].map(dt.datetime.toordinal)

In [329]:
lab_slope_wide = (
    lab_core_win_summ
    .groupby(['PatientID', 'lab_name'])
    .apply(lambda x: pd.Series(linregress(x['result_date_ordinal'], x['test_result_cleaned'])))
    .rename(columns = {0: 'slope'})
    .reset_index()
    .filter(items = ['PatientID', 'lab_name', 'slope'])
    .pivot(index = 'PatientID', columns = 'lab_name', values = 'slope')
    .reset_index()
    .rename(columns = {
        'albumin': 'albumin_slope',
        'creatinine': 'creatinine_slope',
        'hemoglobin': 'hemoglobin_slope',
        'neutrophil_count': 'neutrophil_count_slope',
        'total_bilirubin': 'total_bilirubin_slope',
        'wbc': 'wbc_slope',
        'sodium': 'sodium_slope', 
        'bicarb': 'bicarb_slope',
        'calcium': 'calcium_slope',
        'ast': 'ast_slope', 
        'alt': 'alt_slope',
        'platelet': 'platelet_slope',
        'potassium': 'potassium_slope',
        'chloride': 'chloride_slope',
        'bun': 'bun_slope',
        'alp': 'alp_slope'})
)

lab_slope_wide.columns.name = None

  slope = ssxym / ssxm
  t = r * np.sqrt(df / ((1.0 - r + TINY)*(1.0 + r + TINY)))
  slope_stderr = np.sqrt((1 - r**2) * ssym / ssxm / df)
  slope_stderr = np.sqrt((1 - r**2) * ssym / ssxm / df)


In [330]:
row_ID(lab_slope_wide)

(4533, 4533)

In [331]:
lab_slope_wide = (
    lab_slope_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(lab_slope_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
)

In [332]:
# Create missing variables for lab slope. 
for x in range (1, len(lab_slope_wide.columns)):
    lab_slope_wide.loc[:, lab_slope_wide.columns[x]+'_na'] = np.where(lab_slope_wide[lab_slope_wide.columns[x]].isna(), 1, 0)

In [333]:
row_ID(lab_slope_wide)

(6336, 6336)

#### Merge

In [334]:
lab_wide = pd.merge(lab_diag_wide, lab_summary_wide, on = 'PatientID')

In [335]:
lab_wide = pd.merge(lab_wide, lab_slope_wide, on = 'PatientID')

In [336]:
row_ID(lab_wide)

(6336, 6336)

In [337]:
list(lab_wide.columns)

['PatientID',
 'albumin_diag',
 'alp_diag',
 'alt_diag',
 'ast_diag',
 'bicarb_diag',
 'bun_diag',
 'calcium_diag',
 'chloride_diag',
 'creatinine_diag',
 'hemoglobin_diag',
 'neutrophil_count_diag',
 'platelet_diag',
 'potassium_diag',
 'sodium_diag',
 'total_bilirubin_diag',
 'wbc_diag',
 'albumin_diag_na',
 'alp_diag_na',
 'alt_diag_na',
 'ast_diag_na',
 'bicarb_diag_na',
 'bun_diag_na',
 'calcium_diag_na',
 'chloride_diag_na',
 'creatinine_diag_na',
 'hemoglobin_diag_na',
 'neutrophil_count_diag_na',
 'platelet_diag_na',
 'potassium_diag_na',
 'sodium_diag_na',
 'total_bilirubin_diag_na',
 'wbc_diag_na',
 'albumin_avg',
 'alp_avg',
 'alt_avg',
 'ast_avg',
 'bicarb_avg',
 'bun_avg',
 'calcium_avg',
 'chloride_avg',
 'creatinine_avg',
 'hemoglobin_avg',
 'neutrophil_count_avg',
 'platelet_avg',
 'potassium_avg',
 'sodium_avg',
 'total_bilirubin_avg',
 'wbc_avg',
 'albumin_max',
 'alp_max',
 'alt_max',
 'ast_max',
 'bicarb_max',
 'bun_max',
 'calcium_max',
 'chloride_max',
 'creatinin

In [338]:
%whos DataFrame

Variable                 Type         Data/Info
-----------------------------------------------
biomarker_notpdl1_wide   DataFrame              PatientID      <...>\n[6336 rows x 6 columns]
biomarker_wide           DataFrame              PatientID      <...>\n[6336 rows x 7 columns]
demographics             DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
ecog_diagnosis_wide      DataFrame               PatientID ecog<...>\n[6336 rows x 2 columns]
enhanced_met             DataFrame               PatientID diag<...>\n[6336 rows x 6 columns]
insurance_wide           DataFrame              PatientID  medi<...>\n[6336 rows x 9 columns]
lab                      DataFrame                 PatientID   <...>110125 rows x 18 columns]
lab_avg_wide             DataFrame              PatientID  albu<...>n[4533 rows x 17 columns]
lab_core                 DataFrame                 PatientID Re<...>810279 rows x 11 columns]
lab_core_win             DataFrame                 Patient

In [339]:
# Keep biomarker_wide, demographics, ecog_diagnosis_wide, enhanced_met, insurance_wide, lab_wide, med_admin_wide, 
# mortality, and weight_wide
del lab
del lab_avg_wide
del lab_core
del lab_core_win
del lab_core_win_summ
del lab_diag_wide
del lab_max_wide
del lab_min_wide
del lab_slope_wide
del lab_std_wide
del lab_summary_wide

### 10. Diagnosis

In [340]:
diagnosis = pd.read_csv('Diagnosis.csv')

In [341]:
diagnosis = diagnosis[diagnosis['PatientID'].isin(test_IDs)]

In [342]:
row_ID(diagnosis)

(322910, 6336)

#### Elixhauser

In [343]:
diagnosis = pd.merge(diagnosis, enhanced_met[['PatientID', 'met_date']], on = 'PatientID', how = 'left')

In [344]:
diagnosis.loc[:, 'DiagnosisDate'] = pd.to_datetime(diagnosis['DiagnosisDate'])

In [345]:
diagnosis.loc[:, 'diagnosis_date_diff'] = (diagnosis['DiagnosisDate'] - diagnosis['met_date']).dt.days

In [346]:
# Remove decimal to make mapping to Elixhauser easier. 
diagnosis.loc[:, 'diagnosis_code'] = diagnosis['DiagnosisCode'].replace('\.', '', regex = True)

##### Elixhauser for ICD-9

In [347]:
# ICD-9 dataframe with unique codes for each patient. 
diagnosis_elix_9 = (
    diagnosis
    .query('diagnosis_date_diff <= 30')
    .query('DiagnosisCodeSystem == "ICD-9-CM"')
    .drop_duplicates(subset = (['PatientID', 'DiagnosisCode']), keep = 'first')
    .filter(items = ['PatientID', 'DiagnosisCode', 'diagnosis_code'])
)

In [348]:
row_ID(diagnosis_elix_9)

(22691, 3510)

In [349]:
diagnosis_elix_9.loc[:, 'chf'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('39891|'
                                                          '402(01|11|91)|'
                                                          '404(01|03|[19][13])|'
                                                          '42(5[456789]|8)'), 1, 0)
)

In [350]:
diagnosis_elix_9.loc[:, 'cardiac_arrhythmias'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('426([079]|1[023])|'
                                                          '427[012346789]|'
                                                          '7850|'
                                                          '996(01|04)|'
                                                          'V450|'
                                                          'V533'), 1, 0)
)

In [351]:
diagnosis_elix_9.loc[:, 'valvular_disease'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('0932|'
                                                          '39[4567]|'
                                                          '424|'
                                                          '746[3456]|'
                                                          'V422|'
                                                          'V433'), 1, 0)
)

In [352]:
diagnosis_elix_9.loc[:, 'pulmonary_circulation'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('41(5[01]|6|7[089])'), 1, 0)
)

In [353]:
diagnosis_elix_9.loc[:, 'peripheral_vascular'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('0930|'
                                                          '4373|'
                                                          '44([01]|3[123456789]|71)|'
                                                          '557[19]|'
                                                          'V434'), 1, 0)
)

In [354]:
diagnosis_elix_9.loc[:, 'htn_uncomplicated'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('401'), 1, 0)
)

In [355]:
diagnosis_elix_9.loc[:, 'htn_complicated'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('40[2345]'), 1, 0)
)

In [356]:
diagnosis_elix_9.loc[:, 'paralysis'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('3341|'
                                                          '34([23]|4[01234569])'), 1, 0)
)

In [357]:
diagnosis_elix_9.loc[:, 'other_neuro_disorders'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('33(19|2[01]|3([45]|92)|[45]|62)|'
                                                          '34([015]|8[13])|'
                                                          '78[04]3'), 1, 0)
)

In [358]:
diagnosis_elix_9.loc[:, 'chronic_pulmonary'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('416[89]|'
                                                          '49|'
                                                          '50([012345]|64|8[18])'), 1, 0)
)

In [359]:
diagnosis_elix_9.loc[:, 'diabetes_uncomplicated'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('250[0123]'), 1, 0)
)

In [360]:
diagnosis_elix_9.loc[:, 'diabetes_complicated'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('250[456789]'), 1, 0)
)

In [361]:
diagnosis_elix_9.loc[:, 'hypothyroidism'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('2409|'
                                                          '24([34]|6[18])'), 1, 0)
)

In [362]:
diagnosis_elix_9.loc[:, 'renal_failure'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('403[019]1|'
                                                          '404[019][23]|'
                                                          '58([56]|80)|'
                                                          'V4(20|51)|'
                                                          'V56'), 1, 0)
)

In [363]:
diagnosis_elix_9.loc[:, 'liver_disease'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('070(2[23]]|3[23]|44|54|6|9)|'
                                                          '456[012]|'
                                                          '57([01]|2[2345678]|3[3489])|'
                                                          'V427'), 1, 0)
)

In [364]:
diagnosis_elix_9.loc[:, 'peptic_ulcer_disease'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('53[1234][79]'), 1, 0)
)

In [365]:
diagnosis_elix_9.loc[:, 'aids_hiv'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('04[234]'), 1, 0)
)

In [366]:
diagnosis_elix_9.loc[:, 'lymphoma'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('20([012]|30)|'
                                                          '2386'), 1, 0)
)

In [367]:
diagnosis_elix_9.loc[:, 'metastatic_cancer'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('19[6789]'), 1, 0)
)

In [368]:
diagnosis_elix_9.loc[:, 'solid_tumor_wout_mets'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('1[456]|'
                                                          '17[012456789]|'
                                                          '18|'
                                                          '19([012345])'), 1, 0)
)

In [369]:
diagnosis_elix_9.loc[:, 'rheumatoid_arthritis'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('446|'
                                                          '7010|'
                                                          '71(0[0123489]|12|4|93)|'
                                                          '72([05]|85|889|930)'), 1, 0)
)

In [370]:
diagnosis_elix_9.loc[:, 'coagulopathy'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('28(6|7[1345])'), 1, 0)
)

In [371]:
diagnosis_elix_9.loc[:, 'obesity'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('2780'), 1, 0)
)

In [372]:
diagnosis_elix_9.loc[:, 'weight_loss'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('26[0123]|'
                                                          '7832|'
                                                          '7994'), 1, 0)
)

In [373]:
diagnosis_elix_9.loc[:, 'fluid_electrolyte'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('2(536|76)'), 1, 0)
)

In [374]:
diagnosis_elix_9.loc[:, 'blood_loss_anemia'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('2800'), 1, 0)
)

In [375]:
diagnosis_elix_9.loc[:, 'deficiency_anemia'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('28(0[123456789]|1)'), 1, 0)
)

In [376]:
diagnosis_elix_9.loc[:, 'alcohol_abuse'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('2652|'
                                                          '291[12356789]|'
                                                          '30(3[09]|50)|'
                                                          '3575|'
                                                          '4255|'
                                                          '5353|'
                                                          '571[0123]|'
                                                          '980|'
                                                          'V113'), 1, 0)
)

In [377]:
diagnosis_elix_9.loc[:, 'drug_abuse'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('292|'
                                                          '30(4|5[23456789])|'
                                                          'V6542'), 1, 0)
)

In [378]:
diagnosis_elix_9.loc[:, 'psychoses'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('2938|'
                                                          '296[0145]4|'
                                                          '29[578]'), 1, 0)
)

In [379]:
diagnosis_elix_9.loc[:, 'depression'] = (
    np.where(diagnosis_elix_9['diagnosis_code'].str.match('296[235]|'
                                                          '3(004|09|11)'), 1, 0)
)

In [380]:
# Create variable that captures ICD-9 codes not included in Elixhauser. 
diagnosis_elix_9.loc[:, 'elixhauser_other'] = (
    np.where(diagnosis_elix_9.iloc[:, 3:].eq(0).all(1), 1, 0)
)

In [381]:
# Single-row-per-patient dataframe with columns as Elixhauser comorbidities. 
diagnosis_elix_9_wide = (
    diagnosis_elix_9
    .drop(columns = ['DiagnosisCode', 'diagnosis_code'])
    .groupby('PatientID').sum()
    .reset_index()
)

In [382]:
row_ID(diagnosis_elix_9_wide)

(3510, 3510)

##### Elixhauser for ICD-10

In [383]:
# ICD-10 dataframe with unique codes for each patient.  
diagnosis_elix_10 = (
    diagnosis
    .query('diagnosis_date_diff <= 30')
    .query('DiagnosisCodeSystem == "ICD-10-CM"')
    .drop_duplicates(subset = (['PatientID', 'DiagnosisCode']), keep = 'first')
    .filter(items = ['PatientID', 'DiagnosisCode', 'diagnosis_code'])
)

In [384]:
row_ID(diagnosis_elix_10)

(34495, 3567)

In [385]:
diagnosis_elix_10.loc[:, 'chf'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('I099|'
                                                           'I1(10|3[02])|'
                                                           'I255|'
                                                           'I4(2[056789]|3)|'
                                                           'I50|'
                                                           'P290'), 1, 0)
)

In [386]:
diagnosis_elix_10.loc[:, 'cardiac_arrhythmias'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('I4(4[123]|5[69]|[789])|'
                                                           'R00[018]|'
                                                           'T821|'
                                                           'Z[49]50'), 1, 0)
)

In [387]:
diagnosis_elix_10.loc[:, 'valvular_disease'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('A520|'
                                                           'I0([5678]|9[18])|'
                                                           'I3[456789]|'
                                                           'Q23[0123]|'
                                                           'Z95[234]'), 1, 0)
)

In [388]:
diagnosis_elix_10.loc[:, 'pulmonary_circulation'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('I2([67]|8[089])'), 1, 0)
)

In [389]:
diagnosis_elix_10.loc[:, 'peripheral_vascular'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('I7([01]|3[189]|71|9[02])|'
                                                           'K55[189]|'
                                                           'Z95[89]'), 1, 0)
)

In [390]:
diagnosis_elix_10.loc[:, 'htn_uncomplicated'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('I10'), 1, 0)
)

In [391]:
diagnosis_elix_10.loc[:, 'htn_complicated'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('I1[1235]'), 1, 0)
)

In [392]:
diagnosis_elix_10.loc[:, 'paralysis'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('G041|'
                                                           'G114|'
                                                           'G8(0[12]|[12]|3[012349])'), 1, 0)
)

In [393]:
diagnosis_elix_10.loc[:, 'other_neuro_disorders'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('G1[0123]|'
                                                           'G2([012]|5[45])|'
                                                           'G3(1[289]|[2567])|'
                                                           'G4[01]|'
                                                           'G93[14]|'
                                                           'R470|'
                                                           'R56'), 1, 0)
)

In [394]:
diagnosis_elix_10.loc[:, 'chronic_pulmonary'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('I27[89]|'
                                                           'J4[01234567]|'
                                                           'J6([01234567]|84)|'
                                                           'J70[13]'), 1, 0)
)

In [395]:
diagnosis_elix_10.loc[:, 'diabetes_uncomplicated'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('E1[01234][019]'), 1, 0)
)

In [396]:
diagnosis_elix_10.loc[:, 'diabetes_complicated'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('E1[01234][2345678]'), 1, 0)
)

In [397]:
diagnosis_elix_10.loc[:, 'hypothyroidism'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('E0[0123]|'
                                                           'E890'), 1, 0)
)

In [398]:
diagnosis_elix_10.loc[:, 'renal_failure'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('I1(20|31)|'
                                                           'N1[89]|'
                                                           'N250|'
                                                           'Z49[012]|'
                                                           'Z9(40|92)'), 1, 0)
)

In [399]:
diagnosis_elix_10.loc[:, 'liver_disease'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('B18|'
                                                           'I8(5|64)|'
                                                           'I982|'
                                                           'K7(0|1[13457]|[234]|6[023456789])|'
                                                           'Z944'), 1, 0)
)

In [400]:
diagnosis_elix_10.loc[:, 'peptic_ulcer_disease'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('K2[5678][79]'), 1, 0)
)

In [401]:
diagnosis_elix_10.loc[:, 'aids_hiv'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('B2[0124]'), 1, 0)
)

In [402]:
diagnosis_elix_10.loc[:, 'lymphoma'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('C8[123458]|'
                                                           'C9(0[02]|6)'), 1, 0)
)

In [403]:
diagnosis_elix_10.loc[:, 'metastatic_cancer'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('C(7[789]|80)'), 1, 0)
)

In [404]:
diagnosis_elix_10.loc[:, 'solid_tumor_wout_mets'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('C[01]|'
                                                           'C2[0123456]|'
                                                           'C3[01234789]|'
                                                           'C4[01356789]|'
                                                           'C5[012345678]|'
                                                           'C6|'
                                                           'C7[0123456]|'
                                                           'C97'), 1, 0)
)

In [405]:
diagnosis_elix_10.loc[:, 'rheumatoid_arthritis'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('L94[013]|'
                                                           'M0[568]|'
                                                           'M12[03]|'
                                                           'M3(0|1[0123]|[2345])|'
                                                           'M4(5|6[189])'), 1, 0)
)

In [406]:
diagnosis_elix_10.loc[:, 'coagulopathy'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('D6([5678]|9[13456])'), 1, 0)
)

In [407]:
diagnosis_elix_10.loc[:, 'obesity'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('E66'), 1, 0)
)

In [408]:
diagnosis_elix_10.loc[:, 'weight_loss'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('E4[0123456]|'
                                                           'R6(34|4)'), 1, 0)
)

In [409]:
diagnosis_elix_10.loc[:, 'fluid_electrolyte'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('E222|'
                                                           'E8[67]'), 1, 0)
)

In [410]:
diagnosis_elix_10.loc[:, 'blood_loss_anemia'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('D500'), 1, 0)
)

In [411]:
diagnosis_elix_10.loc[:, 'deficiency_anemia'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('D5(0[89]|[123])'), 1, 0)
)

In [412]:
diagnosis_elix_10.loc[:, 'alcohol_abuse'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('F10|'
                                                           'E52|'
                                                           'G621|'
                                                           'I426|'
                                                           'K292|'
                                                           'K70[039]|'
                                                           'T51|'
                                                           'Z502|'
                                                           'Z7(14|21)'), 1, 0)
)

In [413]:
diagnosis_elix_10.loc[:, 'drug_abuse'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('F1[12345689]|'
                                                           'Z7(15|22)'), 1, 0)
)

In [414]:
diagnosis_elix_10.loc[:, 'psychoses'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('F2[0234589]|'
                                                           'F3([01]2|15)'), 1, 0)
)

In [415]:
diagnosis_elix_10.loc[:, 'depression'] = (
    np.where(diagnosis_elix_10['diagnosis_code'].str.match('F204|'
                                                           'F3(1[345]|[23]|41)|'
                                                           'F4[13]2'), 1, 0)
)

In [416]:
# Create variable that captures ICD-10 codes not included in Elixhauser. 
diagnosis_elix_10.loc[:, 'elixhauser_other'] = (
    np.where(diagnosis_elix_10.iloc[:, 3:].eq(0).all(1), 1, 0)
)

In [417]:
diagnosis_elix_10_wide = (
    diagnosis_elix_10
    .drop(columns = ['DiagnosisCode', 'diagnosis_code'])
    .groupby('PatientID').sum()
    .reset_index()
)

In [418]:
row_ID(diagnosis_elix_10_wide)

(3567, 3567)

In [419]:
# Merge Elixhauser 9 and 10 and sum by PatientID.
diagnosis_elixhauser = (
    pd.concat([diagnosis_elix_9_wide, diagnosis_elix_10_wide])
    .groupby('PatientID').sum()
)

In [420]:
# Create unqiue ICD count for each patient. 
diagnosis_elixhauser['icd_count'] = diagnosis_elixhauser.sum(axis = 1)  

In [421]:
# Other than unique ICD count, values greater than 1 are set to 1; 0 remains unchanged. 
diagnosis_elixhauser.iloc[:, :-1] = (
    diagnosis_elixhauser.iloc[:, :-1].mask(diagnosis_elixhauser.iloc[:, :-1] >1, 1)
)

In [422]:
diagnosis_elixhauser = diagnosis_elixhauser.reset_index()

In [423]:
row_ID(diagnosis_elixhauser)

(5535, 5535)

In [424]:
# Append missing training IDs.
diagnosis_elixhauser = (
    diagnosis_elixhauser
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(diagnosis_elixhauser['PatientID'])].to_frame(name = 'PatientID'), 
        sort = False)
    .fillna(0)
)

In [425]:
row_ID(diagnosis_elixhauser)

(6336, 6336)

In [426]:
diagnosis_elixhauser.sample(5)

Unnamed: 0,PatientID,chf,cardiac_arrhythmias,valvular_disease,pulmonary_circulation,peripheral_vascular,htn_uncomplicated,htn_complicated,paralysis,other_neuro_disorders,...,weight_loss,fluid_electrolyte,blood_loss_anemia,deficiency_anemia,alcohol_abuse,drug_abuse,psychoses,depression,elixhauser_other,icd_count
533,F178942359405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0
2632,F7765C7CF71D6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2651,F78CE7D05ACBF,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0
2145,F5F95929A868F,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0
4384,FC9C76440D698,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0


#### Other cancer 

##### ICD-9 Cancer codes 

In [427]:
# Select all ICD-9 cancer codes between 140-209.
# Exclude benign neoplasms: 210-229, carcinoma in site: 230-234, and neoplasms of uncertain behavior or nature: 235-239.
cancer_9 = (
    diagnosis_elix_9[diagnosis_elix_9['DiagnosisCode'].str.startswith(
        ('14','15', '16', '17', '18', '19', '20'))]
    .filter(items = ['PatientID', 'DiagnosisCode', 'diagnosis_code'])
)

In [428]:
row_ID(cancer_9)

(5165, 3196)

**Remove the following ICD-9 codes representing colorectal cancer, metastasis, ill-defined neoplasms, and benign neoplasms of skin (BCC and SCC):**
* **174 - Malignant neoplasm of female breast**
* **175 - Malignant neoplasm of male breast**
* **173 - Other and unspecified malignant neoplasm of skin**
* **196 - Secondary and unspecified malignant neoplasm of lymph nodes**
* **197 - Secondary malignant neoplasm of respiratory and digestive systems**
* **198 - Secondary malignant neoplasm of other specified sites** 
* **199 - Malignant neoplasm without specification of site**

In [429]:
# Dataframe of ICD-9 neoplasm codes that exclude colorectal cancer, metastasis, or benign neoplasms.
other_cancer_9 = (
    cancer_9[~cancer_9['diagnosis_code'].str.match('17([345])|'
                                                   '19([6789])')]
)

In [430]:
other_cancer_9.loc[:,'other_cancer_9'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


In [431]:
other_cancer_9 = (
    other_cancer_9
    .drop_duplicates(subset = 'PatientID', keep = 'first')
    .filter(items = ['PatientID', 'other_cancer_9'])
)

In [432]:
row_ID(other_cancer_9)

(207, 207)

In [433]:
other_cancer_9 = (
    other_cancer_9
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(other_cancer_9['PatientID'])].to_frame(name = 'PatientID'), 
        sort = False)
    .fillna(0)
)

In [434]:
row_ID(other_cancer_9)

(6336, 6336)

##### ICD-10 Cancer codes

In [435]:
# Select all ICD-10 codes between C00-D49 
# Exclude in situ neoplasms: D00-D09, benign neoplasms: D10-D36, benign neuroendocrine tumor: D3A, and neoplasms of unspecified behavior: D37 and D49
cancer_10 = (
    diagnosis_elix_10[diagnosis_elix_10['DiagnosisCode'].str.startswith(
        ('C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'D38', 'D39', 'D4'))]
    .filter(items = ['PatientID', 'DiagnosisCode', 'diagnosis_code'])
)

In [436]:
row_ID(cancer_10)

(7052, 3202)

**Remove the following ICD-10 codes which capture breast cancer, metastasis, and benign skin neoplasms(eg., BCC and SCC).**
* **C50 - Malignant neoplasm of breast** 
* **C44 - Other and unspecified malignant neoplasm of skin**
* **C77 - Secondary and unspecified malignant neoplasm of lymph nodes**
* **C78 - Secondary malignant neoplasm of respiratory and digestive organs**
* **C79 - Secondary malignant neoplasm of other and unspecified sites**
* **C80 - Malignant neoplasm without specification of site**
* **D47.2 - Monoclonal gammopathy**
* **D48 - Neoplasm of uncertain behavior of other and unspecified sites**
* **D49 - Neoplasms of unspecified behavior** 

In [437]:
# Dataframe of ICD-10 neoplasm codes that exclude lung cancer, metastasis, or benign neoplasms.
other_cancer_10 = (
    cancer_10[~cancer_10['diagnosis_code'].str.match('C50|'
                                                    'C44|'
                                                    'C7[789]|'
                                                    'C80|'
                                                    'D4(72|[89])')]
)

In [438]:
other_cancer_10.loc[:,'other_cancer_10'] = 1

In [439]:
# Drop duplicates.
other_cancer_10 = (
    other_cancer_10
    .drop_duplicates(subset = 'PatientID', keep = 'first')
    .filter(items = ['PatientID', 'other_cancer_10'])
)

In [440]:
row_ID(other_cancer_10)

(220, 220)

In [441]:
# Append missing training IDs.
other_cancer_10 = (
    other_cancer_10
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(other_cancer_10['PatientID'])].to_frame(name = 'PatientID'), 
        sort = False)
    .fillna(0)
)

In [442]:
row_ID(other_cancer_10)

(6336, 6336)

In [443]:
other_cancer = pd.merge(other_cancer_9, other_cancer_10, on = 'PatientID')

In [444]:
# Combine other_cancer_9 and other_cancer_19; replace values equal to 2 with 1. 
other_cancer = (
    other_cancer
    .assign(other_cancer = other_cancer['other_cancer_9'] + other_cancer['other_cancer_10'])
    .filter(items = ['PatientID', 'other_cancer'])
    .replace(2, 1)
)

In [445]:
row_ID(other_cancer)

(6336, 6336)

#### Sites of metastases

In [446]:
mets = pd.read_csv('Enhanced_MetBreastSitesOfMet.csv')

In [447]:
mets = mets[mets['PatientID'].isin(test_IDs)]

In [448]:
row_ID(mets)

(15986, 6306)

In [449]:
mets = pd.merge(mets, enhanced_met[['PatientID', 'met_date']], on = 'PatientID', how = 'left')

In [450]:
mets.loc[:, 'DateOfMetastasis'] = pd.to_datetime(mets['DateOfMetastasis'])

In [451]:
mets.loc[:, 'diagnosis_met_diff'] = (mets['DateOfMetastasis'] - mets['met_date']).dt.days

In [452]:
mets = mets.query('diagnosis_met_diff <= 30')

**The median number of mets at time of metastatic diagnosis is 1. The most common site is bone followed by lung, distant lymph node, and liver. Sites of metastasis will be simplified into the following groups:**

* **1. Bone or bone marrow**
* **2. Lung or pleura**
* **3. Distant lymph node**
* **4. Liver**
* **5. Brain or CNS site**
* **6. Skin or soft tissue**
* **7. Peritoneum**
* **8. Other: other, adrenal, ovary, spleen, pancreas, kidney, or thyroid**

In [453]:
# Recode mets
conditions = [
    (mets['SiteOfMetastasis'] == 'Bone') | 
    (mets['SiteOfMetastasis'] == 'Bone marrow'),
    (mets['SiteOfMetastasis'] == 'Lung') | 
    (mets['SiteOfMetastasis'] == 'Pleura'),
    (mets['SiteOfMetastasis'] == 'Distant lymph node'),
    (mets['SiteOfMetastasis'] == 'Liver'),
    (mets['SiteOfMetastasis'] == 'Brain') | 
    (mets['SiteOfMetastasis'] == 'CNS site'),
    (mets['SiteOfMetastasis'] == 'Skin') | 
    (mets['SiteOfMetastasis'] == 'Soft tissue'),
    (mets['SiteOfMetastasis'] == 'Peritoneum'),
    (mets['SiteOfMetastasis'] == 'Other') | 
    (mets['SiteOfMetastasis'] == 'Adrenal') |
    (mets['SiteOfMetastasis'] == 'Ovary') |
    (mets['SiteOfMetastasis'] == 'Spleen') |
    (mets['SiteOfMetastasis'] == 'Pancreas') |
    (mets['SiteOfMetastasis'] == 'Kidney') |
    (mets['SiteOfMetastasis'] == 'Thyroid')]

choices = ['bone_met', 'thorax_met', 'lymph_met', 'liver_met', 'cns_met', 'skin_met', 'peritoneum_met', 'other_met']

mets.loc[:, 'met_loc'] = np.select(conditions, choices)

In [454]:
mets['bone_met'] = np.where(mets['met_loc'] == 'bone_met', 1, 0)

In [455]:
mets['thorax_met'] = np.where(mets['met_loc'] == 'thorax_met', 1, 0)

In [456]:
mets['lymph_met'] = np.where(mets['met_loc'] == 'lymph_met', 1, 0)

In [457]:
mets['liver_met'] = np.where(mets['met_loc'] == 'liver_met', 1, 0)

In [458]:
mets['cns_met'] = np.where(mets['met_loc'] == 'cns_met', 1, 0)

In [459]:
mets['skin_met'] = np.where(mets['met_loc'] == 'skin_met', 1, 0)

In [460]:
mets['peritoneum_met'] = np.where(mets['met_loc'] == 'peritoneum_met', 1, 0)

In [461]:
mets['other_met'] = np.where(mets['met_loc'] == 'other_met', 1, 0)

In [462]:
# Drop unnecessary columns and condense. 
mets_wide = (
    mets
    .drop(columns = ['DateOfMetastasis', 'SiteOfMetastasis', 'met_date', 'diagnosis_met_diff', 'met_loc'])
    .groupby('PatientID').sum()
)

In [463]:
# Set any value greater than 1 to 1; leave 0 unchanged. 
mets_wide = (
    mets_wide
    .mask(mets_wide > 1, 1)
    .reset_index()
)

In [464]:
row_ID(mets_wide)

(6271, 6271)

In [465]:
# Append missing training IDs.
mets_wide = (
    mets_wide
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(mets_wide['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
)

In [466]:
row_ID(mets_wide)

(6336, 6336)

In [467]:
mets_wide = mets_wide.fillna(0)

In [468]:
mets_wide.sample(5)

Unnamed: 0,PatientID,bone_met,thorax_met,lymph_met,liver_met,cns_met,skin_met,peritoneum_met,other_met
3690,F96D60CB4F98C,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
4896,FC732E9038D0A,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1729,F431D8008DCCA,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5441,FDD5B0B2E874E,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4097,FA7F118C1E459,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [469]:
# Percentage breakdown of insurance by type at time of advanced diagnosis.
(mets_wide.iloc[:, 1:9].sum()/mets_wide.shape[0]).sort_values(ascending = False)

bone_met          0.628157
thorax_met        0.332544
lymph_met         0.274937
liver_met         0.226799
cns_met           0.079703
other_met         0.066446
skin_met          0.062658
peritoneum_met    0.024463
dtype: float64

#### Merge

In [470]:
diagnosis_wide = pd.merge(diagnosis_elixhauser, other_cancer, on = 'PatientID')

In [471]:
diagnosis_wide = pd.merge(diagnosis_wide, mets_wide, on = 'PatientID')

In [472]:
row_ID(diagnosis_wide)

(6336, 6336)

In [473]:
list(diagnosis_wide.columns)

['PatientID',
 'chf',
 'cardiac_arrhythmias',
 'valvular_disease',
 'pulmonary_circulation',
 'peripheral_vascular',
 'htn_uncomplicated',
 'htn_complicated',
 'paralysis',
 'other_neuro_disorders',
 'chronic_pulmonary',
 'diabetes_uncomplicated',
 'diabetes_complicated',
 'hypothyroidism',
 'renal_failure',
 'liver_disease',
 'peptic_ulcer_disease',
 'aids_hiv',
 'lymphoma',
 'metastatic_cancer',
 'solid_tumor_wout_mets',
 'rheumatoid_arthritis',
 'coagulopathy',
 'obesity',
 'weight_loss',
 'fluid_electrolyte',
 'blood_loss_anemia',
 'deficiency_anemia',
 'alcohol_abuse',
 'drug_abuse',
 'psychoses',
 'depression',
 'elixhauser_other',
 'icd_count',
 'other_cancer',
 'bone_met',
 'thorax_met',
 'lymph_met',
 'liver_met',
 'cns_met',
 'skin_met',
 'peritoneum_met',
 'other_met']

In [474]:
%whos DataFrame

Variable                 Type         Data/Info
-----------------------------------------------
biomarker_notpdl1_wide   DataFrame              PatientID      <...>\n[6336 rows x 6 columns]
biomarker_wide           DataFrame              PatientID      <...>\n[6336 rows x 7 columns]
cancer_10                DataFrame                PatientID Dia<...>\n[7052 rows x 3 columns]
cancer_9                 DataFrame                PatientID Dia<...>\n[5165 rows x 3 columns]
demographics             DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
diagnosis                DataFrame                PatientID    <...>[322910 rows x 9 columns]
diagnosis_elix_10        DataFrame                PatientID Dia<...>[34495 rows x 35 columns]
diagnosis_elix_10_wide   DataFrame              PatientID  chf <...>n[3567 rows x 33 columns]
diagnosis_elix_9         DataFrame                PatientID Dia<...>[22691 rows x 35 columns]
diagnosis_elix_9_wide    DataFrame              PatientID 

In [475]:
# Keep biomarker_wide, demographics, diagnosis_wide, ecog_diagnosis_wide, enhanced_met, insurance_wide, 
# lab_wide, med_admin_wide, mortality, and weight_wide
del cancer_10
del cancer_9
del diagnosis
del diagnosis_elix_10
del diagnosis_elix_10_wide
del diagnosis_elix_9
del diagnosis_elix_9_wide
del diagnosis_elixhauser
del mets
del mets_wide
del other_cancer
del other_cancer_10
del other_cancer_9

### 11. SocialDeterminantsOfHealth

In [476]:
sdoh = pd.read_csv('SocialDeterminantsOfHealth.csv')

In [477]:
sdoh = sdoh[sdoh['PatientID'].isin(test_IDs)]

In [478]:
row_ID(sdoh)

(5293, 5293)

In [479]:
conditions = [
    (sdoh['SESIndex2015_2019'] == '5 - Highest SES'),
    (sdoh['SESIndex2015_2019'] == '1 - Lowest SES')]    

choices = ['5', '1']
    
sdoh.loc[:, 'ses'] = np.select(conditions, choices, default = sdoh['SESIndex2015_2019'])

In [480]:
sdoh = sdoh.drop(columns = ['PracticeID', 'SESIndex2015_2019'])

In [481]:
sdoh_wide = (
    sdoh
    .append(
        pd.Series(test_IDs)[~pd.Series(test_IDs).isin(sdoh['PatientID'])].to_frame(name = 'PatientID'),
        sort = False)
)

In [482]:
row_ID(sdoh_wide)

(6336, 6336)

In [483]:
sdoh_wide.ses.value_counts(dropna = False, normalize = True)

NaN    0.261048
4      0.157197
3      0.155619
2      0.143782
5      0.142992
1      0.139362
Name: ses, dtype: float64

In [484]:
%whos DataFrame

Variable                 Type         Data/Info
-----------------------------------------------
biomarker_notpdl1_wide   DataFrame              PatientID      <...>\n[6336 rows x 6 columns]
biomarker_wide           DataFrame              PatientID      <...>\n[6336 rows x 7 columns]
demographics             DataFrame              PatientID gende<...>\n[6336 rows x 7 columns]
diagnosis_wide           DataFrame              PatientID  chf <...>n[6336 rows x 43 columns]
ecog_diagnosis_wide      DataFrame               PatientID ecog<...>\n[6336 rows x 2 columns]
enhanced_met             DataFrame               PatientID diag<...>\n[6336 rows x 6 columns]
insurance_wide           DataFrame              PatientID  medi<...>\n[6336 rows x 9 columns]
lab_wide                 DataFrame              PatientID  albu<...>[6336 rows x 129 columns]
med_admin_wide           DataFrame              PatientID  ster<...>n[6336 rows x 14 columns]
mets_max                 DataFrame              PatientID 

In [485]:
# Keep biomarker_wide, demographics, ecog_diagnosis_wide, enhanced_met, mortality, 
# lab_wide, sdoh_wide, and weight wide
del sdoh

## Part 3: File merge

In [486]:
enhanced_met = enhanced_met.drop(columns = ['diagnosis_date', 'met_date'])

In [487]:
test_full = pd.merge(demographics, enhanced_met, on = 'PatientID')

In [488]:
test_full = pd.merge(test_full, mortality, on = 'PatientID')

In [489]:
test_full = pd.merge(test_full, med_admin_wide, on = 'PatientID')

In [490]:
test_full = pd.merge(test_full, biomarker_wide, on = 'PatientID')

In [491]:
test_full = pd.merge(test_full, insurance_wide, on = 'PatientID')

In [492]:
test_full = pd.merge(test_full, ecog_diagnosis_wide, on = 'PatientID')

In [493]:
test_full = pd.merge(test_full, weight_wide, on = 'PatientID')

In [494]:
test_full = pd.merge(test_full, lab_wide, on = 'PatientID')

In [495]:
test_full = pd.merge(test_full, diagnosis_wide, on = 'PatientID')

In [496]:
test_full = pd.merge(test_full, sdoh_wide, on = 'PatientID')

In [497]:
row_ID(test_full)

(6336, 6336)

In [498]:
len(test_full.columns)

217

In [499]:
list(test_full.columns)

['PatientID',
 'gender',
 'race',
 'ethnicity',
 'age',
 'p_type',
 'region',
 'stage',
 'met_year',
 'delta_met_diagnosis',
 'death_status',
 'timerisk_activity',
 'steroid_diag',
 'opioid_PO_diag',
 'nonopioid_PO_diag',
 'pain_IV_diag',
 'ac_diag',
 'antiinfective_IV_diag',
 'antiinfective_diag',
 'antihyperglycemic_diag',
 'ppi_diag',
 'antidepressant_diag',
 'bta_diag',
 'thyroid_diag',
 'is_diag',
 'ER',
 'HER2',
 'PR',
 'BRCA',
 'PIK3CA',
 'pdl1_n',
 'medicare',
 'medicaid',
 'medicare_medicaid',
 'commercial',
 'patient_assistance',
 'other_govt',
 'self_pay',
 'other',
 'ecog_diagnosis',
 'weight_diag',
 'bmi_diag',
 'bmi_diag_na',
 'weight_pct_change',
 'weight_pct_na',
 'weight_slope',
 'albumin_diag',
 'alp_diag',
 'alt_diag',
 'ast_diag',
 'bicarb_diag',
 'bun_diag',
 'calcium_diag',
 'chloride_diag',
 'creatinine_diag',
 'hemoglobin_diag',
 'neutrophil_count_diag',
 'platelet_diag',
 'potassium_diag',
 'sodium_diag',
 'total_bilirubin_diag',
 'wbc_diag',
 'albumin_diag_na'

In [500]:
test_full.to_csv('test_full.csv', index = False, header = True)