<a href="https://colab.research.google.com/github/wenjunsun/personal-machine-learning-projects/blob/master/cancer-fracture/task1/prepare_data_for_ML_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a notebook for preparing the dataset of fracture. In the last notebook of data preparation, there are some stuff we need to change. We will discuss the new new data preparation process here.

Things we will do differently here:

1. we only had 14% of fractures in patients in our previous processing scheme. That is too small! So we decide to include more fracture data.

2. Based on the age distribution, we'll make the age into 3 categories: <=40 , >40 to <70, >=70 to category 1, 2, 3 respectively.

3. Use new Stages data with less null values.

# 0. Go to the directory containing data + import packages

In [1]:
import pandas as pd
import numpy as np

In [2]:
cd drive/My\ Drive/fracture_with_emissa/Datasets/Raw\ Data

/content/drive/My Drive/fracture_with_emissa/Datasets/Raw Data


In [3]:
ls

BillingCodes.csv           Labs_encoded.csv
BonyLesions.csv            Lesions_encoded.csv
data_agg_2.csv             Medications.csv
data_agg.csv               medicines_90_days_before_STC.csv
data_for_ML_back_fill.csv  MyelomaTherapy.csv
Demographics.csv           PlasmaCells.csv
Demographics_encoded.csv   RadiationTherapy.csv
Diagnoses.csv              Signs.csv
Final_stage.csv            Stage.csv
Fractures.csv              Stages_encoded.csv
Labs2.csv                  STC_Days.csv
Labs_closest_to_SCT.csv    SurvivalDays.csv
Labs.csv                   Symptoms.csv


# 1. include more fractures.

In [None]:
STC_data = pd.read_csv('STC_Days.csv')

In [None]:
STC_data

Unnamed: 0,ID,STC_Day
0,MM1,343.0
1,MM2,229.0
2,MM4,425.0
3,MM5,416.0
4,MM6,657.0
...,...,...
697,MM834,204.0
698,MM843,841.0
699,MM836,786.0
700,MM837,203.0


In [None]:
data = pd.read_csv('BillingCodes.csv')

In [None]:
data.head()

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
0,MM1,354,ICD9CM,V10.87,Personal history of malignant neoplasm of thyroid,Neoplasms,Cancer; other primary,Cancer of thyroid [36.]
1,MM1,373,ICD9CM,V10.87,Personal history of malignant neoplasm of thyroid,Neoplasms,Cancer; other primary,Cancer of thyroid [36.]
2,MM1,355,ICD9CM,V10.87,Personal history of malignant neoplasm of thyroid,Neoplasms,Cancer; other primary,Cancer of thyroid [36.]
3,MM1,741,ICD9CM,V12.29,"Personal history of other endocrine, metabolic...",Endocrine; nutritional; and metabolic diseases...,Other nutritional; endocrine; and metabolic di...,Other and unspecified metabolic; nutritional; ...
4,MM1,318,ICD9CM,V14.0,Personal history of allergy to penicillin,Symptoms; signs; and ill-defined conditions an...,Symptoms; signs; and ill-defined conditions,Allergic reactions [253.]


In [None]:
# we think fracture will be one of the word in the string of
# DxDescription, CCSLevel1Name, CCSLevel2Name, CCSLevel3Name
# Let's get those rows where there any of the 4 columns
# contain the word 'fracture'.

# define a function such that given a row, return true if that row contains
# the word fracture in one of its 4 columns
def doesRowContainFracture(row):
  return (isinstance(row['DxDescription'], str) and 'Fracture'.lower() in row['DxDescription'].lower()) or \
         (isinstance(row['CCSLevel1Name'], str) and 'Fracture'.lower() in row['CCSLevel1Name'].lower()) or \
         (isinstance(row['CCSLevel2Name'], str) and 'Fracture'.lower() in row['CCSLevel2Name'].lower()) or \
         (isinstance(row['CCSLevel3Name'], str) and 'Fracture'.lower() in row['CCSLevel3Name'].lower())

In [None]:
# get rows with word fracture in them.
data_fracture = data[data.apply(doesRowContainFracture, axis = 1)]

In [None]:
data_fracture.head()

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
469,MM2,-5558,ICD9CM,807.00,"Closed fracture of rib(s), unspecified",Injury and poisoning,Fractures,Other fractures [231.]
656,MM3,101,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
657,MM3,194,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
658,MM3,0,ICD10CM,M84.58XA,"Pathological fracture in neoplastic disease, o...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
740,MM4,433,ICD10CM,Z87.81,Personal history of (healed) traumatic fracture,Injury and poisoning,Other injuries and conditions due to external ...,No Value


In [None]:
data_fracture.shape

(2078, 8)

In [None]:
# delete the instance of fractures that are not related to the myeloma disease
data_fracture = data_fracture[data_fracture['DxDescription']\
                              .apply(lambda x: 'encounter for removal of internal fixation device' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription']\
                              .apply(lambda x: 'follow-up examination, following treatment of healed fracture' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription'].\
                              apply(lambda x: 'open wound of tooth (broken) (fractured) (due to trauma), without mention of complication' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription']\
                              .apply(lambda x: 'other aftercare involving internal fixation device' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription']\
                              .apply(lambda x: 'other osteoporosis without current pathological fracture' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription']\
                              .apply(lambda x: 'osteoporosis without current pathological fracture' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription']\
                              .apply(lambda x: 'tooth fracture' not in x.lower())]

In [None]:
data_fracture.shape

(1831, 8)

In [None]:
# num_of_fractures include all the patients, not necessarily taking the STC.
# and not necessarily having fracture after 1 year of STC.
num_of_fractures = len(data_fracture['ID'].unique())
print(f'number of patients having fracture after elimination of non-pathological fractures: {num_of_fractures}')

number of patients having fracture after elimination of non-pathological fractures: 352


Are there any patients who didn't have stem cell transplant treatment, but did have fracture? Let's find out.

In [None]:
patients_with_fracture = data_fracture['ID'].unique()
patients_with_STC = STC_data['ID'].unique()

In [None]:
set(patients_with_fracture) - set(patients_with_STC)

{'MM157',
 'MM199',
 'MM238',
 'MM290',
 'MM3',
 'MM332',
 'MM36',
 'MM363',
 'MM383',
 'MM387',
 'MM396',
 'MM422',
 'MM432',
 'MM444',
 'MM454',
 'MM460',
 'MM491',
 'MM505',
 'MM522',
 'MM533',
 'MM571',
 'MM632',
 'MM655',
 'MM660',
 'MM672',
 'MM684',
 'MM696',
 'MM697',
 'MM712',
 'MM722',
 'MM806',
 'MM820',
 'MM83',
 'MM841',
 'MM849',
 'MM850',
 'MM851',
 'MM853',
 'MM854',
 'MM865',
 'MM895',
 'MM898',
 'MM902',
 'MM910',
 'MM913',
 'MM933',
 'MM95',
 'MM968'}

In [None]:
'MM3' in data_fracture['ID'].unique()

True

In [None]:
'MM3' in STC_data['ID'].unique()

False

As we can see, patient MM3 had fracture, but didn't stem cell transplant. We are only studying people with stem cell transplant so patients like him/her will be excluded from the study, for now.

In [None]:
num_patients_with_fracture_but_without_STC = len(set(patients_with_fracture) - set(patients_with_STC))
print(f'number of patients with fracture but didn\'t have stem cell treatment: {num_patients_with_fracture_but_without_STC} ')

number of patients with fracture but didn't have stem cell treatment: 48 


Now we get all the patients who have fracture within our time window: 1 year after the stem cell treatment.

In [None]:
fractureList = [] # list of mappings from ID to 0 or 1 based on if this patient has fracture
for ID in STC_data['ID'].unique(): # for every patient who took STC:
  if ID not in data_fracture['ID'].unique(): # if they are not in the dataframe
    # it means they don't have fracture after STC. put 0 there
    fractureList.append([ID, 0])
  else:
    # get their dataframe
    patient_df = data_fracture[data_fracture['ID'] == ID]
    # if that dataframe contain any row that has date during 1 year after STC,
    # put 1 there
    STC_Day_of_patient = STC_data[STC_data['ID'] == ID]['STC_Day'].values[0]
    if patient_df[(patient_df['DaysFromDx'] >= STC_Day_of_patient) & (patient_df['DaysFromDx'] <= STC_Day_of_patient + 365)].shape[0] != 0:
      fractureList.append([ID, 1])
    else:
      # means patient does have fracture but not within a year after STC, put 0.
      fractureList.append([ID, 0])

In [None]:
# make a dataframe out of list:
df_result = pd.DataFrame(data = fractureList, columns= ['ID', 'HasFracture?'])

In [None]:
df_result['HasFracture?'].sum()

129

day 0 - infinity: 204, day 0 -365: 129, day -infty to infty: 304

In [None]:
df_result.to_csv('Fractures.csv', index = False)

# 2. age encoding 

In [None]:
data = pd.read_csv('Demographics.csv')

In [None]:
data

Unnamed: 0,ID,AgeAtDx,PatientSex,RacialGroup
0,MM1,67,Male,White
1,MM2,61,Male,Black
2,MM3,59,Male,White
3,MM4,68,Female,White
4,MM5,63,Male,White
...,...,...,...,...
826,MM843,70,Male,White
827,MM835,56,Male,White
828,MM836,46,Male,White
829,MM837,59,Male,White


In [None]:
# given a age return its encoding:
def encodeAge(age):
  # <=40 , >40 to <70, >=70 to category 1, 2, 3 respectively.
  if age <= 40:
    return 1
  elif age < 70:
    return 2
  else:
    return 3

# Male -> 1, Female -> 0
def encodeSex(sex):
  return 1 if sex == 'Male' else 0

# White -> 1, Black -> 2, Asian -> 3
# else -> 4
def encodeRace(race):
  if race == 'White':
    return 1
  elif race == 'Black':
    return 2
  elif race == 'Asian':
    return 3
  else:
    return 4

In [None]:
# apply encoding to Age column
data['AgeAtDx'] = data['AgeAtDx'].apply(encodeAge)
# apply encoding function to sex column
data['PatientSex'] = data['PatientSex'].apply(encodeSex)
# apply encoding function to race column
data['RacialGroup'] = data['RacialGroup'].apply(encodeRace)

In [None]:
data

Unnamed: 0,ID,AgeAtDx,PatientSex,RacialGroup
0,MM1,2,1,1
1,MM2,2,1,2
2,MM3,2,1,1
3,MM4,2,0,1
4,MM5,2,1,1
...,...,...,...,...
826,MM843,3,1,1
827,MM835,2,1,1
828,MM836,2,1,1
829,MM837,2,1,1


In [None]:
data.to_csv('Demographics_encoded.csv', index = False)

# 3. Use new Stages data (hopefully less nulls)

In [None]:
data_stages = pd.read_csv('Final_stage.csv')

In [None]:
data_stages

Unnamed: 0,ID,StagingSystem,Stage
0,MM2,ISS,III
1,MM3,ISS,I
2,MM4,ISS,III
3,MM5,ISS,I
4,MM6,ISS,I
...,...,...,...
815,MM832,,II
816,MM930,,II
817,MM957,,II
818,MM935,,II


In [None]:
def encodeStage(stage):
  if stage == 'I':
    return 1
  elif stage == 'II':
    return 2
  elif stage == 'III':
    return 3
  else:
    return None

In [None]:
data_stages['Stage'] = data_stages['Stage'].apply(lambda x: encodeStage(x))

In [None]:
sorted(data_stages['Stage'].unique())

[1, 2, 3]

In [None]:
data_stages = data_stages[['ID', 'Stage']]

In [None]:
# save this to csv.
data_stages.to_csv('Stages_encoded.csv', index = False)

# 4. reprepare aggregated data

In [None]:
data_labs = pd.read_csv('Labs_closest_to_SCT.csv')
data_demographics = pd.read_csv('Demographics_encoded.csv')
data_medications = pd.read_csv('medicines_90_days_before_STC.csv')
data_cancer_stage = pd.read_csv('Stages_encoded.csv')
data_bony_lesions = pd.read_csv('Lesions_encoded.csv')
data_fracture = pd.read_csv('Fractures.csv')

In [None]:
# combine fracture data with lesions data
data_agg = data_fracture.merge(data_bony_lesions, on = 'ID', how='outer')

In [None]:
# combine with cancer stage data
data_agg = data_agg.merge(data_cancer_stage, on = 'ID', how='outer')

In [None]:
# combine with medications
data_agg = data_agg.merge(data_medications, on = 'ID', how = 'outer')

In [None]:
# combine with demographics data
data_agg = data_agg.merge(data_demographics, on = 'ID', how = 'outer')

In [None]:
data_agg

Unnamed: 0,ID,HasFracture?,DxType,DaysFromDx,BonyLesions,Stage,Vitamin D supplements,Calcium,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup
0,MM1,0.0,MRI,297.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0
1,MM2,0.0,MRI,176.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0
2,MM4,1.0,MRI,411.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0
3,MM5,0.0,MRI,373.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0
4,MM6,0.0,,,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
828,MM423,,,,,,,,,,,,2.0,0.0,2.0
829,MM486,,,,,,,,,,,,2.0,0.0,1.0
830,MM583,,,,,,,,,,,,2.0,0.0,1.0
831,MM675,,,,,,,,,,,,2.0,0.0,1.0


It turns out before we combine it with lab data, we need to process the lab data somehow. We need to flag if this chemical level is normal/abnormal on this patient based on the lab sheet.

In [None]:
data_labs[data_labs['ObservationName'] == 'Parathyroid hormone']

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
39,MM7,IPTH,Parathyroid hormone,Endocrine,404,47.0,pg/mL,N,12.0,88.0
103,MM18,IPTH,Parathyroid hormone,Endocrine,1341,142.0,pg/mL,H,12.0,88.0
157,MM26,IPTH,Parathyroid hormone,Endocrine,1173,17.0,pg/mL,N,12.0,88.0
167,MM844,IPTH,Parathyroid hormone,Endocrine,3508,1.0,pg/mL,L,12.0,88.0
184,MM29,IPTH,Parathyroid hormone,Endocrine,3802,17.0,pg/mL,N,12.0,88.0
...,...,...,...,...,...,...,...,...,...,...
4675,MM715,IPTH,Parathyroid hormone,Endocrine,2266,28.0,pg/mL,N,12.0,88.0
4825,MM738,IPTH,Parathyroid hormone,Endocrine,2,39.0,pg/mL,N,12.0,88.0
5092,MM814,IPTH,Parathyroid hormone,Endocrine,674,197.0,pg/mL,H,12.0,88.0
5138,MM855,IPTH,Parathyroid hormone,Endocrine,791,249.0,pg/mL,H,12.0,88.0


can take out the rows with parathyroid hormone because less than 10% (62 / 700) of people took this lab test. Too many nulls

In [None]:
data_labs = data_labs[data_labs['ObservationName'] != 'Parathyroid hormone']

In [None]:
important_chemicals = ['Calcium', 'Phosphate', 'Parathyroid hormone', \
                       'Alkaline\xa0Phosphatase', 'Vitamin\xa0D3', \
                       'Estradiol', 'Testosterone', 'Thyroid\xa0Stimulating\xa0Hormone',\
                       'Creatinine', 'C-Reactive Protein', 'Sedimentation\xa0Rate']

Calcium: N: normal (0), H, L -> abnormal (1) \\
phosphate: N: normal (0), H, L -> abnormal \\
Parathyroid hormone: not in lab sheet \\
Alkaline Phosphatase: L, N -> 0 normal, H -> 1 abnormal \\
Vitamin D3: H, N -> 0 normal, L -> 1 abnormal \\
Estradiol: H, N -> 0, L -> abnormal \\
testoterone: H, N -> 0, L -> 1 \\
thyroid stimulating hormone: N -> 0, H, L -> 1 \\
creatinine: L, N -> 0, H -> 1 \\
c-reactive protein: L, N -> 0, H -> 1 \\
sedimentation rate: L, N -> 0, H -> 1 \\

In [None]:
lab_interpretation = {}
lab_interpretation['Calcium'] = {'N':0, 'H': 1, 'L': 1}
lab_interpretation['Phosphate'] = {'N':0, 'H': 1, 'L': 1}
lab_interpretation['Alkaline\xa0Phosphatase'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['Vitamin\xa0D3'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['Estradiol'] = {'N':0, 'H': 0, 'L': 1}
lab_interpretation['Testosterone'] = {'N':0, 'H': 0, 'L': 1}
lab_interpretation['Thyroid\xa0Stimulating\xa0Hormone'] = {'N':0, 'H': 1, 'L': 1}
lab_interpretation['Creatinine'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['C-Reactive Protein'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['Sedimentation\xa0Rate'] = {'N':0, 'H': 1, 'L': 0}


In [None]:
# transform lab results flags to match lab interpretation.
data_labs['Abnormal?'] = data_labs.apply(lambda row: lab_interpretation[row['ObservationName']][row['AbnormalFlags']], axis = 1)

In [None]:
data_labs

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit,Abnormal?
0,MM1,CA,Calcium,Electrolyte,343,9.100,mg/dL,N,8.9,10.2,0
1,MM1,P,Phosphate,Electrolyte,343,4.700,mg/dL,H,2.5,4.5,1
2,MM1,ALK,Alkaline Phosphatase,Liver function,343,93.000,U/L,N,36.0,161.0,0
3,MM1,VITD3,Vitamin D3,Nutrtion,289,48.700,ng/mL,N,20.0,100.0,0
4,MM1,TEST,Testosterone,Endocrine,294,0.700,ng/mL,L,1.6,5.9,1
...,...,...,...,...,...,...,...,...,...,...,...
5254,MM838,VITD3,Vitamin D3,Nutrtion,2316,42.600,ng/mL,N,20.0,100.0,0
5255,MM838,TEST,Testosterone,Endocrine,1628,1.800,ng/mL,N,1.6,5.9,0
5256,MM838,TSH,Thyroid Stimulating Hormone,Endocrine,712,7.306,uIU/mL,H,0.4,5.0,1
5257,MM838,CRE,Creatinine,Kidney function,242,1.000,mg/dL,N,0.3,1.2,0


Now we need to convert all these rows into columns indicating whether each patient has abnormal level of each chemical.

In [None]:
list_of_lab_results = []
for ID in data_labs['ID'].unique():
  patient_labs = data_labs[data_labs['ID'] == ID]
  thisRow = [ID]
  for chemical in important_chemicals:
    if chemical in patient_labs['ObservationName'].unique():
      patient_labs_this_chemical = patient_labs[patient_labs['ObservationName'] == chemical]
      if patient_labs_this_chemical.shape[0] == 1:
        thisRow.append(patient_labs_this_chemical['Abnormal?'].values[0])
      else:
        if 1 in patient_labs_this_chemical['Abnormal?']:
          thisRow.append(1)
        else:
          thisRow.append(0)
    else:
      thisRow.append(None)
  list_of_lab_results.append(thisRow)

In [None]:
data_labs_new = pd.DataFrame(data = list_of_lab_results, columns= ['ID'] + important_chemicals)

In [None]:
data_labs_new.drop(['Parathyroid hormone'], axis = 1, inplace = True)

In [None]:
data_labs_new

Unnamed: 0,ID,Calcium,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...
695,MM834,0.0,0.0,0.0,,,0.0,1.0,0.0,1.0,
696,MM843,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,1.0
697,MM836,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1.0,
698,MM837,0.0,0.0,0.0,0.0,,1.0,,0.0,0.0,


In [None]:
data_labs_new.isnull().sum()

ID                               0
Calcium                          1
Phosphate                        4
Alkaline Phosphatase             1
Vitamin D3                     135
Estradiol                      550
Testosterone                   387
Thyroid Stimulating Hormone    221
Creatinine                       1
C-Reactive Protein              21
Sedimentation Rate             599
dtype: int64

In [None]:
data_labs_new.to_csv('Labs_encoded.csv', index=False)

might need to remove nulls later, but now let's combine them with our aggregated dataframe

In [None]:
# combine lab data with everything else
data_agg = data_agg.merge(data_labs_new, on = 'ID', how = 'outer')

In [None]:
data_agg

Unnamed: 0,ID,HasFracture?,DxType,DaysFromDx,BonyLesions,Stage,Vitamin D supplements,Calcium_x,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,Calcium_y,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,MRI,297.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,MRI,176.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,1.0,MRI,411.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,0.0,MRI,373.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,0.0,,,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
828,MM423,,,,,,,,,,,,2.0,0.0,2.0,,,,,,,,,,
829,MM486,,,,,,,,,,,,2.0,0.0,1.0,,,,,,,,,,
830,MM583,,,,,,,,,,,,2.0,0.0,1.0,,,,,,,,,,
831,MM675,,,,,,,,,,,,2.0,0.0,1.0,,,,,,,,,,


In [None]:
# delete the rows NaN HasFracture? values since that is what we are trying to predict
data_agg = data_agg.dropna(subset = ['HasFracture?'])

In [None]:
data_agg

Unnamed: 0,ID,HasFracture?,DxType,DaysFromDx,BonyLesions,Stage,Vitamin D supplements,Calcium_x,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,Calcium_y,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,MRI,297.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,MRI,176.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,1.0,MRI,411.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,0.0,MRI,373.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,0.0,,,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
697,MM834,0.0,MRI,161.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,,,0.0,1.0,0.0,1.0,
698,MM843,0.0,MRI,812.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,1.0
699,MM836,0.0,MRI,754.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1.0,
700,MM837,0.0,MRI,173.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,,1.0,,0.0,0.0,


In [None]:
# drop some columns we are not going to use in prediction
data_agg.drop(['DxType', 'DaysFromDx'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
data_agg.head()

Unnamed: 0,ID,HasFracture?,BonyLesions,Stage,Vitamin D supplements,Calcium_x,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,Calcium_y,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,0.0,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,


In [None]:
data_agg = data_agg.rename(columns={'Calcium_x': 'TookCalciumMedicine', 'Calcium_y': 'LabCalciumLevel'})

Let's save this aggregated data file to csv. (without dealing with null values at all)

In [None]:
data_agg.to_csv('data_agg.csv', index=False)

# 5. dealing with nulls

there are basically no way to elegantly deal with nulls, it is illy defined as a mathematical problem. We can do regression on missing values or use a Boltzmann machine or something. We can try to do that. but the underlying problem will not change

In [None]:
data = pd.read_csv('data_agg.csv')

In [None]:
data

Unnamed: 0,ID,HasFracture?,BonyLesions,Stage,Vitamin D supplements,TookCalciumMedicine,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,LabCalciumLevel,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,0.0,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
697,MM834,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,,,0.0,1.0,0.0,1.0,
698,MM843,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,1.0
699,MM836,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1.0,
700,MM837,0.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,,1.0,,0.0,0.0,


In [None]:
data.isnull().sum()

ID                               0
HasFracture?                     0
BonyLesions                     33
Stage                            4
Vitamin D supplements            0
TookCalciumMedicine              0
Denosumab                        0
Pamidronate                      0
Zoledronate                      0
Dexamethasone                    0
AgeAtDx                          1
PatientSex                       1
RacialGroup                      1
LabCalciumLevel                  3
Phosphate                        6
Alkaline Phosphatase             3
Vitamin D3                     137
Estradiol                      552
Testosterone                   389
Thyroid Stimulating Hormone    223
Creatinine                       3
C-Reactive Protein              23
Sedimentation Rate             601
dtype: int64

In [None]:
data.drop(['Estradiol', 'Testosterone', 'Sedimentation\xa0Rate'], axis = 1, inplace = True)

In [None]:
data.isnull().sum()

ID                               0
HasFracture?                     0
BonyLesions                     33
Stage                            4
Vitamin D supplements            0
TookCalciumMedicine              0
Denosumab                        0
Pamidronate                      0
Zoledronate                      0
Dexamethasone                    0
AgeAtDx                          1
PatientSex                       1
RacialGroup                      1
LabCalciumLevel                  3
Phosphate                        6
Alkaline Phosphatase             3
Vitamin D3                     137
Thyroid Stimulating Hormone    223
Creatinine                       3
C-Reactive Protein              23
dtype: int64

In [None]:
data[data.isnull()['AgeAtDx']]

Unnamed: 0,ID,HasFracture?,BonyLesions,Stage,Vitamin D supplements,TookCalciumMedicine,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,LabCalciumLevel,Phosphate,Alkaline Phosphatase,Vitamin D3,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein
622,MM908,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,


In [None]:
# should remove this row because it doesn't have a lot of data on it
data = data[~data.isnull()['AgeAtDx']]

In [None]:
data.isnull().sum()

ID                               0
HasFracture?                     0
BonyLesions                     33
Stage                            4
Vitamin D supplements            0
TookCalciumMedicine              0
Denosumab                        0
Pamidronate                      0
Zoledronate                      0
Dexamethasone                    0
AgeAtDx                          0
PatientSex                       0
RacialGroup                      0
LabCalciumLevel                  2
Phosphate                        5
Alkaline Phosphatase             2
Vitamin D3                     136
Thyroid Stimulating Hormone    222
Creatinine                       2
C-Reactive Protein              22
dtype: int64

In [None]:
# drop any rows with null value in Vitamin D3 cell
data = data.dropna(subset = ['Vitamin\xa0D3'])

In [None]:
data.isnull().sum()

ID                               0
HasFracture?                     0
BonyLesions                     28
Stage                            1
Vitamin D supplements            0
TookCalciumMedicine              0
Denosumab                        0
Pamidronate                      0
Zoledronate                      0
Dexamethasone                    0
AgeAtDx                          0
PatientSex                       0
RacialGroup                      0
LabCalciumLevel                  0
Phosphate                        1
Alkaline Phosphatase             0
Vitamin D3                       0
Thyroid Stimulating Hormone    172
Creatinine                       0
C-Reactive Protein              12
dtype: int64

In [None]:
data.shape

(565, 20)

In [None]:
data = data.dropna(subset = ['Phosphate'])

In [None]:
data.isnull().sum()

ID                               0
HasFracture?                     0
BonyLesions                     28
Stage                            1
Vitamin D supplements            0
TookCalciumMedicine              0
Denosumab                        0
Pamidronate                      0
Zoledronate                      0
Dexamethasone                    0
AgeAtDx                          0
PatientSex                       0
RacialGroup                      0
LabCalciumLevel                  0
Phosphate                        0
Alkaline Phosphatase             0
Vitamin D3                       0
Thyroid Stimulating Hormone    171
Creatinine                       0
C-Reactive Protein              11
dtype: int64

In [None]:
data = data.drop('Thyroid\xa0Stimulating\xa0Hormone', axis = 1)

In [None]:
data.isnull().sum()

ID                        0
HasFracture?              0
BonyLesions              28
Stage                     1
Vitamin D supplements     0
TookCalciumMedicine       0
Denosumab                 0
Pamidronate               0
Zoledronate               0
Dexamethasone             0
AgeAtDx                   0
PatientSex                0
RacialGroup               0
LabCalciumLevel           0
Phosphate                 0
Alkaline Phosphatase      0
Vitamin D3                0
Creatinine                0
C-Reactive Protein       11
dtype: int64

In [None]:
data = data.dropna(subset = ['Stage', 'BonyLesions', 'C-Reactive Protein'])

In [None]:
data.isnull().sum()

ID                       0
HasFracture?             0
BonyLesions              0
Stage                    0
Vitamin D supplements    0
TookCalciumMedicine      0
Denosumab                0
Pamidronate              0
Zoledronate              0
Dexamethasone            0
AgeAtDx                  0
PatientSex               0
RacialGroup              0
LabCalciumLevel          0
Phosphate                0
Alkaline Phosphatase     0
Vitamin D3               0
Creatinine               0
C-Reactive Protein       0
dtype: int64

In [None]:
data.shape

(525, 19)

In [None]:
data.to_csv('data_agg_2.csv', index = False)

# 6. one hot encoding

In [4]:
ls

BillingCodes.csv           Labs_encoded.csv
BonyLesions.csv            Lesions_encoded.csv
data_agg_2.csv             Medications.csv
data_agg.csv               medicines_90_days_before_STC.csv
data_for_ML_back_fill.csv  MyelomaTherapy.csv
Demographics.csv           PlasmaCells.csv
Demographics_encoded.csv   RadiationTherapy.csv
Diagnoses.csv              Signs.csv
Final_stage.csv            Stage.csv
Fractures.csv              Stages_encoded.csv
Labs2.csv                  STC_Days.csv
Labs_closest_to_SCT.csv    SurvivalDays.csv
Labs.csv                   Symptoms.csv


In [5]:
data = pd.read_csv('data_agg_2.csv')

In [6]:
data

Unnamed: 0,ID,HasFracture?,BonyLesions,Stage,Vitamin D supplements,TookCalciumMedicine,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,LabCalciumLevel,Phosphate,Alkaline Phosphatase,Vitamin D3,Creatinine,C-Reactive Protein
0,MM1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0
1,MM2,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
2,MM4,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,MM5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,MM7,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
520,MM832,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
521,MM843,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
522,MM836,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
523,MM837,0.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
data['HasFracture?'].sum()

108.0

In [8]:
data.isnull().sum()

ID                       0
HasFracture?             0
BonyLesions              0
Stage                    0
Vitamin D supplements    0
TookCalciumMedicine      0
Denosumab                0
Pamidronate              0
Zoledronate              0
Dexamethasone            0
AgeAtDx                  0
PatientSex               0
RacialGroup              0
LabCalciumLevel          0
Phosphate                0
Alkaline Phosphatase     0
Vitamin D3               0
Creatinine               0
C-Reactive Protein       0
dtype: int64

In [9]:
# need to do one hot encoding on 1. Age variable, 2. sex, 3. racial group
data = pd.get_dummies(data, columns=['AgeAtDx', 'PatientSex', 'RacialGroup'])

In [10]:
data

Unnamed: 0,ID,HasFracture?,BonyLesions,Stage,Vitamin D supplements,TookCalciumMedicine,Denosumab,Pamidronate,Zoledronate,Dexamethasone,LabCalciumLevel,Phosphate,Alkaline Phosphatase,Vitamin D3,Creatinine,C-Reactive Protein,AgeAtDx_1.0,AgeAtDx_2.0,AgeAtDx_3.0,PatientSex_0.0,PatientSex_1.0,RacialGroup_1.0,RacialGroup_2.0,RacialGroup_3.0,RacialGroup_4.0
0,MM1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0,1,0,0,1,1,0,0,0
1,MM2,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,0,1,0,0
2,MM4,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0,1,0,1,0,1,0,0,0
3,MM5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,1,0,0,0
4,MM7,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0,1,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
520,MM832,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0,1,0,0,1,1,0,0,0
521,MM843,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,1,1,0,0,0
522,MM836,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0,1,1,0,0,0
523,MM837,0.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,1,0,0,0


In [13]:
data.to_csv('data_agg_3.csv', index = False)