<a href="https://colab.research.google.com/github/wenjunsun/personal-machine-learning-projects/blob/master/cancer-fracture/task1/prepare_data_for_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will try to prepare the data necessary for machine learning prediction. The upshot is to use the data 90 days before the stem cell transplant of a patient with myeloma to predict if there will be fracture of bones after the stem cell transplant.

In [1]:
cd drive/My\ Drive/fracture_with_emissa/Datasets/Raw\ Data

/content/drive/My Drive/fracture_with_emissa/Datasets/Raw Data


In [2]:
ls

BillingCodes.csv          Medications.csv
BonyLesions.csv           medicines_90_days_before_STC.csv
data_agg.csv              MyelomaTherapy.csv
Demographics.csv          PlasmaCells.csv
Demographics_encoded.csv  RadiationTherapy.csv
Diagnoses.csv             Signs.csv
Fractures.csv             Stage.csv
Labs2.csv                 Stages_encoded.csv
Labs_closest_to_SCT.csv   SurvivalDays.csv
Labs.csv                  Symptoms.csv
Lesions_encoded.csv


In [3]:
import pandas as pd

# 1. prepare data from lab results.

In this section we will do the following:

First extract the day on which each patient have the stem cell transplant, then we find in the lab results csv file the closest lab results to the SCT (SCT stands for stem cell transplant) date, which will be provided as features to our machine learning models later.

## 1.1 extract the date of SCT for each patient

In [None]:
data = pd.read_csv('MyelomaTherapy.csv')

In [None]:
data.head()

Unnamed: 0,ID,MedTx,Combination,Class,Mechanism,DaysFromDxStart,DaysFromDxStop,Line,Duration,TreatmentPhase
0,MM1,Bortezomib,VRD,Proteosome inhibitor,Proteosome inhibitor,116.0,260.0,1,195.0,Induction
1,MM1,Dexamethasone,VRD,Steroid,Steroid,116.0,260.0,1,195.0,Induction
2,MM1,Lenalidomide,VRD,Immunotherapy,Immunomodulatory,116.0,260.0,1,195.0,Induction
3,MM1,Cyclophosphamide,Not specified,Chemotherapy,Alkylator,311.0,311.0,2,32.0,Induction
4,MM1,Dexamethasone,Not specified,Steroid,Steroid,311.0,299.0,2,32.0,Induction


In [None]:
data[data['ID'] == 'MM1']

Unnamed: 0,ID,MedTx,Combination,Class,Mechanism,DaysFromDxStart,DaysFromDxStop,Line,Duration,TreatmentPhase
0,MM1,Bortezomib,VRD,Proteosome inhibitor,Proteosome inhibitor,116.0,260.0,1,195.0,Induction
1,MM1,Dexamethasone,VRD,Steroid,Steroid,116.0,260.0,1,195.0,Induction
2,MM1,Lenalidomide,VRD,Immunotherapy,Immunomodulatory,116.0,260.0,1,195.0,Induction
3,MM1,Cyclophosphamide,Not specified,Chemotherapy,Alkylator,311.0,311.0,2,32.0,Induction
4,MM1,Dexamethasone,Not specified,Steroid,Steroid,311.0,299.0,2,32.0,Induction
5,MM1,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,343.0,343.0,3,114.0,Stem cell transplant
6,MM1,Bortezomib,Not specified,Proteosome inhibitor,Proteosome inhibitor,457.0,,4,,Maintenance


As we can see, patient MM1 has the stem cell transplant on day 343. We want to extract this number for each patient. And store the information in a python dictionary. 

For example, the dictionary may look like this after processing: `{'MM1': 343, 'MM2': 200, ...}`

In [None]:
data[data['ID'] == 'MM3']

Unnamed: 0,ID,MedTx,Combination,Class,Mechanism,DaysFromDxStart,DaysFromDxStop,Line,Duration,TreatmentPhase
15,MM3,Dexamethasone,VRD,Steroid,Steroid,73.0,241.0,1,,Induction
16,MM3,Lenalidomide,VRD,Immunotherapy,Immunomodulatory,73.0,241.0,1,,Induction
17,MM3,Bortezomib,VRD,Proteosome inhibitor,Proteosome inhibitor,73.0,241.0,1,,Induction


As we can see. Patient MM3 didn't do SCT. So our dictionary will simply not store its mapping. In other words, after processing dictionary won't have MM3 as one of its keys. We are only focusing on the STC patients for now.

In [None]:
data[data['ID'] == 'MM20']

Unnamed: 0,ID,MedTx,Combination,Class,Mechanism,DaysFromDxStart,DaysFromDxStop,Line,Duration,TreatmentPhase
221,MM20,Bortezomib,CyBorD,Proteosome inhibitor,Proteosome inhibitor,2670.0,2733.0,1,69.0,Induction
222,MM20,Dexamethasone,CyBorD,Steroid,Steroid,2670.0,2733.0,1,69.0,Induction
223,MM20,Cyclophosphamide,CyBorD,Chemotherapy,Alkylator,2670.0,2733.0,1,69.0,Induction
224,MM20,Bortezomib,Not specified,Proteosome inhibitor,Proteosome inhibitor,2739.0,2763.0,2,29.0,Induction
225,MM20,Bortezomib,VRD,Proteosome inhibitor,Proteosome inhibitor,2768.0,2841.0,3,167.0,Induction
226,MM20,Lenalidomide,VRD,Immunotherapy,Immunomodulatory,2768.0,2841.0,3,167.0,Induction
227,MM20,Dexamethasone,VRD,Steroid,Steroid,2768.0,2841.0,3,167.0,Induction
228,MM20,Liposomal doxorubicin,Not specified,Chemotherapy,Topoisomerase inhibitor,2935.0,2999.0,4,89.0,Induction
229,MM20,Carfilzomib,Not specified,Proteosome inhibitor,Proteosome inhibitor,2935.0,2999.0,4,89.0,Induction
230,MM20,Dexamethasone,Not specified,Steroid,Steroid,2935.0,2999.0,4,89.0,Induction


As we can see, patient MM20 had 2 stem cell transplant treatment. In this case the dictionary will simply store day of his/her first treatment

In [None]:
STC_data = data[data['MedTx'] == 'Stem cell transplant']

In [None]:
STC_data

Unnamed: 0,ID,MedTx,Combination,Class,Mechanism,DaysFromDxStart,DaysFromDxStop,Line,Duration,TreatmentPhase
5,MM1,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,343.0,343.0,3,114.0,Stem cell transplant
14,MM2,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,229.0,229.0,4,,Stem cell transplant
25,MM4,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,425.0,425.0,5,98.0,Stem cell transplant
29,MM5,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,416.0,417.0,2,435.0,Stem cell transplant
37,MM6,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,657.0,657.0,3,107.0,Stem cell transplant
...,...,...,...,...,...,...,...,...,...,...
9893,MM836,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,786.0,786.0,6,58.0,Stem cell transplant
9894,MM836,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,844.0,844.0,7,210.0,Stem cell transplant
9899,MM837,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,203.0,203.0,3,,Stem cell transplant
9905,MM838,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,244.0,244.0,3,92.0,Stem cell transplant


In [None]:
STC_data.iloc[1,5]

229.0

In [None]:
# the dictionary that looks like this: {'MM1': 343, 'MM2': 200, ...}
# patient ID -> Days from diagnosis that patient used STC
patientToDayOfSTC = dict()
for i in range(STC_data.shape[0]):
  ID = STC_data.iloc[i, 0] # 0th column is the ID column, i is the row number
  day = STC_data.iloc[i, 5] # 5th column is the DaysFromDxStart column, 
                            # which is the day which patient used STC
  if ID not in patientToDayOfSTC:
    patientToDayOfSTC[ID] = day
  else:
    continue # we only keep the first day of STC treatment, ignore the later ones.


In [None]:
len(patientToDayOfSTC)

702

So we have 702 patient who did the STC.

## 1.2 extrat the closest lab results to STC date from Labs.csv

In [None]:
data = pd.read_csv('Labs.csv')

In [None]:
data.head()

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
0,MM1,P,Phosphate,Electrolyte,363,4.0,mg/dL,N,2.5,4.5
1,MM1,PLT,Platelet Count,Blood count,366,258.0,THOU/uL,N,150.0,400.0
2,MM1,ALK,Alkaline Phosphatase,Liver function,314,74.0,U/L,N,36.0,161.0
3,MM1,RDWCV,Red Blood Cell Distribution Width,Blood count,358,14.3,%,N,11.6,14.4
4,MM1,RDWCV,Red Blood Cell Distribution Width,Blood count,355,13.8,%,N,11.6,14.4


In [None]:
data[(data['ID'] == 'MM1') & (data['ObservationName'] == 'Calcium')].sort_values(by = ['DaysFromDx'])

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
592,MM1,CA,Calcium,Electrolyte,289,8.5,mg/dL,L,8.9,10.2
130,MM1,CA,Calcium,Electrolyte,294,9.2,mg/dL,N,8.9,10.2
271,MM1,CA,Calcium,Electrolyte,296,9.3,mg/dL,N,8.9,10.2
172,MM1,CA,Calcium,Electrolyte,301,9.8,mg/dL,N,8.9,10.2
186,MM1,CA,Calcium,Electrolyte,303,9.4,mg/dL,N,8.9,10.2
459,MM1,CA,Calcium,Electrolyte,307,9.4,mg/dL,N,8.9,10.2
353,MM1,CA,Calcium,Electrolyte,310,9.9,mg/dL,N,8.9,10.2
6,MM1,CA,Calcium,Electrolyte,311,9.6,mg/dL,N,8.9,10.2
320,MM1,CA,Calcium,Electrolyte,311,9.6,mg/dL,N,8.9,10.2
74,MM1,CA,Calcium,Electrolyte,314,9.2,mg/dL,N,8.9,10.2


In [None]:
patientToDayOfSTC['MM1']

343.0

For example, as shown above, the row we are going to preserve for patient MM1 about his/her lab results about Calcium, we will choose the lab done the closest to 343 but prior to 343, which in this case is row with index 23 (having the date value of 343). That is the row that will go to our result dataframe.

In [None]:
data['ObservationName'].unique()

array(['Phosphate', 'Platelet\xa0Count', 'Alkaline\xa0Phosphatase',
       'Red Blood Cell Distribution Width', 'Calcium',
       'White Blood Cell Count', 'Hematocrit', 'Total Protein',
       'Mean Corpuscular Volume', 'Albumin', 'Neutrophils', 'Lymphocytes',
       'Creatinine', 'Lactate\xa0Dehydrogenase', 'Hemoglobin',
       'Aspartate Aminotransferase', 'Alanine\xa0Aminotransferase\xa0',
       'Beta-2\xa0Microglobulin', 'Total Bilirubin', 'Monoclonal protein',
       'Cytomegalovirus PCR', 'Immunoglobulin\xa0G',
       'C.\xa0difficile PCR', 'Total Cholesterol',
       'Bence\xa0Jones\xa0Protein', 'Immunoglobulin\xa0A',
       'Varicella Zoster\xa0Antibody', 'Immunoglobulin\xa0M',
       'C-Reactive Protein', 'Free light chain ratio', 'Testosterone',
       'Prostate\xa0Specific\xa0Antigen', 'Vitamin\xa0D3',
       'Triglycerides', 'Blasts', 'Thyroid\xa0Stimulating\xa0Hormone',
       'Iron', 'Sedimentation\xa0Rate', 'Troponin I',
       'Parathyroid hormone', 'Ferritin', 'Folat

In [None]:
# the chemicals we think are important for prediction.
# don't think we have estrogen, but estradiol?
important_chemicals = ['Calcium', 'Phosphate', 'Parathyroid hormone', \
                       'Alkaline\xa0Phosphatase', 'Vitamin\xa0D3', \
                       'Estradiol', 'Testosterone', 'Thyroid\xa0Stimulating\xa0Hormone',\
                       'Creatinine', 'C-Reactive Protein', 'Sedimentation\xa0Rate']

In [None]:
result_df = pd.DataFrame(columns = data.columns)

In [None]:
result_df

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit


In [None]:
progress = 0
for ID, patient_STC_day in patientToDayOfSTC.items():
  
  for chemical in important_chemicals:
    # get this patient's lab results history of this chemical

    patient_data = data[(data['ID'] == ID) & (data['ObservationName'] == chemical)]
    
    if patient_data.shape[0] != 0:
      row_to_be_added = None # the variable that stores the row that will be added to our result dataframe.
      
      # get the row with lowest day after STC date
      # set this to row_to_be_added first so it can be overwritten by
      # the highest day before STC date later. 
      patient_labs_after_STC = patient_data[patient_data['DaysFromDx'] > patient_STC_day]
      if patient_labs_after_STC.shape[0] != 0:
        row_to_be_added = patient_labs_after_STC[patient_labs_after_STC.DaysFromDx == patient_labs_after_STC.DaysFromDx.min()]
      
      # set the row to be the one with highest day (closest) to STC date
      # if there are any such rows.
      patient_labs_before_STC = patient_data[patient_data['DaysFromDx'] <= patient_STC_day]
      if patient_labs_before_STC.shape[0] != 0:
        row_to_be_added = patient_labs_before_STC[patient_labs_before_STC.DaysFromDx == patient_labs_before_STC.DaysFromDx.max()]
      
      # if the row is not None, that means we have found a row with the "closest" date to STC date
      # add this row to result dataframe
      if row_to_be_added is not None:
        result_df = result_df.append(row_to_be_added, ignore_index=True)
      # else:
        # else we know we don't have any data on 

  progress += 1
  if (progress % 10 == 0):
    print(f'Progress bar: {progress / len(patientToDayOfSTC)}')

Progress bar: 0.014245014245014245
Progress bar: 0.02849002849002849
Progress bar: 0.042735042735042736
Progress bar: 0.05698005698005698
Progress bar: 0.07122507122507123
Progress bar: 0.08547008547008547
Progress bar: 0.09971509971509972
Progress bar: 0.11396011396011396
Progress bar: 0.1282051282051282
Progress bar: 0.14245014245014245
Progress bar: 0.15669515669515668
Progress bar: 0.17094017094017094
Progress bar: 0.18518518518518517
Progress bar: 0.19943019943019943
Progress bar: 0.21367521367521367
Progress bar: 0.22792022792022792
Progress bar: 0.24216524216524216
Progress bar: 0.2564102564102564
Progress bar: 0.2706552706552707
Progress bar: 0.2849002849002849
Progress bar: 0.29914529914529914
Progress bar: 0.31339031339031337
Progress bar: 0.32763532763532766
Progress bar: 0.3418803418803419
Progress bar: 0.3561253561253561
Progress bar: 0.37037037037037035
Progress bar: 0.38461538461538464
Progress bar: 0.39886039886039887
Progress bar: 0.4131054131054131
Progress bar: 0.427

In [None]:
result_df

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
0,MM1,CA,Calcium,Electrolyte,343,9.1,mg/dL,N,8.9,10.2
1,MM1,P,Phosphate,Electrolyte,343,4.7,mg/dL,H,2.5,4.5
2,MM1,ALK,Alkaline Phosphatase,Liver function,343,93,U/L,N,36,161
3,MM1,VITD3,Vitamin D3,Nutrtion,289,48.7,ng/mL,N,20,100
4,MM1,TEST,Testosterone,Endocrine,294,0.7,ng/mL,L,1.6,5.9
...,...,...,...,...,...,...,...,...,...,...
5254,MM838,VITD3,Vitamin D3,Nutrtion,2316,42.6,ng/mL,N,20,100
5255,MM838,TEST,Testosterone,Endocrine,1628,1.8,ng/mL,N,1.6,5.9
5256,MM838,TSH,Thyroid Stimulating Hormone,Endocrine,712,7.306,uIU/mL,H,0.4,5
5257,MM838,CRE,Creatinine,Kidney function,242,1,mg/dL,N,0.3,1.2


All right, let's now check if our result is right, by looking at the data we extracted for patient MM2!

In [None]:
# our extracted data
result_df[result_df['ID'] == 'MM2']

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
7,MM2,CA,Calcium,Electrolyte,229,9.2,mg/dL,N,8.9,10.2
8,MM2,P,Phosphate,Electrolyte,223,4.3,mg/dL,N,2.5,4.5
9,MM2,ALK,Alkaline Phosphatase,Liver function,223,54.0,U/L,N,37.0,159.0
10,MM2,VITD3,Vitamin D3,Nutrtion,167,18.7,ng/mL,L,20.0,100.0
11,MM2,TEST,Testosterone,Endocrine,169,3.2,ng/mL,N,1.6,5.9
12,MM2,CRE,Creatinine,Kidney function,229,1.0,mg/dL,N,0.51,1.18
13,MM2,HSCRP,C-Reactive Protein,Immune,169,9.8,mg/L,N,0.0,10.0


In [None]:
patientToDayOfSTC['MM2']

229.0

In [None]:
# original data
data[(data['ID'] == 'MM2') & (data['ObservationName'] == 'Calcium')].sort_values(by = ['DaysFromDx'])

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
1175,MM2,CA,Calcium,Electrolyte,167,9.8,mg/dL,N,8.9,10.2
1231,MM2,CA,Calcium,Electrolyte,170,8.6,mg/dL,L,8.9,10.2
740,MM2,CA,Calcium,Electrolyte,174,9.4,mg/dL,N,8.9,10.2
1128,MM2,CA,Calcium,Electrolyte,177,8.9,mg/dL,N,8.9,10.2
1191,MM2,CA,Calcium,Electrolyte,181,9.4,mg/dL,N,8.9,10.2
805,MM2,CA,Calcium,Electrolyte,184,9.2,mg/dL,N,8.9,10.2
1230,MM2,CA,Calcium,Electrolyte,188,8.8,mg/dL,L,8.9,10.2
907,MM2,CA,Calcium,Electrolyte,191,9.0,mg/dL,N,8.9,10.2
802,MM2,CA,Calcium,Electrolyte,195,8.5,mg/dL,L,8.9,10.2
939,MM2,CA,Calcium,Electrolyte,196,8.1,mg/dL,L,8.9,10.2


In [None]:
data[(data['ID'] == 'MM2') & (data['ObservationName'] == 'Phosphate')].sort_values(by = ['DaysFromDx'])

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
977,MM2,P,Phosphate,Electrolyte,167,2.8,mg/dL,N,2.5,4.5
1135,MM2,P,Phosphate,Electrolyte,170,2.9,mg/dL,N,2.5,4.5
1257,MM2,P,Phosphate,Electrolyte,174,3.5,mg/dL,N,2.5,4.5
1027,MM2,P,Phosphate,Electrolyte,177,3.5,mg/dL,N,2.5,4.5
961,MM2,P,Phosphate,Electrolyte,181,3.8,mg/dL,N,2.5,4.5
827,MM2,P,Phosphate,Electrolyte,184,3.6,mg/dL,N,2.5,4.5
1286,MM2,P,Phosphate,Electrolyte,188,4.2,mg/dL,N,2.5,4.5
1050,MM2,P,Phosphate,Electrolyte,191,3.3,mg/dL,N,2.5,4.5
1115,MM2,P,Phosphate,Electrolyte,195,3.8,mg/dL,N,2.5,4.5
1003,MM2,P,Phosphate,Electrolyte,198,2.3,mg/dL,L,2.5,4.5


In [None]:
data[(data['ID'] == 'MM2') & (data['ObservationName'] == 'Parathyroid hormone')].sort_values(by = ['DaysFromDx'])

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit


Which we can see from the above dataframes that we indeed get the right rows in our `result_df`. There is no data about Parathyroid hormone for MM2 patient, so there isn't a row for that in the result_df.

Let's now save this as a csv file.

In [None]:
result_df.to_csv('Labs_closest_to_SCT.csv', index = False)

# 2. encode demographics data.

In this section we will encode the demographics data in the following fashion:

Race: White:1, Black: 2, Asian: 3, Other or if there is no information for the race of the patient: 4

Sex       M=1, F=0

Age  :     <=69 : 1,       >=70: 0

In [None]:
data = pd.read_csv('Demographics.csv')

In [None]:
data

Unnamed: 0,ID,AgeAtDx,PatientSex,RacialGroup
0,MM1,67,Male,White
1,MM2,61,Male,Black
2,MM3,59,Male,White
3,MM4,68,Female,White
4,MM5,63,Male,White
...,...,...,...,...
826,MM843,70,Male,White
827,MM835,56,Male,White
828,MM836,46,Male,White
829,MM837,59,Male,White


In [None]:
data['RacialGroup'].unique()

array(['White', 'Black', 'Asian', 'Not reported', 'Other'], dtype=object)

In [None]:
data['PatientSex'].unique()

array(['Male', 'Female'], dtype=object)

First I just want to acknowledge that this method of encoding might be not so good, since we are imposing that somehow gender is comparable and male is greater than female (Male = 1, Female = 0). Later we can use one hot encoding or other better encoding mechanism. For now let's convert these strings to numbers first. 

In [None]:
# given a age return its encoding:
# <= 69 -> 1, >=70 -> 0
def encodeAge(age):
  # not good idea to leave a magical number
  # like 69 but this is not software engineering
  # so it is okay..
  return 1 if age <= 69 else 0

# Male -> 1, Female -> 0
def encodeSex(sex):
  return 1 if sex == 'Male' else 0

# White -> 1, Black -> 2, Asian -> 3
# else -> 4
def encodeRace(race):
  if race == 'White':
    return 1
  elif race == 'Black':
    return 2
  elif race == 'Asian':
    return 3
  else:
    return 4

In [None]:
# apply encoding to Age column
data['AgeAtDx'] = data['AgeAtDx'].apply(encodeAge)

In [None]:
data

Unnamed: 0,ID,AgeAtDx,PatientSex,RacialGroup
0,MM1,1,Male,White
1,MM2,1,Male,Black
2,MM3,1,Male,White
3,MM4,1,Female,White
4,MM5,1,Male,White
...,...,...,...,...
826,MM843,0,Male,White
827,MM835,1,Male,White
828,MM836,1,Male,White
829,MM837,1,Male,White


In [None]:
# apply encoding function to sex column
data['PatientSex'] = data['PatientSex'].apply(encodeSex)

In [None]:
# apply encoding function to race column
data['RacialGroup'] = data['RacialGroup'].apply(encodeRace)

In [None]:
data

Unnamed: 0,ID,AgeAtDx,PatientSex,RacialGroup
0,MM1,1,1,1
1,MM2,1,1,2
2,MM3,1,1,1
3,MM4,1,0,1
4,MM5,1,1,1
...,...,...,...,...
826,MM843,0,1,1
827,MM835,1,1,1
828,MM836,1,1,1
829,MM837,1,1,1


As we can see the transformed data looks like what we expected. Now let's save this dataframe.

In [None]:
data.to_csv('Demographics_encoded.csv', index = False)

# 3. prepare medications data

In this section we will prepare the medications data of each patient. By this I mean the 5 medicines: denosumab, pamidronate, zoledronate, calcium, vitamin D. We will have 5 columns denoting these 5 medicines. And the value in the cell will have a 1 if patient has taken this drug 90 days before STC, and 0 otherwise.

Note: if patient didn't take any of the 5 medicines during 90 days before STC, look at the days before 90 days as well. If no data for this patient taking this medicine before STC date in Medications.csv, look at MyelomaTherapy.csv

In [None]:
data = pd.read_csv('MyelomaTherapy.csv')

In [None]:
data.head()

Unnamed: 0,ID,MedTx,Combination,Class,Mechanism,DaysFromDxStart,DaysFromDxStop,Line,Duration,TreatmentPhase
0,MM1,Bortezomib,VRD,Proteosome inhibitor,Proteosome inhibitor,116.0,260.0,1,195.0,Induction
1,MM1,Dexamethasone,VRD,Steroid,Steroid,116.0,260.0,1,195.0,Induction
2,MM1,Lenalidomide,VRD,Immunotherapy,Immunomodulatory,116.0,260.0,1,195.0,Induction
3,MM1,Cyclophosphamide,Not specified,Chemotherapy,Alkylator,311.0,311.0,2,32.0,Induction
4,MM1,Dexamethasone,Not specified,Steroid,Steroid,311.0,299.0,2,32.0,Induction


In [None]:
sorted(data['MedTx'].unique())

['Adriamycin',
 'Alpelisib',
 'Anti-BCMA CAR-T',
 'Anti-CD352',
 'Anti-CD38',
 'Anti-CD45',
 'Anti-DKK1',
 'Anti-FGFR3',
 'Anti-Huluc63',
 'Anti-PD1',
 'Bendamustine',
 'Bortezomib',
 'Cabazitaxel',
 'Carfilzomib',
 'Carmustine',
 'Cisplatin',
 'Cyclophosphamide',
 'Cytarabine',
 'Daratumumab',
 'Decitabine',
 'Dexamethasone',
 'Donor lymphocyte infusion',
 'Doxorubicin',
 'Doxorubicin ',
 'Elotuzumab',
 'Erlotinib',
 'Etoposide',
 'Filanesib',
 'Fludarabine',
 'G-CSF',
 'GSK2857916',
 'Gamma secretase inhibitor',
 'Ibrutinib',
 'Ifosfamide',
 'Interferon',
 'Interferon alpha',
 'Isatuximab',
 'Ixazomib',
 'LGH447',
 'Lenalidomide',
 'Liposomal doxorubicin',
 'Lucatumumab',
 'MDX1338',
 'Melphalan',
 'Methotrexate',
 'Methylprednisolone',
 'Ofatumumab',
 'Paciltaxel',
 'Panobinostat',
 'Pembrolizumab',
 'Plerixafor',
 'Pomalidomide',
 'Prednisone',
 'Ricolinostat',
 'Rituximab',
 'SEA-BCMA',
 'SGN-CD352A',
 'SNS01-T',
 'Stem cell transplant',
 'TTI-621',
 'Thalidomide',
 'Thiotepa',
 '

As we can see, for all the 5 medicines we consider, none of them are in the medication list in MyelomaTherapy.csv. So I guess we only need to look at Medications.csv

In [None]:
data = pd.read_csv('Medications.csv')

In [None]:
data

Unnamed: 0,ID,Note,DaysFromDx,Time,BeginOffset,EndOffset,Term,NegationScore,TermScore
0,MM7,AdmitNote,1961,14:48:00,794,803,Abatacept,,0.485489
1,MM7,SCCA-OutptRecord,42,17:20:00,906,912,Acacia,,0.381300
2,MM55,Nutrition-OutptRecord,1633,11:07:00,834,840,Acacia,,0.406916
3,MM55,Nutrition-OutptRecord,1633,11:07:00,5057,5063,Acacia,,0.490309
4,MM416,Nutrition-OutptRecord,123,10:30:00,386,392,Acacia,,0.350080
...,...,...,...,...,...,...,...,...,...
1303298,MM509,Neurology-InptRecord,84,11:07:00,4465,4475,Zonisamide,,0.995776
1303299,MM455,SCCA-OutptRecord,180,22:21:08,4790,4799,Zopiclone,,0.994309
1303300,MM455,PhysicalTherapy-OutptRecord,186,14:05:00,1729,1738,Zopiclone,,0.975874
1303301,MM455,SCCA-OutptRecord,20,11:59:34,1743,1752,Zopiclone,,0.991613


reminder of the 5 medicines we are looking at: denosumab, pamidronate, zoledronate, calcium, vitamin D

In [None]:
len(data['Term'].unique())

1348

In [None]:
sorted(data['Term'].unique()[1310:])

['Verapamil',
 'Vervain',
 'Vilazodone',
 'Vincristine',
 'Vitamin',
 'Vitamin D supplements',
 'Vitamin supplement',
 'Vitamin supplementation',
 'Voriconazole',
 'Vorinostat',
 'Vortex',
 'Warfarin',
 'Warfarin sodium',
 'White petrolatum',
 'Witch hazel',
 'Xylitol',
 'Zafirlukast',
 'Zaleplon',
 'Zanamivir',
 'Zeaxanthin',
 'Zileuton',
 'Zinc',
 'Zinc acetate',
 'Zinc chloride',
 'Zinc gluconate',
 'Zinc oxide',
 'Zinc sulfate',
 'Zinc supplement',
 'Ziprasidone',
 'Zoledronate',
 'Zoledronic acid',
 'Zolmitriptan',
 'Zolpidem',
 'Zolpidem tartrate',
 'Zomepirac',
 'Zonisamide',
 'Zopiclone',
 'Zymar']

Assume Vitamin D supplements means Vitamin D.

In [None]:
'Vitamin D supplements' in data['Term'].unique()

True

In [None]:
'Calcium' in data['Term'].unique()

True

In [None]:
'Denosumab' in data['Term'].unique()

True

In [None]:
'Pamidronate' in data['Term'].unique()

True

In [None]:
'Zoledronate' in data['Term'].unique()

True

In [None]:
'Dexamethasone' in data['Term'].unique()

True

In [None]:
# also need to include Dexamethasone.
# so there are actually 6 medicines we consider.
medicines = ['Vitamin D supplements', 'Calcium', 'Denosumab', 'Pamidronate', 'Zoledronate', 'Dexamethasone']

In [None]:
# the rows that contain the the ID + 6 other values
# indicating whether the patient took each medicine.
# for example one row might be ['MM1', 1, 0, 0, 0, 0, 0]
rows = []
progress = 0
for ID, STC_Day in patientToDayOfSTC.items():
  # the data of this patient 90 days prior to STC date
  this_patient_data = data[(data['ID'] == ID) & (data['DaysFromDx'] <= STC_Day) & (data['DaysFromDx'] >= STC_Day - 90)]
  # if this dataframe is empty, or that the medications in this 90 days prior
  # to STC date don't contain any of the 6 medicines we want information of
  # we look at the days even prior to 90 days.
  if this_patient_data.shape[0] == 0 or (not (any(item in this_patient_data['Term'].unique() for item in medicines))):
    this_patient_data = data[(data['ID'] == ID) & (data['DaysFromDx'] <= STC_Day)]

  thisRow = [ID] # put ID in this row first

  # for each medicine, if this medicine is in the dataframe
  # add 1 to our row, else add 0
  for medicine in medicines:
    if medicine in this_patient_data['Term'].unique():
      thisRow.append(1)
    else:
      thisRow.append(0)
  # add this row to rows
  rows.append(thisRow)
  progress += 1
  if (progress % 10 == 0):
    print(progress/len(patientToDayOfSTC))

0.014245014245014245
0.02849002849002849
0.042735042735042736
0.05698005698005698
0.07122507122507123
0.08547008547008547
0.09971509971509972
0.11396011396011396
0.1282051282051282
0.14245014245014245
0.15669515669515668
0.17094017094017094
0.18518518518518517
0.19943019943019943
0.21367521367521367
0.22792022792022792
0.24216524216524216
0.2564102564102564
0.2706552706552707
0.2849002849002849
0.29914529914529914
0.31339031339031337
0.32763532763532766
0.3418803418803419
0.3561253561253561
0.37037037037037035
0.38461538461538464
0.39886039886039887
0.4131054131054131
0.42735042735042733
0.4415954415954416
0.45584045584045585
0.4700854700854701
0.4843304843304843
0.4985754985754986
0.5128205128205128
0.5270655270655271
0.5413105413105413
0.5555555555555556
0.5698005698005698
0.584045584045584
0.5982905982905983
0.6125356125356125
0.6267806267806267
0.6410256410256411
0.6552706552706553
0.6695156695156695
0.6837606837606838
0.698005698005698
0.7122507122507122
0.7264957264957265
0.74074

In [None]:
len(rows)

702

In [None]:
# convert rows to a dataframe.
df_medicines = pd.DataFrame(rows, columns = ['ID'] + medicines)  

In [None]:
df_medicines.sum()

ID                       MM1MM2MM4MM5MM6MM7MM8MM10MM12MM13MM15MM953MM17...
Vitamin D supplements                                                    1
Calcium                                                                344
Denosumab                                                                5
Pamidronate                                                             52
Zoledronate                                                              8
Dexamethasone                                                          462
dtype: object

As we can see, majority of people took Dexamethasone, which is expected.

In [None]:
df_medicines

Unnamed: 0,ID,Vitamin D supplements,Calcium,Denosumab,Pamidronate,Zoledronate,Dexamethasone
0,MM1,0,1,0,0,0,1
1,MM2,0,0,0,0,0,1
2,MM4,0,0,0,0,0,1
3,MM5,0,0,0,0,0,0
4,MM6,0,0,0,0,0,1
...,...,...,...,...,...,...,...
697,MM834,0,0,0,0,0,0
698,MM843,0,0,0,0,0,0
699,MM836,0,0,0,0,0,1
700,MM837,0,1,0,0,0,1


In [None]:
# save dataframe to csv.
df_medicines.to_csv('medicines_90_days_before_STC.csv', index = False)

# 4. encode cancer stage

In [None]:
data = pd.read_csv('Stage.csv')

In [None]:
data

Unnamed: 0,ID,StagingSystem,Stage
0,MM2,ISS,III
1,MM3,ISS,I
2,MM4,ISS,III
3,MM5,ISS,I
4,MM6,ISS,I
...,...,...,...
526,MM834,ISS,I
527,MM843,ISS,II
528,MM835,ISS,I
529,MM836,ISS,II


We want to encode stage I as 1, stage II as 2, and stage III as 3

In [None]:
def encodeStage(stage):
  if stage == 'I':
    return 1
  elif stage == 'II':
    return 2
  elif stage == 'III':
    return 3
  else:
    return None

In [None]:
 data['Stage'] = data['Stage'].apply(lambda x: encodeStage(x))

In [None]:
data

Unnamed: 0,ID,StagingSystem,Stage
0,MM2,ISS,3
1,MM3,ISS,1
2,MM4,ISS,3
3,MM5,ISS,1
4,MM6,ISS,1
...,...,...,...
526,MM834,ISS,1
527,MM843,ISS,2
528,MM835,ISS,1
529,MM836,ISS,2


In [None]:
# we only want the ID and stage columns
data = data[['ID', 'Stage']]

In [None]:
data

Unnamed: 0,ID,Stage
0,MM2,3
1,MM3,1
2,MM4,3
3,MM5,1
4,MM6,1
...,...,...
526,MM834,1
527,MM843,2
528,MM835,1
529,MM836,2


In [None]:
# save this to csv.
data.to_csv('Stages_encoded.csv', index = False)

# 5. encode Bony Lesions

In [None]:
data = pd.read_csv('BonyLesions.csv')

In [None]:
data

Unnamed: 0,ID,DxType,DaysFromDx,BonyLesions
0,MM1,MRI,22.0,0
1,MM1,MRI,2064.0,1
2,MM1,MRI,716.0,1
3,MM1,MRI,442.0,>3
4,MM1,MRI,251.0,0
...,...,...,...,...
3800,MM838,MRI,3773.0,0
3801,MM838,MRI,1687.0,0
3802,MM838,MRI,3773.0,0
3803,MM838,MRI,991.0,0


In [None]:
data['BonyLesions'].unique()

array(['0', '1', '>3', '2'], dtype=object)

we decided to encode BonyLesions of '0' to 0, and '1', '2', '>3' to 1

We will first look from the closest prior to STC day, if there is no data there, we will look after STC day as well. 

In [None]:
df_lesions = pd.DataFrame(columns=data.columns)

In [None]:
df_lesions

Unnamed: 0,ID,DxType,DaysFromDx,BonyLesions


In [None]:
for ID, STC_Day in patientToDayOfSTC.items():
  less_than_STC_data = data[(data['ID'] == ID) & (data['DaysFromDx'] <= STC_Day)]
  if less_than_STC_data.shape[0] == 0:
    greater_than_STC_data = data[(data['ID'] == ID) & (data['DaysFromDx'] > STC_Day)]
    if greater_than_STC_data.shape[0] != 0:
      df_lesions = df_lesions.append(greater_than_STC_data.sort_values(by = ['DaysFromDx']).iloc[0:1, :], ignore_index=True)
  else:
    df_lesions = df_lesions.append(less_than_STC_data.sort_values(by = ['DaysFromDx'], ascending = False).iloc[0:1, :], ignore_index=True)

In [None]:
df_lesions

Unnamed: 0,ID,DxType,DaysFromDx,BonyLesions
0,MM1,MRI,297.0,0
1,MM2,MRI,176.0,>3
2,MM4,MRI,411.0,1
3,MM5,MRI,373.0,0
4,MM7,MRI,197.0,>3
...,...,...,...,...
664,MM834,MRI,161.0,>3
665,MM843,MRI,812.0,>3
666,MM836,MRI,754.0,>3
667,MM837,MRI,173.0,>3


let's check if we indeed get the correct result for patient MM1

In [None]:
data[data['ID'] == 'MM1'].sort_values(by = ['DaysFromDx'])

Unnamed: 0,ID,DxType,DaysFromDx,BonyLesions
0,MM1,MRI,22.0,0
4,MM1,MRI,251.0,0
5,MM1,MRI,297.0,0
7,MM1,MRI,428.0,2
3,MM1,MRI,442.0,>3
6,MM1,MRI,701.0,>3
2,MM1,MRI,716.0,1
1,MM1,MRI,2064.0,1


In [None]:
patientToDayOfSTC['MM1']

343.0

We 297 is within 343 - 90, so that row should be selected, and indeed that is what's selected by our dataframe.

In [None]:
def encodeLesions(lesionsString):
  if lesionsString == '0':
    return 0
  else:
    return 1

In [None]:
df_lesions

Unnamed: 0,ID,DxType,DaysFromDx,BonyLesions
0,MM1,MRI,297.0,0
1,MM2,MRI,176.0,>3
2,MM4,MRI,411.0,1
3,MM5,MRI,373.0,0
4,MM7,MRI,197.0,>3
...,...,...,...,...
664,MM834,MRI,161.0,>3
665,MM843,MRI,812.0,>3
666,MM836,MRI,754.0,>3
667,MM837,MRI,173.0,>3


In [None]:
df_lesions['BonyLesions'] = df_lesions['BonyLesions'].apply(encodeLesions)

In [None]:
df_lesions

Unnamed: 0,ID,DxType,DaysFromDx,BonyLesions
0,MM1,MRI,297.0,0
1,MM2,MRI,176.0,1
2,MM4,MRI,411.0,1
3,MM5,MRI,373.0,0
4,MM7,MRI,197.0,1
...,...,...,...,...
664,MM834,MRI,161.0,1
665,MM843,MRI,812.0,1
666,MM836,MRI,754.0,1
667,MM837,MRI,173.0,1


In [None]:
# save this to csv.
df_lesions.to_csv('Lesions_encoded.csv', index=False)

# 6. prepare frature data

Ultimately we want to predict whether this patient has bone frature during the 365 days after stem cell transplant. This information is stored in the columns of `BillingCodes.csv`

In [None]:
data = pd.read_csv('BillingCodes.csv')

In [None]:
data

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
0,MM1,354,ICD9CM,V10.87,Personal history of malignant neoplasm of thyroid,Neoplasms,Cancer; other primary,Cancer of thyroid [36.]
1,MM1,373,ICD9CM,V10.87,Personal history of malignant neoplasm of thyroid,Neoplasms,Cancer; other primary,Cancer of thyroid [36.]
2,MM1,355,ICD9CM,V10.87,Personal history of malignant neoplasm of thyroid,Neoplasms,Cancer; other primary,Cancer of thyroid [36.]
3,MM1,741,ICD9CM,V12.29,"Personal history of other endocrine, metabolic...",Endocrine; nutritional; and metabolic diseases...,Other nutritional; endocrine; and metabolic di...,Other and unspecified metabolic; nutritional; ...
4,MM1,318,ICD9CM,V14.0,Personal history of allergy to penicillin,Symptoms; signs; and ill-defined conditions an...,Symptoms; signs; and ill-defined conditions,Allergic reactions [253.]
...,...,...,...,...,...,...,...,...
365050,MM838,4899,ICD10CM,C90.00,Multiple myeloma not having achieved remission,Neoplasms,Cancer of lymphatic and hematopoietic tissue,No Value
365051,MM838,2905,ICD9CM,203.02,"Multiple myeloma, in relapse",Neoplasms,Cancer of lymphatic and hematopoietic tissue,Multiple myeloma [40.]
365052,MM838,4594,ICD10CM,C90.00,Multiple myeloma not having achieved remission,Neoplasms,Cancer of lymphatic and hematopoietic tissue,No Value
365053,MM838,2528,ICD9CM,V70.7,Examination of participant in clinical trial,Symptoms; signs; and ill-defined conditions an...,Factors influencing health care,Medical examination/evaluation [256.]


In [None]:
# we think fracture will be one of the word in the string of
# DxDescription, CCSLevel1Name, CCSLevel2Name, CCSLevel3Name
# Let's get those rows where there any of the 4 columns
# contain the word 'fracture'.

# define a function such that given a row, return true if that row contains
# the word fracture in one of its 4 columns
def doesRowContainFracture(row):
  return (isinstance(row['DxDescription'], str) and 'Fracture'.lower() in row['DxDescription'].lower()) or \
         (isinstance(row['CCSLevel1Name'], str) and 'Fracture'.lower() in row['CCSLevel1Name'].lower()) or \
         (isinstance(row['CCSLevel2Name'], str) and 'Fracture'.lower() in row['CCSLevel2Name'].lower()) or \
         (isinstance(row['CCSLevel3Name'], str) and 'Fracture'.lower() in row['CCSLevel3Name'].lower())

In [None]:
doesRowContainFracture(data.iloc[0])

False

As we can see, row 0 does indeed not have fracture in 1 of the 4 columns. function seems be doing what it is supposed to.

In [None]:
# get rows with word fracture in them.
data_fracture = data[data.apply(doesRowContainFracture, axis = 1)]

In [None]:
num_patients_fracture = len(data_fracture['ID'].unique())
print(f'there are {num_patients_fracture} number of patients with fracture')

there are 383 number of patients with fracture


In [None]:
data_fracture

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
469,MM2,-5558,ICD9CM,807.00,"Closed fracture of rib(s), unspecified",Injury and poisoning,Fractures,Other fractures [231.]
656,MM3,101,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
657,MM3,194,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
658,MM3,0,ICD10CM,M84.58XA,"Pathological fracture in neoplastic disease, o...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
740,MM4,433,ICD10CM,Z87.81,Personal history of (healed) traumatic fracture,Injury and poisoning,Other injuries and conditions due to external ...,No Value
...,...,...,...,...,...,...,...,...
360199,MM843,190,ICD9CM,805.2,Closed fracture of dorsal [thoracic] vertebra ...,Injury and poisoning,Fractures,Other fractures [231.]
360421,MM843,324,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]
360447,MM843,239,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]
360478,MM843,415,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]


In [None]:
data_fracture['DxDescription'].unique()

array(['Closed fracture of rib(s), unspecified',
       'Pathological fracture, other site, subsequent encounter for fracture with routine healing',
       'Pathological fracture in neoplastic disease, other specified site, initial encounter for fracture',
       'Personal history of (healed) traumatic fracture',
       'Pathologic fracture of vertebrae',
       'Pathologic fracture of other specified site',
       'Closed fracture of T7-T12 level with complete lesion of cord',
       'Age-related osteoporosis without current pathological fracture',
       'Personal history of (healed) other pathological fracture',
       'Personal history of pathologic fracture',
       'Collapsed vertebra, not elsewhere classified, site unspecified, subsequent encounter for fracture with routine healing',
       'Pathological fracture, other site, initial encounter for fracture',
       'Unspecified fracture of left femur, initial encounter for closed fracture',
       'Aftercare for healing patholog

In [None]:
# print all rows with description "Encounter for removal of internal fixation device"
data_fracture[data_fracture['DxDescription'].apply(lambda x: 'encounter for removal of internal fixation device' in x.lower())]

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
118531,MM206,1214,ICD9CM,V54.01,Encounter for removal of internal fixation device,Injury and poisoning,Fractures,Other fractures [231.]
353751,MM828,-1542,ICD9CM,V54.01,Encounter for removal of internal fixation device,Injury and poisoning,Fractures,Other fractures [231.]


In [None]:
# delete these 2 rows from dataframe
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'encounter for removal of internal fixation device' not in x.lower())]

In [None]:
data_fracture.shape

(2076, 8)

2078 (original dataframe size) - 2 (current dataframe size) = 2076, so we have indeed deleted 2 rows

In [None]:
data_fracture[data_fracture['DxDescription'].apply(lambda x: 'follow-up examination, following treatment of healed fracture' in x.lower())]

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
207131,MM396,482,ICD9CM,V67.4,"Follow-up examination, following treatment of ...",Injury and poisoning,Fractures,Other fractures [231.]
359003,MM834,643,ICD9CM,V67.4,"Follow-up examination, following treatment of ...",Injury and poisoning,Fractures,Other fractures [231.]


In [None]:
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'follow-up examination, following treatment of healed fracture' not in x.lower())]

In [None]:
data_fracture.shape

(2074, 8)

In [None]:
data_fracture[data_fracture['DxDescription'].apply(lambda x: 'open wound of tooth (broken) (fractured) (due to trauma), without mention of complication' in x.lower())]

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
6235,MM12,276,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]
9868,MM20,3031,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]
12499,MM22,2722,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]
31079,MM55,1318,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]
41925,MM867,-1200,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]
48116,MM73,2235,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]
111845,MM195,1757,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]
111981,MM195,1812,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]
134776,MM239,197,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]
223026,MM434,739,ICD9CM,873.63,Open wound of tooth (broken) (fractured) (due ...,Injury and poisoning,Open wounds,Open wounds of head; neck; and trunk [235.]


In [None]:
data_fracture[data_fracture['DxDescription'].apply(lambda x: 'open wound of tooth (broken) (fractured) (due to trauma), without mention of complication' in x.lower())].shape[0]

17

In [None]:
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'open wound of tooth (broken) (fractured) (due to trauma), without mention of complication' not in x.lower())]

In [None]:
data_fracture.shape

(2057, 8)

In [None]:
data_fracture[data_fracture['DxDescription'].apply(lambda x: 'other aftercare involving internal fixation device' in x.lower())].shape[0]

18

In [None]:
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'other aftercare involving internal fixation device' not in x.lower())]

In [None]:
data_fracture.shape

(2039, 8)

In [None]:
data_fracture[data_fracture['DxDescription'].apply(lambda x: 'other osteoporosis without current pathological fracture' in x.lower())].shape[0]

50

In [None]:
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'other osteoporosis without current pathological fracture' not in x.lower())]

In [None]:
data_fracture.shape

(1989, 8)

In [None]:
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'osteoporosis without current pathological fracture' not in x.lower())]

In [None]:
data_fracture.shape

(1831, 8)

In [None]:
data_fracture[data_fracture['DxDescription'].apply(lambda x: 'tooth fracture' in x.lower())]

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name


In [None]:
data_fracture[data_fracture['DxDescription'].apply(lambda x: 'tooth' in x.lower())]

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
147249,MM274,3577,ICD10CM,S02.5XXA,"Fracture of tooth (traumatic), initial encount...",Injury and poisoning,Fractures,No Value


In [None]:
data_fracture[data_fracture['DxDescription'].apply(lambda x: 'tooth' in x.lower())].iloc[0]['DxDescription']

'Fracture of tooth (traumatic), initial encounter for closed fracture'

Does this count as 'tooth fracture'? I guess it does, so let's delete that.

In [None]:
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'tooth' not in x.lower())]

In [None]:
data_fracture.shape

(1830, 8)

In [None]:
len(data_fracture['ID'].unique())

351

In [None]:
data_fracture[data_fracture['CCSLevel1Name'].apply(lambda x: 'injury and poisoning' in x.lower())]

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
469,MM2,-5558,ICD9CM,807.00,"Closed fracture of rib(s), unspecified",Injury and poisoning,Fractures,Other fractures [231.]
740,MM4,433,ICD10CM,Z87.81,Personal history of (healed) traumatic fracture,Injury and poisoning,Other injuries and conditions due to external ...,No Value
741,MM4,428,ICD10CM,Z87.81,Personal history of (healed) traumatic fracture,Injury and poisoning,Other injuries and conditions due to external ...,No Value
742,MM4,427,ICD10CM,Z87.81,Personal history of (healed) traumatic fracture,Injury and poisoning,Other injuries and conditions due to external ...,No Value
2507,MM7,1028,ICD9CM,806.26,Closed fracture of T7-T12 level with complete ...,Injury and poisoning,Spinal cord injury [227.],No Value
...,...,...,...,...,...,...,...,...
360199,MM843,190,ICD9CM,805.2,Closed fracture of dorsal [thoracic] vertebra ...,Injury and poisoning,Fractures,Other fractures [231.]
360421,MM843,324,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]
360447,MM843,239,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]
360478,MM843,415,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]


In [None]:
for description in data_fracture[data_fracture['CCSLevel1Name'].apply(lambda x: 'injury and poisoning' in x.lower())]['DxDescription'].unique():
  print(description)

Closed fracture of rib(s), unspecified
Personal history of (healed) traumatic fracture
Closed fracture of T7-T12 level with complete lesion of cord
Collapsed vertebra, not elsewhere classified, site unspecified, subsequent encounter for fracture with routine healing
Unspecified fracture of left femur, initial encounter for closed fracture
Aftercare for healing pathologic fracture of vertebrae
Closed fracture of dorsal [thoracic] vertebra without mention of spinal cord injury
Closed fracture of unspecified part of neck of femur
Closed fracture of two ribs
Closed fracture of lumbar vertebra without mention of spinal cord injury
Other displaced fracture of upper end of left humerus, initial encounter for closed fracture
Unspecified fracture of shaft of humerus, left arm, initial encounter for closed fracture
Collapsed vertebra, not elsewhere classified, thoracic region, initial encounter for fracture
Aftercare for healing traumatic fracture of lower arm
Aftercare for healing pathologic fr

Should we delete the rows with `'injury and poisoning'` as the CCSLevel1 description? It makes sense to delete them because we don't want to count the fracture from personal injuries, we want the ones that arises from cancer. **Consult Emisa later**. Based on my decision, I think we should delete those rows.

In [None]:
data_fracture = data_fracture[data_fracture['CCSLevel1Name'].apply(lambda x: 'injury and poisoning' not in x.lower())]

In [None]:
data_fracture.shape

(1113, 8)

Now we have deleted all the fractures that are not pathological fractures, let's make a dictionary that maps a patient's ID to whether he/she had fracture during 1 year after STC treatment.

In [None]:
data_fracture[data_fracture['ID'] == 'MM852']

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
37558,MM852,882,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value


In [None]:
fractureList = [] # list of mappings from ID to 0 or 1 based on if this patient has fracture
for ID in patientToDayOfSTC.keys(): # for every patient who took STC:
  if ID not in data_fracture['ID'].unique(): # if they are not in the dataframe
    # it means they don't have fracture after STC. put 0 there
    fractureList.append([ID, 0])
  else:
    # get their dataframe
    patient_df = data_fracture[data_fracture['ID'] == ID]
    # if that dataframe contain any row that has date during 1 year after STC,
    # put 1 there
    if patient_df[(patient_df['DaysFromDx'] >= patientToDayOfSTC[ID]) & (patient_df['DaysFromDx'] <= patientToDayOfSTC[ID] + 365)].shape[0] != 0:
      fractureList.append([ID, 1])
    else:
      # means patient does have fracture but not within a year after STC, put 0.
      fractureList.append([ID, 0])

In [None]:
# make a dataframe out of list:
df_result = pd.DataFrame(data = fractureList, columns= ['ID', 'HasFracture?'])

In [None]:
df_result

Unnamed: 0,ID,HasFracture?
0,MM1,0
1,MM2,0
2,MM4,0
3,MM5,0
4,MM6,0
...,...,...
697,MM834,0
698,MM843,0
699,MM836,0
700,MM837,0


In [None]:
df_result['HasFracture?'].sum()

104

So we have 104 patients out of 700 with fracture within 1 year after STC it seems like

Should verify if our results are correct by checking if our results match the original dataset of data_fracture

In [None]:
df_result[df_result['HasFracture?'] == 1]

Unnamed: 0,ID,HasFracture?
5,MM7,1
6,MM8,1
10,MM15,1
31,MM39,1
46,MM53,1
...,...,...
665,MM788,1
667,MM914,1
671,MM805,1
691,MM828,1


In [None]:
data_fracture[data_fracture['ID'] == 'MM7'].sort_values(by = ['DaysFromDx'])

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
2283,MM7,5,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2293,MM7,6,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2282,MM7,7,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2286,MM7,8,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2287,MM7,26,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2284,MM7,153,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2290,MM7,200,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2292,MM7,201,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2288,MM7,209,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2294,MM7,210,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value


In [None]:
patientToDayOfSTC['MM7']

203.0

We see that MM7 does indeed has a fracture 1 year after day 203, which our result includes a 1 for. seems like our process is correct.

In [None]:
# save dataframe to csv.
df_result.to_csv('Fractures.csv', index = False)

# 7. put every piece of data of a patient together

Now that we have processed and prepared individual piece of information of each patient. (lesions, demographics, fracture) Let's put them into a global dataframe 

In [None]:
ls

BillingCodes.csv          Labs.csv                          Signs.csv
BonyLesions.csv           Lesions_encoded.csv               Stage.csv
Demographics.csv          Medications.csv                   Stages_encoded.csv
Demographics_encoded.csv  medicines_90_days_before_STC.csv  SurvivalDays.csv
Diagnoses.csv             MyelomaTherapy.csv                Symptoms.csv
Fractures.csv             PlasmaCells.csv
Labs_closest_to_SCT.csv   RadiationTherapy.csv


In [83]:
data_labs = pd.read_csv('Labs_closest_to_SCT.csv')
data_demographics = pd.read_csv('Demographics_encoded.csv')
data_medications = pd.read_csv('medicines_90_days_before_STC.csv')
data_cancer_stage = pd.read_csv('Stages_encoded.csv')
data_bony_lesions = pd.read_csv('Lesions_encoded.csv')
data_fracture = pd.read_csv('Fractures.csv')

In [84]:
data_labs

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
0,MM1,CA,Calcium,Electrolyte,343,9.100,mg/dL,N,8.9,10.2
1,MM1,P,Phosphate,Electrolyte,343,4.700,mg/dL,H,2.5,4.5
2,MM1,ALK,Alkaline Phosphatase,Liver function,343,93.000,U/L,N,36.0,161.0
3,MM1,VITD3,Vitamin D3,Nutrtion,289,48.700,ng/mL,N,20.0,100.0
4,MM1,TEST,Testosterone,Endocrine,294,0.700,ng/mL,L,1.6,5.9
...,...,...,...,...,...,...,...,...,...,...
5254,MM838,VITD3,Vitamin D3,Nutrtion,2316,42.600,ng/mL,N,20.0,100.0
5255,MM838,TEST,Testosterone,Endocrine,1628,1.800,ng/mL,N,1.6,5.9
5256,MM838,TSH,Thyroid Stimulating Hormone,Endocrine,712,7.306,uIU/mL,H,0.4,5.0
5257,MM838,CRE,Creatinine,Kidney function,242,1.000,mg/dL,N,0.3,1.2


In [85]:
data_demographics

Unnamed: 0,ID,AgeAtDx,PatientSex,RacialGroup
0,MM1,1,1,1
1,MM2,1,1,2
2,MM3,1,1,1
3,MM4,1,0,1
4,MM5,1,1,1
...,...,...,...,...
826,MM843,0,1,1
827,MM835,1,1,1
828,MM836,1,1,1
829,MM837,1,1,1


In [86]:
data_medications

Unnamed: 0,ID,Vitamin D supplements,Calcium,Denosumab,Pamidronate,Zoledronate,Dexamethasone
0,MM1,0,1,0,0,0,1
1,MM2,0,0,0,0,0,1
2,MM4,0,0,0,0,0,1
3,MM5,0,0,0,0,0,0
4,MM6,0,0,0,0,0,1
...,...,...,...,...,...,...,...
697,MM834,0,0,0,0,0,0
698,MM843,0,0,0,0,0,0
699,MM836,0,0,0,0,0,1
700,MM837,0,1,0,0,0,1


In [87]:
data_cancer_stage

Unnamed: 0,ID,Stage
0,MM2,3
1,MM3,1
2,MM4,3
3,MM5,1
4,MM6,1
...,...,...
526,MM834,1
527,MM843,2
528,MM835,1
529,MM836,2


In [88]:
data_bony_lesions

Unnamed: 0,ID,DxType,DaysFromDx,BonyLesions
0,MM1,MRI,297.0,0
1,MM2,MRI,176.0,1
2,MM4,MRI,411.0,1
3,MM5,MRI,373.0,0
4,MM7,MRI,197.0,1
...,...,...,...,...
664,MM834,MRI,161.0,1
665,MM843,MRI,812.0,1
666,MM836,MRI,754.0,1
667,MM837,MRI,173.0,1


In [89]:
data_fracture

Unnamed: 0,ID,HasFracture?
0,MM1,0
1,MM2,0
2,MM4,0
3,MM5,0
4,MM6,0
...,...,...
697,MM834,0
698,MM843,0
699,MM836,0
700,MM837,0


In [90]:
# combine fracture data with lesions data
data_agg = data_fracture.merge(data_bony_lesions, on = 'ID', how='outer')

In [91]:
# combine with cancer stage data
data_agg = data_agg.merge(data_cancer_stage, on = 'ID', how='outer')

In [92]:
# combine with medications
data_agg = data_agg.merge(data_medications, on = 'ID', how = 'outer')

In [93]:
# combine with demographics data
data_agg = data_agg.merge(data_demographics, on = 'ID', how = 'outer')

In [94]:
data_agg

Unnamed: 0,ID,HasFracture?,DxType,DaysFromDx,BonyLesions,Stage,Vitamin D supplements,Calcium,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup
0,MM1,0.0,MRI,297.0,0.0,,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
1,MM2,0.0,MRI,176.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0
2,MM4,0.0,MRI,411.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
3,MM5,0.0,MRI,373.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
4,MM6,0.0,,,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
827,MM700,,,,,,,,,,,,1.0,1.0,1.0
828,MM711,,,,,,,,,,,,0.0,1.0,1.0
829,MM820,,,,,,,,,,,,0.0,0.0,1.0
830,MM956,,,,,,,,,,,,1.0,0.0,2.0


It turns out before we combine it with lab data, we need to process the lab data somehow. We need to flag if this chemical level is normal/abnormal on this patient based on the lab sheet.

In [24]:
data_labs[data_labs['ObservationName'] == 'Parathyroid hormone']

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
39,MM7,IPTH,Parathyroid hormone,Endocrine,404,47.0,pg/mL,N,12.0,88.0
103,MM18,IPTH,Parathyroid hormone,Endocrine,1341,142.0,pg/mL,H,12.0,88.0
157,MM26,IPTH,Parathyroid hormone,Endocrine,1173,17.0,pg/mL,N,12.0,88.0
167,MM844,IPTH,Parathyroid hormone,Endocrine,3508,1.0,pg/mL,L,12.0,88.0
184,MM29,IPTH,Parathyroid hormone,Endocrine,3802,17.0,pg/mL,N,12.0,88.0
...,...,...,...,...,...,...,...,...,...,...
4675,MM715,IPTH,Parathyroid hormone,Endocrine,2266,28.0,pg/mL,N,12.0,88.0
4825,MM738,IPTH,Parathyroid hormone,Endocrine,2,39.0,pg/mL,N,12.0,88.0
5092,MM814,IPTH,Parathyroid hormone,Endocrine,674,197.0,pg/mL,H,12.0,88.0
5138,MM855,IPTH,Parathyroid hormone,Endocrine,791,249.0,pg/mL,H,12.0,88.0


can take out the rows with parathyroid hormone because less than 10% (62 / 700) of people took this lab test. Too many nulls

In [95]:
data_labs = data_labs[data_labs['ObservationName'] != 'Parathyroid hormone']

In [96]:
important_chemicals = ['Calcium', 'Phosphate', 'Parathyroid hormone', \
                       'Alkaline\xa0Phosphatase', 'Vitamin\xa0D3', \
                       'Estradiol', 'Testosterone', 'Thyroid\xa0Stimulating\xa0Hormone',\
                       'Creatinine', 'C-Reactive Protein', 'Sedimentation\xa0Rate']

Calcium: N: normal (0), H, L -> abnormal (1) \\
phosphate: N: normal (0), H, L -> abnormal \\
Parathyroid hormone: not in lab sheet \\
Alkaline Phosphatase: L, N -> 0 normal, H -> 1 abnormal \\
Vitamin D3: H, N -> 0 normal, L -> 1 abnormal \\
Estradiol: H, N -> 0, L -> abnormal \\
testoterone: H, N -> 0, L -> 1 \\
thyroid stimulating hormone: N -> 0, H, L -> 1 \\
creatinine: L, N -> 0, H -> 1 \\
c-reactive protein: L, N -> 0, H -> 1 \\
sedimentation rate: L, N -> 0, H -> 1 \\

In [97]:
lab_interpretation = {}
lab_interpretation['Calcium'] = {'N':0, 'H': 1, 'L': 1}
lab_interpretation['Phosphate'] = {'N':0, 'H': 1, 'L': 1}
lab_interpretation['Alkaline\xa0Phosphatase'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['Vitamin\xa0D3'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['Estradiol'] = {'N':0, 'H': 0, 'L': 1}
lab_interpretation['Testosterone'] = {'N':0, 'H': 0, 'L': 1}
lab_interpretation['Thyroid\xa0Stimulating\xa0Hormone'] = {'N':0, 'H': 1, 'L': 1}
lab_interpretation['Creatinine'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['C-Reactive Protein'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['Sedimentation\xa0Rate'] = {'N':0, 'H': 1, 'L': 0}


In [98]:
# transform lab results flags to match lab interpretation.
data_labs['Abnormal?'] = data_labs.apply(lambda row: lab_interpretation[row['ObservationName']][row['AbnormalFlags']], axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [99]:
data_labs

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit,Abnormal?
0,MM1,CA,Calcium,Electrolyte,343,9.100,mg/dL,N,8.9,10.2,0
1,MM1,P,Phosphate,Electrolyte,343,4.700,mg/dL,H,2.5,4.5,1
2,MM1,ALK,Alkaline Phosphatase,Liver function,343,93.000,U/L,N,36.0,161.0,0
3,MM1,VITD3,Vitamin D3,Nutrtion,289,48.700,ng/mL,N,20.0,100.0,0
4,MM1,TEST,Testosterone,Endocrine,294,0.700,ng/mL,L,1.6,5.9,1
...,...,...,...,...,...,...,...,...,...,...,...
5254,MM838,VITD3,Vitamin D3,Nutrtion,2316,42.600,ng/mL,N,20.0,100.0,0
5255,MM838,TEST,Testosterone,Endocrine,1628,1.800,ng/mL,N,1.6,5.9,0
5256,MM838,TSH,Thyroid Stimulating Hormone,Endocrine,712,7.306,uIU/mL,H,0.4,5.0,1
5257,MM838,CRE,Creatinine,Kidney function,242,1.000,mg/dL,N,0.3,1.2,0


Now we need to convert all these rows into columns indicating whether each patient has abnormal level of each chemical.

In [101]:
list_of_lab_results = []
for ID in data_labs['ID'].unique():
  patient_labs = data_labs[data_labs['ID'] == ID]
  thisRow = [ID]
  for chemical in important_chemicals:
    if chemical in patient_labs['ObservationName'].unique():
      patient_labs_this_chemical = patient_labs[patient_labs['ObservationName'] == chemical]
      if patient_labs_this_chemical.shape[0] == 1:
        thisRow.append(patient_labs_this_chemical['Abnormal?'].values[0])
      else:
        if 1 in patient_labs_this_chemical['Abnormal?']:
          thisRow.append(1)
        else:
          thisRow.append(0)
    else:
      thisRow.append(None)
  list_of_lab_results.append(thisRow)

In [102]:
data_labs_new = pd.DataFrame(data = list_of_lab_results, columns= ['ID'] + important_chemicals)

In [103]:
data_labs_new.drop(['Parathyroid hormone'], axis = 1, inplace = True)

In [104]:
data_labs_new

Unnamed: 0,ID,Calcium,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...
695,MM834,0.0,0.0,0.0,,,0.0,1.0,0.0,1.0,
696,MM843,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,1.0
697,MM836,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1.0,
698,MM837,0.0,0.0,0.0,0.0,,1.0,,0.0,0.0,


In [105]:
data_labs_new.isnull().sum()

ID                               0
Calcium                          1
Phosphate                        4
Alkaline Phosphatase             1
Vitamin D3                     135
Estradiol                      550
Testosterone                   387
Thyroid Stimulating Hormone    221
Creatinine                       1
C-Reactive Protein              21
Sedimentation Rate             599
dtype: int64

might need to remove nulls later, but now let's combine them with our aggregated dataframe

In [106]:
# combine lab data with everything else
data_agg = data_agg.merge(data_labs_new, on = 'ID', how = 'outer')

In [107]:
data_agg

Unnamed: 0,ID,HasFracture?,DxType,DaysFromDx,BonyLesions,Stage,Vitamin D supplements,Calcium_x,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,Calcium_y,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,MRI,297.0,0.0,,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,MRI,176.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,0.0,MRI,411.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,0.0,MRI,373.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,0.0,,,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
827,MM700,,,,,,,,,,,,1.0,1.0,1.0,,,,,,,,,,
828,MM711,,,,,,,,,,,,0.0,1.0,1.0,,,,,,,,,,
829,MM820,,,,,,,,,,,,0.0,0.0,1.0,,,,,,,,,,
830,MM956,,,,,,,,,,,,1.0,0.0,2.0,,,,,,,,,,


In [108]:
# delete the rows NaN HasFracture? values since that is what we are trying to predict
data_agg = data_agg.dropna(subset = ['HasFracture?'])

In [109]:
data_agg

Unnamed: 0,ID,HasFracture?,DxType,DaysFromDx,BonyLesions,Stage,Vitamin D supplements,Calcium_x,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,Calcium_y,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,MRI,297.0,0.0,,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,MRI,176.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,0.0,MRI,411.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,0.0,MRI,373.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,0.0,,,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
697,MM834,0.0,MRI,161.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,,,0.0,1.0,0.0,1.0,
698,MM843,0.0,MRI,812.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,1.0
699,MM836,0.0,MRI,754.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1.0,
700,MM837,0.0,MRI,173.0,1.0,,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,,1.0,,0.0,0.0,


In [110]:
# drop some columns we are not going to use in prediction
data_agg.drop(['DxType', 'DaysFromDx'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [111]:
data_agg.head()

Unnamed: 0,ID,HasFracture?,BonyLesions,Stage,Vitamin D supplements,Calcium_x,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,Calcium_y,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,0.0,,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,0.0,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,


Let's save this file to csv.

In [112]:
data_agg.to_csv('data_agg.csv', index = False)