<a href="https://colab.research.google.com/github/wenjunsun/personal-machine-learning-projects/blob/master/cancer-fracture/task1/prepare_data_for_ML_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will try to increase the number of positive examples in our dataset (patients with fracture) if we can. (not increasing the number of rows! just the positive examples. The reason we can do this is because we have multiple Stem Cell Transplant days for some patient. So if patient A has STC day of 100, 1000, and he/she doesn't have fracture within 100 - 465 day, but does have it on 1100, then we will consider him/her have fracture, whereas before we didn't.

# 0. go to data directory + load libraries

In [1]:
cd drive/My\ Drive/fracture_with_emissa/Datasets/Raw\ Data

/content/drive/My Drive/fracture_with_emissa/Datasets/Raw Data


In [2]:
ls

BillingCodes.csv           Labs_encoded.csv
BonyLesions.csv            Lesions_encoded.csv
data_agg_2.csv             Medications.csv
data_agg_3.csv             medicines_90_days_before_STC.csv
data_agg.csv               MyelomaTherapy.csv
data_for_ML_back_fill.csv  PlasmaCells.csv
Demographics.csv           RadiationTherapy.csv
Demographics_encoded.csv   Signs.csv
Diagnoses.csv              Stage.csv
Final_stage.csv            Stages_encoded.csv
Fractures.csv              STC_Days.csv
Labs2.csv                  STC_days.ditionary
Labs_closest_to_SCT.csv    SurvivalDays.csv
Labs.csv                   Symptoms.csv


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
import pickle

# 1. Get new Stem cell treatment days of everyone (including multiple STC dates if the patient took STC multiple times)

In [None]:
data = pd.read_csv('MyelomaTherapy.csv')

In [None]:
data

Unnamed: 0,ID,MedTx,Combination,Class,Mechanism,DaysFromDxStart,DaysFromDxStop,Line,Duration,TreatmentPhase
0,MM1,Bortezomib,VRD,Proteosome inhibitor,Proteosome inhibitor,116.0,260.0,1,195.0,Induction
1,MM1,Dexamethasone,VRD,Steroid,Steroid,116.0,260.0,1,195.0,Induction
2,MM1,Lenalidomide,VRD,Immunotherapy,Immunomodulatory,116.0,260.0,1,195.0,Induction
3,MM1,Cyclophosphamide,Not specified,Chemotherapy,Alkylator,311.0,311.0,2,32.0,Induction
4,MM1,Dexamethasone,Not specified,Steroid,Steroid,311.0,299.0,2,32.0,Induction
...,...,...,...,...,...,...,...,...,...,...
9939,MM838,Dexamethasone,Not specified,Steroid,Steroid,4881.0,4887.0,22,20.0,Relapse
9940,MM838,Cyclophosphamide,Not specified,Chemotherapy,Alkylator,4881.0,4887.0,22,20.0,Relapse
9941,MM838,Fludarabine,Not specified,Chemotherapy,Antimetabolite,4901.0,4903.0,23,,Relapse
9942,MM838,Cyclophosphamide,Not specified,Chemotherapy,Alkylator,4901.0,4903.0,23,,Relapse


In [None]:
stem_cell_data = data[data['MedTx'] == 'Stem cell transplant']

In [None]:
# the dictionary stores mapping from a
# id of a patient to a set of his/her
# STC days. Example: 'MM1' : {365, 444}, 'MM2' : {444}
idToDays = dict()

for i in range(stem_cell_data.shape[0]):
  thisRow = stem_cell_data.iloc[i]
  ID = thisRow['ID']
  day = thisRow['DaysFromDxStart']
  if ID not in idToDays:
    idToDays[ID] = set()
  idToDays[ID].add(day)

In [None]:
# print out the patients with multiple STC treatments.
for id in idToDays:
  if len(idToDays[id]) > 1:
    print(id)

MM7
MM10
MM953
MM18
MM20
MM24
MM844
MM29
MM35
MM45
MM862
MM942
MM955
MM52
MM852
MM934
MM65
MM867
MM859
MM868
MM68
MM69
MM70
MM72
MM73
MM950
MM946
MM924
MM77
MM79
MM80
MM943
MM81
MM971
MM82
MM871
MM87
MM88
MM92
MM93
MM857
MM96
MM873
MM969
MM98
MM877
MM102
MM103
MM880
MM883
MM108
MM112
MM130
MM132
MM133
MM954
MM136
MM142
MM927
MM149
MM153
MM155
MM158
MM159
MM166
MM169
MM170
MM172
MM173
MM179
MM892
MM201
MM202
MM205
MM208
MM210
MM212
MM214
MM215
MM217
MM219
MM221
MM229
MM230
MM237
MM240
MM260
MM894
MM271
MM274
MM277
MM280
MM282
MM283
MM288
MM291
MM922
MM308
MM309
MM320
MM324
MM334
MM339
MM343
MM345
MM350
MM351
MM353
MM357
MM366
MM897
MM374
MM388
MM400
MM408
MM430
MM441
MM453
MM456
MM459
MM474
MM475
MM517
MM525
MM530
MM539
MM545
MM552
MM554
MM565
MM578
MM586
MM588
MM589
MM600
MM608
MM609
MM622
MM639
MM677
MM710
MM714
MM725
MM915
MM810
MM812
MM930
MM836
MM838


So we can see we have quite a lot of people who took stem cell treatment multiple times.

In [None]:
# check in our original dataset that those people do have multiple
# STC.
data[(data['ID'] == 'MM838') & (data['MedTx'] == 'Stem cell transplant')]

Unnamed: 0,ID,MedTx,Combination,Class,Mechanism,DaysFromDxStart,DaysFromDxStop,Line,Duration,TreatmentPhase
9905,MM838,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,244.0,244.0,3,92.0,Stem cell transplant
9906,MM838,Stem cell transplant,Not specified,Stem Cell Transplant,Stem Cell Transplant,336.0,336.0,4,199.0,Stem cell transplant


In [None]:
idToDays['MM838']

{244.0, 336.0}

That looks correct to me. Now we want to store this python object of mapping from int to a set in the disk so we can use it later. We will use the `pickle ` python package to do that.

In [None]:
with open('STC_days.ditionary', 'wb') as dictionary_file:
  pickle.dump(idToDays, dictionary_file)

# 2. reprepare fracture data again, excluding 'personal history of fracture', and include fracture within multiple STC days

In [5]:
# this is how you load a pickle file into python object:
with open('STC_days.ditionary', 'rb') as dictionary_file:
    idToDays = pickle.load(dictionary_file)
    print(idToDays)

{'MM1': {343.0}, 'MM2': {229.0}, 'MM4': {425.0}, 'MM5': {416.0}, 'MM6': {657.0}, 'MM7': {259.0, 203.0}, 'MM8': {541.0}, 'MM10': {347.0, 171.0}, 'MM12': {318.0}, 'MM13': {165.0}, 'MM15': {3363.0}, 'MM953': {101.0, 150.0}, 'MM17': {183.0}, 'MM18': {1240.0, 1306.0, 236.0}, 'MM19': {432.0}, 'MM20': {3065.0, 3193.0}, 'MM21': {181.0}, 'MM22': {184.0}, 'MM23': {211.0}, 'MM24': {865.0, 923.0}, 'MM26': {1154.0}, 'MM844': {976.0, 818.0}, 'MM28': {2543.0}, 'MM29': {251.0, 3677.0}, 'MM32': {247.0}, 'MM974': {3505.0}, 'MM33': {265.0}, 'MM35': {339.0, 446.0}, 'MM982': {1346.0}, 'MM37': {223.0}, 'MM38': {1902.0}, 'MM39': {273.0}, 'MM40': {157.0}, 'MM41': {1345.0}, 'MM923': {209.0}, 'MM45': {392.0, 1509.0}, 'MM46': {170.0}, 'MM966': {235.0}, 'MM861': {156.0}, 'MM862': {1072.0, 1952.0}, 'MM942': {3067.0, 2486.0, 183.0}, 'MM955': {1220.0, 2268.0}, 'MM47': {253.0}, 'MM51': {3112.0}, 'MM863': {2370.0}, 'MM52': {209.0, 2083.0}, 'MM53': {4191.0}, 'MM54': {553.0}, 'MM55': {369.0}, 'MM56': {3421.0}, 'MM57': {

In [6]:
idToDays

{'MM1': {343.0},
 'MM10': {171.0, 347.0},
 'MM102': {234.0, 309.0},
 'MM103': {673.0, 1454.0, 1532.0},
 'MM104': {162.0},
 'MM105': {189.0},
 'MM106': {243.0},
 'MM107': {426.0},
 'MM108': {366.0, 429.0, 2038.0},
 'MM109': {550.0},
 'MM110': {484.0},
 'MM112': {288.0, 356.0},
 'MM113': {3559.0},
 'MM114': {250.0},
 'MM115': {180.0},
 'MM116': {250.0},
 'MM117': {289.0},
 'MM118': {221.0},
 'MM12': {318.0},
 'MM123': {200.0},
 'MM124': {1036.0},
 'MM125': {1326.0},
 'MM126': {208.0},
 'MM127': {224.0},
 'MM128': {217.0},
 'MM129': {582.0},
 'MM13': {165.0},
 'MM130': {230.0, 2532.0},
 'MM131': {279.0},
 'MM132': {220.0, 1187.0, 1254.0},
 'MM133': {214.0, 2741.0},
 'MM134': {2257.0},
 'MM135': {206.0},
 'MM136': {327.0, 432.0},
 'MM138': {168.0},
 'MM139': {427.0},
 'MM140': {1112.0},
 'MM141': {335.0},
 'MM142': {281.0, 1307.0},
 'MM143': {399.0},
 'MM144': {520.0},
 'MM145': {303.0},
 'MM146': {283.0},
 'MM149': {183.0, 2181.0},
 'MM15': {3363.0},
 'MM150': {201.0},
 'MM151': {1081.0},

In [7]:
data = pd.read_csv('BillingCodes.csv')

In [8]:
data

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
0,MM1,354,ICD9CM,V10.87,Personal history of malignant neoplasm of thyroid,Neoplasms,Cancer; other primary,Cancer of thyroid [36.]
1,MM1,373,ICD9CM,V10.87,Personal history of malignant neoplasm of thyroid,Neoplasms,Cancer; other primary,Cancer of thyroid [36.]
2,MM1,355,ICD9CM,V10.87,Personal history of malignant neoplasm of thyroid,Neoplasms,Cancer; other primary,Cancer of thyroid [36.]
3,MM1,741,ICD9CM,V12.29,"Personal history of other endocrine, metabolic...",Endocrine; nutritional; and metabolic diseases...,Other nutritional; endocrine; and metabolic di...,Other and unspecified metabolic; nutritional; ...
4,MM1,318,ICD9CM,V14.0,Personal history of allergy to penicillin,Symptoms; signs; and ill-defined conditions an...,Symptoms; signs; and ill-defined conditions,Allergic reactions [253.]
...,...,...,...,...,...,...,...,...
365050,MM838,4899,ICD10CM,C90.00,Multiple myeloma not having achieved remission,Neoplasms,Cancer of lymphatic and hematopoietic tissue,No Value
365051,MM838,2905,ICD9CM,203.02,"Multiple myeloma, in relapse",Neoplasms,Cancer of lymphatic and hematopoietic tissue,Multiple myeloma [40.]
365052,MM838,4594,ICD10CM,C90.00,Multiple myeloma not having achieved remission,Neoplasms,Cancer of lymphatic and hematopoietic tissue,No Value
365053,MM838,2528,ICD9CM,V70.7,Examination of participant in clinical trial,Symptoms; signs; and ill-defined conditions an...,Factors influencing health care,Medical examination/evaluation [256.]


In [9]:
# we think fracture will be one of the word in the string of
# DxDescription, CCSLevel1Name, CCSLevel2Name, CCSLevel3Name
# Let's get those rows where there any of the 4 columns
# contain the word 'fracture'.

# define a function such that given a row, return true if that row contains
# the word fracture in one of its 4 columns
def doesRowContainFracture(row):
  return (isinstance(row['DxDescription'], str) and 'Fracture'.lower() in row['DxDescription'].lower()) or \
         (isinstance(row['CCSLevel1Name'], str) and 'Fracture'.lower() in row['CCSLevel1Name'].lower()) or \
         (isinstance(row['CCSLevel2Name'], str) and 'Fracture'.lower() in row['CCSLevel2Name'].lower()) or \
         (isinstance(row['CCSLevel3Name'], str) and 'Fracture'.lower() in row['CCSLevel3Name'].lower())

In [10]:
# get rows with word fracture in them.
data_fracture = data[data.apply(doesRowContainFracture, axis = 1)]

In [11]:
data_fracture

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
469,MM2,-5558,ICD9CM,807.00,"Closed fracture of rib(s), unspecified",Injury and poisoning,Fractures,Other fractures [231.]
656,MM3,101,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
657,MM3,194,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
658,MM3,0,ICD10CM,M84.58XA,"Pathological fracture in neoplastic disease, o...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
740,MM4,433,ICD10CM,Z87.81,Personal history of (healed) traumatic fracture,Injury and poisoning,Other injuries and conditions due to external ...,No Value
...,...,...,...,...,...,...,...,...
360199,MM843,190,ICD9CM,805.2,Closed fracture of dorsal [thoracic] vertebra ...,Injury and poisoning,Fractures,Other fractures [231.]
360421,MM843,324,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]
360447,MM843,239,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]
360478,MM843,415,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]


In [12]:
# delete some fracture data that is not pathlogical.
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'encounter for removal of internal fixation device' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'follow-up examination, following treatment of healed fracture' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'open wound of tooth (broken) (fractured) (due to trauma), without mention of complication' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'other aftercare involving internal fixation device' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'other osteoporosis without current pathological fracture' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'osteoporosis without current pathological fracture' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'tooth' not in x.lower())]
data_fracture = data_fracture[data_fracture['DxDescription'].apply(lambda x: 'personal history of (healed) traumatic fracture' not in x.lower())]
data_fracture = data_fracture[data_fracture['CCSLevel2Name'].apply(lambda x: 'other injuries and conditions due to external causes' not in x.lower())]

In [13]:
data_fracture

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
469,MM2,-5558,ICD9CM,807.00,"Closed fracture of rib(s), unspecified",Injury and poisoning,Fractures,Other fractures [231.]
656,MM3,101,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
657,MM3,194,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
658,MM3,0,ICD10CM,M84.58XA,"Pathological fracture in neoplastic disease, o...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
2282,MM7,7,ICD9CM,733.13,Pathologic fracture of vertebrae,Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
...,...,...,...,...,...,...,...,...
360199,MM843,190,ICD9CM,805.2,Closed fracture of dorsal [thoracic] vertebra ...,Injury and poisoning,Fractures,Other fractures [231.]
360421,MM843,324,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]
360447,MM843,239,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]
360478,MM843,415,ICD9CM,V54.17,Aftercare for healing traumatic fracture of ve...,Injury and poisoning,Fractures,Other fractures [231.]


In [23]:
data_fracture[data_fracture['ID'] == 'MM3']

Unnamed: 0,ID,DaysFromDx,DxCodingMethod,DxCode,DxDescription,CCSLevel1Name,CCSLevel2Name,CCSLevel3Name
656,MM3,101,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
657,MM3,194,ICD10CM,M84.48XD,"Pathological fracture, other site, subsequent ...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value
658,MM3,0,ICD10CM,M84.58XA,"Pathological fracture in neoplastic disease, o...",Diseases of the musculoskeletal system and con...,Pathological fracture [207.],No Value


In [30]:
data_fracture[data_fracture['ID'] == 'MM3']['DaysFromDx'].values

array([101, 194,   0])

In [50]:
# ex1. patient_fracture_days = [101, 194, 0]
# STC_days = {229, 1000, 3000}
# -> return None because no fracture are within 365 days of any STC ays

# ex2. patient_fracture_days = [300, 194, 0]
# STC_days = {229, 3000, 1000}
# -> return 229 since 300 is within 365 days.

# return the smallest STC_day such taht within 365 days of that
# there is a fracture, if there is none such days, return None
# assume sortedness of STC
def STC_day_of_fracture(STC_days, fracture_days):
  STC_days_list = sorted(list(STC_days))
  for day in STC_days_list:
      upper_limit = day + 365
      lower_limit = day
      for fracture_day in fracture_days:
        if fracture_day >= lower_limit and fracture_day <= upper_limit:
          return day
  return None

In [52]:
# some unit tests
assert STC_day_of_fracture({229, 3000, 1000}, [100, 200 , -11]) == None
assert STC_day_of_fracture({229, 3000, 1000}, [100, 300 , -11]) == 229
assert STC_day_of_fracture({229, 3000, 1000}, [100, 300 , 3200]) == 229
assert STC_day_of_fracture({229, 3000, 1000}, [100, 3100 , 3200]) == 3000
assert STC_day_of_fracture({229, 3000, 1000}, [100, 3366]) == None
assert STC_day_of_fracture({229, 3000, 1000}, [1100, 3300]) == 1000

In [53]:
fractureList = [] # [['MM1', 300 (day of STC), 0 (no fracture)], ...]

for ID in idToDays: # for every patient who took STC:
  if ID not in data_fracture['ID'].unique(): # if they are not in the dataframe
    # it means they don't have fracture after STC. put 0 there
    fractureList.append([ID, sorted(list(idToDays[ID]))[0], 0])
  else:
    patient_fracture_days = data_fracture[data_fracture['ID'] == ID]['DaysFromDx'].values
    # patient_fracture_days = [101, 194, 0]
    STC_days = idToDays[ID]
    # STC_days = {200, 300, ..}
    STC_day_that_produces_fracture = STC_day_of_fracture(STC_days, patient_fracture_days)
    
    if STC_day_that_produces_fracture is not None:
      fractureList.append([ID, STC_day_that_produces_fracture, 1])
    else:
      fractureList.append([ID, sorted(list(idToDays[ID]))[0], 0])

In [56]:
df_result = pd.DataFrame(data = fractureList, columns= ['ID', 'STC_day', 'HasFracture?'])

In [57]:
df_result

Unnamed: 0,ID,STC_day,HasFracture?
0,MM1,343.0,0
1,MM2,229.0,0
2,MM4,425.0,0
3,MM5,416.0,0
4,MM6,657.0,0
...,...,...,...
697,MM834,204.0,0
698,MM843,841.0,0
699,MM836,786.0,0
700,MM837,203.0,0


In [58]:
df_result['HasFracture?'].sum()

131

We used to have about 120 patients with fracture, now we have 131.

In [64]:
df_result[['ID', 'STC_day']].to_csv('STC_Days.csv', index = False)

In [67]:
df_result[['ID', 'HasFracture?']].to_csv('Fractures.csv', index = False)

#3. remake labs, medications, and leisions data

Because our STC_days changed due to our process of finding whether there is fracture within any STC_days, we need to update some datasets that depend on the STC_data.

## 3.1 make labs data

In [73]:
data_STC_days = pd.read_csv('STC_Days.csv')

In [74]:
data_STC_days

Unnamed: 0,ID,STC_day
0,MM1,343.0
1,MM2,229.0
2,MM4,425.0
3,MM5,416.0
4,MM6,657.0
...,...,...
697,MM834,204.0
698,MM843,841.0
699,MM836,786.0
700,MM837,203.0


Need to convert this dataframe into a dictionary, because the later code expects a dictionary, not a dataframe.

In [84]:
patientToDayOfSTC = dict()

In [81]:
data_STC_days.iloc[0].values[0]

'MM1'

In [82]:
data_STC_days.iloc[0].values[1]

343.0

In [85]:
for i in range(data_STC_days.shape[0]):
  ID = data_STC_days.iloc[i].values[0]
  day = data_STC_days.iloc[i].values[1]
  patientToDayOfSTC[ID] = day

In [86]:
patientToDayOfSTC

{'MM1': 343.0,
 'MM10': 171.0,
 'MM102': 234.0,
 'MM103': 673.0,
 'MM104': 162.0,
 'MM105': 189.0,
 'MM106': 243.0,
 'MM107': 426.0,
 'MM108': 366.0,
 'MM109': 550.0,
 'MM110': 484.0,
 'MM112': 288.0,
 'MM113': 3559.0,
 'MM114': 250.0,
 'MM115': 180.0,
 'MM116': 250.0,
 'MM117': 289.0,
 'MM118': 221.0,
 'MM12': 318.0,
 'MM123': 200.0,
 'MM124': 1036.0,
 'MM125': 1326.0,
 'MM126': 208.0,
 'MM127': 224.0,
 'MM128': 217.0,
 'MM129': 582.0,
 'MM13': 165.0,
 'MM130': 230.0,
 'MM131': 279.0,
 'MM132': 220.0,
 'MM133': 214.0,
 'MM134': 2257.0,
 'MM135': 206.0,
 'MM136': 327.0,
 'MM138': 168.0,
 'MM139': 427.0,
 'MM140': 1112.0,
 'MM141': 335.0,
 'MM142': 281.0,
 'MM143': 399.0,
 'MM144': 520.0,
 'MM145': 303.0,
 'MM146': 283.0,
 'MM149': 183.0,
 'MM15': 3363.0,
 'MM150': 201.0,
 'MM151': 1081.0,
 'MM152': 281.0,
 'MM153': 463.0,
 'MM154': 598.0,
 'MM155': 2141.0,
 'MM156': 630.0,
 'MM158': 316.0,
 'MM159': 283.0,
 'MM160': 682.0,
 'MM162': 565.0,
 'MM163': 1452.0,
 'MM165': 323.0,
 'MM166': 2

In [87]:
assert patientToDayOfSTC['MM838']	== 244.0
assert patientToDayOfSTC['MM6']	== 657.0
assert patientToDayOfSTC['MM2']	== 229.0

In [68]:
data = pd.read_csv('Labs.csv')

In [69]:
data.head()

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit
0,MM1,P,Phosphate,Electrolyte,363,4.0,mg/dL,N,2.5,4.5
1,MM1,PLT,Platelet Count,Blood count,366,258.0,THOU/uL,N,150.0,400.0
2,MM1,ALK,Alkaline Phosphatase,Liver function,314,74.0,U/L,N,36.0,161.0
3,MM1,RDWCV,Red Blood Cell Distribution Width,Blood count,358,14.3,%,N,11.6,14.4
4,MM1,RDWCV,Red Blood Cell Distribution Width,Blood count,355,13.8,%,N,11.6,14.4


In [70]:
# the chemicals we think are important for prediction.
# don't think we have estrogen, but estradiol?
important_chemicals = ['Calcium', 'Phosphate', 'Parathyroid hormone', \
                       'Alkaline\xa0Phosphatase', 'Vitamin\xa0D3', \
                       'Estradiol', 'Testosterone', 'Thyroid\xa0Stimulating\xa0Hormone',\
                       'Creatinine', 'C-Reactive Protein', 'Sedimentation\xa0Rate']

In [71]:
result_df = pd.DataFrame(columns = data.columns)

In [72]:
result_df

Unnamed: 0,ID,ObservationId,ObservationName,Panel,DaysFromDx,ObservationValueNumeric,Units,AbnormalFlags,LowerLimit,UpperLimit


In [88]:
progress = 0
for ID, patient_STC_day in patientToDayOfSTC.items():
  
  for chemical in important_chemicals:
    # get this patient's lab results history of this chemical

    patient_data = data[(data['ID'] == ID) & (data['ObservationName'] == chemical)]
    
    if patient_data.shape[0] != 0:
      row_to_be_added = None # the variable that stores the row that will be added to our result dataframe.
      
      # get the row with lowest day after STC date
      # set this to row_to_be_added first so it can be overwritten by
      # the highest day before STC date later. 
      patient_labs_after_STC = patient_data[patient_data['DaysFromDx'] > patient_STC_day]
      if patient_labs_after_STC.shape[0] != 0:
        row_to_be_added = patient_labs_after_STC[patient_labs_after_STC.DaysFromDx == patient_labs_after_STC.DaysFromDx.min()]
      
      # set the row to be the one with highest day (closest) to STC date
      # if there are any such rows.
      patient_labs_before_STC = patient_data[patient_data['DaysFromDx'] <= patient_STC_day]
      if patient_labs_before_STC.shape[0] != 0:
        row_to_be_added = patient_labs_before_STC[patient_labs_before_STC.DaysFromDx == patient_labs_before_STC.DaysFromDx.max()]
      
      # if the row is not None, that means we have found a row with the "closest" date to STC date
      # add this row to result dataframe
      if row_to_be_added is not None:
        result_df = result_df.append(row_to_be_added, ignore_index=True)
      # else:
        # else we know we don't have any data on 

  progress += 1
  if (progress % 10 == 0):
    print(f'Progress bar: {progress / len(patientToDayOfSTC)}')

Progress bar: 0.014245014245014245
Progress bar: 0.02849002849002849
Progress bar: 0.042735042735042736
Progress bar: 0.05698005698005698
Progress bar: 0.07122507122507123
Progress bar: 0.08547008547008547
Progress bar: 0.09971509971509972
Progress bar: 0.11396011396011396
Progress bar: 0.1282051282051282
Progress bar: 0.14245014245014245
Progress bar: 0.15669515669515668
Progress bar: 0.17094017094017094
Progress bar: 0.18518518518518517
Progress bar: 0.19943019943019943
Progress bar: 0.21367521367521367
Progress bar: 0.22792022792022792
Progress bar: 0.24216524216524216
Progress bar: 0.2564102564102564
Progress bar: 0.2706552706552707
Progress bar: 0.2849002849002849
Progress bar: 0.29914529914529914
Progress bar: 0.31339031339031337
Progress bar: 0.32763532763532766
Progress bar: 0.3418803418803419
Progress bar: 0.3561253561253561
Progress bar: 0.37037037037037035
Progress bar: 0.38461538461538464
Progress bar: 0.39886039886039887
Progress bar: 0.4131054131054131
Progress bar: 0.427

In [89]:
result_df.to_csv('Labs_closest_to_SCT.csv', index = False)

In [90]:
data_labs = pd.read_csv('Labs_closest_to_SCT.csv')

In [91]:
data_labs = data_labs[data_labs['ObservationName'] != 'Parathyroid hormone']

In [92]:
important_chemicals = ['Calcium', 'Phosphate', 'Parathyroid hormone', \
                       'Alkaline\xa0Phosphatase', 'Vitamin\xa0D3', \
                       'Estradiol', 'Testosterone', 'Thyroid\xa0Stimulating\xa0Hormone',\
                       'Creatinine', 'C-Reactive Protein', 'Sedimentation\xa0Rate']

In [93]:
lab_interpretation = {}
lab_interpretation['Calcium'] = {'N':0, 'H': 1, 'L': 1}
lab_interpretation['Phosphate'] = {'N':0, 'H': 1, 'L': 1}
lab_interpretation['Alkaline\xa0Phosphatase'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['Vitamin\xa0D3'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['Estradiol'] = {'N':0, 'H': 0, 'L': 1}
lab_interpretation['Testosterone'] = {'N':0, 'H': 0, 'L': 1}
lab_interpretation['Thyroid\xa0Stimulating\xa0Hormone'] = {'N':0, 'H': 1, 'L': 1}
lab_interpretation['Creatinine'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['C-Reactive Protein'] = {'N':0, 'H': 1, 'L': 0}
lab_interpretation['Sedimentation\xa0Rate'] = {'N':0, 'H': 1, 'L': 0}

In [94]:
# transform lab results flags to match lab interpretation.
data_labs['Abnormal?'] = data_labs.apply(lambda row: lab_interpretation[row['ObservationName']][row['AbnormalFlags']], axis = 1)

In [95]:
list_of_lab_results = []
for ID in data_labs['ID'].unique():
  patient_labs = data_labs[data_labs['ID'] == ID]
  thisRow = [ID]
  for chemical in important_chemicals:
    if chemical in patient_labs['ObservationName'].unique():
      patient_labs_this_chemical = patient_labs[patient_labs['ObservationName'] == chemical]
      if patient_labs_this_chemical.shape[0] == 1:
        thisRow.append(patient_labs_this_chemical['Abnormal?'].values[0])
      else:
        if 1 in patient_labs_this_chemical['Abnormal?']:
          thisRow.append(1)
        else:
          thisRow.append(0)
    else:
      thisRow.append(None)
  list_of_lab_results.append(thisRow)

In [96]:
data_labs_new = pd.DataFrame(data = list_of_lab_results, columns= ['ID'] + important_chemicals)

In [97]:
data_labs_new.drop(['Parathyroid hormone'], axis = 1, inplace = True)

In [98]:
data_labs_new.to_csv('Labs_encoded.csv', index=False)

## 3.2 make medicines data

In [99]:
data = pd.read_csv('Medications.csv')

In [100]:
data.head()

Unnamed: 0,ID,Note,DaysFromDx,Time,BeginOffset,EndOffset,Term,NegationScore,TermScore
0,MM7,AdmitNote,1961,14:48:00,794,803,Abatacept,,0.485489
1,MM7,SCCA-OutptRecord,42,17:20:00,906,912,Acacia,,0.3813
2,MM55,Nutrition-OutptRecord,1633,11:07:00,834,840,Acacia,,0.406916
3,MM55,Nutrition-OutptRecord,1633,11:07:00,5057,5063,Acacia,,0.490309
4,MM416,Nutrition-OutptRecord,123,10:30:00,386,392,Acacia,,0.35008


In [101]:
medicines = ['Vitamin D supplements', 'Calcium', 'Denosumab', 'Pamidronate', 'Zoledronate', 'Dexamethasone']

In [102]:
# the rows that contain the the ID + 6 other values
# indicating whether the patient took each medicine.
# for example one row might be ['MM1', 1, 0, 0, 0, 0, 0]
rows = []
progress = 0
for ID, STC_Day in patientToDayOfSTC.items():
  # the data of this patient 90 days prior to STC date
  this_patient_data = data[(data['ID'] == ID) & (data['DaysFromDx'] <= STC_Day) & (data['DaysFromDx'] >= STC_Day - 90)]
  # if this dataframe is empty, or that the medications in this 90 days prior
  # to STC date don't contain any of the 6 medicines we want information of
  # we look at the days even prior to 90 days.
  if this_patient_data.shape[0] == 0 or (not (any(item in this_patient_data['Term'].unique() for item in medicines))):
    this_patient_data = data[(data['ID'] == ID) & (data['DaysFromDx'] <= STC_Day)]

  thisRow = [ID] # put ID in this row first

  # for each medicine, if this medicine is in the dataframe
  # add 1 to our row, else add 0
  for medicine in medicines:
    if medicine in this_patient_data['Term'].unique():
      thisRow.append(1)
    else:
      thisRow.append(0)
  # add this row to rows
  rows.append(thisRow)
  progress += 1
  if (progress % 10 == 0):
    print(progress/len(patientToDayOfSTC))

0.014245014245014245
0.02849002849002849
0.042735042735042736
0.05698005698005698
0.07122507122507123
0.08547008547008547
0.09971509971509972
0.11396011396011396
0.1282051282051282
0.14245014245014245
0.15669515669515668
0.17094017094017094
0.18518518518518517
0.19943019943019943
0.21367521367521367
0.22792022792022792
0.24216524216524216
0.2564102564102564
0.2706552706552707
0.2849002849002849
0.29914529914529914
0.31339031339031337
0.32763532763532766
0.3418803418803419
0.3561253561253561
0.37037037037037035
0.38461538461538464
0.39886039886039887
0.4131054131054131
0.42735042735042733
0.4415954415954416
0.45584045584045585
0.4700854700854701
0.4843304843304843
0.4985754985754986
0.5128205128205128
0.5270655270655271
0.5413105413105413
0.5555555555555556
0.5698005698005698
0.584045584045584
0.5982905982905983
0.6125356125356125
0.6267806267806267
0.6410256410256411
0.6552706552706553
0.6695156695156695
0.6837606837606838
0.698005698005698
0.7122507122507122
0.7264957264957265
0.74074

In [103]:
len(rows)

702

In [104]:
# convert rows to a dataframe.
df_medicines = pd.DataFrame(rows, columns = ['ID'] + medicines)

In [105]:
# save dataframe to csv.
df_medicines.to_csv('medicines_90_days_before_STC.csv', index = False)

## 3.3 make leisions data

In [106]:
data = pd.read_csv('BonyLesions.csv')

In [107]:
df_lesions = pd.DataFrame(columns=data.columns)
df_lesions

Unnamed: 0,ID,DxType,DaysFromDx,BonyLesions


In [108]:
for ID, STC_Day in patientToDayOfSTC.items():
  less_than_STC_data = data[(data['ID'] == ID) & (data['DaysFromDx'] <= STC_Day)]
  if less_than_STC_data.shape[0] == 0:
    greater_than_STC_data = data[(data['ID'] == ID) & (data['DaysFromDx'] > STC_Day)]
    if greater_than_STC_data.shape[0] != 0:
      df_lesions = df_lesions.append(greater_than_STC_data.sort_values(by = ['DaysFromDx']).iloc[0:1, :], ignore_index=True)
  else:
    df_lesions = df_lesions.append(less_than_STC_data.sort_values(by = ['DaysFromDx'], ascending = False).iloc[0:1, :], ignore_index=True)

In [109]:
def encodeLesions(lesionsString):
  if lesionsString == '0':
    return 0
  else:
    return 1

In [110]:
df_lesions['BonyLesions'] = df_lesions['BonyLesions'].apply(encodeLesions)

In [111]:
# save this to csv.
df_lesions.to_csv('Lesions_encoded.csv', index=False)

# 4. reprepare aggregated datasets

In [115]:
data_labs = pd.read_csv('Labs_encoded.csv')
data_demographics = pd.read_csv('Demographics_encoded.csv')
data_medications = pd.read_csv('medicines_90_days_before_STC.csv')
data_cancer_stage = pd.read_csv('Stages_encoded.csv')
data_bony_lesions = pd.read_csv('Lesions_encoded.csv')
data_fracture = pd.read_csv('Fractures.csv')

In [116]:
data_agg = data_fracture.merge(data_bony_lesions, on = 'ID', how='outer')
data_agg = data_agg.merge(data_cancer_stage, on = 'ID', how='outer')
data_agg = data_agg.merge(data_medications, on = 'ID', how = 'outer')
data_agg = data_agg.merge(data_demographics, on = 'ID', how = 'outer')
data_agg = data_agg.merge(data_labs, on = 'ID', how = 'outer')

In [117]:
data_agg

Unnamed: 0,ID,HasFracture?,DxType,DaysFromDx,BonyLesions,Stage,Vitamin D supplements,Calcium_x,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,Calcium_y,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,MRI,297.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,MRI,176.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,0.0,MRI,411.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,0.0,MRI,373.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,0.0,,,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
828,MM423,,,,,,,,,,,,2.0,0.0,2.0,,,,,,,,,,
829,MM486,,,,,,,,,,,,2.0,0.0,1.0,,,,,,,,,,
830,MM583,,,,,,,,,,,,2.0,0.0,1.0,,,,,,,,,,
831,MM675,,,,,,,,,,,,2.0,0.0,1.0,,,,,,,,,,


In [118]:
# delete the rows NaN HasFracture? values since that is what we are trying to predict
data_agg = data_agg.dropna(subset = ['HasFracture?'])

In [119]:
data_agg

Unnamed: 0,ID,HasFracture?,DxType,DaysFromDx,BonyLesions,Stage,Vitamin D supplements,Calcium_x,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,Calcium_y,Phosphate,Alkaline Phosphatase,Vitamin D3,Estradiol,Testosterone,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein,Sedimentation Rate
0,MM1,0.0,MRI,297.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,,1.0,,1.0,1.0,
1,MM2,0.0,MRI,176.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,
2,MM4,0.0,MRI,411.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,,,0.0,1.0,0.0,
3,MM5,0.0,MRI,373.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
4,MM6,0.0,,,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
697,MM834,0.0,MRI,161.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,,,0.0,1.0,0.0,1.0,
698,MM843,0.0,MRI,812.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,1.0
699,MM836,0.0,MRI,754.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1.0,
700,MM837,0.0,MRI,173.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,,1.0,,0.0,0.0,


In [120]:
# drop some columns we are not going to use in prediction
data_agg.drop(['DxType', 'DaysFromDx'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [121]:
data_agg = data_agg.rename(columns={'Calcium_x': 'TookCalciumMedicine', 'Calcium_y': 'LabCalciumLevel'})

In [122]:
data_agg.to_csv('data_agg.csv', index=False)

# 5. deal with nulls

In [123]:
data = pd.read_csv('data_agg.csv')

In [124]:
data.isnull().sum()

ID                               0
HasFracture?                     0
BonyLesions                     33
Stage                            4
Vitamin D supplements            0
TookCalciumMedicine              0
Denosumab                        0
Pamidronate                      0
Zoledronate                      0
Dexamethasone                    0
AgeAtDx                          1
PatientSex                       1
RacialGroup                      1
LabCalciumLevel                  3
Phosphate                        6
Alkaline Phosphatase             3
Vitamin D3                     137
Estradiol                      552
Testosterone                   389
Thyroid Stimulating Hormone    223
Creatinine                       3
C-Reactive Protein              23
Sedimentation Rate             601
dtype: int64

In [125]:
data.drop(['Estradiol', 'Testosterone', 'Sedimentation\xa0Rate'], axis = 1, inplace = True)

In [126]:
data[data.isnull()['AgeAtDx']]

Unnamed: 0,ID,HasFracture?,BonyLesions,Stage,Vitamin D supplements,TookCalciumMedicine,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,LabCalciumLevel,Phosphate,Alkaline Phosphatase,Vitamin D3,Thyroid Stimulating Hormone,Creatinine,C-Reactive Protein
622,MM908,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,


In [127]:
# should remove this row because it doesn't have a lot of data on it
data = data[~data.isnull()['AgeAtDx']]

In [128]:
data.isnull().sum()

ID                               0
HasFracture?                     0
BonyLesions                     33
Stage                            4
Vitamin D supplements            0
TookCalciumMedicine              0
Denosumab                        0
Pamidronate                      0
Zoledronate                      0
Dexamethasone                    0
AgeAtDx                          0
PatientSex                       0
RacialGroup                      0
LabCalciumLevel                  2
Phosphate                        5
Alkaline Phosphatase             2
Vitamin D3                     136
Thyroid Stimulating Hormone    222
Creatinine                       2
C-Reactive Protein              22
dtype: int64

In [129]:
# drop any rows with null value in Vitamin D3 cell
data = data.dropna(subset = ['Vitamin\xa0D3'])

In [130]:
data = data.dropna(subset = ['Phosphate'])

In [131]:
data.isnull().sum()

ID                               0
HasFracture?                     0
BonyLesions                     28
Stage                            1
Vitamin D supplements            0
TookCalciumMedicine              0
Denosumab                        0
Pamidronate                      0
Zoledronate                      0
Dexamethasone                    0
AgeAtDx                          0
PatientSex                       0
RacialGroup                      0
LabCalciumLevel                  0
Phosphate                        0
Alkaline Phosphatase             0
Vitamin D3                       0
Thyroid Stimulating Hormone    171
Creatinine                       0
C-Reactive Protein              11
dtype: int64

In [132]:
data = data.drop('Thyroid\xa0Stimulating\xa0Hormone', axis = 1)

In [133]:
data.isnull().sum()

ID                        0
HasFracture?              0
BonyLesions              28
Stage                     1
Vitamin D supplements     0
TookCalciumMedicine       0
Denosumab                 0
Pamidronate               0
Zoledronate               0
Dexamethasone             0
AgeAtDx                   0
PatientSex                0
RacialGroup               0
LabCalciumLevel           0
Phosphate                 0
Alkaline Phosphatase      0
Vitamin D3                0
Creatinine                0
C-Reactive Protein       11
dtype: int64

In [134]:
data = data.dropna(subset = ['Stage', 'BonyLesions', 'C-Reactive Protein'])

In [135]:
data.isnull().sum()

ID                       0
HasFracture?             0
BonyLesions              0
Stage                    0
Vitamin D supplements    0
TookCalciumMedicine      0
Denosumab                0
Pamidronate              0
Zoledronate              0
Dexamethasone            0
AgeAtDx                  0
PatientSex               0
RacialGroup              0
LabCalciumLevel          0
Phosphate                0
Alkaline Phosphatase     0
Vitamin D3               0
Creatinine               0
C-Reactive Protein       0
dtype: int64

In [136]:
data.shape

(525, 19)

In [137]:
data.to_csv('data_agg_2.csv', index = False)

# 6. one hot encoding

In [138]:
data = pd.read_csv('data_agg_2.csv')

In [139]:
data

Unnamed: 0,ID,HasFracture?,BonyLesions,Stage,Vitamin D supplements,TookCalciumMedicine,Denosumab,Pamidronate,Zoledronate,Dexamethasone,AgeAtDx,PatientSex,RacialGroup,LabCalciumLevel,Phosphate,Alkaline Phosphatase,Vitamin D3,Creatinine,C-Reactive Protein
0,MM1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0
1,MM2,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
2,MM4,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,MM5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,MM7,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
520,MM832,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
521,MM843,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
522,MM836,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
523,MM837,0.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [140]:
data['HasFracture?'].sum()

110.0

we had 108 fractures last time, now we have 110, improved a little bit...

In [141]:
# need to do one hot encoding on 1. Age variable, 2. sex, 3. racial group
data = pd.get_dummies(data, columns=['AgeAtDx', 'PatientSex', 'RacialGroup'])

In [142]:
data

Unnamed: 0,ID,HasFracture?,BonyLesions,Stage,Vitamin D supplements,TookCalciumMedicine,Denosumab,Pamidronate,Zoledronate,Dexamethasone,LabCalciumLevel,Phosphate,Alkaline Phosphatase,Vitamin D3,Creatinine,C-Reactive Protein,AgeAtDx_1.0,AgeAtDx_2.0,AgeAtDx_3.0,PatientSex_0.0,PatientSex_1.0,RacialGroup_1.0,RacialGroup_2.0,RacialGroup_3.0,RacialGroup_4.0
0,MM1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0,1,0,0,1,1,0,0,0
1,MM2,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,0,1,0,0
2,MM4,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0,1,0,1,0,1,0,0,0
3,MM5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,1,0,0,0
4,MM7,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0,1,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
520,MM832,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0,1,0,0,1,1,0,0,0
521,MM843,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,1,1,0,0,0
522,MM836,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0,1,1,0,0,0
523,MM837,0.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,1,0,0,0


In [143]:
data.to_csv('data_agg_3.csv', index = False)