Summary

This file contains the code for data preparation, traning, and testing
- Data Preparation, dataset used https://drive.google.com/file/d/19LmP_-UDSEOmCJwjYCRzz4EiRe_RK3Og/view?usp=sharing
- Loading data
- Calculate and add Age column to the patients dataframe
- Combining columns from different files
- Dropping unwanted columns
- Encoding LabValues according to the Gender, and LabName
- Converting Gender and Race to one-hot
- Preparating ground truth values
- Train/Test split
- Training a Machine Learning classifier
- Training a small neural network

The final train/test data was prepared for 170 DiagnosisCode of diabetes disease. The performance of the neural network classifiers is not upto the mark because the model was traned for less epochs and the model architecture was smaller. Also, enocoded 0/1 values were used for LabValue without converting the column to one-hot 


# Prepare data for training and testing

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
%matplotlib inline

## Load dataset

In [2]:
# load google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# load data
patients = pd.read_csv('/content/drive/MyDrive/Molecular/PatientCorePopulatedTable.txt', index_col=False, delimiter='\t', ) 
admission = pd.read_csv('/content/drive/MyDrive/Molecular/AdmissionsCorePopulatedTable.txt', index_col=False, delimiter='\t', ) 
labs = pd.read_csv('/content/drive/MyDrive/Molecular/LabsCorePopulatedTable.txt', index_col=False, delimiter='\t', ) 
diagnoses = pd.read_csv('/content/drive/MyDrive/Molecular/AdmissionsDiagnosesCorePopulatedTable.txt', index_col=False, delimiter='\t', ) 

In [4]:
patients.head(1)

Unnamed: 0,PatientID,PatientGender,PatientDateOfBirth,PatientRace,PatientMaritalStatus,PatientLanguage,PatientPopulationPercentageBelowPoverty
0,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,1975-01-04 14:49:59.587,White,Single,Unknown,15.6


In [5]:
admission.head(1)

Unnamed: 0,PatientID,AdmissionID,AdmissionStartDate,AdmissionEndDate
0,43556DC2-BCFC-45A8-84C3-1D3E4A11B02F,1,1974-07-26 15:05:30.333,1974-07-30 22:27:44.987


In [6]:
labs.head(1)

Unnamed: 0,PatientID,AdmissionID,LabName,LabValue,LabUnits,LabDateTime
0,915BC24E-8C44-4D33-A386-CEA965B83F32,1,CBC: HEMATOCRIT,40.7,%,1946-09-07 22:20:26.677


In [7]:
diagnoses.head(1)

Unnamed: 0,PatientID,AdmissionID,PrimaryDiagnosisCode,PrimaryDiagnosisDescription
0,E74E9DF1-D8FD-41BC-8CDE-226CFE318E0B,1,E09.42,Drug or chemical induced diabetes mellitus wit...


In [8]:
print('patients, shape', patients.shape)
print('admission, shape', admission.shape)
print('labs shape', labs.shape)
print('diagnoses shape', diagnoses.shape)

patients, shape (10000, 7)
admission, shape (36143, 4)
labs shape (10726505, 6)
diagnoses shape (36143, 4)


In [9]:
# unique recods in diagnoses
diagnoses.nunique()

PatientID                      10000
AdmissionID                       12
PrimaryDiagnosisCode            2625
PrimaryDiagnosisDescription     2618
dtype: int64

In [10]:
# Single patient data in the Lab
code = labs[labs.PatientID.str.startswith('915BC24E-8C44-4D33-A386-CEA965B83F32')] 
code.nunique()

PatientID         1
AdmissionID       3
LabName          35
LabValue        515
LabUnits         14
LabDateTime    1138
dtype: int64

## Calculate and add Age column to the patients dataframe

In [11]:
# function to find and add the Age to the merged dataframe
def findAge(dob):
  date_time_obj = datetime.datetime.strptime(dob, '%Y-%m-%d %H:%M:%S.%f')
  birth_date = date_time_obj.date()
  end_date = datetime.date(2021, 1, 1)
  time_difference = end_date - birth_date
  age = time_difference.days
  #print("days",age)
  #print(int(age/365.2425))
  return int(age/365.2425)

ages=[]
for value in range(len(patients['PatientDateOfBirth'])):
    datecol = patients['PatientDateOfBirth'][value]
    ages.append(findAge(datecol))    

# Add age column after DOB
patients.insert(3, "PatientAge", ages, True) 
patients.head(2)

Unnamed: 0,PatientID,PatientGender,PatientDateOfBirth,PatientAge,PatientRace,PatientMaritalStatus,PatientLanguage,PatientPopulationPercentageBelowPoverty
0,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,1975-01-04 14:49:59.587,45,White,Single,Unknown,15.6
1,801AFB51-036F-40E3-BDFE-FED4844BE275,Male,1964-09-06 13:15:43.043,56,White,Unknown,English,13.23


## Merge data from different files

In [12]:
# merge patients and labs
patients_labs = pd.merge(patients, labs, on = ['PatientID'], how='inner')
patients_labs.shape

(10726505, 13)

In [13]:
patients_labs.head(2)

Unnamed: 0,PatientID,PatientGender,PatientDateOfBirth,PatientAge,PatientRace,PatientMaritalStatus,PatientLanguage,PatientPopulationPercentageBelowPoverty,AdmissionID,LabName,LabValue,LabUnits,LabDateTime
0,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,1975-01-04 14:49:59.587,45,White,Single,Unknown,15.6,1,METABOLIC: AST/SGOT,31.5,U/L,2000-09-12 09:10:38.407
1,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,1975-01-04 14:49:59.587,45,White,Single,Unknown,15.6,1,CBC: HEMOGLOBIN,16.2,gm/dl,2000-09-12 21:40:33.247


In [14]:
# merge patients_labs and diangnoses
patients_labs_diagnose = pd.merge(patients_labs, diagnoses, on = ['PatientID', 'AdmissionID'], how='inner')
patients_labs_diagnose.shape

(10726505, 15)

In [15]:
#Check for nan vlaues
patients_labs_diagnose.isna().sum()

PatientID                                  0
PatientGender                              0
PatientDateOfBirth                         0
PatientAge                                 0
PatientRace                                0
PatientMaritalStatus                       0
PatientLanguage                            0
PatientPopulationPercentageBelowPoverty    0
AdmissionID                                0
LabName                                    0
LabValue                                   0
LabUnits                                   0
LabDateTime                                0
PrimaryDiagnosisCode                       0
PrimaryDiagnosisDescription                0
dtype: int64

In [16]:
# Get 50 random samples from the merged dataframe
temp50 = patients_labs_diagnose.sample(n=50)
temp50

Unnamed: 0,PatientID,PatientGender,PatientDateOfBirth,PatientAge,PatientRace,PatientMaritalStatus,PatientLanguage,PatientPopulationPercentageBelowPoverty,AdmissionID,LabName,LabValue,LabUnits,LabDateTime,PrimaryDiagnosisCode,PrimaryDiagnosisDescription
9087142,2F8776EB-CB46-4D02-A750-9FEB717BA066,Male,1956-11-14 22:54:00.463,64,White,Single,Spanish,1.09,2,METABOLIC: SODIUM,128.9,mmol/L,1993-07-14 04:35:08.997,O9A.5,"Psychological abuse complicating pregnancy, ch..."
337754,45B7CF78-8ADF-4F45-A06A-07A35EB38023,Female,1946-09-30 01:02:46.130,74,Asian,Divorced,Icelandic,17.52,5,CBC: NEUTROPHILS,3.4,k/cumm,2012-02-29 04:38:19.167,D48.3,Neoplasm of uncertain behavior of retroperitoneum
5987398,ACA53276-8488-46EE-A045-BAB10C36B16A,Male,1921-06-15 20:16:05.117,99,White,Single,English,16.46,1,METABOLIC: AST/SGOT,32.9,U/L,1944-05-30 00:08:18.983,E10.351,Type 1 diabetes mellitus with proliferative di...
982043,06C76D98-CAAE-4B82-9CFD-93157B7BE320,Female,1941-03-10 16:20:34.077,79,White,Single,English,94.95,1,CBC: BASOPHILS,0.1,k/cumm,1961-04-09 22:42:24.010,L91,Hypertrophic disorders of skin
4092190,2625F76A-6CAE-4209-AF8D-64686C2BC2CC,Male,1928-11-06 21:14:09.123,92,White,Married,Spanish,15.93,5,CBC: LYMPHOCYTES,2.2,k/cumm,2011-04-16 04:00:51.730,M05.741,Rheumatoid arthritis with rheumatoid factor of...
8463648,F624163F-15E8-44D1-B1EA-FF95E6E43501,Female,1957-06-06 14:47:42.950,63,White,Married,English,11.22,1,METABOLIC: ALT/SGPT,28.9,U/L,1983-12-07 08:25:21.053,S37.812,Contusion of adrenal gland
8017080,052EB7AB-6253-448C-8D66-3E43E554BF80,Female,1969-05-08 15:05:54.533,51,Asian,Married,Unknown,16.81,3,URINALYSIS: WHITE BLOOD CELLS,0.4,wbc/hpf,2001-02-09 12:24:49.433,M06.372,"Rheumatoid nodule, left ankle and foot"
9945808,182AF252-4223-43E2-AF7B-799445A10A17,Male,1943-12-03 19:17:55.630,77,Asian,Married,Unknown,93.55,3,METABOLIC: TOTAL PROTEIN,9.4,gm/dL,1980-10-15 09:52:06.170,L60,Nail disorders
8758534,87BBEE4A-01E5-42B6-8E64-A6905158DC16,Male,1970-12-08 21:26:44.320,50,Unknown,Married,Spanish,14.23,3,CBC: NEUTROPHILS,1.7,k/cumm,2004-08-24 19:53:08.373,F11,Opioid related disorders
8720909,E018A6F9-2D48-464A-A123-CD3BD248EC08,Male,1962-08-09 22:18:33.860,58,White,Separated,Icelandic,18.74,4,METABOLIC: GLUCOSE,62.1,mg/dL,2005-10-04 15:12:09.560,O10.313,Pre-existing hypertensive heart and chronic ki...


## Drop unwanted columns

In [17]:
# drop columns
patients_labs_diagnose.drop(['PatientDateOfBirth', 'PatientMaritalStatus', 'PatientLanguage', 'PatientPopulationPercentageBelowPoverty', 'LabUnits',	'LabDateTime'], axis = 1, inplace=True)

In [18]:
patients_labs_diagnose.shape

(10726505, 9)

In [19]:
patients_labs_diagnose.head()

Unnamed: 0,PatientID,PatientGender,PatientAge,PatientRace,AdmissionID,LabName,LabValue,PrimaryDiagnosisCode,PrimaryDiagnosisDescription
0,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,METABOLIC: AST/SGOT,31.5,C75.1,Malignant neoplasm of pituitary gland
1,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,CBC: HEMOGLOBIN,16.2,C75.1,Malignant neoplasm of pituitary gland
2,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,URINALYSIS: PH,6.3,C75.1,Malignant neoplasm of pituitary gland
3,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,URINALYSIS: WHITE BLOOD CELLS,3.0,C75.1,Malignant neoplasm of pituitary gland
4,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,CBC: HEMATOCRIT,40.9,C75.1,Malignant neoplasm of pituitary gland


## Encode lab values to 0/1

In [20]:
# Lab test name, their min and max values for male and female
# The minimum and the maximum range is collected from Internet

# format: LabName, Male(min, max), Female (min, max)

test_range =[ ['CBC: HEMATOCRIT', [38.3, 48.6], [35.5, 44.9]],                    
              ['METABOLIC: ANION GAP', [8, 12], [8, 12]],      
              ['CBC: LYMPHOCYTES', [20, 40], [20, 40]],              
              ['CBC: HEMOGLOBIN', [13.5, 17.5], [11.5, 15.5]],    
              ['METABOLIC: SODIUM', [135, 145], [135, 145]],             
              ['METABOLIC: ALBUMIN', [3.4, 5.4 ], [3.4, 5.4 ]],             
              ['METABOLIC: BUN', [6, 20], [6, 20]],             
              ['CBC: NEUTROPHILS', [40, 60], [40, 60]],             
              ['METABOLIC: CALCIUM', [8.5, 10.2], [8.5, 10.2]],       
              ['METABOLIC: GLUCOSE', [3.9, 5.6], [3.9, 5.6]],

              ['URINALYSIS: PH', [4.5, 8.0], [4.5, 8.0]],
              ['METABOLIC: BILI TOTAL', [0.1, 1.2], [0.1, 1.2]],
              ['METABOLIC: POTASSIUM', [38.3, 48.6], [35.5, 44.9]],
              ['URINALYSIS: RED BLOOD CELLS', [38.3, 48.6], [35.5, 44.9]],
              ['METABOLIC: CARBON DIOXIDE', [23, 29 ], [23, 29 ]],
              ['METABOLIC: CREATININE', [0.6, 1.3], [0.6, 1.3]],
              ['URINALYSIS: SPECIFIC GRAVITY', [1.002, 1.030], [1.002, 1.030]],
              ['CBC: MEAN CORPUSCULAR VOLUME', [80,100], [80,100]],
              ['METABOLIC: CHLORIDE', [96, 106], [96, 106]],    
              ['METABOLIC: ALT/SGPT', [4, 36], [4, 36]],
             
              ['METABOLIC: AST/SGOT', [8, 33], [8, 33]],
              ['METABOLIC: ALK PHOS', [20, 130], [20, 130]],
              ['CBC: EOSINOPHILS', [0.03, 0.5], [0.03, 0.5]],
              ['CBC: ABSOLUTE NEUTROPHILS', [1.6, 6.5], [1.6, 6.5]],
              ['CBC: MCH', [25.4, 34.6], [25.4, 34.6]],
               ['URINALYSIS: WHITE BLOOD CELLS', [38.3, 48.6], [35.5, 44.9]],
              ['CBC: ABSOLUTE LYMPHOCYTES', [1.0, 3.1], [1.0, 3.1]],
              ['CBC: PLATELET COUNT', [150, 400], [150, 400]],
              ['CBC: RED BLOOD CELL COUNT', [4.7, 6.1], [4.2, 5.4]],
              ['CBC: WHITE BLOOD CELL COUNT', [4.5, 11.0], [4.5, 11.0]],
              ['CBC: RDW', [39, 46], [39, 46]],

              ['CBC: MCHC', [31, 36], [31, 36]],
              ['CBC: MONOCYTES', [0.3, 0.8], [0.3, 0.8]],
              ['METABOLIC: TOTAL PROTEIN', [6.0, 8.3], [6.0, 8.3]],
              ['CBC: BASOPHILS', [0.01, 0.08], [0.01, 0.08]]                   
      ]

test_range = sorted(test_range)
#test_range

The aove data were collected from the below urls
 - https://www.labtestzote.com/complete-blood-count-the-reference-range-in-male-and-females/
 - https://medlineplus.gov/ency/article/003644.htm
 - https://www.ucsfhealth.org/medical-tests/wbc-count
 - https://www.ucsfhealth.org/medical-tests/comprehensive-metabolic-panel
 - METABOLIC: POTASSIUM, URINALYSIS: RED BLOOD CELLS, URINALYSIS: WHITE BLOOD CELLS # random values

In [21]:
# get the name and value column
labname_value = pd.DataFrame(patients_labs_diagnose[['LabName', 'LabValue']])
uniquelab = labname_value['LabName'].unique()

# sort the lab name values by name
uniquelab = sorted(uniquelab)

In [22]:
# separate male and female recods (labname, gender, min, max)
labvalues_range_male=[]
labvalues_range_female=[]
for i in range(len(uniquelab)):
  labvalues_range_male.append([uniquelab[i], 'Male', test_range[i][1][0], test_range[i][1][1] ])  # lab, gender, min, max
  labvalues_range_female.append([uniquelab[i], 'Female', test_range[i][2][0], test_range[i][2][1] ])

In [23]:
# print a single record for each
print(labvalues_range_male[4])
print(labvalues_range_female[4])

['CBC: HEMATOCRIT', 'Male', 38.3, 48.6]
['CBC: HEMATOCRIT', 'Female', 35.5, 44.9]


In [24]:
# assign to df 
df = patients_labs_diagnose

In [25]:
# change LabValue 0/1 according to the condition: Gender, LabName, Min and Max LabValue

def encodeLabValues(labvalues_range_male, labvalues_range_female, df):
  for item in range(len(labvalues_range_male)): 
    name_m = labvalues_range_male[item][0]
    gender_m =  labvalues_range_male[item][1]
    minValue_m =  labvalues_range_male[item][2]
    maxValue_m =  labvalues_range_male[item][3]
    name_f = labvalues_range_female[item][0]
    gender_f =  labvalues_range_female[item][1]
    minValue_f =  labvalues_range_female[item][2]
    maxValue_f =  labvalues_range_female[item][3]
    #male
    c1 = (df.PatientGender==gender_m) & (df.LabName==name_m) & (df.LabValue>=minValue_m) & (df.LabValue<=maxValue_m)
    df.loc[c1, 'LabValue'] = 1
    #female
    c2 = (df.PatientGender==gender_f) & (df.LabName==name_f) & (df.LabValue>=minValue_f) & (df.LabValue<=maxValue_f)
    df.loc[c2, 'LabValue'] = 1

  # replace all other valules to 0
  df.loc[(df['LabValue']!=1.0), 'LabValue'] = 0
  return df

In [26]:
# get LabValue encoded as 0/1 for each LabName
enccode_df = encodeLabValues(labvalues_range_male, labvalues_range_female, df)
enccode_df.sample(n=10) # get 20 random records for cross check lab encoded values to make sure values are encoded properly

Unnamed: 0,PatientID,PatientGender,PatientAge,PatientRace,AdmissionID,LabName,LabValue,PrimaryDiagnosisCode,PrimaryDiagnosisDescription
4439469,827E7C9B-8F84-4293-8C3C-BD6F80A0B3DF,Female,79,White,4,METABOLIC: CHLORIDE,1.0,D41.21,Neoplasm of uncertain behavior of right ureter
415229,F0363379-D0A4-4F17-9690-8C4A57B2BC50,Female,98,White,1,URINALYSIS: SPECIFIC GRAVITY,1.0,T82.110,Breakdown (mechanical) of cardiac electrode
6318982,275B442B-ABBA-4569-9D5E-F373258AFD4B,Male,63,White,3,CBC: RED BLOOD CELL COUNT,1.0,O99.284,"Endocrine, nutritional and metabolic diseases ..."
8825753,853B78C4-E619-4543-B6CA-9CE373AC8D33,Male,57,Asian,1,CBC: RDW,0.0,E09,Drug or chemical induced diabetes mellitus
2241712,6C9000F1-5D82-4CEF-865E-5095D93C9E7E,Male,50,Asian,1,METABOLIC: CREATININE,1.0,D33.4,Benign neoplasm of spinal cord
10418059,98C59B67-E9D9-4809-9EB0-293F4D898696,Female,53,African American,1,CBC: BASOPHILS,0.0,M63.842,Disorders of muscle in diseases classified els...
3384162,93DD56A5-C312-4030-AA15-9174589B2BEB,Male,58,White,2,CBC: ABSOLUTE LYMPHOCYTES,0.0,N00.6,Acute nephritic syndrome with dense deposit di...
3976369,01008EF7-0DF9-4455-A522-FEBE365414A9,Male,68,White,2,CBC: MCHC,0.0,J84.842,Pulmonary interstitial glycogenosis
5861277,5E29B37F-7275-4BD7-A500-7F20727105A9,Female,95,Unknown,2,URINALYSIS: SPECIFIC GRAVITY,1.0,O9A.512,"Psychological abuse complicating pregnancy, se..."
3706247,4E9B83E9-7E4D-4641-BF77-C5A74ED271D5,Male,64,Unknown,4,CBC: PLATELET COUNT,1.0,O46.02,Antepartum hemorrhage with disseminated intrav...


## Convert Gender and Race to one-hot 

In [27]:
orig=enccode_df.copy()

In [28]:
# convert Gender to one-hot vector
one_hot_gender = pd.get_dummies(enccode_df['PatientGender'], prefix='gender')

In [29]:
orig=orig.join(one_hot_gender)
orig.head()

Unnamed: 0,PatientID,PatientGender,PatientAge,PatientRace,AdmissionID,LabName,LabValue,PrimaryDiagnosisCode,PrimaryDiagnosisDescription,gender_Female,gender_Male
0,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,METABOLIC: AST/SGOT,1.0,C75.1,Malignant neoplasm of pituitary gland,0,1
1,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,CBC: HEMOGLOBIN,1.0,C75.1,Malignant neoplasm of pituitary gland,0,1
2,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,URINALYSIS: PH,1.0,C75.1,Malignant neoplasm of pituitary gland,0,1
3,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,URINALYSIS: WHITE BLOOD CELLS,0.0,C75.1,Malignant neoplasm of pituitary gland,0,1
4,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,Male,45,White,1,CBC: HEMATOCRIT,1.0,C75.1,Malignant neoplasm of pituitary gland,0,1


In [30]:
# convert Race to one-hot vector
one_hot_PatientRace = pd.get_dummies(enccode_df['PatientRace'], prefix = 'Race')
orig=orig.join(one_hot_PatientRace)

# Drop original Gender and race column
orig = orig.drop(['PatientGender', 'PatientRace'], axis = 1)
orig.head()

Unnamed: 0,PatientID,PatientAge,AdmissionID,LabName,LabValue,PrimaryDiagnosisCode,PrimaryDiagnosisDescription,gender_Female,gender_Male,Race_African American,Race_Asian,Race_Unknown,Race_White
0,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,45,1,METABOLIC: AST/SGOT,1.0,C75.1,Malignant neoplasm of pituitary gland,0,1,0,0,0,1
1,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,45,1,CBC: HEMOGLOBIN,1.0,C75.1,Malignant neoplasm of pituitary gland,0,1,0,0,0,1
2,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,45,1,URINALYSIS: PH,1.0,C75.1,Malignant neoplasm of pituitary gland,0,1,0,0,0,1
3,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,45,1,URINALYSIS: WHITE BLOOD CELLS,0.0,C75.1,Malignant neoplasm of pituitary gland,0,1,0,0,0,1
4,3A3C2AFB-FFFA-4E69-B4E6-73C1245D5D12,45,1,CBC: HEMATOCRIT,1.0,C75.1,Malignant neoplasm of pituitary gland,0,1,0,0,0,1


## Prepare Y

In [31]:
original2=orig.copy()

In [63]:
# select data for final training and testing for PrimaryDiagnosisCode starting with below code
diabetes1 = original2[(original2['PrimaryDiagnosisCode'].str.startswith('E08')) | 
                      (original2['PrimaryDiagnosisCode'].str.startswith('E09')) |
                      (original2['PrimaryDiagnosisCode'].str.startswith('E10')) |
                      (original2['PrimaryDiagnosisCode'].str.startswith('E11'))
                       ]

codes = diabetes1['PrimaryDiagnosisCode'].unique()
final_data = original2.loc[original2['PrimaryDiagnosisCode'].isin(codes)]

In [64]:
final_data.sample(10)

Unnamed: 0,PatientID,PatientAge,AdmissionID,LabName,LabValue,PrimaryDiagnosisCode,PrimaryDiagnosisDescription,gender_Female,gender_Male,Race_African American,Race_Asian,Race_Unknown,Race_White
2261551,F68950BF-FFB5-4DEE-86D8-A3A2F5DF47BC,58,1,METABOLIC: CHLORIDE,1.0,E11.64,Type 2 diabetes mellitus with hypoglycemia,1,0,0,0,0,1
1711302,BCD190C2-4016-4870-8A1B-E803DCE56B20,73,4,CBC: EOSINOPHILS,1.0,E10.2,Type 1 diabetes mellitus with kidney complicat...,1,0,0,1,0,0
5839313,6BE68512-0245-41DB-9901-FEFC07233374,55,2,CBC: NEUTROPHILS,0.0,E08.349,Diabetes mellitus due to underlying condition ...,0,1,0,1,0,0
3251265,CC6BFFB5-4920-4D74-8080-C3D08159C2F6,72,3,URINALYSIS: WHITE BLOOD CELLS,0.0,E09,Drug or chemical induced diabetes mellitus,1,0,1,0,0,0
1406175,4AB007D8-EB13-49BF-9FB0-EF87EFB32131,79,4,METABOLIC: CALCIUM,0.0,E09.630,Drug or chemical induced diabetes mellitus wit...,0,1,0,0,0,1
389260,A6F66055-7850-4AC2-A2B3-6EA4BB8B6B1E,51,4,CBC: WHITE BLOOD CELL COUNT,0.0,E09.351,Drug or chemical induced diabetes mellitus wit...,1,0,0,1,0,0
897202,B51A9A51-A174-4586-9057-59D52FDDF999,97,4,URINALYSIS: SPECIFIC GRAVITY,1.0,E10.21,Type 1 diabetes mellitus with diabetic nephrop...,0,1,0,1,0,0
4174238,FC2058D4-F6C3-4B41-83BD-5272CF59E366,61,4,METABOLIC: BUN,1.0,E09.1,Drug or chemical induced diabetes mellitus wit...,1,0,1,0,0,0
3522064,3E9E05E8-1515-4F4F-9A06-C29289430FCC,37,3,METABOLIC: GLUCOSE,0.0,E10.44,Type 1 diabetes mellitus with diabetic amyotrophy,1,0,0,0,0,1
2176912,B9F48EC2-72E4-474C-8376-4FC9F859DC41,53,2,CBC: BASOPHILS,0.0,E10.351,Type 1 diabetes mellitus with proliferative di...,0,1,0,1,0,0


In [67]:
# seprate truth values (codes)
truth_data = final_data['PrimaryDiagnosisCode']

# Drop original diagnosiscode column and others from final data
final_data = final_data.drop(['PrimaryDiagnosisCode', 'PatientID', 'AdmissionID', 'PrimaryDiagnosisDescription', 'LabName'], axis = 1)
final_data.head()

Unnamed: 0,PatientAge,LabValue,gender_Female,gender_Male,Race_African American,Race_Asian,Race_Unknown,Race_White
754,56,0.0,0,1,0,0,0,1
755,56,1.0,0,1,0,0,0,1
756,56,1.0,0,1,0,0,0,1
757,56,0.0,0,1,0,0,0,1
758,56,0.0,0,1,0,0,0,1


## Divide the data into training and testing

In [68]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(final_data, truth_data, test_size=0.33, random_state=42)

In [69]:
X_train.shape, y_train.shape

((456970, 8), (456970,))

In [70]:
 X_test.shape, y_test.shape 

((225075, 8), (225075,))

# Train models on the dataset

### Train a DecisionTreeClassifier

In [76]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier(max_depth = 40).fit(X_train, y_train)
model_predicted = model_dt.predict(X_test)

print('Train accuracy', model_dt.score(X_train, y_train))
print('Test accuracy', accuracy_score(y_test, model_predicted))

Train accuracy 0.3106746613563254
Test accuracy 0.3059868932578029


In [77]:
# save model
import pickle
pickle.dump(model_dt, open('model_DT.pk','wb'))
print("model saved")

model saved


In [89]:
# predict on a simgle sample
newsample = X_test[0:1]
predicted = model_dt.predict(newsample)
print('Model predicted class', str(predicted))

# get PrimaryDiagnosisDescription based on the predicted code
#enccode_df[enccode_df['PrimaryDiagnosisCode']=='E09.4']

result = enccode_df.loc[enccode_df['PrimaryDiagnosisCode']=='E09.4']
result['PrimaryDiagnosisDescription'][0:1]

Model predicted class ['E09.4']


645166    Drug or chemical induced diabetes mellitus wit...
Name: PrimaryDiagnosisDescription, dtype: object

### Train a Neural network

In [100]:
# Encode truth values
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
code_dd = label_encoder.fit_transform(truth_data)

In [102]:
import numpy as np
# Import Keras modules
from keras import models
from keras import layers
#from keras.utils import to_categorical
from tensorflow.keras.utils import to_categorical
#
# Create the network
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(8,)))
network.add(layers.Dense(170, activation='softmax'))
#
# Compile the network#
network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

X_train, X_test, y_train, y_test = train_test_split(final_data, code_dd, test_size=0.33, random_state=42)

# Create categorical labels
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)
#
# train the neural network
network.fit(X_train, train_labels, epochs=20, batch_size=40)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f5554215390>

In [None]:
# Get test accuracy
test_loss, test_acc = network.evaluate(X_test, y_test)
print('Test Accuracy: ', test_acc, '\nTest Loss: ', test_loss)

In [None]:
#https://datascienceparichay.com/article/pandas-merge-dataframes-on-multiple-columns/
#https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html