<a href="https://colab.research.google.com/github/rhiats/diab_sev_mimicIII/blob/main/Analysis_of_Patients_with_Type_2_Diabetes_in_ICU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Stratification Analysis of Patients with Type 2 Diabetes in ICU**

Person diagnosed with Diabetes.

Type 2 Diabetic ICD9 Codes (250)/ Admissions - 'diabet...'.

- Admissions- breakdown of insurance type. (one hot encode)
- Admissions - gender (one hot encode)
- Prescriptions  - Top 5 drugs (one hot encode)
- Prescriptions - Distribution number of drugs per patient

http://www.icd9data.com/2014/Volume1/240-279/249-259/250/250.00.htm

https://physionet.org/content/mimic3-carevue/1.4/

https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

In [1]:
!pip install kmodes
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from kmodes.kprototypes import KPrototypes
import matplotlib.pyplot as plt




In [None]:
df_admissions=pd.read_csv('/content/drive/MyDrive/mimic-iii-clinical-database-carevue-subset-1.4/ADMISSIONS.csv.gz', compression='gzip')
df_patients=pd.read_csv('/content/drive/MyDrive/mimic-iii-clinical-database-carevue-subset-1.4/PATIENTS.csv.gz', compression='gzip')
df_presciptions=pd.read_csv('/content/drive/MyDrive/mimic-iii-clinical-database-carevue-subset-1.4/PRESCRIPTIONS.csv.gz', compression='gzip')

**Identify Diabetic Patients**

In [None]:
def diabetetesFind(x):
  """
    Find patients diagnosed with Diabetes.
    @p: pandas series 'diagnosis'
    @r: dataframe with diabetic patients
  """
  x=str(x)
  x=x.lower()
  return x

df_admissions['diagnosis_lower']=df_admissions['diagnosis'].apply(diabetetesFind)
df_diab=df_admissions[df_admissions['diagnosis_lower'].str.contains(pat = 'diabet')]

num_patients=df_diab['subject_id'].nunique()
'There are {} patients in the ICU diagnosed diabetes.'.format(num_patients)

**Distribution of Patients Gender**

In [None]:
sub_id_diab=list(set(df_diab['subject_id'].to_list()))

df_gender=df_patients[df_patients['subject_id'].isin(sub_id_diab)][['subject_id','gender']]

fem_mal_df=df_gender.groupby(['gender']).count().reset_index()

fem_mal_df.rename(columns={"subject_id": "Count","gender": "Gender"}, inplace=True)

sns.barplot(data=fem_mal_df, x='Gender', y="Count", palette="deep")

plt.savefig("/content/drive/MyDrive/output/gender.png")

**Distribution of Insurance Types for Diabetic Patients**

In [None]:
df_insur_diab=df_admissions[df_admissions['subject_id'].isin(sub_id_diab)][['subject_id','insurance']]
df_insur_diab.drop_duplicates(subset=['subject_id','insurance'],inplace=True)

insur_df_cnt=df_insur_diab.groupby(['insurance']).nunique().reset_index()

insur_df_cnt.rename(columns={"subject_id": "Count","insurance": "Insurance"}, inplace=True)

sns.barplot(data=insur_df_cnt, x='Insurance', y="Count", palette="deep")

plt.savefig("/content/drive/MyDrive/output/insurance.png")

**Top 3 Drugs used by Diabetic Patients in the ICU**

In [None]:
df_drug_diab=df_presciptions[df_presciptions['subject_id'].isin(sub_id_diab)][['subject_id','drug']]
df_drug=df_drug_diab.groupby(['drug']).nunique().reset_index()
df_drug.rename(columns={"subject_id": "Count","drug": "Drug"}, inplace=True)
df_top_5_drugs=df_drug.nlargest(5, 'Count')

sns.barplot(data=df_top_5_drugs, x='Drug', y="Count", palette="deep")
plt.xticks(rotation = 25)

plt.savefig("/content/drive/MyDrive/output/top5Drugs.png")

**Distribution of Medications per Person**

In [None]:
df_drug_per_patient=df_drug_diab.groupby(['subject_id']).nunique().reset_index()
df_drug_per_patient.rename(columns={"drug": "Number of Medications"}, inplace=True)
print(df_drug_per_patient.nlargest(5, 'Number of Medications'))

sns.histplot(data=df_drug_per_patient, x="Number of Medications",palette="Deep")

plt.savefig("/content/drive/MyDrive/output/drugs_per_patient.png")

In [None]:
drugs_19213_ser=df_presciptions[df_presciptions['subject_id']==19213]['drug']
drugs_19213_ser_lst=list(set(drugs_19213_ser.to_list()))
"Patient 19213 is on {} unique medications while in the ICU".format(len(drugs_19213_ser_lst))

**Feature Matrix for PCA**

The most used insurance types is considered for those who switch insurance. In situations where people switch insurance once and both are equally used, then each instance is separately registered in the analysis. In this analysis patients switch insurance once, so both instances are used in the analysis. There are patients that do not have any prescriptions in the system.

The features considered are:
- Number of Medications a patient is using during treatment
- Gender (Female/Male)
- Insurance (Government, Medicaid, Medicare, Private, Self-Pay)
- Medication (Other, Heparin, Insulin, Magnesium Sulfate, Normal Saline (NS), Potassium Chloride)

In [None]:
from re import X
gend_insur_diab=df_gender.merge(df_insur_diab, on='subject_id',how='outer')

switch_insurance_df=gend_insur_diab[gend_insur_diab.duplicated(subset=['subject_id','gender'],keep=False)]

freq_switch_insurance_df=switch_insurance_df.groupby(["insurance","subject_id"]).count().reset_index()
switch_insurance_df.rename({"gender": "Frequency"}, inplace=True)

max_switch_insurance_df=freq_switch_insurance_df.groupby(["insurance","subject_id"]).max().reset_index()

gend_insur_drug_diab=gend_insur_diab.merge(df_drug_diab, on='subject_id',how='outer')

gend_insur_drug_diab['drug']=gend_insur_drug_diab['drug'].apply(diabetetesFind)

gend_insur_drug_diab.drop_duplicates(subset=['subject_id','gender','insurance','drug'],inplace=True)

gend_insur_drug_diab.loc[~gend_insur_drug_diab.drug.isin(['insulin','ns','potassium chloride', 'heparin','magnesium sulfate']), 'drug'] = 'Other'

gend_insur_drug_diab['drug'].unique()

X_df=gend_insur_drug_diab.merge(df_drug_per_patient, on='subject_id',how='outer')

X_df.drop(columns=['subject_id'],axis=1, inplace=True)

X_df=X_df.fillna(0)

X=X_df.to_numpy()

#one_hot_encoded_data = pd.get_dummies(gend_insur_drug_diab_per_p_df, columns = ['gender', 'insurance','drug'])


**Model Selection**

While I originally considered segmenting the patients using a PCA to calculate a severity score, I decided to use a Mixed Clustering model as the features are predominantly categorical. PCA would work best if the data was continuous so that the variation could be maximized. K-Prototype is used to handle the categorical and continuous data. A K-Means algorithm uses Euclidean distance which would not work will with the One-Hot encoded data, but KModes finds the most frequent event instead of Euclidean distance. This mixed model handles both categorical and continuous data.

The Elbow method is used to select the optimal number of clusters. The aim is to minimize the distance of each point to the centroid. The optimal number of clusters is 5.

https://medium.com/@tarek.tm/how-to-handle-categorical-variables-in-clustering-1daa3b05bf25#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6ImM3ZTExNDEwNTlhMTliMjE4MjA5YmM1YWY3YTgxYTcyMGUzOWI1MDAiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMDAxMzg1NDE5NTc4NzExMTExNzIiLCJlbWFpbCI6InJoaWFzaW5naDIyQGdtYWlsLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJuYmYiOjE2OTM1MjQ0MTAsIm5hbWUiOiJSaGlhIFNpbmdoIiwicGljdHVyZSI6Imh0dHBzOi8vbGgzLmdvb2dsZXVzZXJjb250ZW50LmNvbS9hL0FBY0hUdGVaNlJ1R1dwUlZCMTE4SHc1TGU4ejM3ZU5UbHlYQk9QS1lHcmdFOVhiYz1zOTYtYyIsImdpdmVuX25hbWUiOiJSaGlhIiwiZmFtaWx5X25hbWUiOiJTaW5naCIsImxvY2FsZSI6ImVuIiwiaWF0IjoxNjkzNTI0NzEwLCJleHAiOjE2OTM1MjgzMTAsImp0aSI6ImQyYzNlYmQwZDU1MGIxMjA4MDFjODY2NDc0MmVlYzAzYzU0OTExOGIifQ.dQFLRtxRsBvfJ-kWasAWRqVprH68WI0oMb81nEERWgxPaJrEJaNgdKnIAWMx4PMbJQKy_tMwW62jWSmvTVQiZCMH169Y9iitHQAmlt0Xopzjkng62pTPkQMPWXcLMuyoXS3DA8KceNhk6DcTkFsxPLqDYGRDodfelrLLlCflDYorb5cTHXT_hWg8B9JE6PClRFLCrai6g8keY5TRTYofth8CVfU2Y9XjwFFAF6C9QM9MRAq5c_Q2TrNcd_VVaSQRTzAe6ySml8NRmNLbcktVfBAAfM8tTzsSducSDv5z6-bYUkTE2RzyLsBw-GfgC5hxbsUYyntmI-LrE_qPbcJCNg

https://medium.com/analytics-vidhya/clustering-on-mixed-data-types-in-python-7c22b3898086

https://medium.com/@keswani-rohitkumar/k-prototypes-clustering-algorithm-f5d8e09a0104

https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

In [None]:
def elbow(df):
  """
    Calculate the optimal number of clusters.

  """
  cluster_num, cost=[],[]
  for i in range(2,11):
    cluster_num.append(i)
    kproto=KPrototypes(n_clusters=i)
    clusters=kproto.fit_predict(X,categorical=[0, 1,2])
    cost.append(kproto.cost_)


  return cluster_num,cost

cluster_n,cost_arr=elbow(X)

In [None]:
plt.plot(cluster_n, cost_arr, 'bo',linestyle='-')

**5 Clusters of Diabetic Patients**

In [None]:
kproto=KPrototypes(n_clusters=5)
clusters=kproto.fit_predict(X,categorical=[0, 1,2])
X_df["clusters"]=pd.Series(clusters,index=X_df.index)