# Getting diadetic dataset from UCI

Getting hospital-visiting dataset of diabetic patients from [UCI](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008)

The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Once any kind of diabetes was entered to the system as a diagnosis for the patients, the encounter recodes will be extracted.

This notebook shows my first exploration step of the dataset, including interpretation of ICD code and flitering records with diabetes diagnosis.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Some options for Pandas and Seaborn:

In [2]:
# Always display all the columns
pd.set_option('display.width', 5000) 
pd.set_option('display.max_columns', 200) 

# Plain Seaborn figures with matplotlib color codes mapped to the default seaborn palette 
sns.set(style="white", color_codes=True)

Import the orignial dataset CSV file as a Pandas dataframe

In [3]:
df = pd.read_csv("diabetic_data.csv")

In [4]:
df.shape

(101766, 50)

In [5]:
df.dtypes

encounter_id                 int64
patient_nbr                  int64
race                        object
gender                      object
age                         object
weight                      object
admission_type_id            int64
discharge_disposition_id     int64
admission_source_id          int64
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
max_glu_serum               object
A1Cresult                   object
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
glimepiride         

## Interpretation of ICD code

Overview the information of primary, secondary and additional diagnoses. 

In [6]:
diag_df = df[["diag_1", "diag_2", "diag_3"]]
diag_df.describe()

Unnamed: 0,diag_1,diag_2,diag_3
count,101766,101766,101766
unique,717,749,790
top,428,276,250
freq,6862,6752,11555


The dataset contained upto three diagnoses for a given patient (primary, secondary and additional). However, each of these had about 700 unique ICD codes and it is extremely difficult to include them in the model and interpret meaningfully. Therefore, we collapsed these diagnosis codes into 9 disease categories in an almost similar fashion to that done in the original publication using this dataset. These 9 categories are listed below:

| Disese         | ICD-9 code    | catelogical code |
| -------------  |:-------------:| ----------------:|
| Circulatory    | 390-459, 785  |         1        |
| Respiratory    | 460-519, 786  |         2        |
| Digestive      | 520-579, 787  |         3        |
| Diabetes       | 250.XX        |         4        |
| Injury         | 800-999       |         5        |
| Musculoskeletal| 710-739       |         6        |
| Genitourinary  | 580-629, 788  |         7        |
| Neoplasms      | 140-239       |         8        |
| Others         | E & V & XXX   |         0        |

In [7]:
def interpret_diag(val):
    """interpret ICD code into catalogs"""
    try: 
        value = float(val)
    except ValueError:
        return "0"
    
    if value >= 390 and value < 460 or value == 785:
        return "1"
    elif value >= 460 and value <= 520 or value == 786:
        return "2"
    elif value >= 520 and value < 580 or value == 787:
        return "3"
    elif value >= 250 and value <256:
        return "4"
    elif value >= 800 and value < 1000:
        return "5"
    elif value >= 710 and value < 740:
        return "6"
    elif value >= 580 and value < 630 or value == 788:
        return "7"
    elif value >= 140 and value < 240:
        return "8"
    elif value == -1:
        return "?"
    else:
        return "0"
    
for i in range(1,4):
    original = "_".join(("diag",str(i)))
    filename1= "_".join(("m_diag", str(i)))
    filename2= "".join(("c_diag", str(i)))
    df[filename1] = df[original].replace('?', -1)
    df[filename2] = df.apply(lambda row:interpret_diag(row[filename1]),axis=1)

In [8]:
df.shape

(101766, 56)

In [9]:
c_diag_df = df[["c_diag1", "c_diag2", "c_diag3"]]
c_diag_df.describe()

Unnamed: 0,c_diag1,c_diag2,c_diag3
count,101766,101766,101766
unique,10,10,10
top,1,1,1
freq,30437,31881,30306


In [10]:
diag_df = df[["diag_1", "diag_2","diag_3","c_diag1", "c_diag2", "c_diag3"]]
diag_df.head(15).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
diag_1,250.83,276.0,648,8.0,197,414,414,428,398,434,250.7,157,428.0,428,518
diag_2,?,250.01,250,250.43,157,411,411,492,427,198,403.0,288,250.43,411,998
diag_3,?,255.0,V27,403.0,250,250,V45,250,38,486,996.0,197,250.6,427,627
c_diag1,4,0.0,0,0.0,8,1,1,1,1,1,4.0,8,1.0,1,2
c_diag2,?,4.0,4,4.0,8,1,1,2,1,8,1.0,0,4.0,1,5
c_diag3,?,4.0,0,1.0,4,4,0,4,0,2,5.0,8,4.0,1,7


In [11]:
df.drop("diag_1", axis=1, inplace=True)
df.drop("diag_2", axis=1, inplace=True)
df.drop("diag_3", axis=1, inplace=True)
df.drop("m_diag_1", axis=1, inplace=True)
df.drop("m_diag_2", axis=1, inplace=True)
df.drop("m_diag_3", axis=1, inplace=True)

In [12]:
df.shape

(101766, 50)

At this point, we might want to save the work done into a csv file as backup, so it can later be loaded from this point without needing to do all the above steps.[link](https://github.com/zhihuaqi/DiaControl/blob/master/data/diabetes_data_catelog.csv)

In [13]:
df.to_csv('./diabetes_data_catelog.csv')

## Flitering out diabetic hospital visit.
For we focus on the diabete control so we extracted all the encounter recodes that have diabetic diagnosis(ICD-9 code: 250.*, catalog code:4) for further analysis.

In [14]:
# one of c_diag shoule be catelog 4(diabetes)
d_df = df[(df.c_diag1 == "4")|(df.c_diag2 == "4")|(df.c_diag3 == "4")]

In [15]:
d_df.shape

(38314, 50)

At this point, we might want to save the work done into a csv file as backup, so it can later be loaded from this point without needing to do all the above steps.[link](https://github.com/zhihuaqi/DiaControl/blob/master/data/diabetes_data_preprocessed.csv)

In [16]:
d_df.to_csv('./diabetes_data_preprocessed.csv')