## Diabetes Readmission Prediction

### The problem statement.
To build a classification model that predicts whether a diabetic patient will be readmitted to the hospital after discharge, and if so, whether the readmission will occur within 30 days (a critical high-risk period) or after 30 days.

**Input:** A dataset containing over 50 features related to the patient's demographics, hospital stay, medical history, and test results.

**Output:** A prediction of one of three classes:
1.  **`NO`**: Patient is not readmitted.
2.  **`<30`**: Patient is readmitted within 30 days. (Primary focus for intervention)
3.  **`>30`**: Patient is readmitted after 30 days.

**End Goal:** To identify high-risk patients before discharge, enabling healthcare providers to implement targeted follow-up care and potentially prevent readmission.

## Feature Information



#### **1. Patient Identification**
*   **`encounter_id`**: Unique ID for the hospital stay. (Will be dropped, not a feature).
*   **`patient_nbr`**: Unique ID for the patient. (Will be dropped, not a feature).

#### **2. Demographic Information**
*   **`race`**: Patient's race (e.g., Caucasian, AfricanAmerican, Hispanic).
*   **`gender`**: Male, Female.
*   **`age`**: Age group, in 10-year intervals (e.g., `[40-50)`, `[50-60)`).
*   **`weight`**: Weight in pounds. *(Note: 97% missing, likely unusable)*.

#### **3. Admission & Discharge Details**
*   **`admission_type_id`**: How patient was admitted (e.g., 1-Emergency, 2-Urgent, 3-Elective).
*   **`discharge_disposition_id`**: How patient left (e.g., 1-Home, 3-Expired, 6-Transferred). **(Very Important)**
*   **`admission_source_id`**: Where patient was admitted from (e.g., 1-Physician Referral, 7-Emergency Room).
*   **`time_in_hospital`**: Number of days between admission and discharge.

#### **4. Hospital System & Care Provided**
*   **`payer_code`**: Insurance provider (e.g., MC-Medicare, MD-Medicaid). *(52% missing)*.
*   **`medical_specialty`**: Department of the admitting physician (e.g., Cardiology, Surgery). *(53% missing, many categories)*.
*   **`num_lab_procedures`**: Number of lab tests performed during the stay.
*   **`num_procedures`**: Number of procedures (other than lab tests) performed.
*   **`num_medications`**: Number of distinct generic names administered.

#### **5. Prior Utilization (Visits in the Past Year)**
*   **`number_outpatient`**: Number of outpatient visits (clinic, doctor's office).
*   **`number_emergency`**: Number of emergency room visits.
*   **`number_inpatient`**: Number of previous hospital admissions (stays). **(Strong Predictor)**
*   **`number_diagnoses`**: Number of diagnoses entered into the system.

#### **6. Diagnosis & Test Results**
*   **`diag_1`**: Primary diagnosis (ICD-9 code).
*   **`diag_2`**: Secondary diagnosis.
*   **`diag_3`**: Additional secondary diagnosis.
*   **`max_glu_serum`**: Glucose test result (Values: `">200"`, `">300"`, `"normal"`, `"none"`).
*   **`A1Cresult`**: A1c test result (Values: `">8"`, `">7"`, `"normal"`, `"none"`).

#### **7. Medications (24 Features)**
**For each medication below, the value indicates what happened during the stay:**
*   `"no"` = not prescribed
*   `"steady"` = no change in dosage
*   `"up"` = dosage increased
*   `"down"` = dosage decreased

**List of Medications:**
`metformin`, `repaglinide`, `nateglinide`, `chlorpropamide`, `glimepiride`, `acetohexamide`, `glipizide`, `glyburide`, `tolbutamide`, `pioglitazone`, `rosiglitazone`, `acarbose`, `miglitol`, `troglitazone`, `tolazamide`, `examide`, `citoglipton`, `insulin`, `glyburide-metformin`, `glipizide-metformin`, `glimepiride-pioglitazone`, `metformin-rosiglitazone`, `metformin-pioglitazone`

#### **8. Medication Summaries**
*   **`change`**: Was there any change in diabetic medications? (`"ch"` for change, `"no"` for no change).
*   **`diabetesMed`**: Was any diabetes medication prescribed? (`"yes"` or `"no"`).

#### **9. Target Variable**
*   **`readmitted`**: **What we want to predict.**
    *   `"NO"` = Not readmitted.
    *   `"<30"` = Readmitted within 30 days. **(Key focus)**
    *   `">30"` = Readmitted after 30 days.

---

In [28]:
## import nessesary liberaries
import pandas as pd

In [29]:
## Load the dataset
path = "../data/diabetic_data.csv"

def load_dataset(path: str):
    try:
        df = pd.read_csv(path)
        print("Dataset loaded successfully!")
        return df
    except Exception as e:
        print(f"Error while loading dataset: {e}")
        return None

# Load and store in a variable
df = load_dataset(path)

Dataset loaded successfully!


In [30]:
## Lets Explore all columns of the dataset with the help of Transpose
df.head().T

Unnamed: 0,0,1,2,3,4
encounter_id,2278392,149190,64410,500364,16680
patient_nbr,8222157,55629189,86047875,82442376,42519267
race,Caucasian,Caucasian,AfricanAmerican,Caucasian,Caucasian
gender,Female,Female,Female,Male,Male
age,[0-10),[10-20),[20-30),[30-40),[40-50)
weight,?,?,?,?,?
admission_type_id,6,1,1,1,1
discharge_disposition_id,25,1,1,1,1
admission_source_id,1,7,7,7,7
time_in_hospital,1,3,2,2,1


In [31]:
## Lets find the total rows and columns
print(f"Total Rows: {df.shape[0]}")
print(f"Total Columns: {df.shape[1]}")

Total Rows: 101766
Total Columns: 50


In [32]:
## Lets view some extra dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

**Observations**
- Most of the features are object type

In [33]:
## Lets check the total null values in each feature
df.isnull().sum()

encounter_id                    0
patient_nbr                     0
race                            0
gender                          0
age                             0
weight                          0
admission_type_id               0
discharge_disposition_id        0
admission_source_id             0
time_in_hospital                0
payer_code                      0
medical_specialty               0
num_lab_procedures              0
num_procedures                  0
num_medications                 0
number_outpatient               0
number_emergency                0
number_inpatient                0
diag_1                          0
diag_2                          0
diag_3                          0
number_diagnoses                0
max_glu_serum               96420
A1Cresult                   84748
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiride                     0
acetohexamide 

**Observations**
- Feature **max_glu_serum**   has  **96420** Null values
- Feature **A1Cresult** has **84748** Null values

In [34]:
## Lets perform operation on Null values features
print(f'Null value percentage of feature max_glu_serum is :  {round(df["max_glu_serum"].isnull().sum() / df.shape[0] * 100, 2)}%')
print(f'Null value percentage of feature A1Cresult is :  {round(df["A1Cresult"].isnull().sum() / df.shape[0] * 100, 2)}%')

Null value percentage of feature max_glu_serum is :  94.75%
Null value percentage of feature A1Cresult is :  83.28%


**Observations**
- These two features have null value percentages greater than 75%, so lets drop them.

In [35]:
## Before modifying the original dataset lets take a copy
df_copy = df.copy()

In [36]:
## Lets drop features [max_glu_serum, A1Cresult]
df_copy = df_copy.drop(columns=["max_glu_serum", "A1Cresult"])

### Numerical Features

In [46]:
## Lets group all numerical features by using datatypes
numerical_features = [feature for feature in df_copy.columns if df_copy[feature].dtype != "O"]
numerical_features

['encounter_id',
 'patient_nbr',
 'admission_type_id',
 'discharge_disposition_id',
 'admission_source_id',
 'time_in_hospital',
 'num_lab_procedures',
 'num_procedures',
 'num_medications',
 'number_outpatient',
 'number_emergency',
 'number_inpatient',
 'number_diagnoses']

In [47]:
## Lets make "numerical_features" as DataFrame to perfrom all pandas operations
numerical_features_df = pd.DataFrame(df_copy[numerical_features])
numerical_features_df

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
0,2278392,8222157,6,25,1,1,41,0,1,0,0,0,1
1,149190,55629189,1,1,7,3,59,0,18,0,0,0,9
2,64410,86047875,1,1,7,2,11,5,13,2,0,1,6
3,500364,82442376,1,1,7,2,44,1,16,0,0,0,7
4,16680,42519267,1,1,7,1,51,0,8,0,0,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
101761,443847548,100162476,1,3,7,3,51,0,16,0,0,0,9
101762,443847782,74694222,1,4,5,5,33,3,18,0,0,1,9
101763,443854148,41088789,1,1,7,1,53,0,9,1,0,0,13
101764,443857166,31693671,2,3,7,10,45,2,21,0,0,1,9


In [49]:
## 'encounter_id' and 'patient_nbr'are not a features so lets drop them
numerical_features_df = numerical_features_df.drop(columns=['encounter_id', 'patient_nbr'])

In [55]:
## Lets do a quick overview using summary statistics
numerical_features_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
admission_type_id,101766.0,2.024006,1.445403,1.0,1.0,1.0,3.0,8.0
discharge_disposition_id,101766.0,3.715642,5.280166,1.0,1.0,1.0,4.0,28.0
admission_source_id,101766.0,5.754437,4.064081,1.0,1.0,7.0,7.0,25.0
time_in_hospital,101766.0,4.395987,2.985108,1.0,2.0,4.0,6.0,14.0
num_lab_procedures,101766.0,43.095641,19.674362,1.0,31.0,44.0,57.0,132.0
num_procedures,101766.0,1.33973,1.705807,0.0,0.0,1.0,2.0,6.0
num_medications,101766.0,16.021844,8.127566,1.0,10.0,15.0,20.0,81.0
number_outpatient,101766.0,0.369357,1.267265,0.0,0.0,0.0,0.0,42.0
number_emergency,101766.0,0.197836,0.930472,0.0,0.0,0.0,0.0,76.0
number_inpatient,101766.0,0.635566,1.262863,0.0,0.0,0.0,1.0,21.0


### Categorical Features

In [40]:
## Lets group all categorical features by using datatypes
categorical_features = [feature for feature in df_copy.columns if df_copy[feature].dtype == "O"]
categorical_features

['race',
 'gender',
 'age',
 'weight',
 'payer_code',
 'medical_specialty',
 'diag_1',
 'diag_2',
 'diag_3',
 'metformin',
 'repaglinide',
 'nateglinide',
 'chlorpropamide',
 'glimepiride',
 'acetohexamide',
 'glipizide',
 'glyburide',
 'tolbutamide',
 'pioglitazone',
 'rosiglitazone',
 'acarbose',
 'miglitol',
 'troglitazone',
 'tolazamide',
 'examide',
 'citoglipton',
 'insulin',
 'glyburide-metformin',
 'glipizide-metformin',
 'glimepiride-pioglitazone',
 'metformin-rosiglitazone',
 'metformin-pioglitazone',
 'change',
 'diabetesMed',
 'readmitted']

**Observations**
- age should not be object 
- weight should not be object