## Elements of Statistical Learning (ESL) Final Project

**Our Machine Learning Question:**

**How can we improve the accuracy of predicting** which diabetic patients are likely to be readmitted to the hospital within 30 days across different classifiers **by addressing the severe class imbalance in the dataset?**
The dataset we chose is on “Early Readmission Prediction of Patients Diagnosed with Diabetes,” which has a significant class imbalance between early admitted patients and others. The positive class for our classification (early admitted patients) is underrepresented with a ratio of 1-to-9.

The techniques we are considering implementing to overcome the class imbalance in the dataset are as follows:
1. SMOTE (Synthetic Minority Oversampling Technique)
2. Class weights
3. Ensemble methods

All the above techniques will be compared across different classifiers to answer the following question: How well do different models (e.g., Logistic Regression, Random Forest, SVM) handle the imbalance and predict early readmissions?

Recommendations from Professor David:

- Check how different models react to class imbalance prior to implementing balancing techniques. 
- Also, check if the balancing techniques actually cause that the same objects are misclassified, or that suddenly also other objects go wrong (that used to be classified well).

In [96]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Data Loading and Exploration

The chosen dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory, medications, and stayed up to 14 days.

Link to the dataset: https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008

**Our Aim:** Identifying patients at risk of being readmitted within 30 days of discharge.

In [97]:
data = pd.read_csv('diabetic_data.csv')
print(data.head())

   encounter_id  patient_nbr             race  gender      age weight  \
0       2278392      8222157        Caucasian  Female   [0-10)      ?   
1        149190     55629189        Caucasian  Female  [10-20)      ?   
2         64410     86047875  AfricanAmerican  Female  [20-30)      ?   
3        500364     82442376        Caucasian    Male  [30-40)      ?   
4         16680     42519267        Caucasian    Male  [40-50)      ?   

   admission_type_id  discharge_disposition_id  admission_source_id  \
0                  6                        25                    1   
1                  1                         1                    7   
2                  1                         1                    7   
3                  1                         1                    7   
4                  1                         1                    7   

   time_in_hospital  ... citoglipton insulin  glyburide-metformin  \
0                 1  ...          No      No                   No

In [98]:
print(data.columns)

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')


In [99]:
target_distribution = data['readmitted'].value_counts()

print("Distribution of 'readmitted' target column:")
print(target_distribution)

target_percentage = data['readmitted'].value_counts(normalize=True) * 100
print("\nPercentage distribution of 'readmitted' target column:")
print(f"{round(target_percentage, 2)}")

Distribution of 'readmitted' target column:
readmitted
NO     54864
>30    35545
<30    11357
Name: count, dtype: int64

Percentage distribution of 'readmitted' target column:
readmitted
NO     53.91
>30    34.93
<30    11.16
Name: proportion, dtype: float64


### Binary Classification Formulation

There are three classes in the target 'readmitted' column:
- Class 1: <30 (early admission)
- Class 2: NO (no admission)
- Class 3: >30 (late admission)

For the purpose of our early admission prediction problem, we will use binary classification because we are looking for a way to classify early admissions correctly in this problem. Therefore, for our problem, the "No" and ">30" cases mean the same and can be combined into the same group.

Hence, the **binary classification** formulation can be described as follows:
- Class 1: <30 (early admission)
- Class 2: NO and >30 (no admission and late admission)

### Understanding the Imbalance in the Dataset

After formulating the dataset as a binary classification problem, it can be seen that the combined percentage of the "No" and ">30" classes is approximately 88.84%, whereas the percentage distribution of the early admission class "<30" is 11.16%.

Hence, it can be seen that there is a severe imbalance in the dataset. The positive class for our classification (early admitted patients) is underrepresented, with a ratio of 1-to-9.

## TODO During the Christmas Break

Link to the Google sheets for writing down comments on the dataset columns and rationalizing the implemented approaches:
https://docs.google.com/spreadsheets/d/1wQvVQijqFmdOWjL4nJtSQ5hRea5tVSDQo9-br04bgew/edit?usp=sharing

## Data Exploration Ideas

### 1. Missing Value Handling
- **Indicator of Missing Values**: Missing values are represented by `?` in the dataset.
- **Imputation Strategies**:
  - **Baseline Approach: Deleting the Rows with Missing Values:** Remove rows that contain at least one missing value in any column.
  - **Correlation-based Imputation**: Identify which features are highly correlated and impute missing values accordingly.
  - **Statistical Methods**:
    - For **categorical features**: Use the most frequent class (mode).
    - For **numerical features**: Use the mean, median, or a neighbor-based approach (e.g., K-Nearest Neighbors).
      - **Note**: Research the name of the neighbor-based imputation technique (e.g., KNN Imputation).
- **Documentation**: Record the rationale behind the chosen imputation strategy.

### 2. Feature Reduction & Extraction
- **Initial Cleanup**:
  - Drop non-informative columns such as:
    - ID fields.
    - Columns with constant values (e.g., same value for all rows).
- **Rationale**: Clearly document why specific columns were dropped.

- **Dimensionality Reduction**:
  - Implement methods like **Principal Component Analysis (PCA)**.
  - **Library**: Check if Scikit-learn has built-in functions for feature extraction or reduction.
  - Alternatively, develop custom logic for feature reduction based on feature correlation.
- **Rationale**: Justify the use of any dimensionality reduction method and its impact on the dataset.

---

## Data Preprocessing Ideas

### 1. Normalization
- **Research Questions**:
  - Which features require normalization?
    - Should normalization apply only to the target class or to all features?
- **Initial Assumption**:
  - Normalization should be applied to all features since some classifiers are sensitive to feature scaling.
- **Documentation**: Record the decision and reasoning behind normalization.

### 2. Handling Class Imbalance
- **Techniques to Address Imbalance**:
  1. **SMOTE (Synthetic Minority Oversampling Technique)**: Generate synthetic samples for the minority class.
  2. **Class Weights**: Assign higher weights to the minority class during training.
  3. **Ensemble Methods**: Use techniques like Random Forest or Bagging that are robust to imbalanced data.
- **Classifier Dependency**:
  - Some imbalance techniques are suitable only for specific classification algorithms.
  - **Documentation**: Clearly note the classifiers compatible with each imbalance-handling technique.

---

## Notes
- Keep a record of all decisions and approaches in the notebook or Markdown.
- Justify each step with reasoning or research findings for transparency and reproducibility.

---

# Handling Data Imbalance Techniques

## 1. SMOTE (Synthetic Minority Oversampling Technique)
- **Purpose**: Generates synthetic samples for the minority class to balance the dataset.
- **How it Works**: SMOTE creates new samples by interpolating between existing minority class examples.
- **Extensions**:
  - Borderline SMOTE: Focuses on samples near the decision boundary.
  - ADASYN: Generates more synthetic samples for harder-to-learn minority examples.
  - SMOTETomek: Combines SMOTE with Tomek Links to clean the dataset.
- **Resources**:
  - [SMOTE for Imbalanced Classification](https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/): Includes table for when to use each variant.
  - [imblearn.over_sampling.SMOTE Documentation](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html).

---

## 2. Class Weights
- **Purpose**: Adjust model training to account for class imbalance.
- **How it Works**: Assigns higher weights to the minority class, forcing the model to focus on it more.
- **F1 Score Formula**:
  \[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
  - If F1 = 0, the model performs poorly on the minority class.
- **Implementation**:
  - Use the `class_weight` parameter in classifiers such as Scikit-learn, LightGBM, or CatBoost.
    - Example: For Logistic Regression, set `class_weight='balanced'` or provide manual weights.
- **Resources**:
  - [Improve Class Imbalance Using Class Weights](https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/).
  - [How to Set Class Weights in Keras](https://datascience.stackexchange.com/questions/13490/how-to-set-class-weights-for-imbalanced-classes-in-keras).

---

## 3. Ensemble Methods
- **Purpose**: Combines multiple classifiers to improve performance and handle class imbalance.
- **Techniques**:
  - **Data-Level Approaches**:
    - **Undersampling**: Reduces the majority class size.
    - **Oversampling**: Increases the minority class size.
    - **Hybrid Approaches**: Combines under and oversampling methods.
  - **Algorithm-Level Techniques**:
    - **Cost-Sensitive Learning**: Assigns different misclassification costs to classes.
    - **Threshold-Moving**: Adjusts the decision threshold to favor the minority class.
  - **Ensemble Learning Methods**:
    - **Bagging**: SMOTEBagging – Combines SMOTE with bagging methods.
    - **Boosting**: RUSBoost – Applies random undersampling with boosting.
    - **Stacking**: EasyEnsemble – Combines multiple models with data resampling.
    - **Hybrid Methods**: Mix bagging + boosting, hybrid sampling + ensemble learning, or dynamic selection + preprocessing.
- **Resources**:
  - [Ensemble Techniques for Class Imbalance](https://thecontentfarm.net/ensemble-techniques-for-handling-class-imbalance/).


### Missing Value Handling

In [100]:
missing_values = data.isnull().sum()
columns_with_missing_values = missing_values[missing_values > 0]
print(f"Columns with missing values:\n{columns_with_missing_values}")

Columns with missing values:
max_glu_serum    96420
A1Cresult        84748
dtype: int64


It is observed from the dataset that in some columns, missing values are represented by the '?' character. Hence, a second data exploration step for analyzing which columns contain the question mark character and their counts is conducted.

In [101]:
question_marks = (data == '?').sum()
columns_with_question_marks = question_marks[question_marks > 0]
print(f"Columns containing question marks and their counts:\n{columns_with_question_marks}")

Columns containing question marks and their counts:
race                  2273
weight               98569
payer_code           40256
medical_specialty    49949
diag_1                  21
diag_2                 358
diag_3                1423
dtype: int64


Based on this observation, we decided to find the summation of missing values (both NaN and ? characters). 

In [102]:
total_missing_values = missing_values + question_marks
columns_with_all_missing_values = total_missing_values[total_missing_values > 0]
print("Columns with missing values (including '?' and NaN):")
print(columns_with_all_missing_values)

Columns with missing values (including '?' and NaN):
race                  2273
weight               98569
payer_code           40256
medical_specialty    49949
diag_1                  21
diag_2                 358
diag_3                1423
max_glu_serum        96420
A1Cresult            84748
dtype: int64


In [103]:
# Convert '?'s into pandas NA values
data = data.replace('?', pd.NA)

# Check to see if the missing value summations match with the previous cell's output
missing_values = data.isnull().sum()
columns_with_missing_values = missing_values[missing_values > 0]
print(f"Columns with missing values:\n{columns_with_missing_values}")

Columns with missing values:
race                  2273
weight               98569
payer_code           40256
medical_specialty    49949
diag_1                  21
diag_2                 358
diag_3                1423
max_glu_serum        96420
A1Cresult            84748
dtype: int64


#### Baseline Approach: Deleting the Rows with Missing Values 

Remove rows that contain at least one missing value in any column.

In [104]:
data_dropped = data.dropna()
print(f"Original dataset shape: {data.shape}")
print(f"Cleaned dataset shape: {data_dropped.shape}")

Original dataset shape: (101766, 50)
Cleaned dataset shape: (0, 50)


It is observed from the output of the previous cell that no rows remain in the dataset when we drop the rows containing a missing value. This suggests that each row contains at least one missing value.

Hence, a more sophisticated missing-value handling is needed. 

Firstly, the following three columns contain approximately 90% missing values:
- weight               (98569 missing values out of 101766)
- max_glu_serum        (96420 missing values out of 101766)
- A1Cresult            (84748 missing values out of 101766)

Hence, they are not informative and can be dropped before continuing with the rest of the analysis.


In [105]:
columns_to_drop = ['weight', 'max_glu_serum', 'A1Cresult']
data_columns_dropped = data.drop(columns=columns_to_drop, axis=1)
print(f"Original dataset shape: {data.shape}")
print(f"Dataset shape after dropping columns: {data_columns_dropped.shape}")

data = data_columns_dropped.copy()

Original dataset shape: (101766, 50)
Dataset shape after dropping columns: (101766, 47)


NOTE: The two columns that contain approximately 50% missing values can also be dropped. Discuss it with your teammates.

- payer_code           (40256 missing values out of 101766)
- medical_specialty    (49949 missing values out of 101766)

In [106]:
# Now, again remove rows that contain at least one missing value in any column.
data_dropped = data.dropna()
print(f"Original dataset shape: {data.shape}")
print(f"Cleaned dataset shape: {data_dropped.shape}")

Original dataset shape: (101766, 47)
Cleaned dataset shape: (26755, 47)


#### Correlation-based Imputation

Identify which features are highly correlated and impute missing values accordingly.