## Elements of Statistical Learning (ESL) Final Project

**Our Machine Learning Question:**

**How can we improve the accuracy of predicting** which diabetic patients are likely to be readmitted to the hospital within 30 days across different classifiers **by addressing the severe class imbalance in the dataset?**
The dataset we chose is on “Early Readmission Prediction of Patients Diagnosed with Diabetes,” which has a significant class imbalance between early admitted patients and others. The positive class for our classification (early admitted patients) is underrepresented with a ratio of 1-to-9.

The techniques we are considering implementing to overcome the class imbalance in the dataset are as follows:
1. SMOTE (Synthetic Minority Oversampling Technique)
2. Class weights
3. Ensemble methods

All the above techniques will be compared across different classifiers to answer the following question: How well do different models (e.g., Logistic Regression, Random Forest, SVM) handle the imbalance and predict early readmissions?

Recommendations from Professor David:

- Check how different models react to class imbalance prior to implementing balancing techniques. 
- Also, check if the balancing techniques actually cause that the same objects are misclassified, or that suddenly also other objects go wrong (that used to be classified well).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Data Loading and Exploration

The chosen dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory, medications, and stayed up to 14 days.

Link to the dataset: https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008

**Our Aim:** Identifying patients at risk of being readmitted within 30 days of discharge.

In [2]:
dataset = pd.read_csv('diabetic_data.csv')
dataset.head(20)

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
5,35754,82637451,Caucasian,Male,[50-60),?,2,1,2,3,...,No,Steady,No,No,No,No,No,No,Yes,>30
6,55842,84259809,Caucasian,Male,[60-70),?,3,1,2,4,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
7,63768,114882984,Caucasian,Male,[70-80),?,1,1,7,5,...,No,No,No,No,No,No,No,No,Yes,>30
8,12522,48330783,Caucasian,Female,[80-90),?,2,1,4,13,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
9,15738,63555939,Caucasian,Female,[90-100),?,3,3,4,12,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [4]:
print(dataset.columns)

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')


In [6]:
target_distribution = dataset['readmitted'].value_counts()

print("Distribution of 'readmitted' target column:")
print(target_distribution)

target_percentage = dataset['readmitted'].value_counts(normalize=True) * 100
print("\nPercentage distribution of 'readmitted' target column:")
print(f"{round(target_percentage, 2)}")

Distribution of 'readmitted' target column:
readmitted
NO     54864
>30    35545
<30    11357
Name: count, dtype: int64

Percentage distribution of 'readmitted' target column:
readmitted
NO     53.91
>30    34.93
<30    11.16
Name: proportion, dtype: float64


### Binary Classification Formulation

There are three classes in the target 'readmitted' column:
- Class 1: <30 (early admission)
- Class 2: NO (no admission)
- Class 3: >30 (late admission)

For the purpose of our early admission prediction problem, we will use binary classification because we are looking for a way to classify early admissions correctly in this problem. Therefore, for our problem, the "No" and ">30" cases mean the same and can be combined into the same group.

Hence, the **binary classification** formulation can be described as follows:
- Class 1: <30 (early admission)
- Class 2: NO and >30 (no admission and late admission)

### Understanding the Imbalance in the Dataset

After formulating the dataset as a binary classification problem, it can be seen that the combined percentage of the "No" and ">30" classes is approximately 88.84%, whereas the percentage distribution of the early admission class "<30" is 11.16%.

Hence, it can be seen that there is a severe imbalance in the dataset. The positive class for our classification (early admitted patients) is underrepresented, with a ratio of 1-to-9.

## TODO During the Christmas Break

Link to the Google sheets for writing down comments on the dataset columns and rationalizing the implemented approaches:
https://docs.google.com/spreadsheets/d/1wQvVQijqFmdOWjL4nJtSQ5hRea5tVSDQo9-br04bgew/edit?usp=sharing

## Data Exploration Ideas

### 1. Missing Value Handling
- **Indicator of Missing Values**: Missing values are represented by `?` in the dataset.
- **Imputation Strategies**:
  - **Baseline Approach: Deleting the Rows with Missing Values:** Remove rows that contain at least one missing value in any column.
  - **Correlation-based Imputation**: Identify which features are highly correlated and impute missing values accordingly.
  - **Statistical Methods**:
    - For **categorical features**: Use the most frequent class (mode).
    - For **numerical features**: Use the mean, median, or a neighbor-based approach (e.g., K-Nearest Neighbors).
      - **Note**: Research the name of the neighbor-based imputation technique (e.g., KNN Imputation).
- **Documentation**: Record the rationale behind the chosen imputation strategy.

### 2. Feature Reduction & Extraction
- **Initial Cleanup**:
  - Drop non-informative columns such as:
    - ID fields.
    - Columns with constant values (e.g., same value for all rows).
- **Rationale**: Clearly document why specific columns were dropped.

- **Dimensionality Reduction**:
  - Implement methods like **Principal Component Analysis (PCA)**.
  - **Library**: Check if Scikit-learn has built-in functions for feature extraction or reduction.
  - Alternatively, develop custom logic for feature reduction based on feature correlation.
- **Rationale**: Justify the use of any dimensionality reduction method and its impact on the dataset.

---

## Data Preprocessing Ideas

### 1. Normalization
- **Research Questions**:
  - Which features require normalization?
    - Should normalization apply only to the target class or to all features?
- **Initial Assumption**:
  - Normalization should be applied to all features since some classifiers are sensitive to feature scaling.
- **Documentation**: Record the decision and reasoning behind normalization.

### 2. Handling Class Imbalance
- **Techniques to Address Imbalance**:
  1. **SMOTE (Synthetic Minority Oversampling Technique)**: Generate synthetic samples for the minority class.
  2. **Class Weights**: Assign higher weights to the minority class during training.
  3. **Ensemble Methods**: Use techniques like Random Forest or Bagging that are robust to imbalanced data.
- **Classifier Dependency**:
  - Some imbalance techniques are suitable only for specific classification algorithms.
  - **Documentation**: Clearly note the classifiers compatible with each imbalance-handling technique.

---

## Notes
- Keep a record of all decisions and approaches in the notebook or Markdown.
- Justify each step with reasoning or research findings for transparency and reproducibility.

---

# Handling Data Imbalance Techniques

## 1. SMOTE (Synthetic Minority Oversampling Technique)
- **Purpose**: Generates synthetic samples for the minority class to balance the dataset.
- **How it Works**: SMOTE creates new samples by interpolating between existing minority class examples.
- **Extensions**:
  - Borderline SMOTE: Focuses on samples near the decision boundary.
  - ADASYN: Generates more synthetic samples for harder-to-learn minority examples.
  - SMOTETomek: Combines SMOTE with Tomek Links to clean the dataset.
- **Resources**:
  - [SMOTE for Imbalanced Classification](https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/): Includes table for when to use each variant.
  - [imblearn.over_sampling.SMOTE Documentation](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html).

---

## 2. Class Weights
- **Purpose**: Adjust model training to account for class imbalance.
- **How it Works**: Assigns higher weights to the minority class, forcing the model to focus on it more.
- **F1 Score Formula**:
  \[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
  - If F1 = 0, the model performs poorly on the minority class.
- **Implementation**:
  - Use the `class_weight` parameter in classifiers such as Scikit-learn, LightGBM, or CatBoost.
    - Example: For Logistic Regression, set `class_weight='balanced'` or provide manual weights.
- **Resources**:
  - [Improve Class Imbalance Using Class Weights](https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/).
  - [How to Set Class Weights in Keras](https://datascience.stackexchange.com/questions/13490/how-to-set-class-weights-for-imbalanced-classes-in-keras).

---

## 3. Ensemble Methods
- **Purpose**: Combines multiple classifiers to improve performance and handle class imbalance.
- **Techniques**:
  - **Data-Level Approaches**:
    - **Undersampling**: Reduces the majority class size.
    - **Oversampling**: Increases the minority class size.
    - **Hybrid Approaches**: Combines under and oversampling methods.
  - **Algorithm-Level Techniques**:
    - **Cost-Sensitive Learning**: Assigns different misclassification costs to classes.
    - **Threshold-Moving**: Adjusts the decision threshold to favor the minority class.
  - **Ensemble Learning Methods**:
    - **Bagging**: SMOTEBagging – Combines SMOTE with bagging methods.
    - **Boosting**: RUSBoost – Applies random undersampling with boosting.
    - **Stacking**: EasyEnsemble – Combines multiple models with data resampling.
    - **Hybrid Methods**: Mix bagging + boosting, hybrid sampling + ensemble learning, or dynamic selection + preprocessing.
- **Resources**:
  - [Ensemble Techniques for Class Imbalance](https://thecontentfarm.net/ensemble-techniques-for-handling-class-imbalance/).


### Setting up the dataset

In [9]:
from sklearn.model_selection import train_test_split

# Merging ">30" and "NO" (not readmitted) categories of the target variable
dataset['readmitted'] = dataset['readmitted'].apply(lambda x: 1 if x == '<30' else 0)

# Isolate target column
X = dataset.drop('readmitted', axis=1) # inplace=False (default)
y = dataset['readmitted']

# Splitting the dataset in training and test sets
# From this point on, we should only use the training set until we test the models
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# TODO stratification? to preserve the distribution of classes in training and test sets
# TODO grouping? to prevent data leakage
# we can't have information of the same person being separated in training and test sets
# lowercase x until we change Zeyu's to validation

data = x_train.copy() # TODO change this properly

### Missing Value Handling

In [10]:
missing_values = data.isnull().sum()
columns_with_missing_values = missing_values[missing_values > 0]
print(f"Columns with missing values:\n{columns_with_missing_values}")

Columns with missing values:
max_glu_serum    77130
A1Cresult        67778
dtype: int64


It is observed from the dataset that in some columns, missing values are represented by the '?' character. Hence, a second data exploration step for analyzing which columns contain the question mark character and their counts is conducted.

In [11]:
question_marks = (data == '?').sum()
columns_with_question_marks = question_marks[question_marks > 0]
print(f"Columns containing question marks and their counts:\n{columns_with_question_marks}")

Columns containing question marks and their counts:
race                  1824
weight               78852
payer_code           32135
medical_specialty    39969
diag_1                  17
diag_2                 271
diag_3                1131
dtype: int64


Based on this observation, we decided to find the summation of missing values (both NaN and ? characters). 

In [12]:
total_missing_values = missing_values + question_marks
columns_with_all_missing_values = total_missing_values[total_missing_values > 0]
print("Columns with missing values (including '?' and NaN):")
print(columns_with_all_missing_values)

Columns with missing values (including '?' and NaN):
race                  1824
weight               78852
payer_code           32135
medical_specialty    39969
diag_1                  17
diag_2                 271
diag_3                1131
max_glu_serum        77130
A1Cresult            67778
dtype: int64


In [13]:
# Convert '?'s into pandas NA values
data = data.replace('?', pd.NA)

# Check to see if the missing value summations match with the previous cell's output
missing_values = data.isnull().sum()
columns_with_missing_values = missing_values[missing_values > 0]
print(f"Columns with missing values:\n{columns_with_missing_values}")

assert (total_missing_values == missing_values).all(), "Mismatch in missing value summations!"

# If the assertion is successful, print a success message
print("\nConsistency check passed: Missing value summations match for '?' and NaN handling.")

Columns with missing values:
race                  1824
weight               78852
payer_code           32135
medical_specialty    39969
diag_1                  17
diag_2                 271
diag_3                1131
max_glu_serum        77130
A1Cresult            67778
dtype: int64

Consistency check passed: Missing value summations match for '?' and NaN handling.


#### Baseline Approach: Deleting the Rows with Missing Values 

Remove rows that contain at least one missing value in any column.

In [14]:
data_dropped = data.dropna()
print(f"Original dataset shape: {data.shape}")
print(f"Cleaned dataset shape after dropping rows: {data_dropped.shape}")

Original dataset shape: (81412, 49)
Cleaned dataset shape after dropping rows: (0, 49)


It is observed from the output of the previous cell that no rows remain in the dataset when we drop the rows containing a missing value. This suggests that each row contains at least one missing value.

Hence, a more sophisticated missing-value handling is needed. 

Firstly, the following three columns contain approximately 90% missing values:
- weight               (98569 missing values out of 101766)
- max_glu_serum        (96420 missing values out of 101766)
- A1Cresult            (84748 missing values out of 101766)

Hence, they are not informative and can be dropped before continuing with the rest of the analysis.


In [15]:
print(f"Original dataset shape: {data.shape}")
columns_to_drop = ['weight', 'max_glu_serum', 'A1Cresult']
data = data.drop(columns=columns_to_drop, axis=1)
print(f"Dataset shape after dropping columns: {data.shape}")

Original dataset shape: (81412, 49)
Dataset shape after dropping columns: (81412, 46)


NOTE: The two columns that contain approximately 50% missing values can also be dropped. Discuss it with your teammates.

- payer_code           (40256 missing values out of 101766)
- medical_specialty    (49949 missing values out of 101766)

In [16]:
# Now, again remove rows that contain at least one missing value in any column.
print(f"Dataset shape before dropping rows: {data.shape}")
data_dropped = data.dropna()
print(f"Cleaned dataset shape after dropping rows: {data_dropped.shape}")

Dataset shape before dropping rows: (81412, 46)
Cleaned dataset shape after dropping rows: (21494, 46)


#### Correlation-based Imputation

Identify which features are highly correlated and impute missing values accordingly.

In [17]:
print("Column types before applying one-hot encoding:")
print(data.dtypes[:25])

Column types before applying one-hot encoding:
encounter_id                 int64
patient_nbr                  int64
race                        object
gender                      object
age                         object
admission_type_id            int64
discharge_disposition_id     int64
admission_source_id          int64
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
dtype: object


In [18]:
print("Column types before applying one-hot encoding:")
print(data.dtypes[25:])

# First, apply one-hot encoding to non-numeric columns,
# prior to computing the correlation between the features

data_encoded = data.copy()
for column in data.select_dtypes(include=['object', 'category']).columns:
    data_encoded[column] = data_encoded[column].astype('category').cat.codes

Column types before applying one-hot encoding:
glimepiride                 object
acetohexamide               object
glipizide                   object
glyburide                   object
tolbutamide                 object
pioglitazone                object
rosiglitazone               object
acarbose                    object
miglitol                    object
troglitazone                object
tolazamide                  object
examide                     object
citoglipton                 object
insulin                     object
glyburide-metformin         object
glipizide-metformin         object
glimepiride-pioglitazone    object
metformin-rosiglitazone     object
metformin-pioglitazone      object
change                      object
diabetesMed                 object
dtype: object


In [19]:
print("Column types after applying one-hot encoding:")
print(data_encoded.dtypes[:25])

Column types after applying one-hot encoding:
encounter_id                int64
patient_nbr                 int64
race                         int8
gender                       int8
age                          int8
admission_type_id           int64
discharge_disposition_id    int64
admission_source_id         int64
time_in_hospital            int64
payer_code                   int8
medical_specialty            int8
num_lab_procedures          int64
num_procedures              int64
num_medications             int64
number_outpatient           int64
number_emergency            int64
number_inpatient            int64
diag_1                      int16
diag_2                      int16
diag_3                      int16
number_diagnoses            int64
metformin                    int8
repaglinide                  int8
nateglinide                  int8
chlorpropamide               int8
dtype: object


In [20]:
print("Column types after applying one-hot encoding:")
print(data_encoded.dtypes[25:])

Column types after applying one-hot encoding:
glimepiride                 int8
acetohexamide               int8
glipizide                   int8
glyburide                   int8
tolbutamide                 int8
pioglitazone                int8
rosiglitazone               int8
acarbose                    int8
miglitol                    int8
troglitazone                int8
tolazamide                  int8
examide                     int8
citoglipton                 int8
insulin                     int8
glyburide-metformin         int8
glipizide-metformin         int8
glimepiride-pioglitazone    int8
metformin-rosiglitazone     int8
metformin-pioglitazone      int8
change                      int8
diabetesMed                 int8
dtype: object


In [21]:
# Check the first 5 rows of the one-hot encoded data to see the encodings
print(data_encoded.head)

<bound method NDFrame.head of        encounter_id  patient_nbr  race  gender  age  admission_type_id  \
24079      81844290        94788     2       0    7                  1   
98079     396159158    135023315     2       1    5                  1   
6237       31258956     18397782     2       1    8                  1   
72208     210691074     67509558     2       1    8                  1   
33075     104902980     23272362     0       0    7                  1   
...             ...          ...   ...     ...  ...                ...   
6265       31296060      3344202     2       1    7                  1   
54886     159139902     93611655     2       1    6                  5   
76820     232191828     85600899     2       1    7                  3   
860         6740700      8208234     2       0    6                  6   
15795      60115668     77943780     2       0    4                  6   

       discharge_disposition_id  admission_source_id  time_in_hospital  \
24079  

In [22]:
# Now, compute the correlation matrix using the one-hot encoded data

correlation_matrix = data_encoded.corr()
print("Correlation Matrix:")
print(correlation_matrix)

Correlation Matrix:
                          encounter_id  patient_nbr      race    gender  \
encounter_id                  1.000000     0.514508  0.080394  0.006532   
patient_nbr                   0.514508     1.000000  0.149678  0.005483   
race                          0.080394     0.149678  1.000000  0.056734   
gender                        0.006532     0.005483  0.056734  1.000000   
age                           0.071384     0.070792  0.112790 -0.049667   
admission_type_id            -0.158658    -0.013086  0.095050  0.012961   
discharge_disposition_id     -0.133966    -0.138138  0.005491 -0.019367   
admission_source_id          -0.112098    -0.034054  0.029967 -0.003162   
time_in_hospital             -0.064575    -0.026424 -0.014443 -0.031082   
payer_code                    0.439541     0.228585  0.037413  0.000373   
medical_specialty            -0.171585    -0.140651 -0.030297  0.008465   
num_lab_procedures           -0.024343     0.016870 -0.021331 -0.002811   
num_p

In [26]:
def compute_high_correlations(correlation_threshold, corr_matrix): # TODO changed the name from correlation_matrix -M
    """
        Method for calculating highly correlated features above a given threshold
    """
    highly_correlated_features = corr_matrix[(corr_matrix.abs() > correlation_threshold) & (corr_matrix != 1.0)]
    print(f"\nHighly Correlated Features (Threshold > {correlation_threshold}):")
    print(highly_correlated_features.dropna(how="all", axis=0).dropna(how="all", axis=1))

In [28]:
compute_high_correlations(correlation_threshold=0.6, corr_matrix=correlation_matrix)
compute_high_correlations(correlation_threshold=0.5, corr_matrix=correlation_matrix)


Highly Correlated Features (Threshold > 0.6):
Empty DataFrame
Columns: []
Index: []

Highly Correlated Features (Threshold > 0.5):
              encounter_id  patient_nbr    change  diabetesMed
encounter_id           NaN     0.514508       NaN          NaN
patient_nbr       0.514508          NaN       NaN          NaN
change                 NaN          NaN       NaN    -0.508258
diabetesMed            NaN          NaN -0.508258          NaN


**Important Observation**  

It has been observed that the highest correlation among the features is 0.5 in magnitude (as the absolute values of the correlation matrix are used for computations). No correlations above this threshold have been observed.  

Since 0.5 is not considered a high correlation, implementing a correlation-based imputation technique for handling missing values does not seem plausible.

#### Statistical Methods for Missing Value Handling

- Missing values in **categorical features** will be imputed using the most frequent class (mode).

- Missing values in **numerical features** will be imputed using the **k-nearest neighbor (KNN) imputation approach**.

In [30]:
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

print("Shape of dataset before imputation:", data.shape)

# All NA values are converted into nan for compatibility with sklearn
data_impute = data.replace({pd.NA: np.nan}) 

categorical_columns = data_impute.select_dtypes(include=['object', 'category']).columns
numerical_columns = data_impute.select_dtypes(include=['number']).columns

# Handling missing values in categorical features using Simple Imputer with "most frequent" strategy
data_impute[categorical_columns] = SimpleImputer(strategy='most_frequent').fit_transform(data_impute[categorical_columns])

# Handling missing values in numerical features using KNN Imputer with the given k value
k = 5 # TODO test different k values with pipeline
data_impute[numerical_columns] = KNNImputer(n_neighbors=k).fit_transform(data_impute[numerical_columns])

print("Shape of dataset after imputation:", data_impute.shape)
print("Total missing values in the dataset after imputation:", data_impute.isnull().sum().sum())

data_impute.head(20)

Shape of dataset before imputation: (81412, 46)
Shape of dataset after imputation: (81412, 46)
Total missing values in the dataset after imputation: 0


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,...,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed
24079,81844290.0,94788.0,Caucasian,Female,[70-80),1.0,1.0,7.0,4.0,MC,...,No,No,No,No,No,No,No,No,No,No
98079,396159158.0,135023315.0,Caucasian,Male,[50-60),1.0,1.0,7.0,1.0,BC,...,No,No,No,No,No,No,No,No,No,No
6237,31258956.0,18397782.0,Caucasian,Male,[80-90),1.0,1.0,7.0,4.0,MC,...,No,No,No,No,No,No,No,No,No,Yes
72208,210691074.0,67509558.0,Caucasian,Male,[80-90),1.0,3.0,7.0,3.0,MC,...,No,No,Steady,No,No,No,No,No,Ch,Yes
33075,104902980.0,23272362.0,AfricanAmerican,Female,[70-80),1.0,11.0,7.0,11.0,MC,...,No,No,No,No,No,No,No,No,No,No
12913,51999162.0,19832265.0,Caucasian,Male,[40-50),3.0,18.0,1.0,4.0,MC,...,No,No,No,No,No,No,No,No,No,Yes
81022,250443612.0,38173338.0,Caucasian,Female,[60-70),1.0,6.0,7.0,6.0,MC,...,No,No,Steady,No,No,No,No,No,Ch,Yes
79358,243417654.0,1868706.0,Caucasian,Male,[80-90),1.0,6.0,7.0,8.0,MC,...,No,No,Steady,No,No,No,No,No,No,Yes
60741,170003016.0,59043186.0,Caucasian,Female,[30-40),1.0,1.0,7.0,1.0,PO,...,No,No,No,No,No,No,No,No,No,No
62297,173665674.0,91722366.0,AfricanAmerican,Female,[70-80),2.0,2.0,7.0,5.0,SP,...,No,No,Steady,No,No,No,No,No,Ch,Yes


# Data Preprocessing

## Normalization

Methods:

- Min-Max Scaling

- Standard Scaling

Features that may need normalization (numeric):

time_in_hospital, num_lab_procedures, num_procedures, num_medications, number_outpatient, number_emergency, number_inpatient, number_diagnoses

In [31]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

numeric_features = ['num_lab_procedures', 'num_medications', 'number_diagnoses']
# TODO why only those 3 when more are mentioned above?

scaler = MinMaxScaler()

data_scaled = data_impute.copy()
#data_scaled[numeric_features] = scaler.fit_transform(data_scaled[numeric_features])

print(data_scaled[numeric_features].head())

       num_lab_procedures  num_medications  number_diagnoses
24079                48.0             11.0               9.0
98079                42.0              5.0               6.0
6237                 44.0             10.0               7.0
72208                54.0              8.0               8.0
33075                35.0             23.0               8.0


## SMOTE

Encode non-numerical features first:

In [34]:
from sklearn.preprocessing import OneHotEncoder

data_onehot = data_scaled.copy()

# this is already done above
# data_onehot['binary'] = data_onehot['readmitted'].apply(lambda x: 1 if x == '<30' else 0)
# data_onehot.drop(['readmitted'], axis=1, inplace=True)

categorical_features = data_onehot.select_dtypes(include=['object', 'category']).columns
numerical_features = data_onehot.select_dtypes(include=['number']).columns

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_onehot = encoder.fit_transform(data_onehot[categorical_features])
encoded_features = pd.DataFrame(encoded_onehot, columns=encoder.get_feature_names_out(categorical_features))
data_onehot = pd.concat([data_onehot.drop(columns=categorical_features), encoded_features], axis=1)

# encoded_features = pd.get_dummies(data_onehot[categorical_features])
# data_onehot = pd.concat([data_onehot[numerical_features], encoded_features], axis=1)

print(f"Dataset Shape Before Encoding: {data_scaled.shape}")
print(f"Dataset Shape After Encoding: {data_onehot.shape}")
print(data_onehot.head())

KeyError: 'readmitted'

Onehot encoding significantly increase features, may need feature reduction here.

In [176]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from collections import Counter

# changed it to reflect the splitting in the beginning
X_t = data_onehot #.drop('binary', axis=1)
y_t = x_test #data_onehot['binary']

X_train_val, X_val, y_train_val, y_val = train_test_split(X_t, y_t, test_size=0.2, random_state=42, stratify=y_t)

smote = SMOTE(random_state=42)
X_resampled, Y_resampled = smote.fit_resample(X_train_val, y_train_val)

print('Original Class Distribution:', Counter(y_train_val))
print('Over-sampled Class Distribution:', Counter(Y_resampled))

Original Class Distribution: Counter({0: 72326, 1: 9086})
Over-sampled Class Distribution: Counter({0: 72326, 1: 72326})


Classes are now balanced

In [177]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression(random_state=42, max_iter=500)

model.fit(X_train_val, y_train_val)
Y_pred = model.predict(X_val)
print('Original Data Performance:')
print(classification_report(y_val, Y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_val, Y_pred))

model.fit(X_resampled, Y_resampled)
Y_pred_smote = model.predict(X_val)
print('\nSMOTE Data Performance:')
print(classification_report(y_val, Y_pred_smote))
print('Confusion Matrix:\n', confusion_matrix(y_val, Y_pred_smote))

Original Data Performance:
              precision    recall  f1-score   support

           0       0.89      1.00      0.94     18083
           1       0.00      0.00      0.00      2271

    accuracy                           0.89     20354
   macro avg       0.44      0.50      0.47     20354
weighted avg       0.79      0.89      0.84     20354

Confusion Matrix:
 [[18083     0]
 [ 2271     0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



SMOTE Data Performance:
              precision    recall  f1-score   support

           0       0.89      0.60      0.71     18083
           1       0.12      0.42      0.18      2271

    accuracy                           0.58     20354
   macro avg       0.50      0.51      0.45     20354
weighted avg       0.81      0.58      0.66     20354

Confusion Matrix:
 [[10787  7296]
 [ 1307   964]]


The most brute force oversampling, it somewhat increases performance on the minority class (not much though), but it also destroys predictions on the majority class, need more experiments with other methods.