<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4 - Unveiling Chronic Disease in Singaporean Lifestyle

> Authors: Chung Yau, Gilbert, Han Kiong, Zheng Gang
---

**Problem Statement:**  
In Singapore, the increasing prevalence of chronic diseases presents a pressing public health concern, underscoring the need for proactive intervention strategies. 

How can we identify individuals at high risk for chronic diseases based on their behavioral habits? By doing so, we can enable early detection and provide recommendations, fostering a proactive approach to preventing various chronic diseases.

  
**Target Audience:**  
Product team in Synapxe, in preparation for Healthier SG 2025 roadmap workshop. 

These are the notebooks for this project:  
 1. `01_Data_Collection_Food.ipynb`  
 2. `02_Data_Preprocessing.ipynb`   
 3. `03_FeatureEngineering_and_EDA.ipynb`
 4. `04_Data_Modelling.ipynb` 
 5. `05_Hyperparameter_Model Fitting_Evaluation.ipynb`
 6. `05a_Model_Pickling.ipynb`
 7. `06_Implementation_FoodRecommender.ipynb` 

 ---

# This Notebook: 05a_Model_Pickling

This notebook is for isolating the chosen model with the best parameters determined from 05_GridSearch_HyperparameterTuning for pickling purposes to upload to Streamlit

---

### Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.under_sampling import RandomUnderSampler
import pickle


### Read the CSV 

In [2]:
# read cleaned csv file into a data frame
df = pd.read_csv('../data/03_asian_data.csv')

In [3]:
df.head()

Unnamed: 0,heart_attack,stroke,asthma,skin_cancer,other_cancer,cpd_bronchitis,depression,kidney_disease,diabetes,sex,...,one_alc_per_day,binge_drink,ave_drink_week,fruit,vegetable,exercise_cat,high_cholesterol,weight,BMI,CD
0,0,0,0,0,0,0,0,0,0,1,...,1,0,233.0,0,0,2,1,6804.0,22.733803,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,5.397605e-79,0,0,0,0,5897.0,23.923891,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,5.397605e-79,0,1,1,1,6350.0,21.21688,0
3,0,0,0,0,0,1,1,0,0,0,...,0,0,5.397605e-79,0,0,1,1,8048.087779,28.055853,1
4,0,0,0,0,0,0,0,0,0,1,...,1,0,117.0,1,1,2,0,7484.0,23.098765,0


In [4]:
df.shape

(16104, 33)

### Preprocessing

Similar to what was done in the data_modelling notebook and the GridSearch_HyperparameterTuning notebook, the preprocessing involves: 
1. Dropping the '_RACE' column since filtering of asian has been performed during EDA
2. Checking the class balance of 1 and 0 in 'CD' column

In [5]:
# drop '_RACE' column
df.drop(columns=['race'], inplace=True)

In [6]:
# Checking percentage of occurrences of each unique value in the 'CD' column
print(df['CD'].value_counts())
print(df['CD'].value_counts(normalize = True)*100)

CD
0    10631
1     5473
Name: count, dtype: int64
CD
0    66.014655
1    33.985345
Name: proportion, dtype: float64


### Transformation of data 


Similar to what was done in the data_modelling notebook and the GridSearch_HyperparameterTuning notebook, the transformation of data involves: 
1. Define X and y
2. Create a train-test split of X and y 

In [7]:
# Train-test split
columns_to_check = [
    'cpd_bronchitis',  
    'depression',      
    'arthritis',       
    'heart_attack',    
    'stroke',          
    'asthma',          
    'diabetes',        
    'kidney_disease',  
    'heart_disease',   
    'CD',              
    'height',         
    'weight'          
]

X = df.drop(columns=columns_to_check)
y = df['CD']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [8]:
# check columns in X
X.columns

Index(['skin_cancer', 'other_cancer', 'sex', 'martial', 'employment_status',
       'blind', 'diff_walking', 'occasion_drink_30days', 'high_bp', 'age',
       'education', 'smoker_status', 'one_alc_per_day', 'binge_drink',
       'ave_drink_week', 'fruit', 'vegetable', 'exercise_cat',
       'high_cholesterol', 'BMI'],
      dtype='object')

### Modelling


Similar to what was done in the data_modelling notebook and the GridSearch_HyperparameterTuning notebook, the modelling involves: 
1. Perform RandomUnderSampler for class imbalance 
2. Apply PolynomialFeatures to selected columns 
3. Create a pipeline for PolynomialFeatures, StandardScaler and the classifiers
4. Call and fit the respective classification models
5. Generate scoring metrics (accuracy, precision, recall, F1 score) to see how well the model is performing based on the best parameters

In [9]:
# Perform RandomUnderSampler for class imbalance
rus = RandomUnderSampler(random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

# Define columns you want to apply polynomial features to
poly_cols = [
    'occasion_drink_30days', 
    'BMI',                    
    'education',              
    'smoker_status',          
    'exercise_cat',          
    'ave_drink_week',         
    'age'                    
]
# Define PCA with n_components=30
pca = PCA(n_components=30)

# Create a ColumnTransformer to apply PolynomialFeatures to selected columns
poly_transformer = ColumnTransformer(
    transformers=[
        ('poly', PolynomialFeatures(), poly_cols)
    ],
    remainder='passthrough'  # Pass through columns not specified for polynomial features
)

# Define the pipeline with best parameters
pipeline = Pipeline([
    ('poly_transformer', poly_transformer),
    ('std_scaler', StandardScaler()),
    ('pca', pca),  # Apply PCA with n_components=30
    ('lr', LogisticRegression(C=0.021544346900318832, class_weight='balanced',
                               max_iter=100, penalty='l1', solver='liblinear', random_state=42))
])

# Fit the pipeline to the training data
pipeline.fit(X_train_resampled, y_train_resampled)

# Predict on the train data
train_y_pred = pipeline.predict(X_train_resampled)

# Calculate evaluation metrics for train data
train_accuracy = accuracy_score(y_train_resampled, train_y_pred)
train_precision = precision_score(y_train_resampled, train_y_pred)
train_recall = recall_score(y_train_resampled, train_y_pred)
train_f1 = f1_score(y_train_resampled, train_y_pred)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Calculate evaluation metrics for test data
test_accuracy = accuracy_score(y_test, y_pred)
test_precision = precision_score(y_test, y_pred)
test_recall = recall_score(y_test, y_pred)
test_f1 = f1_score(y_test, y_pred)

# Create DataFrames to store the evaluation metrics
train_results_df = pd.DataFrame({
    'classifier': ['LogisticRegression'],
    'train accuracy': [train_accuracy],
    'train precision': [train_precision],
    'train recall': [train_recall],
    'train f1 score': [train_f1]
})

test_results_df = pd.DataFrame({
    'classifier': ['LogisticRegression'],
    'test accuracy': [test_accuracy],
    'test precision': [test_precision],
    'test recall': [test_recall],
    'test f1 score': [test_f1]
})

In [10]:
train_results_df

Unnamed: 0,classifier,train accuracy,train precision,train recall,train f1 score
0,LogisticRegression,0.694724,0.725948,0.625628,0.672065


In [11]:
test_results_df

Unnamed: 0,classifier,test accuracy,test precision,test recall,test f1 score
0,LogisticRegression,0.719652,0.579734,0.637443,0.607221


### Pickling of Desired Model

In [12]:
# Save the trained model to a file
with open('trained_model.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

print("Model is pickled and saved as 'trained_model.pkl'")

Model is pickled and saved as 'trained_model.pkl'
