<div style="border-radius: 10px; border: #6B8E23 solid; padding: 15px; background-color: #F5F5DC; font-size: 100%; text-align: left">

<h3 align="left"><font color='#556B2F'>📜 Introduction : </font></h3>
    
Smoking has been conclusively linked to numerous health issues, affecting almost every organ in the body and leading to various diseases. It significantly reduces the life expectancy of smokers. As of 2018, it has been recognized as the primary cause of preventable diseases and deaths globally, posing a significant threat to public health.

According to the World Health Organization, smoking is projected to cause 10 million deaths by 2030.

Efforts to assist individuals in quitting smoking through evidence-based treatments have had limited success, with less than one third of participants achieving abstinence. Many physicians find smoking cessation counseling ineffective and time-consuming, leading to infrequent use in daily practice. To address this challenge, various factors have been proposed to identify smokers with a higher likelihood of quitting, such as nicotine dependence, carbon monoxide levels, daily cigarette consumption, age of smoking initiation, previous quit attempts, marital status, emotional well-being, personality traits, and motivation to quit. However, using these factors individually for prediction often yields complex and conflicting results. Developing a prediction model offers a more straightforward way to assess an individual's likelihood of quitting smoking. In recent years, machine learning methods have been employed to create health outcome prediction models, with a specific focus on predicting smoking status using bio-signals. A team of scientists is currently working on such predictive models. Our task is to assist them in developing a machine learning model for identifying an individual's smoking status based on bio-signals.

# Content

1. [Importing & Reading Data](#1)
1. [EDA](#2)
1. [Visualization](#3)
    * [Categorical](#4)
    * [Numerical](#5)
    * [Correlation](#6)
1. [Feature Engineering](#7)
    * [Creating New Features](#8)
    * [Outlier Detection](#9)
    * [Checking Distributions](#10)
1. [Modelling](#11)
1. [Split](#12)
1. [Feature Importance](#13)
1. [LightGBM Classifier](#14)
1. [XGBoost Classifier](#15)
1. [Voting and Stacking Classifier](#16)
1. [Prediction](#17)

<a id="1"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">🐼 Importing & Reading Data 🐼</h1>


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

!wget https://raw.githubusercontent.com/h4pZ/rose-pine-matplotlib/main/themes/rose-pine-dawn.mplstyle -P /tmp
plt.style.use("/tmp/rose-pine-dawn.mplstyle")

import warnings
warnings.filterwarnings("ignore")

In [None]:
path = "/kaggle/input/playground-series-s3e24/"
train = pd.read_csv(path+"train.csv")
test = pd.read_csv(path+"test.csv")
sub = pd.read_csv(path+"sample_submission.csv")
org = pd.read_csv("/kaggle/input/smoker-status-prediction-using-biosignals/train_dataset.csv")

<a id="2"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">🔬 EDA 🔬</h1>

In [None]:
train = pd.concat([train.drop("id",axis=1), org], ignore_index=True)
test.drop("id",axis=1, inplace=True)

In [None]:
train.head(10)

<div style="border-radius: 10px; border: #6B8E23 solid; padding: 15px; background-color: #F5F5DC; font-size: 100%; text-align: left">

<h3 align="left"><font color='#556B2F'>👀 Features : </font></h3>

1. **id:** Identification number

2. **age:** Age represented in 5-year intervals.

3. **height (cm):** Individual's height in centimeters.

4. **weight (kg):** Individual's weight in kilograms.

5. **waist (cm):** Length of waist circumference in centimeters.

6. **eyesight (left):** Eyesight measurement for the left eye.

7. **eyesight (right):** Eyesight measurement for the right eye.

8. **hearing (left):** Hearing ability for the left ear.

9. **hearing (right):** Hearing ability for the right ear.

10. **systolic:** Systolic blood pressure measurement.

11. **relaxation:** Diastolic blood pressure measurement.

12. **fasting blood sugar:** Fasting blood sugar measurement.

13. **Cholesterol (total):** Total cholesterol level.

14. **triglyceride:** Triglyceride level.

15. **HDL (High-Density Lipoprotein):** HDL cholesterol level (good cholesterol).

16. **LDL (Low-Density Lipoprotein):** LDL cholesterol level (bad cholesterol).

17. **hemoglobin:** Hemoglobin level.

18. **Urine protein:** Urine protein level.

19. **serum creatinine:** Serum creatinine level.

20. **AST (Aspartate Aminotransferase):** AST (Aspartate Aminotransferase) enzyme level.

21. **ALT (Alanine Aminotransferase):** ALT (Alanine Aminotransferase) enzyme level.

22. **Gtp (γ-Glutamyltranspeptidase):** Gtp (γ-Glutamyltranspeptidase) level.

23. **dental caries:** Dental caries status.

24. **smoking:** Smoking status.

In [None]:
train.info()

In [None]:
train.isnull().sum()

In [None]:
train.duplicated().sum() # checked amounts of duplicate rows

In [None]:
train.drop_duplicates(inplace = True) # dropped duplicate rows if there is any.

In [None]:
train.nunique()

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* As we can see, there are no null values.
* We should remove duplicate rows before modeling. Therefore, 'drop_duplicates' was applied.
* Let's convert the smoking status into binary form for modeling (0/1).

In [None]:
train["smoking"]=train["smoking"].map({0: False,1: True})

In [None]:
num_cols = [col for col in train.columns if (train[col].dtype in ["int64","float64"]) & (train[col].nunique()>10)]

num_cols

In [None]:
cat_cols = [col for col in train.columns if train[col].nunique()<10]

cat_cols

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* We identified columns with more than 10 unique values as **"number features"**. Those with 10 or fewer unique values were categorized as **"categorical variables"**.

<a id="3"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">🐼 Visualization 🐼</h1>

<a id = "4"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Categorical✨</p>

In [None]:
plt.figure(figsize=(12,len(cat_cols)*3))
for idx,column in enumerate(cat_cols[:-1]):
    plt.subplot(len(cat_cols)//2+1,2,idx+1)
    sns.countplot(hue="smoking", x=column, data=train, palette="pastel")
    plt.title(f"{column} Distribution",weight='bold',fontsize=12)
    plt.tight_layout()

<a id = "5"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Numerical✨</p>

In [None]:
plt.figure(figsize=(12,len(num_cols)*3))
for idx,column in enumerate(num_cols):
    plt.subplot(len(num_cols)//2+1,2,idx+1)
    sns.boxplot(x="smoking", y=column, data=train,palette="pastel")
    plt.title(f"{column} Distribution")
    plt.tight_layout()

<a id = "6"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Correlation✨</p>

In [None]:
plt.figure(figsize=(12,12))
corr=train[num_cols].corr(numeric_only=True)
mask= np.triu(np.ones_like(corr))
sns.heatmap(corr, annot=True, fmt=".1f", linewidths=1, mask=mask, cmap=sns.color_palette("icefire"));

<a id="7"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">🛠️ Feature Engineering 🛠️</h1>

<a id = "8"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Creating New Features✨</p>

In [None]:
train["height(cm)"] = train["height(cm)"] / 100
test["height(cm)"] = test["height(cm)"] / 100

train["BMI"] = train["weight(kg)"] / train["height(cm)"] ** 2
test["BMI"] = test["weight(kg)"] / test["height(cm)"] ** 2

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
BMI, Body Mass Index, is a numerical measure of a person's weight in relation to their height. It is used as a simple and quick way to assess whether an individual has a healthy body weight for their height. BMI is calculated by dividing a person's weight in kilograms by the square of their height in meters.

* BMI = (Weight in kilograms) / (Height in meters)^2
    * BMI < 18.5 - Underwieght
    * BMI between 18.5- 24.99 - Normal
    * BMI between 25- 29.99 - Overweight
    * BMI between 30-34.99 - Obesity
* Ranked as;         
    * Normal - 1
    * Underweight, Overweight - 2
    * Obesity - 3

In [None]:
train["Triglyceride_Rank"] = pd.cut(train["triglyceride"], 
                        bins = [0,150,199,499,train["triglyceride"].max()],
                        labels = [1,2,3,4]).astype(int)

test["Triglyceride_Rank"] = pd.cut(test["triglyceride"], 
                        bins = [0,150,199,499,test["triglyceride"].max()],
                        labels = [1,2,3,4]).astype(int)

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
To maintain good health, it's essential to keep triglyceride levels within a healthy range. Triglyceride levels;
    
* Less than 150 - normal
* 150- 199- moderate risk
* 200-499- high risk
* 500+ - very high risk
    
Also changed as numbers for modelling;
    
* Normal - 1
* Moderate Risk - 2
* High Risk - 3
* Very High Risk - 4

In [None]:
train["Cho_Rank"] = pd.cut(train["Cholesterol"], 
                        bins = [0,200,239,train["Cholesterol"].max()],
                        labels = [1,2,3]).astype(int)

test["Cho_Rank"] = pd.cut(test["Cholesterol"], 
                        bins = [0,200,239,test["Cholesterol"].max()],
                        labels = [1,2,3]).astype(int)

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

Total cholesterol levels are typically measured in milligrams per deciliter (mg/dL) of blood in the United States. The measurement includes both "low-density lipoprotein" (LDL) cholesterol, often referred to as "bad cholesterol," and "high-density lipoprotein" (HDL) cholesterol, known as "good cholesterol." Levels of total cholesterol;
    
* Less than 200mg/dL Desirable
* 200-239 mg/dL Borderline high
* 240mg/dL and above High
    
Also changed as rank numbers for modelling;
    
* Desirable - 1
* Borderline High - 2
* High - 3

In [None]:
train['Cholesterol_Lipit'] = train['Cholesterol'] / train['HDL']
test['Cholesterol_Lipit'] = test['Cholesterol'] / test['HDL']

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3> 
    
Cholesterol and lipid profile is a blood test that measures a person's levels of lipids (fats) in the blood. Lipids are fats and related compounds used by the body for energy. Cholesterol and lipid profile is an important medical test used to assess a person's cardiovascular health and predict the risk of cardiovascular diseases.

Cholesterol and lipid profile helps assess the accumulation of fat in the vessel walls and the risk of atherosclerosis (hardening of the arteries). Based on the results of cholesterol and lipid profile, healthcare professionals evaluate a person's risk of cardiovascular disease and recommend treatment or lifestyle changes if necessary. This test can help in the prevention and management of heart attacks, strokes, and other cardiovascular diseases.

In [None]:
# train["LDL_tri"] = train["LDL"] / train["triglyceride"]
# train["Liver_Enzyme"] = train["AST"] / train["ALT"]
# 
# test["LDL_tri"] = test["LDL"] / test["triglyceride"]
# test["Liver_Enzyme"] = test["AST"] / test["ALT"]

In [None]:
train = train.drop(["eyesight(left)","eyesight(right)","hearing(left)","hearing(right)"],axis=1,inplace= False)
test = test.drop(["eyesight(left)","eyesight(right)","hearing(left)","hearing(right)"],axis=1,inplace=False)

In [None]:
train.head()

In [None]:
train.info()

<a id = "9"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Outlier Detection✨</p>

In [None]:
num_cols = [col for col in train.columns if (train[col].dtype in ["int64","float64"]) & (train[col].nunique()>10)]

In [None]:
def corr_skew_outliner(df, cols):

    #outliner and #skewness part
    for col in cols:
        Q1 = df[col].quantile(0.01)
        Q3 = df[col].quantile(0.95)
        df.loc[df[col] < Q1, col] = Q1
        df.loc[df[col] > Q3, col] = Q3
        df[col] = np.sqrt(df[col])
        
    return df

In [None]:
train = corr_skew_outliner(train,num_cols)
test = corr_skew_outliner(test,num_cols)

In [None]:
plt.figure(figsize=(12,len(num_cols)*3))
for idx,column in enumerate(num_cols):
    plt.subplot(len(num_cols)//2+1,2,idx+1)
    sns.boxplot(x="smoking", y=column, data=train,palette="pastel")
    plt.title(f"{column} Distribution")
    plt.tight_layout()

<a id = "10"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Checking Distributions✨</p>

In [None]:
plt.figure(figsize=(12,len(num_cols)*3))
for idx,column in enumerate(num_cols):
    plt.subplot(len(num_cols)//2+1,2,idx+1)
    sns.histplot(x=column, hue="smoking", data=train,bins=30,kde=True, palette="pastel")
    plt.title(f"{column} Distribution")
    plt.tight_layout()

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* It seems like the removal of outliers has worked.

<a id="11"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">🐼 Modelling 🐼</h1>

<a id = "12"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Split✨</p>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X = train.drop("smoking", axis=1)
y = train["smoking"]


X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.1, random_state = 401)

<a id = "13"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Feature Importance✨</p>

In [None]:
import lightgbm
lgb = lightgbm.LGBMClassifier()
lgb.fit(X_train, y_train)
lightgbm.plot_importance(lgb);
accuracy_score(y_test,lgb.predict(X_test))

In [None]:
import xgboost
xgb = xgboost.XGBClassifier()
xgb.fit(X_train, y_train)
xgboost.plot_importance(xgb);
accuracy_score(y_test,xgb.predict(X_test))

<a id = "14"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨LightGBM Classifier✨</p>

In [None]:
from lightgbm import LGBMClassifier
import optuna

def objective_lgb(trial):
    """Define the objective function"""

    params = {
        'metric': trial.suggest_categorical('metric', ['binary_error']),
        'max_depth': trial.suggest_int('max_depth', 1, 10),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 15),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.05),
        'n_estimators': trial.suggest_int('n_estimators', 300, 700),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'subsample': trial.suggest_float('subsample', 0.1, 0.9),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.01, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 1.0),
        "seed" : trial.suggest_categorical('seed', [42]),
        'device': trial.suggest_categorical('device', ['gpu']),
    }


    model_lgb = LGBMClassifier(**params)
    model_lgb.fit(X_train, y_train)
    y_pred = model_lgb.predict(X_test)
    return accuracy_score(y_test,y_pred)

In [None]:
study_lgb = optuna.create_study(direction='maximize')
optuna.logging.set_verbosity(optuna.logging.WARNING)
study_lgb.optimize(objective_lgb, n_trials=50,show_progress_bar=True)

In [None]:
# Print the best parameters;

print('Best parameters', study_lgb.best_params)

In [None]:
lgb = LGBMClassifier(**study_lgb.best_params)
lgb.fit(X_train, y_train)
y_pred = lgb.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(lgb,X_test, y_test,display_labels=("False", "True"),cmap="RdPu");

In [None]:
import shap 
explainer = shap.TreeExplainer(lgb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

<a id = "15"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨XGBoost Classifier✨</p>

In [None]:
from xgboost import XGBClassifier
import optuna
def objective_xg(trial):
    """Define the objective function"""

    params = {
        'booster': trial.suggest_categorical('booster', ['gbtree']),
        'max_depth': trial.suggest_int('max_depth', 1, 10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.05),
        'n_estimators': trial.suggest_int('n_estimators', 300, 1000),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_loguniform('gamma', 1e-8, 1.0),
        'subsample': trial.suggest_loguniform('subsample', 0.3, 0.9),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 1.0),
        "seed" : trial.suggest_categorical('seed', [42]),
        'tree_method': trial.suggest_categorical('tree_method', ['gpu_hist']),
        'objective': trial.suggest_categorical('objective', ['binary:logistic']),
    }
    model_xgb = XGBClassifier(**params)
    model_xgb.fit(X_train, y_train)
    y_pred = model_xgb.predict(X_test)
    return accuracy_score(y_test,y_pred)

In [None]:
study_xgb = optuna.create_study(direction='maximize')
optuna.logging.set_verbosity(optuna.logging.WARNING)
study_xgb.optimize(objective_xg, n_trials=50,show_progress_bar=True)

In [None]:
# Print the best parameters;

print('Best parameters', study_xgb.best_params)

In [None]:
xgb = XGBClassifier(**study_xgb.best_params)
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(xgb,X_test, y_test,display_labels=("False", "True"),cmap="RdPu");

In [None]:
import shap 
explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

<a id = "16"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Voting and Stacking Classifier✨</p>

In [None]:
from sklearn.ensemble import VotingClassifier
voting = VotingClassifier(estimators=[
                                      ('lgbm', lgb), 
                                      ('xgb', xgb)], voting='soft')
voting.fit(X_train,y_train)
voting_pred = voting.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, voting_pred))

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(voting,X_test, y_test,display_labels=("False", "True"),cmap="RdPu");

In [None]:
from sklearn.ensemble import StackingClassifier
stk = StackingClassifier(estimators=[
                                      ('lgbm', lgb), 
                                      ('xgb', xgb)])
stk.fit(X_train,y_train)
stk_pred = stk.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, stk_pred))

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(stk,X_test, y_test,display_labels=("False", "True"),cmap="RdPu");

<a id="17"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">🏅Prediction🏅</h1>

In [None]:
sub["smoking"]=voting.predict_proba(test)[:, 1]
sub.to_csv('submission.csv',index=False)
sub