<a href="https://www.kaggle.com/code/najeebz/obesity-risk-multi-class-mlp-classifier-detailed?scriptVersionId=163474598" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# [Multi-Class Prediction of Obesity Risk](https://www.kaggle.com/competitions/playground-series-s4e2)
### Playground Series - Season 4, Episode 2

_______________________________________________________________________ 
# Author Details:
- Name: Najeeb Haider Zaidi
- Email: zaidi.nh@gmail.com
- Profiles: [Github](https://github.com/snajeebz)  [LinkedIn](https://www.linkedin.com/in/najeebz) [Kaggle](https://www.kaggle.com/najeebz)
- Prepared for the submission to the competition.
________________________________________________________________________
# Attributions:


[Walter Reade, Ashley Chow. (2024). Multi-Class Prediction of Obesity Risk. Kaggle.](https://www.kaggle.com/competitions/playground-series-s4e2)
________________________________________________________________________
​
This Notebook is to be submitted to the competition so aims to start the process from the beginning to the creation of the submission csv file in proper format.
__________________________________________________________________________
# Code Execution and Versioning Repository: 
- [Execute the notebook in Kaggle](https://www.kaggle.com/najeebz/obesity-risk-multi-class-mlp-classifier-detailed)
- [Github Repository](https://github.com/snajeebz/playground)
​
____________________________________________________________________
# Citation:

Najeeb Zaidi. (2024). Multi-Class Prediction of Obesity Risk. Competition Submission. Kaggle. https://www.kaggle.com/najeebz/obesity-risk-multi-class-mlp-classifier-detailed

___________________________________________________________________
# Other Contributions to this Competition:
1. [Obesity Risk Features Generation XGBoost](https://www.kaggle.com/code/najeebz/obesity-risk-features-generation-xgboost)
1. [MultiClass Prediction LGBM Simple and Easy](https://www.kaggle.com/code/najeebz/multiclass-prediction-lgbm-simple-and-easy)
1. [Obesity Risk Viz, EDA, Auto Visualization tools](https://www.kaggle.com/code/najeebz/obesity-risk-viz-eda-auto-visualization-tools)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
from warnings import filterwarnings;
filterwarnings('ignore')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df=pd.read_csv('/kaggle/input/playground-series-s4e2/train.csv')
test=pd.read_csv('/kaggle/input/playground-series-s4e2/test.csv')


In [None]:
df.head()

In [None]:
test.head()

In [None]:
df.isnull().sum()


In [None]:
test.isnull().sum()


In [None]:
df.columns

# Dataset Description:

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Obesity or CVD risk dataset. Feature distributions are close to, but not exactly the same, as the original. Both to explore differences as well as to see whether incorporating the original in training improves model performance.

Note: This dataset is particularly well suited for visualizations, clustering, and general EDA.

Files
train.csv - the training dataset; NObeyesdad is the categorical target
test.csv - the test dataset; your objective is to predict the class of NObeyesdad for each row
sample_submission.csv - a sample submission file in the correct format

# Visualization

In [None]:
cat_cols=df[['Gender','family_history_with_overweight','FAVC','CAEC','SMOKE','SCC','CALC','MTRANS','NObeyesdad']]
num_cols=df[['Age','Height','Weight','FCVC','NCP','CH2O','FAF','TUE']]


# Individual Columns

## Categorical Data Value Counts Plots

In [None]:
for col in cat_cols:
    plt.figure(figsize=[15,7])
    sns.countplot(df,x=df[col]).set(title= col+' Value Distribution')
    plt.show()

## Numerical Columns Histograms and Mean Medians

In [None]:
for col in num_cols:
    plt.figure(figsize=[10,7])
    sns.distplot(df[col],kde=True).set(title= col+' Histogram')
    plt.axvline(df[col].mean(),color='r', label='Mean')
    plt.axvline(df[col].median(),color='y', linestyle='--',label='Median')
    plt.legend()
    plt.show()


# Categorical Columns Data Hierarchy 

In [None]:
import plotly.express as px
fig = px.sunburst(
    df,
    path=['NObeyesdad','Gender','MTRANS','family_history_with_overweight','SMOKE'], 
    color='Gender',color_discrete_map={'Male':'gold', 'Female':'darkblue'},
    width=1200, height=1200
)
fig.show()



## Analysis Note:
Obesity type 2 and Obesity type 3 are specific to one gender only as per the dataset

# Numerical Columns Gender wise Distribution in NObeyesdad

In [None]:
for col in num_cols:
    plt.figure(figsize=(15, 5))
    sns.lineplot(data=df, x='NObeyesdad', y=col, hue='Gender').set(title= col+' vs NObeyesdad')
    

In [None]:
df['NObeyesdad'].value_counts()

## Observations:
1. No Nulls
1. There are seven classifications, in which the data is divided. The column to predict is NObeyesdad
1. There are two classes of NObeyesdad specific to Male and Female only.
1. Data Distribution is quite uniform, all classes have 2400-3200 representation apart of one class which has 4046 rows.


In [None]:
df.columns

# Features Generation

## Feature 1: BMI
Body mass index (BMI) is a value derived from the mass (weight) and height of a person. The BMI is defined as the body mass divided by the square of the body height, and is expressed in units of kg/m2, resulting from mass in kilograms (kg) and height in metres (m).

In the current dataset, BMI can be used to distinguis two classes out of the seven; (Normal_Weight and Insufficient_Weight)

In [None]:
df['BMI']=df['Weight']/df['Height']
test['BMI']=test['Weight']/test['Height']


In [None]:
plt.figure(figsize=(15, 5))

sns.boxplot(data=df,x='NObeyesdad',y='BMI', hue='Gender')

In [None]:
describe_df=df['BMI'][df['NObeyesdad']=='Normal_Weight'].describe().reset_index()
describe_df.rename(columns={'BMI':'Normal_Weight'}, inplace=True)
describe_df['Underweight']=df['BMI'][df['NObeyesdad']=='Insufficient_Weight'].describe().reset_index().BMI
describe_df.drop(0).plot.bar(x='index')

In [None]:
sns.distplot(df['BMI'][df['NObeyesdad']=='Insufficient_Weight'],kde=True).set(title= 'BMI Under Weight Histogram')
plt.axvline(df['BMI'][df['NObeyesdad']=='Insufficient_Weight'].mean(),color='r', label='Mean')
plt.axvline(df['BMI'][df['NObeyesdad']=='Insufficient_Weight'].median(),color='y', linestyle='--',label='Median')
plt.legend()
df['BMI'][df['NObeyesdad']=='Insufficient_Weight'].describe()

In [None]:
sns.distplot(df['BMI'][df['NObeyesdad']=='Normal_Weight'],kde=True).set(title= 'BMI Normal Weight Histogram')
plt.axvline(df['BMI'][df['NObeyesdad']=='Normal_Weight'].mean(),color='r', label='Mean')
plt.axvline(df['BMI'][df['NObeyesdad']=='Normal_Weight'].median(),color='y', linestyle='--',label='Median')
plt.legend()
df['BMI'][df['NObeyesdad']=='Normal_Weight'].describe()

## Analysis:
BMI for the data doesn't reflect any linear relationship with obesity classification, and in contrast of the earlier assumption there are significant amount of outliers.It indicates that there must exist unaccounted, linear or non linear variables.We can have following scenarios;
1. delete the outliers, from the training dataset and train the model based on linearly related data only. 
1. Keep the data intact and add a 3 cluster column to it based on Height, Gender and Weight and/or Gender and BMI we can also add age in some equations to identify nonlinear relations.

In this Notebook we will go with option two and will create two K-mean Clusters.

In order to create clusters with Gender, we need to convert Gender into Numeric Column, Modifying the dataset, so let's make a copy of training data. before moving further

In [None]:
train_df=df.drop(columns='id')
test_df=test.drop(columns='id')

train_df.columns

## Now let's convert the labelled Catergorical Columns in Numeric/Boolean

In [None]:
cat_cols.head()

In [None]:
train_df=pd.get_dummies(train_df,columns=['Gender','family_history_with_overweight','SMOKE','MTRANS','SCC','FAVC','CAEC'],dtype=int)
test_df=pd.get_dummies(test_df,columns=['Gender','family_history_with_overweight','SMOKE','MTRANS','SCC','FAVC','CAEC'],dtype=int)

In [None]:
test_df

## Labelling the NObeyesdad and CALC Column
Since test dataset has one additional unique value for CALC column i-e Always, So we need to encode the column separately

In [None]:
test['CALC'].unique()

In [None]:
train_df['NObeyesdad'] = df['NObeyesdad'].map({'Overweight_Level_II':0, 'Normal_Weight':1,'Insufficient_Weight':2,'Obesity_Type_III':3,'Obesity_Type_II':4,'Overweight_Level_I':5,'Obesity_Type_I':6})
train_df['CALC'] = df['CALC'].map({'no':0, 'Sometimes':1,'Frequently':2,'Always':3})
test_df['CALC'] = test['CALC'].map({'no':0, 'Sometimes':1,'Frequently':2,'Always':3})


In [None]:
train_df['NObeyesdad'].unique()

In [None]:
train_df.columns

In [None]:
test_df.columns

## Creating Clusterred Features.

In [None]:
def cluster(X):
    from sklearn import cluster
    agglo = cluster.KMeans(n_clusters=3,random_state=0, n_init="auto")
    agglo.fit(X)
    return ((agglo.labels_+1)/5)

X=train_df[['BMI', 'Gender_Female', 'Gender_Male']]
train_df['Cluster-1']=(cluster(X)+1)/3  
X=train_df[['Gender_Female', 'Gender_Male','Age']]
train_df['Cluster-2']=(cluster(X)+1)/3  

X=test_df[['BMI', 'Gender_Female', 'Gender_Male']]
test_df['Cluster-1']=(cluster(X)+1)/3  
X=test_df[['Gender_Female', 'Gender_Male','Age']]
test_df['Cluster-2']=(cluster(X)+1)/3 

# Correlation between Features

In [None]:
corr = train_df.corr()
# plot the heatmap
plt.figure(figsize=(30,30))
s=sns.heatmap(corr,annot=True, cmap='crest')

# Creating Testing and Training Data

In [None]:
def scale(X):
    from sklearn import preprocessing
    scaled=preprocessing.StandardScaler()
    scaler=scaled.fit(X)
    X=scaler.transform(X)
    return X

In [None]:
X=train_df.drop(columns='NObeyesdad')
y=train_df[['NObeyesdad']]
print(X.columns)
print(y.columns)
print(test_df.columns)

In [None]:

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(scale(X),y,train_size=0.99, random_state=42)
X.keys()
test_df=scale(test_df)

# Creating Training Evaluation Function

In [None]:
def evaluate(y_test,ypred):
    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    from sklearn.metrics import f1_score
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import confusion_matrix
    print("Accuracy: ",accuracy_score(y_test,y_pred)) 
    print("Recall Score: ", recall_score(y_test,y_pred, average='macro')) #Recall measures the proportion of true positive predictions among all actual positive instalnces. If we predicted 100 survived correctly whereas actually 100 survived out of which 67 predicted correctly so recall will be 0.67
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(10, 10))
    s=sns.heatmap(cm,annot=True, cmap='Reds')

# Training MLP Classifier

In [None]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', 
              max_iter =5000, 
              alpha=0.0001, 
              hidden_layer_sizes=20, 
              activation='relu',
              learning_rate='adaptive', 
              verbose=0,
              early_stopping=1, 
              n_iter_no_change=100)
# 'activation': 'relu', 'alpha': 0.0001, 'early_stopping': 1, 'hidden_layer_sizes': 20, 'learning_rate': 'adaptive', 'max_iter': 5000, 'solver': 'lbfgs', 'verbose': 0
print ('Training the model')
clf.fit(X_train,y_train)
print(clf.score(X_train,y_train))
y_pred=clf.predict(X_test)
evaluate(y_test,y_pred)

# Predicting the results

In [None]:
test['NObeyesdad']=clf.predict(test_df)


In [None]:
test['NObeyesdad'].value_counts()

In [None]:
submission=test[['id','NObeyesdad']]


In [None]:
submission['NObeyesdad'].unique()

# Creating the Submission File

In [None]:
submission['NObeyesdad'] = test['NObeyesdad'].map({0:'Overweight_Level_II', 1:'Normal_Weight',2:'Insufficient_Weight',3:'Obesity_Type_III',4:'Obesity_Type_II',5:'Overweight_Level_I',6:'Obesity_Type_I'})
submission['NObeyesdad'].value_counts()

In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
submission['NObeyesdad'].value_counts()