## Obesity Dataset

This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.

Attributes consisting of 

* What is your gender? (Gender) = Female, Male
* what is your age? (Age) = Numeric value
* What is your height? (Height) = Numeric value in meters
* What is your weight? (Weight) = Numeric value in Kilograms
* Has a family member suffered or suffers from overweight? = Yes, No
* Do you eat high caloric food frequenlty? (FAVC) = Yes, No
* Do you usually eat vegatables in your meals? (FCVC) = 1-3 follow by usually meal
* How many main means do you have daily? (NCP) = Between 1 y 2, 3, more than 4
* Do you eat any food between mean? (CAEC) = No, Sometimes, Frequently, Always
* Do you smoke? (Smoke) = Yes, No
* How much water do you drink daily? (CH20) = less than a liter, between 1 and 2L, more than 2 L
* Do you monitor the calories you eat daily? (SCC) = Yes, No
* How often do you have physical activity? (FAF) = I do not have, 1 or 2 days, 2 or 4 days, 4 or 5 day
* How much time do you use technological devices? (TUE) = 0-2 hours, 3-5 hours, more than 5 hours
* How often do you drink alcohol? (CALC) =I don't drink, Sometimes, Frequently, Always
* Which transportation do you usually use? (MTRANS) = Automobile, Motorbike, Bike Public Transportation, Walking 

* Associated task: regression, classification, clustering
      

## 1. Importing Library

In [None]:
# Default
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly 
import plotly.graph_objects as go
import seaborn as sns

# Data prepocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Tuning parameter
from sklearn.model_selection import GridSearchCV

# Model
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from keras.optimizers import SGD
from keras.optimizers import Adam
from sklearn.neural_network import MLPClassifier

# result
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score


#retina
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')

## 2. Importing dataset

In [None]:
df = pd.read_csv("../input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv")
# rename the lebel columns from 'NObeyesdad' to 'result'
df = df.rename(columns={'NObeyesdad': 'result'})
df.head()

In [None]:
print('Dataset consisting of ',df.shape[0],' observations')
print('Dataset consisting of ',df.shape[1],' columns')

In [None]:
df.info()

* Data includes of both numerical data and catagorical data


In [None]:
df.isnull().sum()

* None of missing value

In [None]:
df.describe()

## 3. EDA

Overall of result

In [None]:
name = df['result'].value_counts().index
num = df['result'].value_counts().values

fig = px.pie(data_frame=df,names=name,values=num
             ,title='Pies chard show the over all result',width=800,height=600)
fig.update_traces(textposition='inside',textinfo='label+percent')
fig.show()

plt.figure(figsize=(12,7))
sns.countplot(x='result',data=df,order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Count',fontsize=12)
plt.title('Results of each weight',fontsize=15)
plt.show()

* The label of this dataset contain of 7 weight type which are Insufficient weight, Normal weight, Overweight level I, Overweight level II, Obesity type II, and Obesity type II. Each type of Obesty quite normal distribution and closely same value (12%-16%) and it slightly imbalance

Results and Gender

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='result',data=df,hue='Gender',order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.ylabel('Count',fontsize=12)
plt.xlabel(None)
plt.title('The result of weight Vs Gender',fontsize=15)
plt.show()

* When classified the obisity levels by gender found that some of levels are super-imbalance (Obisity type II and Obisity type III)

Results and Age

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(x='result',y='Age',data=df,order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.title('Result Vs Age',fontsize=15)
plt.ylabel('Age (Year old)',fontsize=12)
plt.xlabel(None)
plt.show()

* Insufficient weight  are in the young age (19-21)
* The obesity level increasing follow by age except Obesity type III

Result and weight

In [None]:
plt.figure(figsize=(13,8))
sns.barplot(x='result',y='Weight',data=df,
            order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Weight (kg)',fontsize=12)
plt.title('Result Vs Weight',fontsize=15)
plt.show()

* Make sense

Has family member suffer or suffers from overweight?

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='result',data=df,hue='family_history_with_overweight',
            order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Count',fontsize=12)
plt.title('Result Vs overweight family',fontsize=15)
plt.show()


* Obesity in family effected to sample

Do you eat high caloric food frequenlty? 

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='result',data=df,hue='FAVC',
            order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Count',fontsize=12)
plt.title('Result Vs Eat high caloric food',fontsize=15)
plt.show()

* Make Sense

Do you usually eat vagatables in your meals? 

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(x='result',y='FCVC',data=df,
            order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Amount of event',fontsize=12)
plt.title('Result Vs Vegatable in meals',fontsize=15)
plt.show()

How many main meals do you have daily?

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(x='result',y='NCP',data=df,hue_order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Amount of meals',fontsize=12)
plt.title('Result Vs Main means do you have daily',fontsize=15)
plt.show()

Do you eat any food between meals?

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='result',data=df,hue='CAEC',
            order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Count',fontsize=12)
plt.title('Result Vs Any food between meals',fontsize=15)
plt.show()

Do you smoke? 

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='result',data=df,hue='SMOKE',
            order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Count',fontsize=12)
plt.title('Result Vs Smoking',fontsize=15)
plt.show()

How much water do you drink daily?

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(x='result',y='CH2O',data=df,hue_order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Water (L)',fontsize=12)
plt.title('Result Vs drink water (L)',fontsize=15)
plt.show()

Do you monitor the calories you eat daily?

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='result',data=df,hue='SCC',
            order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Count',fontsize=12)
plt.title('Result Vs Monitor the calories',fontsize=15)
plt.show()

How often do you have physical activity?

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(x='result',y='FAF',data=df,hue_order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.ylabel('Amount of days',fontsize=12)
plt.xlabel(None)
plt.title('Result Vs Weekly working out \n 0-5 day',fontsize=15)
plt.show()


How much time do you have physical activity?

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(x='result',y='TUE',data=df,hue_order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Hours',fontsize=12)
plt.title('Result Vs Hour',fontsize=15)
plt.show()



How often do you drink alcohol?

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='result',data=df,hue='CALC',
            order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Count',fontsize=12)
plt.title('Result Vs How often drink alcohol',fontsize=15)
plt.show()

Which transportation do you usually use?

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='result',data=df,hue='MTRANS',
            order=['Insufficient_Weight','Normal_Weight','Overweight_Level_I',
'Overweight_Level_II','Obesity_Type_I','Obesity_Type_II','Obesity_Type_III'])
plt.xlabel(None)
plt.ylabel('Count',fontsize=12)
plt.title('Result Vs Transportation',fontsize=15)
plt.show()

Heatmap(Correlation)

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(),annot=True,square=True,center=0,vmin=-1,vmax=1,
            cmap='BrBG',linewidths=5)

## 4. Data Preprocessing

   The dataset contaning of both numerical data and categorical data then I splited data to feature as attribute and answer as label and encoder the object types column

In [None]:
data = df.copy()

feature = data.drop('result',axis=1)
answer = data['result'].values.reshape(-1)

In [None]:
le = LabelEncoder()
for column_name in feature.columns:
  if feature[column_name].dtype == object:
    feature[column_name] = le.fit_transform(feature[column_name])
  else:
    pass

answer = le.fit_transform(answer)

In [None]:
feature.head()

## 5. Modeling and tuning hyperparameter

This experiment perform on various model:
    * DecisionTree
    * RandomForest
    * Neural network using Multilayer perceptron
    * Neural network using Keras
and using GridSearch for tuning hyper parameter
 

**Decision Tree**

In [None]:
xtrain,xtest,ytrain,ytest = train_test_split(feature,answer,test_size=0.3,random_state=42)

param_grid = {'criterion':['gini', 'entropy'],
              'splitter':['best','random'],
              'max_depth':list(range(1,50)),
              }

grid = GridSearchCV(DecisionTreeClassifier(random_state=42),param_grid,cv=5)
grid.fit(xtrain,ytrain)
print(grid.best_params_)

In [None]:
clf = DecisionTreeClassifier(criterion='entropy',max_depth=9,splitter='best')
clf.fit(xtrain,ytrain)
y_pred = clf.predict(xtest)
y_prob = clf.predict_proba(xtest)

mapping = dict(zip(le.classes_, range(0, len(le.classes_)+1)))

cm = confusion_matrix(ytest,y_pred)
cm_df = pd.DataFrame(cm,index=mapping)

plt.figure(figsize=(8,6))                  
sns.heatmap(cm_df, annot=True)
plt.title('The accuracy Decision Tree (Best param): {0:.3f}'.format(accuracy_score(ytest,y_pred)),fontsize=13)
plt.ylabel('Actual label')
plt.xlabel('Predicted label',)
plt.show()

print('Accuracy score: ', accuracy_score(ytest,y_pred))
print('Precision score: ',precision_score(ytest,y_pred,average='macro'))
print('Recall: ', recall_score(ytest,y_pred,average='macro'))
print('F1 score: ',f1_score(ytest,y_pred,average='macro'))
print('ROC-AUC score',roc_auc_score(ytest, y_prob, multi_class="ovo",
                                  average="macro"))



**Random Forest**

In [None]:
xtrain,xtest,ytrain,ytest = train_test_split(feature,answer,test_size=0.3,random_state=42)

param_grid = {'n_estimators':list(range(8,30)),
              'criterion':['gini','entropy'],
              'max_depth':list(range(1,50))
              }

grid = GridSearchCV(RandomForestClassifier(random_state=42),param_grid,cv=5)
grid.fit(xtrain,ytrain)
print(grid.best_params_)

In [None]:
clf = RandomForestClassifier(criterion='entropy',max_depth=10,n_estimators=29)
clf.fit(xtrain,ytrain)
y_pred = clf.predict(xtest)
y_prob = clf.predict_proba(xtest)

mapping = dict(zip(le.classes_, range(0, len(le.classes_)+1)))
print(mapping)


cm = confusion_matrix(ytest,y_pred)
cm_df = pd.DataFrame(cm,index=mapping)

plt.figure(figsize=(8,6))                  
sns.heatmap(cm_df, annot=True)
plt.title('The accuracy of Random Forest (Best param): {0:.3f}'.format(accuracy_score(ytest,y_pred)),fontsize=13)
plt.ylabel('Actual label')
plt.xlabel('Predicted label',)
plt.show()

print('Accuracy score: ', accuracy_score(ytest,y_pred))
print('Precision score: ',precision_score(ytest,y_pred,average='macro'))
print('Recall: ', recall_score(ytest,y_pred,average='macro'))
print('F1 score: ',f1_score(ytest,y_pred,average='macro'))
print('ROC-AUC score',roc_auc_score(ytest, y_prob, multi_class="ovo",
                                  average="macro"))



**Neural network: Multilayer perceptron**

In [None]:
xtrain,xtest,ytrain,ytest = train_test_split(feature,answer,test_size=0.3,random_state=42)

scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.fit_transform(xtest)

param_grid = {'hidden_layer_sizes': [(16,),(32,),(48,)],
    'activation': ['logistic', 'relu',  'tanh'],
    'solver': ['sgd', 'adam'],
    'learning_rate': ['constant','adaptive'],
    'learning_rate_init':[0.001,0.1,0.2]
}

grid = GridSearchCV(MLPClassifier(random_state=42),param_grid=param_grid,cv=5)
grid.fit(xtrain,ytrain)
print(grid.best_params_)

In [None]:
clf = MLPClassifier(random_state=42, activation='tanh',hidden_layer_sizes=(16,),learning_rate='constant',learning_rate_init=0.2,solver='sgd')
clf.fit(xtrain,ytrain)
y_pred = clf.predict(xtest)
y_prob = clf.predict_proba(xtest)

mapping = dict(zip(le.classes_, range(0, len(le.classes_)+1)))


cm = confusion_matrix(ytest,y_pred)
cm_df = pd.DataFrame(cm,index=mapping)

plt.figure(figsize=(8,6))                  
sns.heatmap(cm_df, annot=True)
plt.title('The accuracy of Neural network (MLP) (Best param): {0:.3f}'.format(accuracy_score(ytest,y_pred)),fontsize=13)
plt.ylabel('Actual label')
plt.xlabel('Predicted label',)
plt.show()

print('Accuracy score: ', accuracy_score(ytest,y_pred))
print('Precision score: ',precision_score(ytest,y_pred,average='macro'))
print('Recall: ', recall_score(ytest,y_pred,average='macro'))
print('F1 score: ',f1_score(ytest,y_pred,average='macro'))
print('ROC-AUC score',roc_auc_score(ytest, y_prob, multi_class="ovo",
                                  average="macro"))

**Neural network: Keras**

In [None]:
xtrain,xtest,ytrain,ytest = train_test_split(feature,answer,test_size=0.3,random_state=42)

scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.fit_transform(xtest)

In [None]:

def create_model(optimizer='adam',init_mode='uniform',activation='relu',learn_rate=0.01):
    model = Sequential()
    model.add(Dense(32, input_dim=16,kernel_initializer=init_mode, activation=activation))
    model.add(Dense(7,kernel_initializer=init_mode, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

model = KerasClassifier(build_fn=create_model,epochs=200,batch_size=10, verbose=0)

optimizer = ['SGD', 'Adam']
learn_rate = [0.001, 0.01, 0.1, 0.2]
init_mode = ['uniform', 'normal', 'zero']
activation = ['relu', 'tanh', 'sigmoid']

param_grid = dict(optimizer=optimizer,learn_rate=learn_rate,init_mode=init_mode,activation=activation)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=5)
grid_result = grid.fit(xtrain, ytrain)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

In [None]:
from keras.utils import to_categorical
ytrain = to_categorical(ytrain)


def create_model(learn_rate=0.1):
    model = Sequential()
    model.add(Dense(32, input_dim=16,kernel_initializer='uniform', activation='tanh'))
    model.add(Dense(7, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# create model
model = KerasClassifier(build_fn=create_model, verbose=0,epochs=200,batch_size=10)
# define the grid search parameters
model.fit(xtrain,ytrain)

y_pred = model.predict(xtest)
y_prob = model.predict_proba(xtest)

In [None]:
cm = confusion_matrix(ytest,y_pred)
# Transform to df for easier plotting
cm_df = pd.DataFrame(cm,index=mapping)

plt.figure(figsize=(8,6))                  
sns.heatmap(cm_df, annot=True)
plt.title('The accuracy of Neural Network (Keras) (Best param): {0:.3f}'.format(accuracy_score(ytest,y_pred)),fontsize=13)
plt.ylabel('Actual label')
plt.xlabel('Predicted label',)
plt.show()

print('Accuracy score: ', accuracy_score(ytest,y_pred))
print('Precision score: ',precision_score(ytest,y_pred,average='macro'))
print('Recall: ', recall_score(ytest,y_pred,average='macro'))
print('F1 score: ',f1_score(ytest,y_pred,average='macro'))
print('ROC-AUC score',roc_auc_score(ytest, y_prob, multi_class="ovo",
                                  average="macro"))

## 6. Summary

In [None]:
Accuracy = [0.9495268138801262,0.9290220820189274,0.9495268138801262,0.9542586750788643]
Precision = [0.9496370646467616,0.9289982846109511,0.9479671732410866,0.9527158255497573]
Recall =[0.949554506796094,0.9282477859822877,0.9490035908012435,0.9539127392113407]
F1Score = [0.9484366746957662,0.9282682517367851,0.9482428382213861,0.9529216047587254]
Roc = [0.974036944270467,0.994470045662756,0.9975705078538368,0.9976146480444302]

result = pd.DataFrame(index=['DecisionTree','RandomForest','MLP','Keras'])
result['Accuracy'] = Accuracy
result['Precision'] = Precision
result['Recall'] = Recall
result['F1-Score'] = F1Score
result['ROC-AUC score'] = Roc

print(result)

As i said, I am still a beginner in this field so if you guys please suggest the wrong part or what to do, That would be a great help to me. Appreciate that, Thank you.