# Exploring and modelling Obesity Dataset

This Notebook explores the *Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico*[1] published on the University of California Irvine Machine Learning Repository ([link to the dataset](https://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition+)). "ScienceDirect" provides free access to the corresponding [paper](https://www.sciencedirect.com/science/article/pii/S2352340919306985?via%3Dihub).

The dataset has 2111 records and 17 features. The records are labeled with the class variable "NObesity" (Obesity Level) that allows classification in 7 groups: "Insufficient Weight", "Normal Weight", "Overweight Level I", "Overweight Level II", "Obesity Type I", "Obesity Type II" and "Obesity Type III". The dataset authors note that 23% of the records were collected directly from users through a web platform, and the remaining 77% were generated synthetically with Weka tool and SMOTE filter.

Eating habits, physical activity, and genes are factors which affect person's obesity predisposition. The task here is to explore the dataset, and to find a decent model that would be capable to tell if someone is overweight or obese, or his or her body fits into the normal (health) range. On the other hand, an attempt was made to cluster the data based on all features (predictors). Both - classification and clustering - tasks are described after data exploration.

#### Imports

In [None]:
%matplotlib inline

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import make_scorer, f1_score, accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score

from sklearn.cluster import KMeans

In [None]:
pip install scikit-plot

In [None]:
import scikitplot as skplt

## I. Load data

The original data are provided in a `csv` file. It is loaded and stored in `obesity_data`. The first five rows are displayed below.

In [None]:
obesity_data = pd.read_csv("../input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv")

In [None]:
obesity_data.head()

A brief check confirms the dataset has 2111 rows and 17 columns.

In [None]:
obesity_data.shape

## II. Exploratory Data Analysis

The output below shows that there are not missing values in the DataFrame; half of the features hold numeric (float64) values, and the other half - categorical ones. All are further explored in this Chapter. In general, the dataset is tidy, hence data cleaning was not neccessary.

In [None]:
obesity_data.info()

### II. 1. Explore Features

As mentioned earlier, part of the data were collected thought an online survey. Respondents had several options to answer each question. Features hold information gathered for each particular question and the corresponding possible answers. These are described and explored below.

Counting and visualizing categorical variables is wrapped in a function for avoiding repeated operations. The first plot shows the number of men and women in the dataset.

In [None]:
def count_values(dataset, cat_variable, order = None):
    """
    Function: Counts values in each category and displays them on a plot.
    
    Parameters: Dataset, category feature, and order of appearance (order is optional).
    """
    ax = sns.countplot(x = cat_variable, data = dataset, palette = "Blues_r", order = order)
    for p in ax.patches:
        ax.annotate(f"\n{p.get_height()}", (p.get_x()+0.2, p.get_height()), 
                    ha = "center", va = "top", color = "white", size = 10)
    
    plt.title(f"Number of items in each {cat_variable} category")
    plt.show()

#### Gender

There are almost an equal number of females and males in the dataset. Data is available for slightly more men than women but this does not make it imbalanced.

In [None]:
count_values(obesity_data, "Gender")

#### Age

Computing and visualizing distribution of continuous values is wrapped in a function, too. It displays not only data distribution but also its mean and median.

In [None]:
def plot_distribution(dataset, feature):
    """
    Function: Computes and displays distribution of features with continuous values; plots their mean and median.
    
    Parameters: Dataset and feature with continuous values.
    """
    plt.hist(dataset[feature], bins = "fd")
    
    plt.axvline(dataset[feature].mean(), color = "red", label = "mean")
    plt.axvline(dataset[feature].median(), color = "orange", label = "median")
    
    plt.xlabel(f"{feature}")
    plt.ylabel("Count")
    plt.legend()
    plt.title(f"Distribution of values in {feature}")
    plt.show()

The youngest person in the dataset is 14 years old, and the oldest one - 61 years of age. Values in this column are not normally distributed; the historgram is positively skewed with mean (24.31) and median (22.78) closer to the lower bound.

In [None]:
obesity_data["Age"].describe()

In [None]:
obesity_data["Age"].median()

In [None]:
plot_distribution(obesity_data, "Age")

#### Height

Obesity is determined by computing the `Body mass index`. It is a function of person's height and weight. The exact formula is $Body mass index = \frac{Weight}{Height * Height}$. Thus, height is an important element for determining obesity. 

Distribution of height values is plotted below. Most people are 1.60 m - 1.85 m tall. Both mean and median values are around 1.70. Still, height values do not seem to be normally distributed.

In [None]:
plot_distribution(obesity_data, "Height")

#### Weight

Weight does not offer interesting observations. Distribution is more or less bi-modal; the mean and the median are shifted to the left because of the larger number of people weighting 80 kg. 

In [None]:
plot_distribution(obesity_data, "Weight")

It would be interesting to see if there is any relationship between "Height" and "Weight" since both metrics are used to compute `Body mass index`. Furthermore, these are the most important features (see Chapter V) for predicting if a person suffers from overweight/obesity.

The code line below plots each person's weight and height. The red line shows that there is a positive correlation between them, which means an increase in one variable leads to an increase in the other. In other words, taller people are more likely to weight more.

In [None]:
plt.scatter(obesity_data["Height"], obesity_data["Weight"], alpha = 0.5)
m, b = np.polyfit(obesity_data["Height"], obesity_data["Weight"], 1)
plt.plot(obesity_data["Height"], m * obesity_data["Height"] + b, color = "red")

plt.xlabel("Height [m]")
plt.ylabel("Weight [kg]")
plt.title("Correlation between 'Height' and 'Weight'")
plt.show()

#### Does overweight run in the family? 

People were asked if family members suffered from overweight. Most of them replied affirmative.

In [None]:
count_values(obesity_data, "family_history_with_overweight")

#### Consumption of high caloric food

Survey respondents had to say if they eat high caloric food frequenty. There were only two possible answers: "yes" or "no". Most of them (ca. 88%) admitted they consume high caloric food.

In [None]:
count_values(obesity_data, "FAVC")

#### Consumption of vegetables

"FCVC" column denotes if people consume vegetables. Possible answers were "Never", "Sometimes", and "Always". It is not clear why values are numeric and not categorical (discrete) ones. It could be assumed that "3" means "Always", "2" - "Sometimes", and "1" - "Never", but it is not clear what the values inbetween mean.

In [None]:
plot_distribution(obesity_data, "FCVC")

#### Meals per day

Similarly, repondents had to point the number of main meals they have daily: "Between 1 and 2",  "Three", and "More than three". Instead of categorical, this feature also holds numerical values. Mean and median are not informative here either.

In [None]:
plot_distribution(obesity_data, "NCP")

#### Food between meals

People had to say if and how offen they eat between meals. They could answer eigher "No" (if they do not get bites between regular time for eating), or "Sometimes", "Frequently", or "Always". The data suggests that most people "sometimes" get small snacks between meals.

In [None]:
count_values(obesity_data, "CAEC", ["no", "Sometimes", "Frequently", "Always"])

#### Smoke

Most respondents do not smoke.

In [None]:
count_values(obesity_data, "SMOKE")

#### Drink water

Drinking water habits should have been categorised in three groups: "Less than a litter", "Between 1 and 2 L", and "More than 2 L". Instead, the answeres are entered as continuous values. Their distribution (not very informative, too) is shown below.

In [None]:
plot_distribution(obesity_data, "CH2O")

#### Monitor intake of calories 

It seems people do not worry about the calories they get daily. On the other hand, they might not have been aware of the nutritional value and ingredients of each food if these were not listed on the packing.

In [None]:
count_values(obesity_data, "SCC")

#### Physical activity

Respondents were asked to share their physical activity. They had to choose 1 out of 4 optional answers: "I do not have", "1 or 2 days", "2 or 4 days", and "4 or 5 days". Values in "FAF" column are continuous instead of categorical ones. These are plotted below but their distribution (as well as mean and median) are hard for interpretation.

In [None]:
plot_distribution(obesity_data, "FAF")

#### Physical INactivity

Similarly, people were asked to state how much time they spend on using technological devices such as cell phone, videogames, television, computer, etc. They could say "0-2 hours", "3-5 hours", and "More than 5 hours". Responses are stored as continuous values. Their distribution, which could not be interepreted, is shown below.

In [None]:
plot_distribution(obesity_data, "TUE")

#### Drink alcohol

Most people drink alcohol "sometimes", but almost a third claim they do not consume any alcoholic beverages.

In [None]:
count_values(obesity_data, "CALC")

#### Transportation

Most people (around 3/4) rely on public transportation. Much fewer respondents use their cars. The remainder either commute or use a bike or motorbike.

In [None]:
plt.figure(figsize = (7, 4))
count_values(obesity_data, "MTRANS")

#### Normal, Overweight or Obese?

People, according to their `Body mass index (BMI)`, are categorised as:

* Underweight if BMI < 18.5
* Normal if BMI 18.5 - 24.9 
* Overweight if BMI 25.0 - 29.9
* Obesity I if BMI 30.0 - 34.9
* Obesity II if BMI 35.0 to 39.9
* Obesity III if BMI > 40

Number of people per category is displayed below (note: categories are ordered logically).

The plot shows the dataset is balanced; only "Obese Type I" class slightly outnumber the other categories.

In [None]:
plt.figure(figsize = (12, 5))
count_values(obesity_data, "NObeyesdad", ["Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"]) 

##### Overweight and Gender

It is interesting to see how Overweight/Obesity interact with different categorical variables. The function below computes and plots this interaction.

In [None]:
def cross_plot(dataset, lead_category, sup_category, order = None):
    """
    Function: Plots interaction between two categorical variables.
    
    Parameters: Dataset, lead category, suplemental category, and order of appearance (order is optional).
    """
    
    sns.countplot(x = lead_category, hue = sup_category, data = dataset, order = order, palette = "Blues_r")
    
    plt.show()

Women are more likely to have "Insufficient weight" than men. On the other hand, there are more obese men than women, save in the last, extreme obesity category.

In [None]:
plt.figure(figsize = (13, 5))
cross_plot(obesity_data, "NObeyesdad", "Gender", ["Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"])

##### Overweight and family history

It seems obesity runs in the family. All those categorised as overweight or obese had family members suffering from weight problems.

In [None]:
plt.figure(figsize = (13, 5))
cross_plot(obesity_data, "NObeyesdad", "family_history_with_overweight", ["Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"])

##### Overweight and high calories food

Both normal weight and overweight/obese people consume high calories food. Perhaps food quantity makes the difference and affects body fats.

In [None]:
plt.figure(figsize = (13, 5))
cross_plot(obesity_data, "NObeyesdad", "FAVC", ["Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"]) 

##### Overweight and food between meals

Weight of those who "frequently" or "always" get food between meals does not seem to be abnormal. Only people having snaks from time to time go into "Overweight" or "Obesity" categories.

In [None]:
plt.figure(figsize = (18, 5))
cross_plot(obesity_data, "NObeyesdad", "CAEC", ["Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"]) 

##### Overweight and smoking

It seems smoking is not a predictor or does not affect body weight. There is a tiny number of smokers who could be both normal and overweight.

In [None]:
plt.figure(figsize = (13, 5))
cross_plot(obesity_data, "NObeyesdad", "SMOKE", ["Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"]) 

##### Overweight and monitoring calories

People who tend to monitor their calories intake are less likely to get excess weight.

In [None]:
plt.figure(figsize = (13, 5))
cross_plot(obesity_data, "NObeyesdad", "SCC", ["Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"]) 

##### Overweight and alcohol

There might exist a weak link between alcohol and obesity. The data suggest that people who "sometimes" drink alcohol could face weight propblems.

In [None]:
plt.figure(figsize = (18, 5))
cross_plot(obesity_data, "NObeyesdad", "CALC", ["Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"]) 

##### Overweight and means of transport

Transportation seems does not (significantly) affect a person's weight. Both slim, normal and overweight people use public transport; all groups rely on cars as well.

In [None]:
plt.figure(figsize = (18, 5))
cross_plot(obesity_data, "NObeyesdad", "MTRANS", ["Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"]) 

### II. 2. Explore Statistics

Five-number statistics does not reveal much information about features with numeric values. Data in most columns (except age, height and weight) are not interpretable. Nonetheless, these are displayed below.

In [None]:
obesity_data.describe().T

In [None]:
obesity_numeric = obesity_data[["Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE"]]

The boxplots below show quartiles and outliers. Distributions in the last 5 columns are not taken into account. 

The first boxplot suggests that there are outliers in the "Age" column. However, 40, 50 or 60 years of age are normal values (they are not extreme or errors) and for this reason these are not removed. "Height" does not seem to have outliers, and "Weight" has only a couple ones. These are not treated either.

In [None]:
fig, axs = plt.subplots(ncols = 4, nrows = 2, figsize = (20, 8))
# fig.delaxes(axs[1][3])
idx = 0
axs = axs.flatten()
for k, v in obesity_numeric.items():
    sns.boxplot(y = k, data = obesity_numeric, ax = axs[idx])
    idx += 1
plt.tight_layout(pad = 0.4, w_pad = 0.5, h_pad = 5.0)

(Linear) correlation between numeric features is weak or nonexistent. Thus, all features remain in the table.

In [None]:
plt.figure(figsize = (12, 10))
sns.heatmap(obesity_data.corr(),
           annot = True,
           cmap = "Blues_r",
           linewidths = 2, 
           linecolor = "white")
plt.title("Correlation matrix of obesity data")
plt.show()

## III. Data pre-processing
### III.1. Encoding Features

Categorical variables are one-hot encoded with `get_dummies()`. Labels (i.e., the column holding information if a person is overweight/obese or not) are stored in separate variable; it will be used later.

In [None]:
obesity_dummies = pd.get_dummies(obesity_data[["Gender", "family_history_with_overweight", "FAVC", "CAEC", "SMOKE", "SCC", "CALC", "MTRANS"]])

In [None]:
obesity_lab = obesity_data[["NObeyesdad"]]

All three sets - numeric features, one-hot encoded ones, and labels are concatenated in a new DataFrame. It has 32 columns now. Its head rows are displayed below. 

In [None]:
obesity_concatenated = pd.concat([obesity_numeric, obesity_dummies, obesity_lab], axis = 1)

In [None]:
obesity_concatenated.head()

### III.2 Separate Features and Labels

Features and labels are separated and stored in different variables.

In [None]:
obesity_label = obesity_concatenated["NObeyesdad"]
obesity_features = obesity_concatenated.drop("NObeyesdad", axis = 1)

In [None]:
obesity_label

### III.3 Convert Numerical Values

A brief check shows that some columns hold "float64" numbers, and another - "uint8" values. Machine Learning algorithms work best with floating point numbers. For this reason, all values are converted into floats.

In [None]:
obesity_features.info()

In [None]:
obesity_features = obesity_features.astype("float")

The code line below confirms the features hold only "float64" numbers now.

In [None]:
obesity_features.dtypes

### III.4 Scale Features

Values in all features should be in the same range. Otherwise, the algorithm might misinterpret and assign them wrong coefficients (weights). Obesity features are scaled with `MinMaxScaler()` which makes all values between 0 and 1. The second row confirms the scaling was successful.

In [None]:
obesity_features_scaled = MinMaxScaler().fit_transform(obesity_features)

In [None]:
obesity_features_scaled.min(axis = 1), obesity_features_scaled.max(axis = 1)

### III.5 Encode Labels

Most Machine Learning classification algorithms expect labels with numeric values (and not strings). For this reason, obesity class is encoded with `LabelEncoder()`. The latter replaces each class with an integer. 

First, the encoder is instantiated. Then, it "overviews" the data. `transform()` encodes the classes and assigns them the respective number.

In [None]:
encoder = LabelEncoder()

In [None]:
encoder.fit(obesity_label)

In [None]:
list(encoder.classes_)

In [None]:
obesity_labels_encoded = encoder.transform(obesity_label)

In [None]:
obesity_labels_encoded

### III.6 Train - Test split

The dataset is split into training and testing sets. A validation set was not withheld since the dataset is small and sufficient number of samples should be kept for training. Cross validation during Grid Search addresses this drawback. 

Splitting function (`train_test_split`) shuffles the data and reserves 20% for testing. Datasets' shape after splitting is checked below.

In [None]:
obesity_features_tr, obesity_features_ts, obesity_labels_tr, obesity_labels_ts = train_test_split(
                obesity_features, obesity_labels_encoded, 
                test_size = 0.2, stratify = obesity_labels_encoded,
                random_state = 42) # shuffle=True

In [None]:
obesity_features_tr.shape, obesity_labels_tr.shape, obesity_features_ts.shape, obesity_labels_ts.shape

`Counter` tells how many examples are placed in each class. The outputs below show that there are sufficient number of samples both in training and testing set. 

In [None]:
Counter(obesity_labels_tr)

In [None]:
Counter(obesity_labels_ts)

## IV. Train model to classify data into obesity categories

The first modelling task is to classify data into obesity categories. "Accuracy" is a good performance metric but "f1 score" (geometric mean of "precision" and "recall") is a more appropriate one. To use it for grid search and cross validation, it is instantiated as a variable. 

It could be assumed that many classifiers would return good scores. `DecisionTreeClassifier()` is chosen for its simplicity and interpretability. It has several hyper-parameters, which could be tuned but only tree's depth was used.

`RandomSearchCV()` checks which combination returns best results. The grid space is limited between 5 and 15 tree nodes (questions). These are stored in a dictionary, which is passed to for searching. Models are trained and cross-validated on 5 folds.

### IV. 1 Build Model

In [None]:
f1 = make_scorer(f1_score, average = "weighted")

In [None]:
params = {
    "max_depth": [5, 7, 9, 11, 13, 15]
}

In [None]:
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid = params, cv = 5, scoring = f1)

In [None]:
grid_search.fit(obesity_features_tr, obesity_labels_tr)

Cross-validation shows that almost all combinations reach "f1 score" close to or above 90%. 

In [None]:
grid_search.cv_results_

The best model is a tree with 13 nodes (questions). Therefore, this value is set as a hyper-parameter. 

In [None]:
grid_search.best_params_

### IV. 2. Train and Evaluate Model

In [None]:
model_tree = DecisionTreeClassifier(max_depth = 13, random_state = 42)

For clarity, `fit`, `predict`, and `score` are placed in a function, which facilitates model training, evaluation and selection. In this particular case it will only print Decision Tree's preformance in terms of its "accuracy" and "f1 score" on both sets. 

In [None]:
def train_predict_score(estimator, train_features, train_labels, test_features, test_labels):
    """
    Function: Trains model, predict classes and computes accuracy and f1 score.
    
    Parameters: estimator, X_train, y_train, X_test, y_test.
    """
    estimator.fit(train_features, train_labels)
    
    print(f"Accuracy on Train data: {accuracy_score(train_labels, estimator.predict(train_features))}")
    print(f"F1 score on Train data: {f1_score(train_labels, estimator.predict(train_features), average = 'weighted')}")
    print(f"Accuracy on Test data: {accuracy_score(test_labels, estimator.predict(test_features))}")
    print(f"F1 on Test data: {f1_score(test_labels, estimator.predict(test_features), average = 'weighted')}")

Both "accuracy" and "f1 score" on the training data are 100% but on the testing one is 91%-92%. The latter suggests the model is overfitting. Its performance could be improved either with regularization (e.g., shallower tree, setting minimum samples per leaf), or with feature selection (e.g., removing non-important columns), or with increasing the number of samples in both sets. Neither of these techniques is explored further since "accuracy" and "f1 score" over 90% is not so disappointing.

In [None]:
train_predict_score(model_tree, obesity_features_tr, obesity_labels_tr, obesity_features_ts, obesity_labels_ts)

Decision trees are easier to interpret. If plotted (see below), they show how decisions were taken (i.e., how classification happened). Each node "asks" a question; if the response is "True", the information is transmitted to the child node on the left; if it is "False", information goes to the child on the right. This process continues either until no more questions could be asked, or until reaching "max_depth" limit. Only the first 2 nodes are displayed below.

In [None]:
plt.figure(figsize = (22, 6))
plot_tree(model_tree, max_depth = 2)
plt.show()

##### Classification Report

`classification_report` is a `scikit learn` function which shows classification success (metrics) for each class. For example, most of the samples in "Obesity_Type_III" (class 4) were properly classified. The model reached 100% "precision" and 99% "f1 score". On the other hand, features indicating "Normal_Weight" (class 1) were wrongly interpreted and got around 80% on "precision" and "f1 score". 

In [None]:
print(classification_report(obesity_labels_ts, model_tree.predict(obesity_features_ts)))

In [None]:
model_tree.classes_

In [None]:
list(encoder.classes_)

##### Confusion Matrix

`confusion_matrix` shows *actual* vs *predicted* labels. Rows represent actual classes, while columns represent predicted classes. For example, 47 samples were properly classified in the 0-th class but 7 were wrongly placed in 1-st class. Only one sample of class 4 was misclassified as a sample of class 3.

In [None]:
plt.figure(figsize = (8, 6))
sns.heatmap(confusion_matrix(obesity_labels_ts, model_tree.predict(obesity_features_ts)),
           annot = True,
           fmt = ".0f",
           cmap = "Blues_r",
           linewidths = 2, 
           linecolor = "white",
           xticklabels = model_tree.classes_,
           yticklabels = model_tree.classes_)
plt.show()

##### ROC Score and Curve

Another popular classification metric is the ROC curve (Receiver Operating Characteristic curve). It is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate and False Positive Rate. Area Under the Curve (AUC) represents the probability that a random positive example is positioned to the right of a random negative example. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.

AUC and ROC curve require computing probability prediction scores. These show the probability a certain sample belongs to a particular class.

In [None]:
obesity_score_probability = model_tree.predict_proba(obesity_features_ts)

The first sample has the highest probability of being 0-th class, the second - 1-st class, and so forth.

In [None]:
obesity_score_probability

Aggregated AUC score for all classes (computed as "One vs Rest") is around 95%. This is not so bad performance.

In [None]:
roc_auc_score(obesity_labels_ts, obesity_score_probability, multi_class = "ovr")

ROC Curves are ploted below. They climb up and to the left, which indicates a good model performance. As found earlier, the model best predicts class 4 (light green line), class 6 (red line) and class 0 (black line). AUCs for all classes are displayed on the legend.

In [None]:
skplt.metrics.plot_roc(obesity_labels_ts, obesity_score_probability)
plt.show()

## V. Clustering 

An experiment was made to use dataset's features clustering. Forming separate clusters would indicate that values for the given features are specific for particular overweight/obesity type. The task is performed with "KMeans" - the simplest clustering algorithm. Instantiating requires setting the number of clusters to form, as well as the number of centroids to generate. Number of clusters is known: 7, for each weight type. "K-means++" is the chosen method for initialization - it selects initial cluster centers in a smart way to speed up convergence.

In [None]:
kmeans = KMeans(n_clusters = 7, init = "k-means++")

Features and their projection should be visualized to show how clustering works. However, displaying more than 3 dimensions on a 2D surface is impossible. For this reason, only the most important features (i.e., those holding the most valuable information) are shown. `DecisionTreeClassifier()` found that the second ("Height") and the third ("Weight") columns are the most important ones. They bear 21.9% and 47.85%, respectively, of the information in the data. The output below also shows that values in some columns were not beneficial for revealing their relationship with obesity and could have been removed.

In [None]:
model_tree.feature_importances_

Clusters (formed by "Height" and "Weight" features) in the testing data according to their real labels are plotted below. 

In [None]:
def plot_clusters(dataset, feature_one, feature_two, labels, title = None):
    """
    Function: Computes and displays clusters.
    
    Parameters: dataset, 2 features, cluster indicator.
    """
    sns.scatterplot(data = dataset, x = feature_one, y = feature_two, hue = labels, palette = "Blues_r")
    if title is not None:
        plt.title(title)
    plt.show()

In [None]:
plot_clusters(obesity_features_ts, "Height", "Weight", obesity_labels_ts, "Clusters in Test data")

`KMeans` computes the distances between each point (described by feature values) and assigns it to a cluster. Thus, clustering could be considered an unsupervised learning classification tool (algorithm). However, its performance could not assessed since there are not evaluation metrics for unsuprevised training. 

In [None]:
predicted_labels = kmeans.fit_predict(obesity_features_ts)

Classes on both plots differ since the clustering algorithm does not know how to order them (i.e., which predicted values correspond to class 0, which - to class 1, etc.). Nonetheless, KMeans managed to group "Height" and "Weight" points in 7 categories which very much overlap the testing labels.

In [None]:
plot_clusters(obesity_features_ts, "Height", "Weight", predicted_labels, "Predicted clusters")

## Conclusion

A person's height and weight are the most important factors determining his/her obesity status. Other factors might also play a role, e.g., eating habits and physical activity. Dataset features could be used both for classification and clustering tasks but it should be borne in mind that most samples are synthetically generated, i.e., they do not reflect the real world. Thus, robust conclusions require much more data representative for larger groups.

## References:

[1] Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 104344.