In [None]:
# Bruno Viera Ribeiro - 09/2020

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Project Healthy Diet (fighting COVID-19)

How can eating habits help fight the current COVID-19 pandemic? A healthy diet is very important to prevent and recover from various infections. Keeping a healthy immune system is a **must** in our current situation, and what we eat (along with exercising and clearing our heads every now and then) is key.

While it is clear that good nutrition alone will not cure nor prevent the spread of COVID-19, it helps us fight back in the case of infection and prevents several other health issues. A lot of tips can be found in [this](https://www.who.int/campaigns/connecting-the-world-to-combat-coronavirus/healthyathome/healthyathome---healthy-diet) very usefull and clear page kept by WHO (World Health Organization).

In this project, we will use data *from food intake by countries* along with data associated with the *spread of COVID-19 and other health issues* the help get new insights into the importance of nutrition and eating habits to combat spreading diseases.

Data for this project is taken from [this](https://www.kaggle.com/mariaren/covid19-healthy-diet-dataset) very interesting kaggle dataset. From the owner of the dataset:

> In this dataset, I have combined data of different types of food, world population obesity and undernourished rate, and global COVID-19 cases count from around the world in order to learn more about how a healthy eating style could help combat the Corona Virus. And from the dataset, we can gather information regarding diet patterns from countries with lower COVID infection rate, and adjust our own diet accordingly

There are 5 files in the dataset:
* Fat_Supply_Quantity_Data.csv: percentage of fat intake from different food groups for 170 different countries.
* Food_Supply_Quantity_kg_Data.csv: percentage of food intake( in $kg$ ) from different food groups for 170 different countries.
* Food_Supply_kcal_Data.csv: percentage of energy intake (in $kcal$) from different food groups for 170 different countries.
* Protein_Supply_Quantity_Data.csv: percentage of protein intake from different food groups for 170 different countries.
    * All of these files have, also, columns including obesity, undernourishment and COVID-19 cases as percentages of total population.
* Supply_Food_Data_Descriptions.csv: This dataset is obtained from FAO.org, and is used to show the specific types of food that belongs to each category for the above datasets.


Now we can dig into the files.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Let's start by looking into the descriptions

In [None]:
pd.set_option('display.max_colwidth', None)
desc_df = pd.read_csv('../input/covid19-healthy-diet-dataset/Supply_Food_Data_Descriptions.csv', index_col = 'Categories')
desc_df

It looks like we might have some redundant categories. Reading `Animal Products` and `Vegetal Products`, it seems they are a summary of other categories. We should be carefull when using these categories for modeling.

# Food intake (in kg) by food group

## Data cleaning

In [None]:
kg_df_full = pd.read_csv('../input/covid19-healthy-diet-dataset/Food_Supply_Quantity_kg_Data.csv')
kg_df_full.head()

In [None]:
kg_df_full.columns

In [None]:
kg_df_full.columns.size

In [None]:
# Let's drop the last column as it is just a unit information
kg_df = kg_df_full.drop('Unit (all except Population)', axis = 1)
kg_df.head()

Beyond the columns described in the `Categories` from the data description, we have 7 other columns:
* Obesity: obesity rate
* Undernourished: undernourished rate
* Confirmed: confirmed cases of COVID-19, by population
* Deaths: confirmed deaths from COVID-19, by population
* Recovered: recovered cases of COVID-19, by population
* Active: active cases of COVID-19, by population
* Population: country population

In [None]:
kg_df.isnull().sum()

We have some missing data from these last columns. We'll start by simply dropping these data.

In [None]:
kg_df.head()

In [None]:
kg_df = kg_df.dropna()

In [None]:
kg_df.info()

Something is not a number in the `Undernourished` columns. Let's inspect:

In [None]:
kg_df['Undernourished'][:20]

In [None]:
kg_df['Undernourished'][0]

OK, so we have strings and some of them are of the form '<2.5'. Let's replace these values with '2.0', as a very crude way of dealing with these values. We need to remember, in the analysis, that all values '2.0' represent something below 2.5.

In [None]:
kg_df.loc[kg_df['Undernourished'] == '<2.5', 'Undernourished'] = '2.0'

In [None]:
kg_df['Undernourished'][:20]

Now, to turn data into numeric types:

In [None]:
kg_df['Undernourished'] = pd.to_numeric(kg_df['Undernourished'])

In [None]:
kg_df.info()

Now we have no missing values and all data is numeric, except for country names.

##  General COVID-19 data: analysis and further cleaning

Before digging into the data from food intake, let's create a simple visualization of COVID-19 cases by country.

In [None]:
fig = px.scatter(kg_df, x="Confirmed", y = "Deaths",size = "Active", hover_name='Country', log_x=False,
                 size_max=30, trendline = "ols", marginal_x = "box",marginal_y = "violin", template="simple_white")
fig.show()

Here, the size of points corresponds to the active cases of COVID-19. As expected, there is a tendency of having more deaths where more confirmed cases are present.

Now, to understand the dataset a bit more clearly, let's do some sanity checks.

In [None]:
kg_df.columns

What is the sum of `Animal Products` and `Vegetal Products`?

In [None]:
kg_df['Animal Products'] + kg_df['Vegetal Products']

In [None]:
(kg_df['Animal Products'] + kg_df['Vegetal Products']).mean()

Well, for all countries this sum appears to be roughly $50 \%$ of food intake in $kg$. That is strange, as this two are a sum of all other columns.

To understand the data better, let's sum all food related categories.

In [None]:
kg_df.iloc[:, 1:24].sum(axis=1)

Ok, so it looks like we are counting twice every entry inside `Animal Products` and `Vegetal Products`. From my understanding, `Animal Products` + `Vegetal Products` should sum to $100\%$ of the food intake. This is easily fixed by multiplying all columns of food categories by 2.

In [None]:
kg_df.iloc[:,1:24] = kg_df.iloc[:, 1:24] * 2

In [None]:
(kg_df['Animal Products'] + kg_df['Vegetal Products']).round(1)

In [None]:
(kg_df['Animal Products'] + kg_df['Vegetal Products']).mean()

That fixed the issue. Now, let's do some sanity checks with the COVID-19 categories. 

Columns related to this are: **'Confirmed', 'Deaths', 'Recovered', 'Active'**.

If my understanding is correct, the number of confirmed cases should be the sum of deaths, recoverd and active. Let's investigate.

In [None]:
(kg_df['Confirmed'] - (kg_df['Deaths'] + kg_df['Recovered'] + kg_df['Active'])).round(2)

Great! Our understanding is correct.

To further investigate the impact of deaths by COVID-19, we will create a column `Mortality` which will be calculated as `Deaths` by `Confirmed`.

In [None]:
kg_df['Mortality'] = kg_df['Deaths']/kg_df['Confirmed']

In [None]:
kg_df['Mortality']

* Next, we'll look at some general distributions from the COVID-19 data:

In [None]:
# Distributions
fig = px.bar(kg_df, x = "Country", y ="Confirmed").update_xaxes(categoryorder="total descending")
fig.show()

In [None]:
# Distributions
fig = px.bar(kg_df, x = "Country", y ="Deaths").update_xaxes(categoryorder="total descending")
fig.show()

In [None]:
# Distributions
fig = px.bar(kg_df, x = "Country", y ="Active").update_xaxes(categoryorder="total descending")
fig.show()

In [None]:
# Distributions
fig = px.bar(kg_df, x = "Country", y ="Mortality").update_xaxes(categoryorder="total descending")
fig.show()

From this last figure, we can see that `Yemen` stands out as having a very alarming mortality (almost $30\%$). However, `Yemen` also appears as one of the lowest death rate countries (death rate of 0.001955).

In [None]:
kg_df[kg_df.Country == 'Yemen']['Deaths']

## Investigate: does obesity rate affect impact of COVID-19?

There is a nice report from **Science** ([sciencemag](https://www.sciencemag.org/)) linking obesity to COVID-19 mortalitiy:
* [Why COVID-19 is more deadly in people with obesity—even if they’re young](https://www.sciencemag.org/news/2020/09/why-covid-19-more-deadly-people-obesity-even-if-theyre-young)

From the authors:
> Since the pandemic began, dozens of studies have reported that many of the sickest COVID-19 patients have been people with obesity. In recent weeks, that link has come into sharper focus as large new population studies have cemented the association and demonstrated that even people who are merely overweight are at higher risk.

Our hypothesis is that we can find a pattern from this datset supporting this report. To do so, we'll start by simply plotting the Obesity rate against our newly defined Mortality.

In [None]:
fig = px.scatter(kg_df[kg_df.Country != 'Yemen'], x="Mortality", y = "Obesity", size = "Active", hover_name='Country', log_x=False,
                 size_max=30, template="simple_white")

fig.add_shape(
        # Line Horizontal
            type="line",
            x0=0,
            y0=kg_df[kg_df.Country != 'Yemen']['Obesity'].mean(),
            x1=kg_df[kg_df.Country != 'Yemen']['Mortality'].max(),
            y1=kg_df[kg_df.Country != 'Yemen']['Obesity'].mean(),
            line=dict(
                color="crimson",
                width=4
            ),
    )


fig.show()

The red line represents the avergae obesity rate among countries. In this analysis, we have excluded "Yemen", as it was far above the "main cluster" of other countries. To clarify, here is the same graph including "Yemen":

In [None]:
fig = px.scatter(kg_df, x="Mortality", y = "Obesity", size = "Active", hover_name='Country', log_x=False,
                 size_max=30, template="simple_white")

fig.add_shape(
        # Line Horizontal
            type="line",
            x0=0,
            y0=kg_df['Obesity'].mean(),
            x1=kg_df['Mortality'].max(),
            y1=kg_df['Obesity'].mean(),
            line=dict(
                color="crimson",
                width=4
            ),
    )


fig.show()

In [None]:
fig = px.scatter(kg_df, x="Deaths", y = "Obesity", size = "Mortality",
                 hover_name='Country', log_x=False, size_max=30, template="simple_white")

fig.add_shape(
        # Line Horizontal
            type="line",
            x0=0,
            y0=kg_df['Obesity'].mean(),
            x1=kg_df['Deaths'].max(),
            y1=kg_df['Obesity'].mean(),
            line=dict(
                color="crimson",
                width=4
            ),
    )

fig.show()

In this figure, the size of the points correspond to the country's COVID-19 mortality. Here we can see that Yemen indeed stands out as having a big mortality (the huge point just bellow the mean obesity red line).

In [None]:
kg_df[kg_df.Obesity < kg_df['Obesity'].mean()].shape

In [None]:
kg_df[kg_df.Obesity > kg_df['Obesity'].mean()].shape

Our finding: **The "high mortality" and "high death rate" countries all seem to have an above average obesity rate.**

## Distribution of food intake (in kg) - exploring high obesity cases

<mark style="background-color: lightblue">Let's inspect this further. What can we say about the food intake in countries grouped by obesity rate?</mark>

In [None]:
df_high_ob = kg_df[kg_df.Obesity > kg_df['Obesity'].mean()]
df_low_ob = kg_df[kg_df.Obesity <= kg_df['Obesity'].mean()]

In [None]:
kg_df['ObesityAboveAvg'] = (kg_df["Obesity"] > kg_df['Obesity'].mean()).astype(int)

We have created a column `ObesityAboveAvg` that has value **1** if the country has obesity rate above the mean of all other countries, and **0** otherwise.

In [None]:
fig = px.histogram(kg_df, x = "Animal Products", nbins=50, color = "ObesityAboveAvg", marginal="rug")

fig.add_shape(
        # Mean value of Animal Products intake in low obesity countries
            type="line",
            x0=df_low_ob['Animal Products'].median(),
            y0=0,
            x1=df_low_ob['Animal Products'].median(),
            y1=12,
            line=dict(
                color="darkblue",
                width=4
            ),
    )

fig.add_shape(
        # Mean value of Animal Products intake in high obesity countries
            type="line",
            x0=df_high_ob['Animal Products'].median(),
            y0=0,
            x1=df_high_ob['Animal Products'].median(),
            y1=12,
            line=dict(
                color="crimson",
                width=4
            ),
    )



fig.show()

In [None]:
fig = px.histogram(kg_df, x = "Vegetal Products", nbins=50, color = "ObesityAboveAvg", marginal="rug")

fig.add_shape(
        # Mean value of Vegetal Products intake in low obesity countries
            type="line",
            x0=df_low_ob['Vegetal Products'].median(),
            y0=0,
            x1=df_low_ob['Vegetal Products'].median(),
            y1=12,
            line=dict(
                color="darkblue",
                width=4
            ),
    )

fig.add_shape(
        # Mean value of Vegetal Products intake in high obesity countries
            type="line",
            x0=df_high_ob['Vegetal Products'].median(),
            y0=0,
            x1=df_high_ob['Vegetal Products'].median(),
            y1=12,
            line=dict(
                color="crimson",
                width=4
            ),
    )

fig.show()

This might be a naive first analysis, but countries with obesity rates above the mean of all countries have a higher consumption of `Animal Products` and lower consuption of `Vegetal Products`. The vertical lines in both figures represent the **median** value of intake for each group.

In [None]:
fig = px.bar(kg_df, x = "Country", y ="Deaths", facet_col = "ObesityAboveAvg")
fig.update_xaxes(matches=None,categoryorder="total descending")
fig.show()

In the figure above, we can see **clearly** that the "high obesity rate" countries have a worst impact from COVID-19.

## Distribution of food intake (in kg) by product type

Ok, now we can dig into the separate food types (`Animal Products` and `Vegetal Products`) to see their distributions.

First, let's define a list for the features in each food type:

In [None]:
kg_df.columns

In [None]:
animal_features = ['Animal fats', 'Aquatic Products, Other', 'Eggs', 'Fish, Seafood', 'Meat',
                   'Milk - Excluding Butter', 'Offals']
vegetal_features = ['Alcoholic Beverages', 'Cereals - Excluding Beer', 'Fruits - Excluding Wine', 'Miscellaneous', 'Oilcrops', 'Pulses',
                    'Spices', 'Starchy Roots', 'Stimulants', 'Sugar & Sweeteners', 'Sugar Crops', 'Treenuts',
                    'Vegetable Oils', 'Vegetables']

In [None]:
# Sanity check
kg_df[animal_features + vegetal_features].sum(axis=1).round(2)

In [None]:
df_high_ob.mean()

Now that we have a list with all categories within `Animal Products`, we can check the distribution of intake in our defined "High Obesity" and "Low Obesity" countries:

In [None]:
fig = px.pie(values = df_high_ob[animal_features].mean().tolist(), names = animal_features,
             title='Mean food intake by Animal products groups - High Obesity Countries')
fig.show()

In [None]:
fig = px.pie(values = df_low_ob[animal_features].mean().tolist(), names = animal_features,
             title='Mean food intake by Animal products groups - Low Obesity Countries')
fig.show()

Ok, the distributions are somewhat similar. The order of highest to lowest intake is the same (except for `Offals` and `Animal fats`). However, two things stand out:
* The `Milk - Excluding Butter` intake int he first group is huge (almost $60\%$!)
* The difference between the `Fish, Seafood` intake in both groups (the first - around $7\%$, the second - around $20\%$).

Let's see the vegetal intake:

In [None]:
fig = px.pie(values = df_high_ob[vegetal_features].mean().tolist(), names = vegetal_features,
             title='Mean food intake by Vegetal products groups - High Obesity Countries')
fig.show()

fig = px.pie(values = df_low_ob[vegetal_features].mean().tolist(), names = vegetal_features,
             title='Mean food intake by Vegetal products groups - Low Obesity Countries')
fig.show()

Let's define, for ease of writting, the following: **HOC** - High Obesity Countries and **LOC** - Low Obesity Countries.
Here, we have some major differences:
* The intake of `Starchy Roots` in LOC is almost $20\%$, double that of HOC.
* The intake of `Alcoholic Beverages` is at $5.8\%$ in LOC, as in HOC it reaches almost $10\%$.
* The intake of `Sugar & Sweeteners` is at $4.78\%$ in LOC, as in HOC it reaches almost $9.5\%$.
Some others can be seen, but these caught my attention.

We can create a simple graph of `Vegetal Products` versus `Animal Products` to get a new visual on the distribtuion of **HOC** and **LOC**.

In [None]:
fig = px.scatter(kg_df, x = 'Animal Products', y ='Vegetal Products',
                 color='ObesityAboveAvg', hover_name = 'Country')
fig.show()

The color corresponds to the `ObesityAboveAvg`. Here, again, we see there is a relation between high consuption of Animal Products (comparing with Vegetal Products) and high obesity rates. Using the hover information you can find the country with highest Animal Products intake (Finland) and the one with highest Vegetal Products intake (Nigeria).

# Modeling - Classification

## 1 - KNN for ObesityAboveAvg

We can do a couple of modeling exercises with this dataset. The first thing we'll try is to check if we can predict the `ObesityAboveAvg` column using food features. Let's start with some exploration:

In [None]:
df_ob = kg_df[animal_features+vegetal_features+['ObesityAboveAvg']]
df_ob.head()

In [None]:
df_ob.describe()

In [None]:
df_ob.corr()

In [None]:
ob_features = df_ob.columns.drop('ObesityAboveAvg')
ob_target = 'ObesityAboveAvg'

print('Model features: ', ob_features)
print('Model target: ', ob_target)

## Training and test datasets

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df_ob, test_size = 0.2, shuffle = True, random_state = 28)

### Target balancing

In [None]:
print('Training set shape:', train_data.shape)

print('Class 0 samples in the training set:', sum(train_data[ob_target] == 0))
print('Class 1 samples in the training set:', sum(train_data[ob_target] == 1))

print('Class 0 samples in the test set:', sum(test_data[ob_target] == 0))
print('Class 1 samples in the test set:', sum(test_data[ob_target] == 1))

We want to fix any imbalance only in the training set. The test set should keep the original distribution.

In [None]:
from sklearn.utils import shuffle

class_0_no = train_data[train_data[ob_target] == 0]
class_1_no = train_data[train_data[ob_target] == 1]

upsampled_class_0_no = class_0_no.sample(n=len(class_1_no), replace=True, random_state=42)

train_data = pd.concat([class_1_no, upsampled_class_0_no])
train_data = shuffle(train_data)

In [None]:
print('Training set shape:', train_data.shape)

print('Class 1 samples in the training set:', sum(train_data[ob_target] == 1))
print('Class 0 samples in the training set:', sum(train_data[ob_target] == 0))

## Data preprocessing pipeline

First, we will do preprocessing on the training set. As there are no missing values, we will build a pipeline to scale features to have similar orders of magnitude by bringing all of them between 0 and 1 using MinMaxScaler and them apply a [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) classifier.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

## Defining the pipeline

classifier = Pipeline([
    ('scaler', MinMaxScaler()),
    ('estimator', KNeighborsClassifier(n_neighbors = 3))
])

# Visualize the pipeline
from sklearn import set_config
set_config(display='diagram')
classifier

## Training

First we train our classifier with the **.fit()** method.

In [None]:
# Get train data
X_train = train_data[ob_features]
y_train = train_data[ob_target]

# Fit the classifier
classifier.fit(X_train, y_train)

## Testing

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

# Using the fitted model to make predicitions on the training set

train_preds = classifier.predict(X_train)

print('Model performance on the train set:')
print(confusion_matrix(y_train, train_preds))
print(classification_report(y_train, train_preds))
print("Train accuracy:", accuracy_score(y_train, train_preds))

In [None]:
from sklearn.metrics import plot_confusion_matrix

disp = plot_confusion_matrix(classifier, X_train, y_train)

disp.ax_.set_title('Confusion matrix for train set');

Now let's check performance on the test set:

In [None]:
# Get data to test classifier
X_test = test_data[ob_features]
y_test = test_data[ob_target]

test_preds = classifier.predict(X_test)

print('Model performance on the test set:')
print(confusion_matrix(y_test, test_preds))
print(classification_report(y_test, test_preds))
print("Test accuracy:", accuracy_score(y_test, test_preds))

In [None]:
disp = plot_confusion_matrix(classifier, X_test, y_test)

disp.ax_.set_title('Confusion matrix for test set');

## Tunning the value of n_neighbors

In [None]:
# Setting k values to try on our validation performance
k_values = list(range(1,11))

# Creating a validation set within the train set
sub_train_data, val_data = train_test_split(train_data, test_size = 0.2, shuffle = True, random_state = 28)

# Upsampling to fix imbalance
class_0_no = sub_train_data[sub_train_data[ob_target] == 0]
class_1_no = sub_train_data[sub_train_data[ob_target] == 1]

upsampled_class_0_no = class_0_no.sample(n=len(class_1_no), replace=True, random_state=42)

sub_train_data = pd.concat([class_1_no, upsampled_class_0_no])
sub_train_data = shuffle(sub_train_data, random_state = 28)

# Creating training and validation sets
X_sub_train = sub_train_data[ob_features]
y_sub_train = sub_train_data[ob_target]

X_val = val_data[ob_features]
y_val = val_data[ob_target]

In [None]:
# Searching for best performing K value
for k in k_values:
    classifier = Pipeline([
    ('scaler', MinMaxScaler()),
    ('estimator', KNeighborsClassifier(n_neighbors = k))
    ])
    
    classifier.fit(X_sub_train, y_sub_train)
    val_preds = classifier.predict(X_val)
    print(f"K = {k} -- Test accuracy: {accuracy_score(y_val, val_preds)}")

**THIS HAS STOCHASTIC NATURE AND WILL CHANGE EACH RUN:** It looks that K = 2 has the best performance. Let's use K = 2 for our classifier, train it on the train set and test it on our test set.

In [None]:
# Build the classifier
classifier = Pipeline([
    ('scaler', MinMaxScaler()),
    ('estimator', KNeighborsClassifier(n_neighbors = 2))
])

# Fit the classifier
classifier.fit(X_train, y_train)

# Making predictions on test set
test_preds = classifier.predict(X_test)

print('Model performance on the test set:')
print(confusion_matrix(y_test, test_preds))
print(classification_report(y_test, test_preds))
print("Test accuracy:", accuracy_score(y_test, test_preds))

disp = plot_confusion_matrix(classifier, X_test, y_test)

disp.ax_.set_title('Confusion matrix for test set - k = 2');

We gained a slight improvement in accuracy with this quick tunning.

### Using GridSearchCV

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()

# Creating a dictionary of all values to test
param_grid = {'n_neighbors': np.arange(2,10)}

# Use grid search to test all values
knn_gscv = GridSearchCV(knn, param_grid, cv = 5)

# Fit the model to data
knn_gscv.fit(X_train, y_train)

# Check for best parameter
knn_gscv.best_params_

In [None]:
# Accuracy when at best parameters
knn_gscv.best_score_

# Modeling - Regression

## 2 - Prediciting mortality

We'll try to build a model (regressor) to predict the mortality rate based on food inatke information and obesity. Let's start by choosing the right features.

In [None]:
kg_df.columns

In [None]:
# df_mort = kg_df[animal_features+vegetal_features+['Obesity','Mortality']]
df_mort = kg_df[kg_df.Country != 'Yemen'][animal_features+vegetal_features+['Obesity','Mortality']]
# df_mort = kg_df[['Animal Products','Vegetal Products','Obesity','Mortality']]

df_mort = shuffle(df_mort)

mort_features = df_mort.columns.drop('Mortality')
mort_target = 'Mortality'

print('Model features: ', mort_features)
print('Model target: ', mort_target)

X = df_mort[mort_features]
y = df_mort[mort_target]

### Train-test split

In [None]:
train_data, test_data = train_test_split(df_mort, test_size = 0.2, shuffle = True, random_state = 28)

Let's take a look at the data before building our models.

We'll start with some visuals and build a scatter matrix for our dataframe. Because we have a large number of features, we will only plot those among the top intake in HOC (check the pie chart for HOC intake).

In [None]:
df = df_mort[['Meat', 'Milk - Excluding Butter', 'Fish, Seafood',
                         'Cereals - Excluding Beer', 'Obesity','Mortality']]
g = sns.PairGrid(df)
g.map(plt.scatter)

So, we are interested in the last "row" of this matrix. Nothing seems particularly linear, but we'll see what we can tell from building linear models.

**NOTE**: as the data seems very scattered, I am expecting bad values of $R^2$ score (for an interessting explanation on why this is so, I recommend reading this two well written articles: [Interpreting R-squared](https://statisticsbyjim.com/regression/interpret-r-squared-regression/) and [Interpreting low R-squared in regression models](https://statisticsbyjim.com/regression/low-r-squared-regression/)).

To keep track of what is important later on, let's check what features correlate the most with our target.

In [None]:
df_mort.corr().tail()

With the correlation matrix we can sort the values of the `Mortality` row to get the info we need.

In [None]:
df_mort.corr().loc['Mortality'].sort_values()

So, the highest correlation with `Mortality` is `Milk - Excluding Butter` (note that this is the highest intake from both HOC and LOC).

Let's actually build some models now.

### [Ridge Regression](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression)

First, we'll split our training data.

In [None]:
# Get train data
X_train = train_data[mort_features]
y_train = train_data[mort_target]

Now, to build our regressor with a standardization step in our pipeline (always scale your data for Ridge regression!).

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

## Defining the pipeline

regressor = Pipeline([
    ('scaler', StandardScaler()),
    ('estimator', Ridge(random_state=28))
])

# Visualize the pipeline
from sklearn import set_config
set_config(display='diagram')
regressor

#### Training the model

In [None]:
# Training
regressor.fit(X_train, y_train)

In [None]:
# Scoring the training set

train_preds = regressor.predict(X_train)
regressor.score(X_train, y_train)

#### Cross validate our score

In [None]:
# Cross validate
cv_score = cross_val_score(regressor, X_train, y_train, cv = 10)
print(cv_score)
print(cv_score.mean())

This looks very bad... Let's see some other metrics for our model.

We'll use [mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error), [mean absolute error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error) and [R2](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score) to evaluate the model.

First, let's build a simple helper function to return a dictionary with all of our scores for the chosen model.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Create function to evaluate model on a few different scores
def show_scores(model, X_train, X_test, y_train, y_test):    
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)
    scores = {'Training MAE': mean_absolute_error(y_train, train_preds),
              'Test MAE': mean_absolute_error(y_test, test_preds),
              'Training MSE': mean_squared_error(y_train, train_preds),
              'Test MSE': mean_squared_error(y_test, test_preds),
              'Training R^2': r2_score(y_train, train_preds),
              'Test R^2': r2_score(y_test, test_preds)}
    return scores

Now, let's test the model in the test set.

In [None]:
# Get data to test model
X_test = test_data[mort_features]
y_test = test_data[mort_target]

show_scores(regressor, X_train, X_test , y_train, y_test)

#### Visualizing

Let's make a simple visualization of our model's predictions using the firts feature entry (`Animal fats`).

In [None]:
test_plot = X_test.copy()
test_plot['Mortality'] = y_test
test_plot['Mortality_pred'] = regressor.predict(X_test)

test_plot.head()

In [None]:
# fig = px.scatter(test_plot, x = 'Animal fats', y = ['Mortality','Mortality_pred'],
#                  trendline = "ols")


# fig.show()

In [None]:
fig, ax = plt.subplots(figsize=[10,8])

sns.regplot(x = 'Animal fats', y = 'Mortality', data = test_plot, ax = ax, label='Mortality')
sns.regplot(x = 'Animal fats', y = 'Mortality_pred', data = test_plot, ax = ax, label='Mortality_pred')

plt.legend();

Above we see a plot of `Mortality` as a function of the `Animal fats`. Our model fails to make a good prediction but it somehow captures the **tendency** of our target. Let's try to imporve by comparing other models.

### Training and testing multiple models

Now that we have a general flow of testing our model, let's build a function to test different models.

We will use, besides our Ridge regressor, three other models:
* [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)
* [Random Forest Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
*  [XGBoost Regressor](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)

As tree base models don't require scaling as we have done for Ridge regressor, our function will have to account for scaling as a parameter. The main goal is to print out various metrics for each model.

In [None]:
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost.sklearn import XGBRegressor

# First, we create a dict with our desired models
models = {'Ridge':Ridge(random_state=28),
          'SVR':SVR(),
          'RandomForest':RandomForestRegressor(),
          'XGBoost':XGBRegressor(n_estimators = 1000, learning_rate = 0.05)}

# Now to build the function that tests each model
def model_build(model, X_train, y_train, X_test, y_test, scale=True):
    
    if scale:
        regressor = Pipeline([
            ('scaler', StandardScaler()),
            ('estimator', model)
        ])
    
    else:
        regressor = Pipeline([
            ('estimator', model)
        ])

    # Training
    regressor.fit(X_train, y_train)

    # Scoring the training set

    train_preds = regressor.predict(X_train)
    print(f"R2 on single split: {regressor.score(X_train, y_train)}")

    # Cross validate
    cv_score = cross_val_score(regressor, X_train, y_train, cv = 10)

    print(f"Cross validate R2 score: {cv_score.mean()}")

    # Scoring the test set
    for k, v in show_scores(regressor, X_train, X_test , y_train, y_test).items():
        print("     ", k, v)

Now that we have our helper function, we loop through our `models` dictionary and score each one of them.

In [None]:
for name, model in models.items():
    print(f"==== Scoring {name} model====")
    
    if name == 'RandomForest' or name == 'XGBoost':
        model_build(model, X_train, y_train, X_test, y_test, scale=False)
    else:
        model_build(model, X_train, y_train, X_test, y_test,)
    print()
    print(40*"=")
        

### Hyperparameter tunning for XGBoost model

We can pick our best perfroming model and try some hyperparameter tunning with a simple [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

Let's start by defining our parameters:

In [None]:
xgb = XGBRegressor()

parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower
              'objective':['reg:squarederror'],
              'learning_rate': [.03, 0.05, .07], #so called `eta` value
              'max_depth': [5, 6, 7],
              'min_child_weight': [4],
              'subsample': [0.7],
              'colsample_bytree': [0.7],
              'n_estimators': [500, 1000]}

Now we can do the search (note that it can take a long time).

In [None]:
# from sklearn.model_selection import GridSearchCV

# xgb_grid = GridSearchCV(xgb, parameters, cv = 5, n_jobs = 4, verbose = True)

# xgb_grid.fit(X_train, y_train)

# print(xgb_grid.best_score_)
# print(xgb_grid.best_params_)

## RAN AND GOT THE PARAMETERS USED BELLOW

In [None]:
xgb_best = XGBRegressor(colsample_bytree = 0.7,
                        learning_rate = 0.05,
                        max_depth = 6,
                        min_child_weight = 4,
                        n_estimators = 500,
                        nthread = 4,
                        objective = 'reg:squarederror',
                        subsample = 0.7)

In [None]:
model_build(xgb_best, X_train, y_train, X_test, y_test, scale=False)

## 2.1 - A simpler model for Mortality

Let's try to reduce the dimensionality by using only two features: **Animal Products** and **Vegetal Products**.

In [None]:
df_mort2 = kg_df[kg_df.Country != 'Yemen'][['Animal Products','Vegetal Products','Obesity','Mortality']]


df_mort2 = shuffle(df_mort2)

mort2_features = df_mort2.columns.drop('Mortality')
mort2_target = 'Mortality'

print('Model features: ', mort2_features)
print('Model target: ', mort2_target)

X = df_mort2[mort2_features]
y = df_mort2[mort2_target]

In [None]:
df_mort2.head()

Let's do a simple box visual of our distributions. To better see the `Mortality` distribution, we'll multiply the column by $1000$ just for the boxplot.

In [None]:
dummie = df_mort2.copy()
dummie['Mortality'] = dummie['Mortality']*1000


plt.figure(figsize=(10,10))
sns.boxplot(data = dummie, palette = 'rainbow');

Now to split our data.

In [None]:
train_data, test_data = train_test_split(df_mort2, test_size = 0.2, shuffle = True, random_state = 28)

# Get train data
X_train = train_data[mort2_features]
y_train = train_data[mort2_target]

# Get data to test model
X_test = test_data[mort2_features]
y_test = test_data[mort2_target]

In [None]:
# First, we create a dict with our desired models
models = {'Ridge':Ridge(random_state=28),
          'SVR':SVR(),
          'RandomForest':RandomForestRegressor(),
          'XGBoost':XGBRegressor(n_estimators = 1000, learning_rate = 0.05)}

In [None]:
for name, model in models.items():
    print(f"==== Scoring {name} model====")
    
    if name == 'RandomForest' or name == 'XGBoost':
        model_build(model, X_train, y_train, X_test, y_test, scale=False)
    else:
        model_build(model, X_train, y_train, X_test, y_test,)
    print()
    print(40*"=")

### Building best performing model

In [None]:
model = RandomForestRegressor()

#### Training

In [None]:
model.fit(X_train, y_train)

#### Making predictions and visualizing

In [None]:
test_preds = model.predict(X_test)

test_plot = X_test.copy()
test_plot['Mortality'] = y_test
test_plot['Mortality_pred'] = test_preds

test_plot.head()

In [None]:
def plotTest(col, target, data):
    fig, ax = plt.subplots(figsize=[10,8])

    sns.regplot(x = col, y = target, data = data, ax = ax, label=target)
    sns.regplot(x = col, y = target+'_pred', data = data, ax = ax, label=target+'_pred')

    plt.legend();

To visualize the resulting model, let's plot target (`Mortality`) dependecy with all features separately. In each plot, let's see both **real** data and predicted data.

In [None]:
plotTest('Animal Products', 'Mortality', test_plot)

In [None]:
plotTest('Vegetal Products', 'Mortality', test_plot)

In [None]:
plotTest('Obesity', 'Mortality', test_plot)

We have very few data points and the data is very scattered (bad value for $R^2$). But, this simplified model seems to compare well with the more complex one.

## 3 - Prediciting obesity

As obesity has a higher correlation with all "food features" let's try to buil our models to predict the actaul obesity rate. We expect these models to have better metrics that the ones build to predict mortality.

In [None]:
df_obes = kg_df[animal_features+vegetal_features+['Obesity']]

df_obes = shuffle(df_obes)

obes_features = df_obes.columns.drop('Obesity')
obes_target = 'Obesity'

print('Model features: ', obes_features)
print('Model target: ', obes_target)

X = df_obes[obes_features]
y = df_obes[obes_target]

Let's check the correlation of features with target:

In [None]:
df_obes.corr().loc['Obesity'].sort_values()

**Train-test splitting**

In [None]:
train_data, test_data = train_test_split(df_obes, test_size = 0.2, shuffle = True, random_state = 28)

# Get train data
X_train = train_data[obes_features]
y_train = train_data[obes_target]

# Get data to test model
X_test = test_data[obes_features]
y_test = test_data[obes_target]

We can use the same workflow above to test different models. To keep things clear, we'll define, again, the dictionary of all models we'll be using:

In [None]:
# First, we create a dict with our desired models
models = {'Ridge':Ridge(random_state=28),
          'SVR':SVR(),
          'RandomForest':RandomForestRegressor(),
          'XGBoost':XGBRegressor(n_estimators = 1000, learning_rate = 0.05)}

We can loop through these and use our `model_build` function once more:

In [None]:
for name, model in models.items():
    print(f"==== Scoring {name} model====")
    
    if name == 'RandomForest' or name == 'XGBoost':
        model_build(model, X_train, y_train, X_test, y_test, scale=False)
    else:
        model_build(model, X_train, y_train, X_test, y_test,)
    print()
    print(40*"=")

### Building best performing model

In [None]:
model = XGBRegressor(n_estimators = 1000, learning_rate = 0.05)

#### Training

In [None]:
model.fit(X_train, y_train)

#### Making predictions and visualizing

In [None]:
test_preds = model.predict(X_test)

test_plot = X_test.copy()
test_plot['Obesity'] = y_test
test_plot['Obesity_pred'] = test_preds

test_plot.head()

In [None]:
# def plotTest(col, target, data):
#     fig, ax = plt.subplots(figsize=[10,8])

#     sns.regplot(x = col, y = target, data = data, ax = ax, label=target)
#     sns.regplot(x = col, y = target+'_pred', data = data, ax = ax, label=target+'_pred')

#     plt.legend();

Let's make two plots:

* First, the `Obesity` dependency on `Cereals - Excluding Beer`, as it has the most negative correlation with the target.
* Second, the `Obesity` dependency on `Meat`, as it has the most positive correlation with the target.

For both graphs we'll plot the **real** values of `Obesity` and the ones predicted by our model.

In [None]:
plotTest('Cereals - Excluding Beer', 'Obesity', test_plot)

In [None]:
plotTest('Meat', 'Obesity', test_plot)

It looks like our model is doing a good job predicting these feature's influence on the obesity rate.

## Clustering countries by obesity and mortality due to COVID-19

Now that we have seen that there is a relation between `Obesity` rates and `Mortality` we can try to cluster countries together based on these features.

The first thing we have to do is to filter all other features:

In [None]:
X = kg_df[kg_df.Country != 'Yemen'][['Obesity', 'Mortality']]

X.head()

All centroid-based algorithms need a scaling step before modelling. And, as this is a case of unsupervised learning model, we don't need to split the data.

Let's first instantiate our scaler.

In [None]:
scaler = StandardScaler()

# Fit the scaler
scaler.fit(X)

In [None]:
# Transform our data
X_scaled = scaler.transform(X)

# Sanity checks
print(X_scaled.mean(axis = 0))

print(X_scaled.std(axis=0))

### K-means modeling

In [None]:
from sklearn.cluster import KMeans

# Instantiate the model
kmeans = KMeans(n_clusters = 3)

# Fit the model
kmeans.fit(X_scaled)

# Make predictions
preds = kmeans.predict(X_scaled)

print(preds)

In [None]:
# Amount of countries in each cluster

unique_countries, counts_countries = np.unique(preds, return_counts=True)
print(unique_countries)
print(counts_countries)

<mark style="background-color: lightblue">We excluded Yemen as it is an outlier in the `Mortality` distirbution.</mark>

#### Visualizing

In [None]:
df_vis = kg_df[kg_df.Country != 'Yemen'].copy()
df_vis['cluster'] = [str(i) for i in preds]

df_vis.head()

In [None]:
fig = px.scatter(df_vis, x = 'Mortality', y = 'Obesity', color = 'cluster', hover_name = 'Country')
fig.show()

We can find an optimal value for $k$ (number of clusters) using the "elbow" method.

In [None]:
# Calculate inertia for a range of clusters number
inertia = []

for i in np.arange(1,11):
    km = KMeans(n_clusters = i)
    km.fit(X_scaled)
    inertia.append(km.inertia_)
    
# Plotting
plt.plot(np.arange(1,11), inertia, marker = 'o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.grid()
plt.show();

It appears $k = 3$ already is a good value for our modelling.

We can wrap this whole clustering process in a function:

In [None]:
def cluster_preds(df, feat1, feat2, k):
    X = df[[feat1, feat2]]

    # Scaling
    scaler = StandardScaler()

    # Fit the scaler
    scaler.fit(X)

    # Transform our data
    X_scaled = scaler.transform(X)

    # Instantiate the model
    kmeans = KMeans(n_clusters = k)

    # Fit the model
    kmeans.fit(X_scaled)

    # Make predictions
    preds = kmeans.predict(X_scaled)

    # Visualizing
    df_vis = df.copy()
    df_vis['cluster'] = [str(i) for i in preds]

    fig = px.scatter(df_vis, x = feat1, y = feat2, color = 'cluster', hover_name = 'Country')
    fig.show()

Now we can quickly cluster together countries based on `Animal Products` intake and `Obesity` rate (recalling we used this features in our "simpler model to predict mortality").

In [None]:
cluster_preds(kg_df, 'Animal Products', 'Obesity', 3)

As a further look into COVID-19 impact, we can cluster countries based on `Deaths` by `Confirmed` cases (referring to `Mortality`).

In [None]:
cluster_preds(kg_df, 'Confirmed', 'Deaths', 3)

# Conclusions

There are MANY factors that are important to fight against the current COVID-19 epidemic. Maintaining good eating habits helps keep our immune system healthy and ready to combat a possible disease.

In this notebook I tried to explore possible patterns found in data of COVID-19 and food intake in different countries. One major goal was to find the influence of obesity rates in the effect of the disease in each country. Splitting countries into HOC and LOC groups, it was possible to create a classifier, with good accuracy, predicting in which group would a country be based on its food intake data.

Having this, we created regression models to try to predict the `Mortality` of COVID-19 in countries based on ther eating habits and obesity rate. Two approaches were taken: one with all food related features taken as parameters and a simpler one. Both have issues (mainly of spread and non-linearity), but we could show use of different models and metrics.

Next, we build a model to predict `Obesity` rates based on eating habits in each country. This model was far more succesfull and the overall tendecy of the data was captured and predicted.

Finally, we build a quick helper function to do some clustering based on pairs of features.

For a more visual data exploration, I have built simple dashboards to do some EDA with Dash:
- [App1](https://healthycovid19app1.herokuapp.com/)
- [App2](https://healthycovid19app2.herokuapp.com/)
- [App3](https://healthycovid19app3.herokuapp.com/)

The app was split into 3 to avoid long processes erros in Heroku.

Please comment if you liked the notebook and critic if you found any inaccuracies. I am still very new to the field and this is my second ever notebook, so suggestions are very welcome!

Stay safe everyone!