# Machine Learning - Final project - Kerbidi Kim Lou / Malka Laura

## Context


For this final project you are required to choose and define a business problem of which you will apply machine learning to. 

## Data
You shall choose a dataset from the datasets available on <a href="https://www.kaggle.com/datasets">kaggle.com</a>.

You are free to choose any dataset you want, however, your choice should be motivated by something that interests you, for example:
- your speciality or your future professional project
- recent events in the world such as the USA election or the covid
- etc.

The following are a selection of some datasets from Kaggle, you can choose one of them if you want. 

<a href="https://www.kaggle.com/c/home-credit-default-risk/data">Home Credit Default Risk</a>  

<a href="https://www.kaggle.com/mariaren/covid19-healthy-diet-dataset">COVID-19 Healthy Diet Dataset</a>

<a href="https://www.kaggle.com/volodymyrgavrysh/bank-marketing-campaigns-dataset">Bank marketing campaigns dataset</a>




## What should you do ?

Make a notebook telling interesting things about the data you have fetched, tell a story (or many) using everything you learned. Build predictive models and compare them.  
You have to submit at least a notebook and any resources you used (like images or any other files).


Your final submission should include the following: 

- Problem definition
- Data Exploration
- Data Processing (Cleaning, etc.)
- Features Selection
- Features Engineering
- Model Selection
- Learning Curves analysis
- Dimensionality Reduction
- Results Visualization
- Results Interpretation

## Assessment 
Here are the criteria we will use to assess your work:

### Is it meaningful?
As a machine learning expert you have to produce something meaningful enough, just plotting random data is not going to work. Like a story your analysis should have some kind of logical progression.

### How well did you use the technical knowledge you’ve been taught?
Obviously, the way you use everything you learned during the lectures is going to be assessed.

### Cleanliness, aesthetics and clearness of your notebook
Is your analysis full of unused code? Is it difficult to read? Have you tried to make it easy and enjoyable to read?

### Innovation
Creativity, surprising things or any good initiatives you take are potential bonus points.


### Careful:
	This work is individual, plagiarism is going to be measured by both machines and humans. Too many similarities between your work and any online or python buddy work will result in grade penalties.


Good Luck!




--------------------------

Let's **upload the packages we would like to use.** 

In [None]:
pip install plotly_express==0.4.0 #this one will be useful to hover on one of our scatter plots and obtain the exact information on the points

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import numpy as em
import pandas as pd
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import plotly.express as px

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

import seaborn as sns
sns.set_palette("Set2")

--------------------------

# Table of Contents

* [Dataset chosen on Kaggle](#chapter1)
* [Problem definition](#chapter2)
* [Data Exploration and Processing](#chapter3)
    * [Value Types](#section_3_1)
    * [Missing Values](#Section_3_2)
    * [Analyzing extremes](#Section_3_3)
        * [Alcoholic consumption](#section_3_3_1)
        * [Undernourished](#section_3_3_2)
        * [Obesity](#section_3_3_3)
    * [Early visualizations](#Section_3_4)
        * [World average diet](#section_3_4_1)
        * [World covid cases](#section_3_4_2)   
    * [Insights](#Section_3_5)  
        * [Diet vs covid](#section_3_5_1)
        * [Health state vs covid](#section_3_5_2)   
* [Features Selection](#chapter4) 
    * [Most deaths: Belgium](#section_4_1)
    * [Most confirmed cases: Montenegro](#Section_4_2)
    * [Least deaths: Cambodia](#Section_4_3)
    * [Least confirmed cases: Vanuatu](#Section_4_4)
* [Features Engineering](#chapter5)
    * [Handmade clustering](#section_5_1)
        * [Mortality rate scoring](#section_5_1_1)
        * [Confirmed cases scoring](#Section_5_1_2)
    * [Unsupervised learning](#section_5_2)
        * [Preparation](#section_5_2_1)
        * [K-means](#Section_5_2_2)
        * [Elbow method](#section_5_2_3)
        * [Silhouette plots](#Section_5_2_4)
* [Model Selection](#chapter6) 
    * [Creating train and test sets](#Section_6_1)
    * [Performing features scaling](#Section_6_2)
    * [Linear regression](#Section_6_3)
    * [Random forest](#Section_6_4)
* [Learning Curves analysis](#chapter7) 
* [Dimensionality Reduction](#chapter8)
    * [Missing values filter](#Section_8_1)
    * [High correlation filter](#Section_8_2)
    * [Random forest](#Section_8_3)
* [Results Visualization and Interpretation](#chapter9) 
    * [Obesity vs undernutrition: a happy medium?](#Section_9_1)
        * [Obesity average diet](#Section_9_1_1)
        * [Undernutrition average diet](#Section_9_1_2)    
    * [Extreme covid results and diet: back to our 4 extremes](#Section_9_2)
        * [Most deaths: Belgium](#section_9_2_1)
        * [Most confirmed cases: Montenegro](#Section_9_2_2)
        * [Least confirmed cases: Vanuatu](#Section_9_2_3)   
        * [Least deaths: Cambodia](#Section_9_2_4)

--------------------------

 ## Dataset chosen on Kaggle <a class="anchor" id="chapter1"></a>

https://www.kaggle.com/mariaren/covid19-healthy-diet-dataset

## Problem definition <a class="anchor" id="chapter2"></a>

We chose a dataset combining different types of **food,** world population **obesity and undernourished rate**, and **global covid cases count** from **around the world.** 

The idea is to understand how a **healthy eating style could help combat the coronavirus,** distinguishing the diet patterns from countries with lower COVID infection rate.

Our goal here is to **provide diet recommendations base on our findings.**


Each dataset provides **different diet measure** different categories of food, depending on what we want to focus on, so we have 
- fat quantity, 
- energy intake (kcal), 
- food supply quantity (kg), 
- protein for different categories of food 

To which have been added:
- obesity rate
- undernourished rate 
- the most up to date confirmed/deaths/recovered/active cases.

Let's start by **loading the data.**

In [None]:
fat_quantity = pd.read_csv("../input/covid19healthydietdataset/Fat_Supply_Quantity_Data.csv")
food_kcal = pd.read_csv("../input/covid19healthydietdataset/Food_Supply_kcal_Data.csv")
food_kg = pd.read_csv("../input/covid19healthydietdataset/Food_Supply_Quantity_kg_Data.csv")
protein_quantity = pd.read_csv("../input/covid19healthydietdataset/Protein_Supply_Quantity_Data.csv")
supply_food = pd.read_csv("../input/covid19-healthy-diet-dataset/Supply_Food_Data_Descriptions.csv")

Now let's **discover the different datasets.**

In [None]:
fat_quantity.head()

In [None]:
food_kcal.head()

In [None]:
food_kg.head()

In [None]:
protein_quantity.head()

In [None]:
supply_food.head()

In almost all dataset, the data are organized by countries. There are 170 countries in these datasets.
<br>After we discovered the different columns in each dataset, we wanted to **focus on how each column data is calculated.**

In [None]:
fat_quantity.drop(['Obesity','Confirmed','Undernourished','Deaths','Recovered','Active', 'Population', 'Unit (all except Population)'], axis = 1, inplace = True)

In [None]:
print('Sum of diet measures per Fat quantity :')
print(fat_quantity.sum(axis = 1))

First, we noticed that the different diet measures are described as their **percentage of prevalence in the total diet.** For Afghanistan for example, alcohol represents 0% of an inhabitant's diet. 

In [None]:
food_kg.head()

Then, we noticed that the **different rates for undernourished, obesity and COVID are in percentage of the total population.**

In [None]:
food_kg['Confirmed'].round() == (food_kg['Active'] + food_kg['Deaths'] + food_kg['Recovered']).round()

Finally, we wanted to make sure that **confirmed cases** are the result of the **sum of deaths, recovered and active case.**

Now that we know more about each value, **we start to explore.**

## Data Exploration and Processing <a class="anchor" id="chapter3"></a>

### 1. Value Types  <a class="anchor" id="section_3_1"></a>

Let's dig into **different data types.**

We focus on one of the dataset since those are similar, and chose to **focus on the easiest to understand: food_kg.**

In [None]:
food_kg.dtypes.value_counts()

Let's go deeper and dig into **different data types for each variable.**

In [None]:
food_kg.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

We have data on **170 countries,** and the **Undernourished and Unit columns are considered as objects,** let's modify that.
<br> First, we look at the unique values of this column : this help us understand wether we can directly convert it to float or if there is one or more string elements blocking.

In [None]:
food_kg['Unit (all except Population)'].unique()

The Unit column only contains the % sign, indicating that all the column except the Population one are in percentages. It is important to know the unit used, but now that we know this information we can **delete this column which is no longer useful.**

In [None]:
food_kg = food_kg.drop(['Unit (all except Population)'], axis=1)

Now that we have fixed the Unit column problem we are going to **focus on the Undernourished one.** Again, we proceed to firstly look at the unique values of this column: helping us understand wHether we can directly convert it to float or if there is one or more string elements blocking.

In [None]:
food_kg['Undernourished'].unique()

Indeed, the problem seems to come from the **"<2.5"** value, as the float type does not support special characters. To fix that, we are going to replace all the "<2.5" values with **just "2.5".** 

In [None]:
food_kg["Undernourished"] = food_kg["Undernourished"].replace('<2.5','2.5')
food_kg['Undernourished'].value_counts()

Now that we have fixed the string elements in the column, we can actually **do the convertion to float.**

In [None]:
food_kg["Undernourished"] = pd.to_numeric((food_kg["Undernourished"]), downcast="float")

To confirm that the Undernourished column is now a float type, and to dive a bit deeper in the composition of the data set, we now look at **an overview of all information.**

In [None]:
food_kg.info()

We have **six columns with missing values** (not reaching 170 values): obesity, undernourished, confirmed, deaths, recovered, and active.

### 2. Missing Values  <a class="anchor" id="section_3_2"></a>

Let's create a function to **check missing data** and unveil **the percentage of data missing** for each dataframe, as seen in the python bootcamp.

In [None]:
def missing_data(data):
    nb_values = data.isnull().sum().sort_values(ascending = False) #contains the number of values missing
    percent_values = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False) #contains the percentage of values missing
    return pd.concat([nb_values, percent_values], axis=1, keys=['Number of Missing Values', 'Percentage of Missing Values'])

Let's apply the function. We display **6 rows as we know we have 6 columns with missing values**.

In [None]:
missing_data(food_kg).head(6)

The **number of missing values is low** - from 3 to 9 missing values on 170 lines. Let's see which countries are concerned.

In [None]:
food_kg_missing = food_kg[food_kg.isna().any(axis=1)]
food_kg_missing

Since we **cannot approximate those values** nor find the exact missing values in terms of extraction date and population count taken into account for calculation. We thus decided we would **delete the countries** for which values are missing.

In [None]:
food_kg = food_kg.dropna(axis=0)

### 3. Analyzing extremes <a class="anchor" id="section_3_3"></a>

We then proceed exploring the data by describing it ; it gives us a **data description**. 

In [None]:
food_kg.describe()

There seems to be some **anomalies** in the dataset, with notably strong extremes, that we see in the **max range**. To test if the data is accurate, we quickly test some of the most striking figures.

#### a. Alcoholic consumption  <a class="anchor" id="section_3_3_1"></a>

We start out test with the **Alcoholic Beverages** column ; one country average diet is supposedly composed of **15%** of alcoholic beverages. We first look at the corresponding country and then we do a short search on the internet to confirm or deny the information. 

In [None]:
food_kg[food_kg['Alcoholic Beverages'] == food_kg['Alcoholic Beverages'].max()]

Here, the corresponding country is **Burkina Faso.** After searching it on the internet, it appears that in spite of Islam being the most prevalent religion in Burkina Faso (nearly 60% of the population according to a survey conducted in 2006), there is a high consumption of alcohol in the country. Indeed, in 2018, **22** liters of pure alcohol were consumed per year and per inhabitant (<a href='https://movendi.ngo/news/2020/07/03/burkina-faso-300000-liters-of-liquor-destroyed-in-ouagadougou/#:~:text=Per%20capita%20alcohol%20intake%20of,adults%20that%20number%20is%2046%25'>source</a>). In comparaison, in 2016 in France, it is **11.7** liters that are consumed per year per inhabitant (<a href='https://www.stop-alcool.ch/fr/l-alcool-en-general-2/statistiques-sur-la-consommation/quelques-chiffres-pour-la-france#:~:text=Avec%2011%2C7%20litres%20d,2'>source</a>).

#### b. Undernourished  <a class="anchor" id="section_3_3_2"></a>

Then, we move on the **Undernourished column:** 

In [None]:
food_kg[food_kg['Undernourished'] == food_kg['Undernourished'].max()]

The corresponding country is **Central African Republic.** Again, this information is also accurate : **79%** of the country's population was estimated to be living in poverty in 2018, thus being more susceptible to be undernourished (<a href='https://www.wfp.org/countries/central-african-republic#:~:text=The%20Central%20African%20Republic%20'>source</a>).

#### c. Obesity  <a class="anchor" id="section_3_3_3"></a>

Finally, the last surprising figure is related to the **Obesity column:**

In [None]:
food_kg[food_kg['Obesity'] == food_kg['Obesity'].max()]

The corresponding country is **Kiribati**, an archipelago republic in Central Pacific Ocean. After looking up on the internet, it is once again true : the obesity rate of the country was as high as **46%** in 2016 (<a href='https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-020-09217-z#:~:text=In%20fact%2C%20in%202016%2C%20the,of%201.96%25%20%5B5%5D'> source</a>).

Although some data seem to be anormal at first sight, there are **no anomalies nor abnormal extremes** ; it is a good sign. Indeed, despite the **variety** of the data set, it seems like it still holds **accurate data**, that will enable us to conduct a relevant analysis. 

This table gives us an **approximation of how the world consumption under covid times**, as the data of over 170 countries are gathered here. 

This help us get **an understanding of the world's average health**:
- Obesity : 19% of the population of the 170 countries
- Undernourished : 11% of the population of the 170 countries

And learning more about the **covid pandemic**: 
- Confirmed : 1.2% of confirmed cases of covid among the latters
- Deaths : 0.02% of deaths due to covid
- Recovered : 0.8% of recovered patients from covid

### 4. Early visualizations  <a class="anchor" id="section_3_4"></a>

#### a. World average diet <a class="anchor" id="section_3_4_1"></a>

First, we want to **visualize the world average diet**. To do so, we create a variable called "diet_mean" where we put the dataset description. We select the first row with *iloc* in order to only have the mean of all columns. Then, we *drop* the columns related to covid or health. We also only choose the columns having a mean superior to 1% ; we thus select 11 product categories.

In [None]:
diet_mean = food_kg.describe().iloc[1]
diet_mean = pd.DataFrame(diet_mean).drop(['Deaths', 'Population','Undernourished','Obesity', 'Recovered', 'Confirmed', 'Active'], axis=0)
diet_mean = diet_mean.sort_values(by='mean', ascending=False).iloc[:11]

In [None]:
diet_mean_plot = diet_mean.plot.pie(subplots=True, figsize=(15, 15), autopct='%1.1f%%')

When looking through the diet details, we can see that **vegetal products** are the most consumed, followed by **animal products and cereals.** This pie chart will help us in our analysis later on ; it will serve as a **refferal** to understand the differences in covid cases between countries, based on alimentation.

#### b. World covid cases <a class="anchor" id="section_3_4_2"></a>

Now, we want to **visualize the world average covid state**. To do so, we use *plotly express* to have the possibility to **hover** on a scatter plot and see the statistics per country clearer as explained [here](https://plotly.com/python/hover-text-and-formatting/).  

In [None]:
covid_stats_plot = px.scatter(food_kg, x='Confirmed', y='Deaths', hover_name='Country', size_max=30)
covid_stats_plot.show()

This graph gives a sense of **repartition of countries in function of their covid deaths and confirmed case**. Extremes can be easily spotted. 

### 5. Insights <a class="anchor" id="section_3_5"></a>

Now that we know our dataset a bit better, let's analyze **correlations**, and **challenge our own bias**, especially regarding **obesity being an aggravating factor of covid.**
To do so we are going to drop all the mirror correlation (ex population/population, spices/spices, etc.). To apply the function below we used two sources : <a href='https://stackoverflow.com/questions/17778394/list-highest-correlation-pairs-from-a-large-correlation-matrix-in-pandas'> one for abs and unstack</a> and <a href='https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html'> one for dropping the head rows</a>. 
That way we now have all the *'true'* correlations.

#### a. Diet vs covid <a class="anchor" id="section_3_5_1"></a>

In [None]:
corr_food=food_kg.corr(method='pearson')
corr_final=corr_food.abs().unstack().sort_values(ascending = False)
corr_final.drop(corr_final.head(32).index, inplace=True)

We want to create a list of the **strongest correlations with diets for 3 covid states**: confirmed, deaths, and recovered. We will not use the active covid state as it is part of the confirmed covid cases. 

To do so we have to *drop* the health attributes for each list:

In [None]:
print('Confirmed')
corr_confirmed = corr_final['Confirmed'].head(15)
corr_confirmed = corr_confirmed.drop(['Recovered', 'Deaths', 'Active', 'Undernourished', 'Obesity'])
print(corr_confirmed)
print()

print('Deaths')
corr_deaths = corr_final['Deaths'].head(15)
corr_deaths = corr_deaths.drop(['Recovered', 'Confirmed', 'Active', 'Undernourished', 'Obesity'])
print(corr_deaths)
print()

print('Recovered')
corr_recovered = corr_final['Recovered'].head(14)
corr_recovered = corr_recovered.drop(['Confirmed', 'Deaths', 'Undernourished', 'Obesity'])
print(corr_recovered)

Then, we **merge these lists in one**, giving us the most interesting diet attributes to compare to the health states.

In [None]:
corr_base = corr_deaths + corr_confirmed + corr_recovered
corr_base

Now we know the **11 strongest correlations** of those 3 covid states, listed above.

We are going to visualize it a bit better by making a **heatmap for each of them**, solely based on the diets' attributes listed above:

In [None]:
corr_heatmap=food_kg[['Deaths','Animal Products','Animal fats','Cereals - Excluding Beer','Eggs','Meat','Milk - Excluding Butter','Pulses','Starchy Roots','Sugar & Sweeteners','Vegetal Products']]
x=corr_heatmap.corr(method='pearson')
plt.figure(figsize=(7,5), dpi= 80)
sns.heatmap(x[['Deaths']].sort_values(by=['Deaths'],ascending=False),cmap='Pastel2_r',annot=True,linewidth=0.6)
plt.title('Covid deaths cases diets')
plt.xticks()
plt.suptitle('Pearson Correlation Coefficient', size=18, va='top')

corr_heatmap=food_kg[['Confirmed','Animal Products','Animal fats','Cereals - Excluding Beer','Eggs','Meat','Milk - Excluding Butter','Pulses','Starchy Roots','Sugar & Sweeteners','Vegetal Products']]
x=corr_heatmap.corr(method='pearson')
plt.figure(figsize=(7,5), dpi= 80)
sns.heatmap(x[['Confirmed']].sort_values(by=['Confirmed'],ascending=False),cmap='Pastel2_r',annot=True,linewidth=0.6)
plt.title('Covid confirmed cases diets')
plt.xticks()

corr_heatmap=food_kg[['Recovered','Animal Products','Animal fats','Cereals - Excluding Beer','Eggs','Meat','Milk - Excluding Butter','Pulses','Starchy Roots','Sugar & Sweeteners','Vegetal Products']]
x=corr_heatmap.corr(method='pearson')
plt.figure(figsize=(7,5), dpi= 80)
sns.heatmap(x[['Recovered']].sort_values(by=['Recovered'],ascending=False),cmap='Pastel2_r',annot=True,linewidth=0.6)
plt.title('Covid recovered cases diets')
plt.xticks()

We can see that the first two diets, **confirmed and deaths cases diets, are very similar**. Indeed, the top 3 correlations are **animal products, milk (excluding butter), and animal fat**. When we compare that to the recovered cases diets, the **animal fat is significantly lower** in terms of correlation. Recovered cases diets have a **lesser correlation to meat** as well. 

This could mean that in average, **recovered cases eat less meat and animal fat**. This is why [malnutrition is a threat-multiplier](https://globalnutritionreport.org/blog/nutrition-and-covid-19-malnutrition-threat-multiplier/).

#### b. Health state vs covid <a class="anchor" id="section_3_5_2"></a>

After having compared diets to covid results, we now do the **same covid comparison to the health state** to see if there is a pattern as well:

In [None]:
corr_heatmap=food_kg[['Deaths','Confirmed','Recovered','Obesity','Undernourished']]
x=corr_heatmap.corr(method='pearson')
plt.figure(figsize=(10,8), dpi= 80)
sns.heatmap(x,cmap='Pastel2_r',annot=True,linewidth=0.6)
plt.title('Pearson Correlation Coefficient')
plt.xticks(rotation=45)

Indeed, we can now see that **obesity has a stronger correlation with covid deaths than recovery** and **undernourished patients has a stronger correlation with covid recovery than deaths.**

This could mean that in average, **obese patients are most likely to die from covid** while **undernourished are most likely to survive**. This is why [obesity worsens outcomes from covid](https://www.cdc.gov/obesity/data/obesity-and-covid-19.html).

Such results are to be **interpreted carefully** as many other factors are to be taken into account - for example, undernourished patients are most likely to be in emerging countries, where the population is very young and most likely to survive. 

## Features Selection <a class="anchor" id="chapter4"></a>

We decided to select **4 countries to focus on.** Those countries reflect extremes in terms of deaths and confirmed cases in relation to their population, and could therefore be **representative of some of our results.**

### 1. The one with the most deaths in relation to the population: Belgium <a class="anchor" id="Section_4_1"></a>

In [None]:
food_kg[food_kg['Deaths'] == food_kg['Deaths'].max()]

First, we have Belgium, 
[the world's worst affected country when it comes to the coronavirus mortality rate](https://www.bbc.com/news/world-europe-52491210).

### 2. The one with the most confirmed cases in relation to the population: Montenegro <a class="anchor" id="Section_4_2"></a>

In [None]:
food_kg[food_kg['Confirmed'] == food_kg['Confirmed'].max()]

Then, we have Montenegro **the world's worst affected country when it comes to the coronavirus confirmed cases rate**.

### 3. The most populated one with the least deaths in relation to the population: Cambodia <a class="anchor" id="Section_4_3"></a>

For this one we have several countries with very few covid-related deaths. For our analysis to be relevant we decided to choose among these countries the one with the largest population. To do so we sorted it by population.

In [None]:
food_kg[food_kg['Deaths'] == food_kg['Deaths'].min()].sort_values(by='Population', ascending=False)

We now have Belgium's polar opposite, Cambodia, [one of the world's least affected country when it comes to the coronavirus mortality rate](https://www.abc.net.au/news/2020-12-04/cambodia-handling-covid-19-community-transmission-zero-deaths/12938226).

### 4. The one with the least confirmed cases in relation to the population: Vanuatu <a class="anchor" id="Section_4_4"></a>

In [None]:
food_kg[food_kg['Confirmed'] == food_kg['Confirmed'].min()].sort_values(by='Population', ascending=False)

Eventually, we have Montenegro's polar opposite, Vanuatu, [the world's least affected country when it comes to the coronavirus confirmed cases rate](https://time.com/5910456/pacific-islands-covid-19-vanuatu/).

## Features Engineering <a class="anchor" id="chapter5"></a>

Now we dive into feature engineering, as we try to **create new input features from your existing ones**. In this sense, we thought about creating a new column in which we would **score countries based on their coronavirus results** (again, confirmed and deaths rates). Such grading would allow us to start clustering countries based on the way they handled the situation. 

### 1. Handmade clustering <a class="anchor" id="Section_5_1"></a>

#### a. Mortality rate scoring <a class="anchor" id="Section_5_1_1"></a>

We create bins, knowing that covid mortality rates go from **0 to almost 15%** (from Cambodia to Belgium). We want to have **4 figures** (1 to 4 - 4 being the worse) so we **arbitrary** create **4 bins.** 

In [None]:
score_bins = [-0.1, 0.0375, 0.075, 0.1125, 0.15] #-1 because otherwise for some reason we don't get zeroes
grades = ['1','2','3','4']
cats = pd.cut(food_kg.Deaths, score_bins, labels=grades)
food_kg['DeathsScore'] = cats
food_kg

We can test the repartition looking at the **occurrence of each grade.** The occurence is **not evenly shared** since the bins were arbitrary made.

In [None]:
food_kg.DeathsScore.value_counts()

We can have a look at our 4 clusters **through a graph.** 

In [None]:
food_kg['DeathsScore'] = food_kg['DeathsScore'].astype(str)
food_kg['DeathsScore'] = food_kg['DeathsScore'].astype(float)
covid_man_cluster_conf = px.scatter(food_kg, x='Confirmed', y='Deaths', color='DeathsScore', hover_name='Country', size_max=30)
covid_man_cluster_conf.show()

Our handmade clustering works but it is **not very representative.**  

#### 2. Confirmed cases scoring <a class="anchor" id="Section_5_1_2"></a>

Same as above, we create bins, knowing that covid confirmed case rate go from **almost 0 to almost 6** (from Vanuatu to Montenegro). We want to have **4 figures** (1 to 4 - 4 being the worse) so we **arbitrary** create **4 bins.** 

In [None]:
score_bins_2 = [-1, 1.5, 3, 4.5, 6] #-1 because otherwise for some reason we don't get zeroes
grades = ['1','2','3','4']
cats_2 = pd.cut(food_kg.Confirmed, score_bins_2, labels=grades)
food_kg['ConfirmedScore'] = cats_2
food_kg

Again, we can test the repartition looking at the **occurrence of each grade.** The occurence is **not evenly shared** since the bins were arbitrary made.

In [None]:
food_kg.ConfirmedScore.value_counts()

In [None]:
food_kg['ConfirmedScore'] = food_kg['ConfirmedScore'].astype(str)
food_kg['ConfirmedScore'] = food_kg['ConfirmedScore'].astype(float)
covid_man_cluster_conf = px.scatter(food_kg, x='Confirmed', y='Deaths', color='ConfirmedScore', hover_name='Country', size_max=30)
covid_man_cluster_conf.show()

Again, our handmade clustering works but it is **not very representative.** We can try to do better. 

### 2. Unsupervised clustering <a class="anchor" id="Section_5_2"></a>

Now that we have try some handmade clustering through arbitrary bins, we can go for **unsupervised clustering** as we have seen in Lesson 5. 

#### a. Preparation <a class="anchor" id="Section_5_2_1"></a>

We start by **converting the string column 'Country' into an integer**, creating a new dataset called *food_kg_int.* This dataset in now an **array,** storing values of same data type.

In [None]:
food_kg_country = food_kg[['Country']]
food_kg_drop = food_kg.drop('Country', axis=1)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(food_kg_drop)
food_kg_scaled = scaler.transform(food_kg_drop)
food_kg_scaled

In [None]:
food_kg_coun = food_kg[['Country']]
food_kg_coun.tail(3)

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
food_kg_coun_encoded = encoder.fit_transform(food_kg_coun)

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder(sparse=False)
food_kg_coun_1hot = cat_encoder.fit_transform(food_kg_coun)

In [None]:
print("Country:",food_kg_coun_1hot.shape)

Now we **concatenate** the processed numerical and categorical features into **one matrix.**

In [None]:
food_kg_int = np.concatenate((food_kg_scaled, food_kg_coun_1hot), axis=1, out=None)
food_kg_int.shape

And then we create a new dataset, called *food_kg_ar* **focusing on confirmed cases and deaths**, that we turn into an **array** as well.

In [None]:
food_kg_ar = food_kg[['Confirmed','Deaths']].to_numpy()

Finally, we **turn back into a list rather than an array for both.**

In [None]:
food_kg_pd = pd.DataFrame(food_kg_int)
food_kg_pd.head()

In [None]:
food_kg_li = food_kg[['Confirmed','Deaths']]
food_kg_li.head()

#### b.  K-Means <a class="anchor" id="Section_5_2_2"></a>

We start with **K-Means model,** since K-means is easy to implement and computationally very efficient. We **create 4 groups based on their feature similarities.**

In [None]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=5,
            init='random',
            n_init=10, 
            max_iter=300,
            tol=1e-04,
            random_state=0)

Then we **have a look at the predicted clusters.**

In [None]:
food_kg_km = km.fit_predict(food_kg_int)

And we check the **predicted clusters' centers (centroids).**

In [None]:
km.cluster_centers_

We get the **labels found of the K-means.**

In [None]:
km.labels_

Finally, we plot **the results of clustering.**

In [None]:
plt.scatter(food_kg_int[food_kg_km==0,0], 
            food_kg_int[food_kg_km==0,1], 
            s=50, 
            c='lightgreen', 
            marker='o', 
            label='cluster 1')

plt.scatter(food_kg_int[food_kg_km==1,0], 
            food_kg_int[food_kg_km==1,1], 
            s=50, 
            c='orange', 
            marker='o', 
            label='cluster 2')

plt.scatter(food_kg_int[food_kg_km==2,0], 
            food_kg_int[food_kg_km==2,1], 
            s=50, 
            c='lightblue', 
            marker='o', 
            label='cluster 3')

plt.scatter(food_kg_int[food_kg_km==3,0], 
            food_kg_int[food_kg_km==3,1], 
            s=50, 
            c='green', 
            marker='o', 
            label='cluster 4')

plt.scatter(km.cluster_centers_[:,0], 
            km.cluster_centers_[:,1], 
            s=250, 
            marker='*', 
            c='red', 
            label='centroids')

plt.ylabel('Deaths')
plt.xlabel('Confirmed')
plt.title('Country clusters')

plt.legend()
plt.grid()
plt.tight_layout()
plt.show()

The result is a bit **messy,** let's see if had chosen the right number of samples in the first place through the **elbow method.**

#### c.  Elbow method <a class="anchor" id="Section_5_2_3"></a>

In order **to quantify the quality of our clustering**, we need to use **distortion** to **compare the performance of different K-means clusterings**. 

In [None]:
print('Distortion: %.2f' % km.inertia_)

We use a graphical tool, the **elbow method,** to estimate the optimal number of clusters k for a given task.

In [None]:
distortions = []
for i in range(1, 12):
    km = KMeans(n_clusters=i, 
                init='k-means++', 
                n_init=10, 
                max_iter=300, 
                random_state=0)
    km.fit(food_kg_ar)
    distortions.append(km.inertia_)
plt.plot(range(1, 12), distortions , marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.title('Comparing the performance of different K-means clusterings')
plt.tight_layout()
plt.show()

The distortion begins to decrease with not much significant change for **k=2** so 2 clusters would have been a better choice for this dataset.

#### d.  Silhouette plots <a class="anchor" id="Section_5_2_4"></a>

We compute the **silhouette score** of each sample to quantify the quality our clustering

In [None]:
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_samples

silhouette_vals = silhouette_samples(food_kg_int, food_kg_km, metric='euclidean')
silhouette_vals

We now **compute the mean silhouette coefficient of all samples.**

In [None]:
from sklearn.metrics import silhouette_score
silhouette_score_ = silhouette_score(food_kg_int, food_kg_km)
silhouette_score_

And we **create a plot of the silhouette coefficients** for a K-means clustering with **k=5.**

In [None]:
#Getting the clusters from food_kg_km
cluster_labels = np.unique(food_kg_km)
n_clusters = cluster_labels.shape[0]

y_ax_lower, y_ax_upper = 0, 0
yticks = []

#For each cluster, getting the silhouette values and sort them
for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[food_kg_km == c]
    #sort them
    c_silhouette_vals.sort()
    
    y_ax_upper += len(c_silhouette_vals)
    
    #specify the color with respect to the number of clusters
    color = cm.jet(i / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, 
            edgecolor='none', color=color)

    yticks.append((y_ax_lower + y_ax_upper) / 2)
    y_ax_lower += len(c_silhouette_vals)

#Computing and plotting the average silhouette
silhouette_avg = silhouette_score(food_kg_int,food_kg_km)


plt.axvline(silhouette_avg, color="red", linestyle="--") 

plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')

plt.tight_layout()
plt.show()

We have a **negative difference** here, meaning the point is on average closer to the neighboring group than to its own: it is therefore **misclassified.**

As underlined by the silhouette above, our clustering is **not so great**, therefore we will **not pursue clustering** to continue our explanation of covid deaths thanks to other models. 

## Model Selection <a class="anchor" id="chapter6"></a>

Given this dataset and **the emphasis we have already laid on deaths** through clustering and classification, we thought it would be interesting to try identifying **the factors that are most likely to lead to a deceased person by modelling such data.**

### 1. Creating train and test sets <a class="anchor" id="Section_6_1"></a>

Let's **separate the data into a training and testing sets** using random selection and setting the ratio to 0.2.

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(food_kg, test_size=0.2)

In [None]:
test_set.head()

The test and train sets **seem representatives.**

We **now drop the labels** from the training set and **create a new variable for the labels.**

In [None]:
food_kg_train = train_set.drop("Deaths", axis=1) # drop labels for training set
food_kg_train_labels = train_set["Deaths"].copy()

food_kg_test = test_set.drop("Deaths", axis=1) # drop labels for test set
food_kg_test_labels = test_set["Deaths"].copy()

Here, **we don't need an imputer** or any additional manipulations since **we no longer have any missing values.** 

In [None]:
food_kg_train_num = food_kg_train.drop('Country', axis=1)
food_kg_test_num = food_kg_test.drop('Country', axis=1)

### 2. Performing feature scaling <a class="anchor" id="Section_6_2"></a>

Now we perform **features scaling** on the cleaned training and testing *food_kg* datasets.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(food_kg_train_num)
food_kg_train_num_scaled = scaler.transform(food_kg_train_num)
food_kg_test_num_scaled = scaler.transform(food_kg_test_num)

In [None]:
food_kg_train_num_scaled

And we **preprocess the categorical input features.**

In [None]:
food_kg_train_coun = food_kg_train[['Country']]
food_kg_test_coun = food_kg_test[['Country']]
food_kg_test_coun.tail(3)

We transform our *food_kg* train categories and *food_kg* test categories **from categories into numerical data.**

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
food_kg_train_coun_encoded = encoder.fit_transform(food_kg_train_coun)
food_kg_test_coun_encoded = encoder.fit_transform(food_kg_test_coun)

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder(sparse=False)
food_kg_train_coun_1hot = cat_encoder.fit_transform(food_kg_train_coun)
food_kg_test_coun_1hot = cat_encoder.fit_transform(food_kg_test_coun)

In [None]:
print("Country:",food_kg_train_coun_1hot.shape)

In [None]:
food_kg_train_num_scaled.shape

We **concatenate** the processed numerical and categorical features **into one matrix.**

In [None]:
food_kg_train_prepared = np.concatenate((food_kg_train_num_scaled, food_kg_train_coun_1hot), axis=1, out=None)
food_kg_train_prepared.shape
food_kg_test_prepared = np.concatenate((food_kg_test_num_scaled, food_kg_test_coun_1hot), axis=1, out=None)
food_kg_test_prepared.shape

Now **onto linear regression.**

### 3. Linear Regression <a class="anchor" id="Section_6_3"></a>

We try to **model mortality** through linear regression. 

In [None]:
food_kg_train_prepared

Let's train a **linear regression model** on the prepared *food_kg* training set.

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(food_kg_train_prepared, food_kg_train_labels)

Now **we predict.**

In [None]:
food_kg_train_predictions = lin_reg.predict(food_kg_train_prepared)

And we measure this regression model’s **RMSE** on the whole training set.

In [None]:
from sklearn.metrics import mean_squared_error

lin_mse = mean_squared_error(food_kg_train_labels, food_kg_train_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

Let's measure this regression model’s **MAE** on the whole training set.

In [None]:
from sklearn.metrics import mean_squared_error

lin_mse = mean_squared_error(food_kg_train_labels, food_kg_train_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

### 4. Random forest <a class="anchor" id="Section_6_4"></a>

Let's try a **random forest model** on the prepared *food_kg* training set.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(food_kg_train_prepared, food_kg_train_labels)

Now **we predict.**

In [None]:
food_kg_train_predictions = forest_reg.predict(food_kg_train_prepared)
forest_mse = mean_squared_error(food_kg_train_labels, food_kg_train_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

Let's perform a **10 fold cross validation.**

In [None]:
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, food_kg_train_prepared, food_kg_train_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)

And display the **resulting scores:**

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
display_scores(forest_rmse_scores)

## Learning Curves analysis <a class="anchor" id="chapter7"></a>


We will use the following function to **plot learning curves with cross validation.** 

As seen in class 6, the function generates 3 plots: 
- the **test and training** learning curve, 
- the **training samples** vs **fit times curve,**
- the **fit times vs score curve.**

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes,
                       return_times=True)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, 'o-')
    axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
                         fit_times_mean + fit_times_std, alpha=0.1)
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    axes[2].grid()
    axes[2].plot(fit_times_mean, test_scores_mean, 'o-')
    axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1)
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt

To define below X and y we used this <a href='https://www.dataquest.io/blog/learning-curves-machine-learning/'>source</a> to help us :

In [None]:
food_lc = food_kg.drop(['Country'], axis=1)
features = list(food_lc.columns)
target = 'Country'
X = food_kg[features]
y = food_kg[target]

In [None]:
from sklearn.model_selection import ShuffleSplit
cross_val_strategy = ShuffleSplit(n_splits=100,test_size=0.2)
plot_learning_curve(estimator=GaussianNB(), title='Learning Curves (Naive Bayes)', X=X, y=y, axes=None, ylim=None, cv=cross_val_strategy, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5))
plt.show()

cv = ShuffleSplit(n_splits=10,test_size=0.2) 
plot_learning_curve(estimator=SVC(gamma=0.001), title="Learning Curves (SVM, RBF kernel, $\gamma=0.001$)", X=X, y=y, axes=None, ylim=None, cv=cv, train_sizes=np.linspace(.1, 1.0, 5))
plt.show()

Nonetheless it doesn't seem to give us any conclusive results, so we are going to **move on** and **refocus on the different diets and their impacts.**

## Dimensionality Reduction <a class="anchor" id="chapter8"></a>

As seen in class 7, dimensionality reduction is a way to **reduce the number of features** in your dataset without having **to lose much information** and keep the model’s performance 
[(source)](https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/).
Our dataset is quite small, so we **only seeked basic dimensionality reduction techniques.**

### 1. Missing Values filter <a class="anchor" id="Section_8_1"></a>

In the Data Exploration part, we created a *function* to show the **percentage of missing values** per category, and decided to **get rid** of the *Countries* that had missing values.

Indeed, we were faced with **two choices**: imputing the missing values or dropping the variable. Knowing we could not recover the missing data due to the **uncertainty** in terms of extraction date and population count taken into account for calculation, we went for the second option. 

**Such filtering through missing values is was a form of dimensionality reduction.**

### 2. High Correlation filter <a class="anchor" id="Section_8_2"></a>

Another filtering we have used is the high correlation filter for  variables that have **similar trends** and are **likely to carry similar information.** 

Indeed, when we worked on correlations, we **focused on the most correlated diet attributes** to compare to the health states in different countries, and we **phased out some of them** (that were too similar or countained in one another).

We thus focused on **each covid state only** on *'Animal Products','Animal fats','Cereals - Excluding Beer','Eggs','Meat','Milk - Excluding Butter','Pulses','Starchy Roots','Sugar & Sweeteners'* and *'Vegetal Products'* **rather than the whole column assortment** provided in the first place.

### 3. Random forest filter <a class="anchor" id="Section_8_3"></a>

Random forest can also be used for dimensionality reduction, offering a **built-in feature importance measurer**, helping us to select a **smaller subset of features** [(source)](https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/).

In [None]:
features = food_kg.columns
importances = forest_reg.feature_importances_
indices = np.argsort(importances)[-9:]  # We focus on top 10 features

Now we plot the **feature importance graph.**

In [None]:
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Based on the above graph, the key features are the **active cases rate** (contained in the confirmed rate we focused on), the **undernourished feature** and the **confirmed cases feature.** 

The **mortality rate is also among the top 10 important features**, and so are the **diet indicators** we kept through our correlation filter (*'Alcoholic Beverages'*, *'Milk - Excluding Butter'* and more).

## Results Visualization and Interpretation <a class="anchor" id="chapter9"></a>

### 1. Obesity vs undernutrition: a happy medium? <a class="anchor" id="Section_9_1"></a>

We have established that **obesity and undernutrition are correlated to covid-cases.** We are now going to dive deeper analyzing the **diet in countries with each health attributes**, and see how covid impacted them.

#### a. Obesity average diet <a class="anchor" id="Section_9_1_1"></a>

We take the **top 10 countries in terms of obesity rate** and put them in a variable so we can analyse and plot them :

In [None]:
obesity_set = food_kg[food_kg['Obesity'] == food_kg['Obesity']].sort_values(by='Obesity', ascending=False).head(10)
obesity_mean = obesity_set.describe().iloc[1]
obesity_mean = pd.DataFrame(obesity_mean).drop(['Deaths', 'Population','Undernourished','Obesity', 'Recovered', 'Confirmed', 'Active', 'ConfirmedScore','DeathsScore'], axis=0)
obesity_mean = obesity_mean.sort_values(by='mean', ascending=False).iloc[:11]
obesity_mean_plot = obesity_mean.plot.pie(subplots=True, figsize=(25, 10), autopct='%1.1f%%')

The pie chart above **looks a lot like the average pie chart diet we made earlier for the world consumption.** The countries with the most obesity rate seem to consume more vegetables than people on average.

#### b. Undernutrition average diet  <a class="anchor" id="Section_9_1_2"></a>

Just like we did above for obesity average diet we take the **top 10 countries in terms of undernourished rate** and put them in a variable so we can analyse and plot them :

In [None]:
undernutrition_set = food_kg[food_kg['Undernourished'] == food_kg['Undernourished']].sort_values(by='Undernourished', ascending=False).head(10)
undernutrition_mean = undernutrition_set.describe().iloc[1]
undernutrition_mean = pd.DataFrame(undernutrition_mean).drop(['Deaths', 'Population','Undernourished','Obesity', 'Recovered', 'Confirmed', 'Active', 'ConfirmedScore','DeathsScore'], axis=0)
undernutrition_mean = undernutrition_mean.sort_values(by='mean', ascending=False).iloc[:11]
undernutrition_mean_plot = undernutrition_mean.plot.pie(subplots=True, figsize=(25, 10), autopct='%1.1f%%')

Now we can **easily spot the differences.** Here undernourished people consume **way less animal products** and **much more starchy roots** than the world's consumption in average or the obese people on average. Moreover, they seem to be **consuming a bit more alcoholic beverages.**

### 2. Extreme covid results and diet: back to our 4 extremes <a class="anchor" id="Section_8_2"></a>

To better understand the four countries that we have identified earlier and their dynamics we are going to **analyze each of their diet**, to see if it is linked to their results.

#### a. Belgium : the one with the more deaths cases in relation to its population <a class="anchor" id="Section_9_2_1"></a>

In [None]:
belgium_case = food_kg[food_kg['Deaths'] == food_kg['Deaths'].max()]
belgium_case

It is important to underline that Belgium has a **24.5% obesity rate, which could partly explained the high mortality rate.**

In [None]:
belgium_case = belgium_case.describe().iloc[1]
belgium_diet = pd.DataFrame(belgium_case).drop(['Deaths', 'Population','Undernourished','Obesity', 'Recovered', 'Confirmed', 'Active', 'ConfirmedScore','DeathsScore'], axis=0)
belgium_diet = belgium_diet.sort_values(by='mean', ascending=False).iloc[:11]
belgium_diet_plot = belgium_diet.plot.pie(subplots=True, figsize=(25, 10), autopct='%1.1f%%')

We can see here that Belgium population consumes **more animal products, milk (excluding butter) and alcohol than the average worlds' consumption**, or **even than the average obese diet.**

#### b. Montenegro : the one with the more confirmed cases in relation to its population <a class="anchor" id="Section_9_2_2"></a>

In [None]:
montenegro_diet = food_kg[food_kg['Confirmed'] == food_kg['Confirmed'].max()]
montenegro_diet

Montenegro has a high obesity rate of **24.9%** too, and a **quite high mortality rate.**

In [None]:
montenegro_diet = montenegro_diet.describe().iloc[1]
montenegro_diet = pd.DataFrame(montenegro_diet).drop(['Deaths', 'Population','Undernourished','Obesity', 'Recovered', 'Confirmed', 'Active', 'ConfirmedScore','DeathsScore'], axis=0)
montenegro_diet = montenegro_diet.sort_values(by='mean', ascending=False).iloc[:11]
montenegro_diet_plot = montenegro_diet.plot.pie(subplots=True, figsize=(25, 10), autopct='%1.1f%%')

Montenegro's population consumes **even more animal products, meat and milk (excluding butter) than the Belgium one.** 

#### c. Vanuatu : the one with the less confirmed cases in relation to its population <a class="anchor" id="Section_9_2_3"></a>

In [None]:
vanuatu_diet = food_kg[food_kg['Confirmed'] == food_kg['Confirmed'].min()].sort_values(by='Population', ascending=False)
vanuatu_diet 

Vanuatu is very interesting, as the country has also a **high obesity rate of 23.5%**. Nevertheless, its mortality rate and confirmed rate are one of the lowest. Thus, its **diet has to be quite different** than the others and could explain those differences : 

In [None]:
vanuatu_diet  = vanuatu_diet.describe().iloc[1]
vanuatu_diet  = pd.DataFrame(vanuatu_diet).drop(['Deaths', 'Population','Undernourished','Obesity', 'Recovered', 'Confirmed', 'Active', 'ConfirmedScore','DeathsScore'], axis=0)
vanuatu_diet  = vanuatu_diet.sort_values(by='mean', ascending=False).iloc[:11]
vanuatu_diet_plot = vanuatu_diet.plot.pie(subplots=True, figsize=(25, 10), autopct='%1.1f%%')

Vanuatu's population consumes **much less animal products than the first two of our examples, and consumes much more starchy roots and oilcrops.**

#### d. Cambodia : the one with the less deaths cases in relation to its population <a class="anchor" id="Section_9_2_4"></a>

In [None]:
cambodia_diet = food_kg[food_kg['Deaths'] == food_kg['Deaths'].min()].sort_values(by='Population', ascending=False)
cambodia_diet

Finally Cambodia has a **low obesity rate**, and a **high undernourished one.** 

In [None]:
cambodia_diet = cambodia_diet.describe().iloc[1]
cambodia_diet = pd.DataFrame(cambodia_diet).drop(['Deaths', 'Population','Undernourished','Obesity', 'Recovered', 'Confirmed', 'Active', 'ConfirmedScore','DeathsScore'], axis=0)
cambodia_diet = cambodia_diet.sort_values(by='mean', ascending=False).iloc[:11]
cambodia_diet_plot = cambodia_diet.plot.pie(subplots=True, figsize=(25, 10), autopct='%1.1f%%')

Cambodia's population **consume more starchy roots and cereals (excluding beer) than the average world's consumption.**

<br>
<b>Conclusions

In conclusion, now that we have studied those pie charts, there seem to be **some patterns**. Firstly, as we already highlighted it, **obesity is indeed a risk factor**, leading in most cases to a high mortality rate. Moreover, it seems that **alcoholic beverages could also be a factor** ; we can notice it in the case of Belgium, which is quite similar to Montenegro, but differs in terms of alcoholic beverages consumption. That could explain the differences in mortality rate between the two countries.

Nonetheless, our conclusions only rely on the factors accessible in this dataset. Other factors such as **age, other health problems**, as well as the **country's covid-related politics** could influence the confirmed and mortality rates. Indeed, the reaction from some country such as Cambodia, which **promptly instaured a national lockdown, helped to control the virus propagation** (<a href='https://blogs.worldbank.org/health/what-explains-cambodias-effective-emergency-health-response-covid-19-coronavirus'>source</a>).