# Dataset 

https://www.kaggle.com/mariaren/covid19-healthy-diet-dataset

# Problem definition 

We chose a dataset combining different types of food, world population obesity and undernourished rate, and global covid cases count from around the world.

The idea is to understand how a healthy eating style could help combat the coronavirus, distinguishing the diet patterns from countries with lower COVID infection rate.

Our goal here is to provide diet recommendations based on our findings.

Each dataset provides different diet measure different categories of food, depending on what we want to focus on, so we have

- fat quantity,
- energy intake (kcal),
- food supply quantity (kg),
- protein for different categories of food

To which have been added:

- obesity rate
- undernourished rate
- the most up to date confirmed/deaths/recovered/active cases.

We are going to focus on the fat quantity dataset.

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
from sklearn.tree import export_graphviz, plot_tree
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.svm import LinearSVC, SVC, SVR

Let's start by loading the data

In [5]:
fat_quantity = pd.read_csv("data/Fat_Supply_Quantity_Data.csv")

 ## Data Exploration and Processing

Now let's explore the dataset:
- check the head
- the columns
- the variable types

In [24]:
display(fat_quantity.head())
print(fat_quantity.columns.__str__())
df_types = pd.DataFrame(fat_quantity.dtypes).reset_index()
df_types.columns = ['features','type']
display(df_types.sort_values(by='type'))

Unnamed: 0,Country,Alcoholic Beverages,Animal Products,Animal fats,"Aquatic Products, Other",Cereals - Excluding Beer,Eggs,"Fish, Seafood",Fruits - Excluding Wine,Meat,...,Vegetable Oils,Vegetables,Obesity,Undernourished,Confirmed,Deaths,Recovered,Active,Population,Unit (all except Population)
0,Afghanistan,0.0,21.6397,6.2224,0.0,8.0353,0.6859,0.0327,0.4246,6.1244,...,17.0831,0.3593,4.5,29.8,0.125149,0.005058,0.098263,0.021827,38928000.0,%
1,Albania,0.0,32.0002,3.4172,0.0,2.6734,1.6448,0.1445,0.6418,8.7428,...,9.2443,0.6503,22.3,6.2,1.733298,0.0358,0.87456,0.822939,2838000.0,%
2,Algeria,0.0,14.4175,0.8972,0.0,4.2035,1.2171,0.2008,0.5772,3.8961,...,27.3606,0.5145,26.6,3.9,0.208754,0.005882,0.137268,0.065604,44357000.0,%
3,Angola,0.0,15.3041,1.313,0.0,6.5545,0.1539,1.4155,0.3488,11.0268,...,22.4638,0.1231,6.8,25.0,0.050049,0.001144,0.02744,0.021465,32522000.0,%
4,Antigua and Barbuda,0.0,27.7033,4.6686,0.0,3.2153,0.3872,1.5263,1.2177,14.3202,...,14.4436,0.2469,19.1,,0.15102,0.005102,0.140816,0.005102,98000.0,%


Index(['Country', 'Alcoholic Beverages', 'Animal Products', 'Animal fats',
       'Aquatic Products, Other', 'Cereals - Excluding Beer', 'Eggs',
       'Fish, Seafood', 'Fruits - Excluding Wine', 'Meat', 'Miscellaneous',
       'Milk - Excluding Butter', 'Offals', 'Oilcrops', 'Pulses', 'Spices',
       'Starchy Roots', 'Stimulants', 'Sugar Crops', 'Sugar & Sweeteners',
       'Treenuts', 'Vegetal Products', 'Vegetable Oils', 'Vegetables',
       'Obesity', 'Undernourished', 'Confirmed', 'Deaths', 'Recovered',
       'Active', 'Population', 'Unit (all except Population)'],
      dtype='object')


Unnamed: 0,features,type
15,Spices,float64
26,Confirmed,float64
29,Active,float64
24,Obesity,float64
23,Vegetables,float64
22,Vegetable Oils,float64
21,Vegetal Products,float64
20,Treenuts,float64
19,Sugar & Sweeteners,float64
18,Sugar Crops,float64


Let's create a function to **check missing data** and unveil **the percentage of data missing** for each dataframe

In [79]:
#import dataFrame
files = ['data/Fat_Supply_Quantity_Data.csv',
         'data/Food_Supply_kcal_Data.csv',
         'data/Food_Supply_Quantity_kg_Data.csv',
         'data/heart.csv',
         'data/Protein_Supply_Quantity_Data.csv',
         'data/Supply_Food_Data_Descriptions.csv']
df_FSQD = pd.read_csv(files[0])
df_FSQD.Name = files[0].strip('data/').strip('.csv')
df_FSKD = pd.read_csv(files[1])
df_FSKD.Name = files[1].strip('data/').strip('.csv')
df_FSQKD = pd.read_csv(files[2])
df_FSQKD.Name = files[2].strip('data/').strip('.csv')
df_H = pd.read_csv(files[3])
df_H.Name = files[3].strip('data/').strip('.csv')
df_PSQD = pd.read_csv(files[4])
df_PSQD.Name = files[4].strip('data/').strip('.csv')
df_SFDD = pd.read_csv(files[5])
df_SFDD.Name = files[5].strip('data/').strip('.csv')
dfs=[df_FSQD,df_FSKD,df_FSQKD,df_H,df_PSQD,df_SFDD]


In [80]:
def check_missing_data(df_list:pd.DataFrame):
    for df in df_list:
        missing_cell = df.isna().sum(axis=0).sum(axis=0)
        n_cell = df.size
        nan_percent = missing_cell/n_cell*100
        print(f"missing cell({df.Name}) = \t{np.round(nan_percent,3)}%\t({missing_cell}/{n_cell})")

In [81]:
# pd.DataFrame(fat_quantity.isna().sum(axis=0)/fat_quantity.size*100).sort_values(by=0,ascending=False)
check_missing_data(dfs)


missing cell(Fat_Supply_Quantity_Data) = 	0.662%	(36/5440)
missing cell(Food_Supply_kcal_Data) = 	0.662%	(36/5440)
missing cell(Food_Supply_Quantity_kg_Data) = 	0.662%	(36/5440)
missing cell(heart) = 	0.0%	(0/4242)
missing cell(Protein_Supply_Quantity_Data) = 	0.662%	(36/5440)
missing cell(Supply_Food_Data_Description) = 	0.0%	(0/46)


Delete the countries for which values are missing.

In [93]:
display(fat_quantity[fat_quantity.isna().any(axis=1)])
df_FSQD = df_FSQD.dropna(axis=0,how='any')

Unnamed: 0,Country,Alcoholic Beverages,Animal Products,Animal fats,"Aquatic Products, Other",Cereals - Excluding Beer,Eggs,"Fish, Seafood",Fruits - Excluding Wine,Meat,...,Vegetable Oils,Vegetables,Obesity,Undernourished,Confirmed,Deaths,Recovered,Active,Population,Unit (all except Population)
4,Antigua and Barbuda,0.0,27.7033,4.6686,0.0,3.2153,0.3872,1.5263,1.2177,14.3202,...,14.4436,0.2469,19.1,,0.15102,0.005102,0.140816,0.005102,98000.0,%
10,Bahamas,0.0,30.2259,4.56,0.0,3.6327,1.2829,1.4991,0.8995,17.4941,...,10.1659,0.3974,32.1,,1.952672,0.041476,1.544529,0.366667,393000.0,%
26,Canada,0.0,23.1663,9.7895,0.0,1.544,1.2437,0.4244,0.4244,7.8211,...,20.4733,0.222,31.3,<2.5,1.238073,0.035533,1.008172,,38190000.0,%
29,Chile,0.0,32.5206,5.2608,0.0,3.1847,1.4085,0.5204,0.4186,20.2172,...,10.844,0.2602,28.8,2.7,2.947252,0.081823,2.807078,,19470000.0,%
52,French Polynesia,0.0,25.1596,4.6648,0.0,3.7271,0.822,1.5842,0.1995,14.0782,...,14.0024,0.1397,,4.2,,,,,280000.0,%
59,Grenada,0.0,23.0842,2.4889,0.0,1.4883,1.444,1.2856,2.8436,12.9576,...,16.8841,0.1077,20.2,,0.061062,0.0,0.036283,0.024779,113000.0,%
80,Kiribati,0.0,10.5699,0.7467,0.0,1.956,0.265,3.0833,0.3517,6.1955,...,7.2939,0.1397,45.6,2.7,,,,,125000.0,%
81,"Korea, North",0.0,11.4709,0.0262,0.0,7.2234,1.6911,0.5375,0.6424,8.7179,...,27.5957,1.0357,7.1,47.8,,,,,25779000.0,%
105,Myanmar,0.0,26.047,3.0066,0.0,3.8192,0.9251,2.5441,0.275,18.1335,...,12.7453,0.2938,5.7,10.6,,,,,54704000.0,%
109,New Caledonia,0.0,22.2697,3.6599,0.0,5.0979,1.0439,0.8385,0.2222,11.6086,...,16.8532,0.1887,,7.1,,,,,295000.0,%


Look at the different data types for each variable.

In [103]:
num_col = df_FSQD.describe().columns.to_list()
num_col = df_FSQD[num_col].corr().sort_values(by='Deaths').index.tolist()

In [104]:
import seaborn as sns
sns.heatmap(df_FSQD.corr())


ModuleNotFoundError: No module named 'seaborn'

Explore the variables that are not of float type and see of you can convert them in to float type.

# Clustering

## Data preparation

Scale the dataset

## Plot some data

Now, we want to visualize some variables for each state. To do so, we use plotly express to have the possibility to hover on a scatter plot and see the statistics per country clearer as explained here.

https://plotly.com/python/hover-text-and-formatting/#:~:text=Basic%20Charts%20tutorials.-,Hover%20Labels,having%20a%20hover%20label%20appear.

Plot the "Obesity" vs "Deaths" statistics

Plot the "Animal fats" vs "Deaths" statistics

## K-means and Elbow method

We start with the K-Means model:
- use the scikit-learn method
- use the method you implemented.

Use a graphical tool, the elbow method, to estimate the optimal number of clusters k for a given task.
- Determine the optimal number of clusters for the previous 2 plots.

In [1]:
from sklearn.cluster import KMeans


Plot the obtained clusters

## Other clustering methods

We are going to explore other clustering methods, such as Mean-Shift.

You can read more about it in the next ressource:
https://scikit-learn.org/stable/modules/clustering.html


Apply the method to our datasets made of 2 variables ("Obesity" vs "Deaths")

In [2]:
# Mean-Shift
from sklearn.cluster import MeanShift, estimate_bandwidth


Plot the obtained clusters

Check out other algorithms such as DBSAN or OPTICS, why are these algorithms very interesting and in what cases? 

In [36]:
from sklearn.cluster import DBSCAN


# Regression and prediction

Given this dataset and the emphasis we have already laid on deaths through clustering, it would be interesting to study this dataset for a classification purpose and see how accurately we can predict the mortality rate in fonction of the given features.

## Creating train and test sets 

Let's separate the data into a training and testing sets using random selection.

Now drop the labels from the training set and create a new variable for the labels.

Scale the datasets.

## Random Forest

Let's try a random forest model on the prepared fat_quantity training set.

RandomForestRegressor(random_state=42)

Now we predict.

Let's perform a 10 fold cross validation.
And display the resulting scores:

## Learning Curves analysis 

Use the function seen in **Module 1 to plot learning curves with cross validation.** 

In [31]:
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    pass

Try to interpret the obtained learning curve.

Perform a grid search to try to obtain the best hyperparameters. What is the best score that you obtained?

## SVM

Use the SVM regressor to estimate the death rate. See if you can get a better model than with the Random forest regressor.

## Linear regression

# Dimensionality reduction

Let's take a look at the whole dataset and see if there are any clusters.

In order to do these perform and plot a PCA of 2 components.

Dimensionality reduction is a way to reduce the number of features in your dataset without having to lose much information and keep the model’s performance. Check out the Random Forest based method and PCA for dimensionality reduction in the following ressource:

https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/

## Random Forest feature selection

Plot the feature importance graph.

Comment the graph.

## PCA dimensionality reduction

PCA is a technique which helps us in extracting a new set of variables from an existing large set of variables. Apply clustering methods on this new set of variables. Are the clusters obtained different than the clusters obtained on the "Obesity" vs "Deaths"?

Apply the Elbow method to determne the right number of clusters.

Use diverse methods to cluster the countries.