**The goal of this analysis is to find out how a country’s diet correlates with its COVID-19 mortality rate. With different food cultures across the world, it would be interesting to see what are the food categories that can best predict a country’s rate of deaths. **

# Import packages

In [None]:
! pip install pandas numpy matplotlib plotly dash-core-components scikit-learn dash missingpy yellowbrick

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go
import dash_core_components as dcc
import sklearn
import plotly.figure_factory as ff
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Data Exploratory and Analysis

In [None]:
df_fat_quantity = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Fat_Supply_Quantity_Data.csv')
df_food_quantity = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Food_Supply_Quantity_kg_Data.csv')
df_food_kcal = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Food_Supply_kcal_Data.csv')
df_protein_quantity = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Protein_Supply_Quantity_Data.csv')
df_food_description = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Supply_Food_Data_Descriptions.csv')

### Food Quantity Data 

In [None]:
df_food_quantity.describe()

In [None]:
x=df_food_quantity.corr(method='pearson').round(3)
ff.create_annotated_heatmap(z=x[['Deaths']].sort_values(by=['Deaths'],ascending=False).values, x = ['Deaths'], y=x[['Deaths']].sort_values(by=['Deaths'],ascending=False).index.to_list(), showscale=True, colorscale='Viridis')

In [None]:
x=df_food_quantity.corr(method='pearson').round(3)
ff.create_annotated_heatmap(z=x.values, x = x.index.tolist(), y=x.index.to_list(), showscale=True, colorscale='Viridis')

In [None]:
df_food_quantity.Deaths = df_food_quantity.Deaths.astype(np.float32)*100.0

In [None]:
df_food_quantity.Deaths = df_food_quantity.Deaths.map(lambda val: np.log(val + 1))

In [None]:
df_food_quantity['Undernourished'] = df_food_quantity.apply(lambda row: 2.5 if row['Undernourished'] == '<2.5' else float(row['Undernourished']), axis = 1)

#### Distributions  

In [None]:
for feature_name in ["Oilcrops",'Alcoholic Beverages', 'Animal fats', 'Animal Products','Milk - Excluding Butter', 'Obesity', 'Vegetal Products']:
    fig = go.Figure(px.histogram(df_food_quantity[feature_name],nbins=7))
    fig.show()

By visualizing the histograms we can conclude the following:

Animal Products, Obesity and Vegetal Products have a roughly normal distribution. We'll probably just scale their values using z-score formula.
Alcoholic Beverages, Animal Fats, Milk - Excluding Butter,  on the other hand, present a right skewed distribution. Maybe a log scalling we'll help us getting a normal distribution for those two features.

In [None]:
df_food_quantity = df_food_quantity.drop('Unit (all except Population)', axis=1)

In [None]:
df_food_quantity = df_food_quantity.dropna()

In [None]:
df_food_quantity.iloc[:,1:24] = df_food_quantity.iloc[:, 1:24] * 2

In [None]:
# Mortality  = Deaths / Confirmed
df_food_quantity['Mortality'] = df_food_quantity['Deaths'] / df_food_quantity['Confirmed']

In [None]:
# Distributions
fig = px.bar(df_food_quantity, x = "Country", y ="Confirmed").update_xaxes(categoryorder="total descending")
fig.show()

In [None]:
fig = px.bar(df_food_quantity, x = "Country", y ="Deaths").update_xaxes(categoryorder="total descending")
fig.show()

In [None]:
# Distributions

#Yemen data seem to be so over-evalued
fig = px.bar(df_food_quantity, x = "Country", y ="Mortality").update_xaxes(categoryorder="total descending")
fig.show()

In [None]:
fig = px.scatter(df_food_quantity, x="Confirmed", y = "Deaths",size = "Active", hover_name='Country', log_x=False,
                 size_max=30, trendline = "ols", marginal_x = "box",marginal_y = "violin", template="simple_white")
fig.show()

In [None]:
# Investigate: does obesity rate affect impact of COVID-19

In [None]:
fig = px.scatter(df_food_quantity[df_food_quantity.Country != 'Yemen'], x="Mortality", y = "Obesity", size = "Active", hover_name='Country', log_x=False,
                 size_max=30, template="simple_white")

fig.add_shape(
        # Line Horizontal
            type="line",
            x0=0,
            y0=df_food_quantity[df_food_quantity.Country != 'Yemen']['Obesity'].mean(),
            x1=df_food_quantity[df_food_quantity.Country != 'Yemen']['Mortality'].max(),
            y1=df_food_quantity[df_food_quantity.Country != 'Yemen']['Obesity'].mean(),
            line=dict(
                color="crimson",
                width=4
            ),
    )

fig.show()

In [None]:
fig = px.scatter(df_food_quantity, x="Deaths", y = "Obesity", size = "Mortality", hover_name='Country', log_x=False,
                 size_max=30, template="simple_white")

fig.add_shape(
        # Line Horizontal
            type="line",
            x0=0,
            y0=df_food_quantity['Obesity'].mean(),
            x1=df_food_quantity['Deaths'].max(),
            y1=df_food_quantity['Obesity'].mean(),
            line=dict(
                color="crimson",
                width=4
            ),
    )

fig.show()

### Animal Products:  Which products make the difference between High and Low Obesity countries  ?

In [None]:
df_high_ob = df_food_quantity[df_food_quantity.Obesity > df_food_quantity['Obesity'].mean()]
df_low_ob = df_food_quantity[df_food_quantity.Obesity <= df_food_quantity['Obesity'].mean()]

In [None]:
animal_features = ['Animal fats', 'Aquatic Products, Other', 'Eggs', 'Fish, Seafood', 'Meat',
                   'Milk - Excluding Butter', 'Offals']
vegetal_features = ['Alcoholic Beverages', 'Cereals - Excluding Beer', 'Fruits - Excluding Wine', 'Miscellaneous', 'Oilcrops', 'Pulses',
                    'Spices', 'Starchy Roots', 'Stimulants', 'Sugar & Sweeteners', 'Sugar Crops', 'Treenuts',
                    'Vegetable Oils', 'Vegetables']

#### High obesity rates countries : Animal products

In [None]:
fig = px.pie(values = df_high_ob[animal_features].mean().tolist(), names = animal_features,
             title='Mean food intake by Animal products groups - High Obesity Countries')
fig.show()

In [None]:
fig = px.pie(values = df_low_ob[animal_features].mean().tolist(), names = animal_features,
             title='Mean food intake by Animal products groups - Low Obesity Countries')
fig.show()

#### Vegetal products

In [None]:
fig = px.pie(values = df_high_ob[vegetal_features].mean().tolist(), names = vegetal_features,
             title='Mean food intake by Vegetal products groups - High Obesity Countries')
fig.show()

fig = px.pie(values = df_low_ob[vegetal_features].mean().tolist(), names = vegetal_features,
             title='Mean food intake by Vegetal products groups - Low Obesity Countries')
fig.show()

#### Obesity between countries High or Low Obesity  ? 

In [None]:
df_food_quantity['ObesityAboveAverage'] = (df_food_quantity["Obesity"] > df_food_quantity['Obesity'].mean()).astype(int)

In [None]:
fig = px.scatter(df_food_quantity, x = 'Animal Products', y ='Vegetal Products',
                 color='ObesityAboveAverage', hover_name = 'Country')
fig.show()

In [None]:
fig = px.bar(df_food_quantity, x = "Country", y ="Deaths", facet_col = "ObesityAboveAverage")
fig.update_xaxes(matches=None,categoryorder="total descending")
fig.show()

In the figure above, we can see clearly that the "high obesity rate" countries have a worst impact from COVID-19.

In [None]:
fig = px.bar(df_food_quantity, x = "Country", y ="Confirmed", facet_col = "ObesityAboveAverage")
fig.update_xaxes(matches=None,categoryorder="total descending")
fig.show()

In [None]:
fig = px.bar(df_food_quantity, x = "Country", y ="Recovered", facet_col = "ObesityAboveAverage")
fig.update_xaxes(matches=None,categoryorder="total descending")
fig.show()

## Goal: predict Mortality 

In [None]:
fig = px.scatter_matrix(df_food_quantity[['Meat', 'Milk - Excluding Butter', 'Fish, Seafood',
                         'Cereals - Excluding Beer', 'Obesity','Mortality']])
fig.show()

### Diet vs COVID19 

In [None]:
corr_food=df_food_quantity.loc[:, df_food_quantity.columns != 'ObesityAboveAvg'].corr(method='pearson')
corr_final=corr_food.abs().unstack().sort_values(ascending = False)
corr_final.drop(corr_final.head(32).index, inplace=True)
corr_confirmed = corr_final['Confirmed'].head(15)
corr_confirmed = corr_confirmed.drop(['Recovered', 'Deaths', 'Active', 'Undernourished', 'Obesity'])
corr_deaths = corr_final['Deaths'].head(15)
corr_deaths = corr_deaths.drop(['Recovered', 'Confirmed', 'Active', 'Undernourished', 'Obesity'])
corr_recovered = corr_final['Recovered'].head(14)
corr_recovered = corr_recovered.drop(['Confirmed', 'Deaths', 'Undernourished', 'Obesity'])

In [None]:
corr_heatmap=df_food_quantity[['Deaths','Animal Products','Animal fats','Cereals - Excluding Beer','Eggs','Meat','Milk - Excluding Butter','Pulses','Starchy Roots','Sugar & Sweeteners','Vegetal Products']]
x=corr_heatmap.corr(method='pearson')
fig = go.Figure(ff.create_annotated_heatmap(z=x[['Deaths']].sort_values(by=['Deaths'],ascending=False).values, x = ['Deaths'], y=x[['Deaths']].sort_values(by=['Deaths'],ascending=False).index.to_list(), colorscale='Viridis'))
fig.show()

corr_heatmap=df_food_quantity[['Confirmed','Animal Products','Animal fats','Cereals - Excluding Beer','Eggs','Meat','Milk - Excluding Butter','Pulses','Starchy Roots','Sugar & Sweeteners','Vegetal Products']]
x=corr_heatmap.corr(method='pearson')
fig = go.Figure(ff.create_annotated_heatmap(z=x[['Confirmed']].sort_values(by=['Confirmed'],ascending=False).values, x = ['Confirmed'], y=x[['Confirmed']].sort_values(by=['Confirmed'],ascending=False).index.to_list(), colorscale='Viridis'))
fig.show()

corr_heatmap=df_food_quantity[['Recovered','Animal Products','Animal fats','Cereals - Excluding Beer','Eggs','Meat','Milk - Excluding Butter','Pulses','Starchy Roots','Sugar & Sweeteners','Vegetal Products']]
x=corr_heatmap.corr(method='pearson')
fig = go.Figure(ff.create_annotated_heatmap(z=x[['Recovered']].sort_values(by=['Recovered'],ascending=False).values, x = ['Recovered'], y=x[['Recovered']].sort_values(by=['Recovered'],ascending=False).index.to_list(), colorscale='Viridis'))
fig.show()


#### Health diet vs COVID19

In [None]:
corr_heatmap=df_food_quantity[['Deaths','Confirmed','Recovered','Obesity','Undernourished', 'Mortality']]
x=corr_heatmap.corr(method='pearson').round(3)
ff.create_annotated_heatmap(z=x.values, x=x.columns.to_list(), y=x.columns.to_list(), colorscale='Viridis', showscale=True)

#### Obesity average diet

In [None]:
obesity_set = df_food_quantity[df_food_quantity['Obesity'] == df_food_quantity['Obesity']].sort_values(by='Obesity', ascending=False).head(10)
obesity_mean = obesity_set.describe().iloc[1]
obesity_mean = pd.DataFrame(obesity_mean).drop(['Deaths', 'Population','Undernourished','Obesity', 'Recovered', 'Confirmed', 'Active'], axis=0)
obesity_mean = obesity_mean.sort_values(by='mean', ascending=False).iloc[:11]

In [None]:
fig = px.pie(values = obesity_mean['mean'].values, names = obesity_mean.index.tolist(),
             )
fig.show()

#### Undernutrition average diet 

In [None]:
undernutrition_set = df_food_quantity[df_food_quantity['Undernourished'] == df_food_quantity['Undernourished']].sort_values(by='Undernourished', ascending=False).head(10)
undernutrition_mean = undernutrition_set.describe().iloc[1]
undernutrition_mean = pd.DataFrame(undernutrition_mean).drop(['Deaths', 'Population','Undernourished','Obesity', 'Recovered', 'Confirmed', 'Active',], axis=0)
undernutrition_mean = undernutrition_mean.sort_values(by='mean', ascending=False).iloc[:11]

In [None]:
fig = px.pie(values = undernutrition_mean['mean'].values, names = undernutrition_mean.index.tolist(),
             )
fig.show()

# Supervised Approach

## Predict Deaths 

In [None]:
feature_names = ['Animal fats', 'Alcoholic Beverages', 'Animal Products','Milk - Excluding Butter', 'Obesity', 'Vegetal Products']
for feature_name in feature_names:
    fig = go.Figure(px.histogram(df_food_quantity[feature_name]))
    fig.show()

### Response Variables

In [None]:
feature_names = ['Deaths', 'Recovered', 'Confirmed']
for feature_name in feature_names:
    fig = go.Figure(px.histogram(df_food_quantity[feature_name]))
    fig.show()

In [None]:
def zscore(mean, std, val):
    epsilon = 0.000001
    return (val - mean) / (epsilon + std)
feature_names=['Animal Products', 'Obesity', 'Vegetal Products','Animal fats', 'Milk - Excluding Butter' ]
z_score_scaled_feature_names = ['Animal Products', 'Obesity', 'Vegetal Products']
log_scaled_feature_names = ['Animal fats', 'Milk - Excluding Butter']

training_df_copy =df_food_quantity.copy()
z_score_scaled_features = training_df_copy[z_score_scaled_feature_names].copy()

# Apply z-score on 'Animal Products', 'Obesity' and 'Vegetal Products'
for feature_name in z_score_scaled_feature_names:
    mean = z_score_scaled_features[feature_name].mean()
    std = z_score_scaled_features[feature_name].std()
    z_score_scaled_features[feature_name] = zscore(mean, std, z_score_scaled_features[feature_name])

log_scaled_features = training_df_copy[log_scaled_feature_names].copy()
for feature_name in log_scaled_feature_names:
  # Apply log scaling for 'Cereals - Excluding Beer'
    log_scaled_features[feature_name] = np.log(log_scaled_features[feature_name])

In [None]:
training_df_copy[z_score_scaled_feature_names]=z_score_scaled_features
training_df_copy[log_scaled_feature_names] = log_scaled_features

In [None]:
X = training_df_copy[feature_names]
y = training_df_copy['Deaths']

In [None]:
from sklearn.utils import shuffle

animal_features = ['Animal fats', 'Aquatic Products, Other', 'Eggs', 'Fish, Seafood', 'Meat',
                   'Milk - Excluding Butter', 'Offals']
vegetal_features = ['Alcoholic Beverages', 'Cereals - Excluding Beer', 'Fruits - Excluding Wine', 'Miscellaneous', 'Oilcrops', 'Pulses',
                    'Spices', 'Starchy Roots', 'Stimulants', 'Sugar & Sweeteners', 'Sugar Crops', 'Treenuts',
                    'Vegetable Oils', 'Vegetables']

df_mort = df_food_quantity[df_food_quantity.Country != 'Yemen'][animal_features+vegetal_features+['Obesity','Mortality']]
# df_mort = kg_df[['Animal Products','Vegetal Products','Obesity','Mortality']]

df_mort = shuffle(df_mort)

mort_features = df_mort.columns.drop('Mortality')
mort_target = 'Mortality'

print('Model features: ', mort_features)
print('Model target: ', mort_target)

X = df_mort[mort_features]
y = df_mort[mort_target]


## Missing Values

In [None]:
from missingpy import MissForest

# Make an instance and perform the imputation
imputer = MissForest()
X_imputed = imputer.fit_transform(X)

## Data Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, train_size=0.8, shuffle = True, random_state = 28)

## Train Models 

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score

#Random Forest 
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()
random_forest.fit(X_train, y_train)

#Arbre de regression
from sklearn import tree
arbre_regression = tree.DecisionTreeRegressor()
arbre_regression.fit(X_train, y_train)

# Regression linéaire multiple
from sklearn.linear_model import LinearRegression
reg_multiple = LinearRegression()
reg_multiple.fit(X_train, y_train)

### Results / Predictions 

In [None]:
print(reg_multiple.coef_)
print(reg_multiple.score(X_train, y_train))

### Linear Regression

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, reg_multiple.predict(X_test)))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, reg_multiple.predict(X_test)))

### Random Forest 

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, random_forest.predict(X_test)))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, random_forest.predict(X_test)))

### Regression Tree

In [None]:
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, arbre_regression.predict(X_test)))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, arbre_regression.predict(X_test)))

## Improve Models ?  

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Create function to evaluate model on a few different scores
def show_scores(model, X_train, X_test, y_train, y_test):    
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)
    scores = {'Training MAE': mean_absolute_error(y_train, train_preds),
              'Test MAE': mean_absolute_error(y_test, test_preds),
              'Training MSE': mean_squared_error(y_train, train_preds),
              'Test MSE': mean_squared_error(y_test, test_preds),
              'Training R^2': r2_score(y_train, train_preds),
              'Test R^2': r2_score(y_test, test_preds)}
    return scores

In [None]:
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# First, we create a dict with our desired models
models = {'Ridge':Ridge(random_state=28),
          'SVR':SVR(),
          'RandomForest':RandomForestRegressor(),
          'XGBoost':XGBRegressor(n_estimators = 1000, learning_rate = 0.05)}

# Now to build the function that tests each model
def model_build(model, X_train, y_train, X_test, y_test, scale=True):
    
    if scale:
        regressor = Pipeline([
            ('scaler', StandardScaler()),
            ('estimator', model)
        ])
    
    else:
        regressor = Pipeline([
            ('estimator', model)
        ])

    # Training
    regressor.fit(X_train, y_train)

    # Scoring the training set

    train_preds = regressor.predict(X_train)
    print(f"R2 on single split: {regressor.score(X_train, y_train)}")

    # Cross validate
    cv_score = cross_val_score(regressor, X_train, y_train, cv = 10)

    print(f"Cross validate R2 score: {cv_score.mean()}")

    # Scoring the test set
    for k, v in show_scores(regressor, X_train, X_test , y_train, y_test).items():
        print("     ", k, v)
        
    
for name, model in models.items():
    print(f"==== Scoring {name} model====")
    
    if name == 'RandomForest' or name == 'XGBoost':
        model_build(model, X_train, y_train, X_test, y_test, scale=False)
    else:
        model_build(model, X_train, y_train, X_test, y_test,)
    print()
    print(40*"=")
        

In [None]:
model = RandomForestRegressor()
model.fit(X_train, y_train)

test_preds = model.predict(X_test)

test_plot = pd.DataFrame(X_test, columns=X.columns)
test_plot['Mortality'] = y_test
test_plot['Mortality_pred'] = test_preds

test_plot.head()

In [None]:
def plotTest(col, target, data):
    fig, ax = plt.subplots(figsize=[10,8])

    sns.regplot(x = col, y = target, data = data, ax = ax, label=target)
    sns.regplot(x = col, y = target+'_pred', data = data, ax = ax, label=target+'_pred')

    plt.legend();

In [None]:
import seaborn as sns
plotTest('Animal fats', 'Mortality', test_plot)

In [None]:
There are MANY factors that are important to fight against the current COVID-19 epidemic. Maintaining good eating habits helps keep our immune system healthy and ready to combat a possible disease.
In this notebook I tried to explore possible patterns found in data of COVID-19 and food intake in different countries. One major goal was to find the influence of obesity rates in the effect of the disease in each country. Splitting countries into HOC and LOC groups, it was possible to create a classifier, with good accuracy, predicting in which group would a country be based on its food intake data.
Having this, we created regression models to try to predict the Mortality of COVID-19 in countries based on ther eating habits and obesity rate. Two approaches were taken: one with all food related features taken as parameters and a simpler one. Both have issues (mainly of spread and non-linearity), but we could show use of different models and metrics.

### Method 2

In [None]:
df_fat_quantity = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Fat_Supply_Quantity_Data.csv')
df_food_quantity = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Food_Supply_Quantity_kg_Data.csv')
df_food_kcal = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Food_Supply_kcal_Data.csv')
df_protein_quantity = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Protein_Supply_Quantity_Data.csv')
df_food_description = pd.read_csv(r'../input/covid19-healthy-diet-dataset/Supply_Food_Data_Descriptions.csv')

In [None]:
df = pd.DataFrame()
df[[i+'-fat' for i in ['Country', 'Alcoholic Beverages', 'Animal fats',
       'Cereals - Excluding Beer', 'Fruits - Excluding Wine', 'Miscellaneous',
       'Milk - Excluding Butter', 'Stimulants', 'Sugar Crops',
       'Sugar & Sweeteners', 'Vegetable Oils']]] = df_fat_quantity[['Country', 'Alcoholic Beverages', 'Animal fats',
       'Cereals - Excluding Beer', 'Fruits - Excluding Wine', 'Miscellaneous',
       'Milk - Excluding Butter', 'Stimulants', 'Sugar Crops',
       'Sugar & Sweeteners', 'Vegetable Oils']]

In [None]:
df[[i+'-kcal' for i in df_food_kcal.columns[[2,4,5,6,7,9,11,13,14,15,16,19,20,21,22,23]]]]= df_food_kcal[df_food_kcal.columns[[2,4,5,6,7,9,11,13,14,15,16,19,20,21,22,23]]]

In [None]:
df[[i+'-food' for i in df_food_quantity.columns[[1,2,3,12,17,18,19,21,23]]]]= df_food_quantity[df_food_quantity.columns[[1,2,3,12,17,18,19,21,23]]]

In [None]:
df[[i+'-protein' for i in df_protein_quantity.columns[[3,8]]]]= df_protein_quantity[df_protein_quantity.columns[[3,8]]]
df[[i+'-protein' for i in df_protein_quantity.columns[10:30]]]= df_protein_quantity[df_protein_quantity.columns[10:30]]

In [None]:
df['Undernourished-protein'] = df.apply(lambda row: 2.5 if row['Undernourished-protein'] == '<2.5' else float(row['Undernourished-protein']), axis = 1)

In [None]:
for feature_name in df.columns:
    fig = go.Figure(px.histogram(df[feature_name]))
    fig.show()

In [None]:
df = df.drop(columns=['Animal fats-food', 'Vegetal Products-food', 'Animal Products-kcal', 'Vegetal Products-kcal', 'Alcoholic Beverages-fat', 'Sugar Crops-fat', 'Sugar & Sweeteners-fat', 'Sugar & Sweeteners-food', 'Sugar Crops-protein','Aquatic Products, Other-kcal'])

In [None]:
x = np.corrcoef([df_food_quantity["Sugar & Sweeteners"].values.tolist(),df_protein_quantity["Sugar & Sweeteners"].values.tolist(),df_food_kcal['Sugar & Sweeteners'].values.tolist(),df_fat_quantity['Sugar & Sweeteners'].values.tolist()]).round(3)
fig = go.Figure(ff.create_annotated_heatmap(z=x,x=['Sugar & Sweeteners-food',"Sugar & Sweeteners-protein","Sugar & Sweeteners-kcal","Sugar & Sweeteners-fat"],y=['Sugar & Sweeteners-food',"Sugar & Sweeteners-protein","Sugar & Sweeteners-kcal","Sugar & Sweeteners-fat"], colorscale='Viridis', showscale=True))
fig.show()

In [None]:
for feature_name in df.columns[44:]:
    fig = go.Figure(px.histogram(df[feature_name]))
    fig.show()

In [None]:
from missingpy import MissForest

# Make an instance and perform the imputation
imputer = MissForest()
X = imputer.fit_transform(df[df.columns[1:]].values.tolist())

In [None]:
df[df.columns[1:]] = X

In [None]:
corr_heatmap=df
x=corr_heatmap.corr(method='pearson').round(3)
fig = go.Figure(ff.create_annotated_heatmap(z=x[['Deaths-protein']].sort_values(by='Deaths-protein').values, x=['Deaths-protein'], y=x[['Deaths-protein']].sort_values(by='Deaths-protein').index.to_list(), colorscale='Viridis', showscale=True))
fig.update_layout(height=1000)
fig.show()

In [None]:
corr_heatmap=df
x=corr_heatmap.corr(method='pearson').round(3)
fig = go.Figure(ff.create_annotated_heatmap(z=x[['Recovered-protein']].sort_values(by='Recovered-protein').values, x=['Recovered-protein'], y=x[['Recovered-protein']].sort_values(by='Recovered-protein').index.to_list(), colorscale='Viridis', showscale=True))
fig.show()

In [None]:
df = df.dropna()

In [None]:
X = df[['Miscellaneous-protein', 'Sugar & Sweeteners-kcal', 'Meat-kcal', 'Pulses-kcal','Stimulants-protein','Oilcrops-kcal','Fruits - Excluding Wine-protein', 'Eggs-kcal']]
y = df['Confirmed-protein']

In [None]:
X = df[['Miscellaneous-protein', 'Vegetables-protein', 'Obesity-protein', 'Undernourished-protein', 'Animal fats-fat']]
y = df['Deaths-protein']

In [None]:
X = df[['Miscellaneous-protein','Stimulants-fat' ,'Treenuts-protein', 'Eggs-kcal','Offals-protein']]
y = df['Recovered-protein']

## Results 

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(X.values.tolist(), y.values.tolist(), train_size=0.7, shuffle = True)
#Random Forest 
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()
random_forest.fit(X_train, y_train)

#Arbre de regression
from sklearn import tree
arbre_regression = tree.DecisionTreeRegressor()
arbre_regression.fit(X_train, y_train)

# Regression linéaire multiple
from sklearn.linear_model import LinearRegression
reg_multiple = LinearRegression()
reg_multiple.fit(X_train, y_train)
print('Mean squared error: %.2f'
          % mean_squared_error(y_test, reg_multiple.predict(X_test)))
    # The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
          % r2_score(y_test, reg_multiple.predict(X_test)))

print('Mean squared error: %.2f'
          % mean_squared_error(y_test, arbre_regression.predict(X_test)))
    # The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
          % r2_score(y_test, arbre_regression.predict(X_test)))

print('Mean squared error: %.2f'
          % mean_squared_error(y_test, random_forest.predict(X_test)))
    # The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
          % r2_score(y_test, random_forest.predict(X_test)))


In [None]:
test_df = pd.DataFrame()
test_df['random_forest_pred'] = random_forest.predict(X_test)
test_df['arbre_regression_pred'] = arbre_regression.predict(X_test)
test_df['regression_lineaire_mult_pred'] = reg_multiple.predict(X_test)

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = ResidualsPlot(random_forest)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show() 

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = ResidualsPlot(random_forest,hist=False, qqplot=True
)
visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show() 

In [None]:
r2_score(y_test, random_forest.predict(X_test))

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = ResidualsPlot(random_forest)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show() 

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = ResidualsPlot(random_forest,hist=False, qqplot=True
)
visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show() 

In [None]:
print('Mean squared error: %.5f'
              % mean_squared_error(y_test, random_forest.predict(X_test)))

# PCA  

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=6)
pca.fit(df[df.columns[1:20]])

In [None]:
print(pca.explained_variance_ratio_)

In [None]:
X = pca.transform(df[df.columns[1:20]])
y = df['Deaths-protein']

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(X.tolist(), y.values.tolist(), train_size=0.8, shuffle = True)
#Random Forest w
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()
random_forest.fit(X_train, y_train)

#Arbre de regression
from sklearn import tree
arbre_regression = tree.DecisionTreeRegressor()
arbre_regression.fit(X_train, y_train)

# Regression linéaire multiple
from sklearn.linear_model import LinearRegression
reg_multiple = LinearRegression()
reg_multiple.fit(X_train, y_train)

print('Linear Regression')
print('Mean squared error: %.2f'
          % mean_squared_error(y_test, reg_multiple.predict(X_test)))
    # The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
          % r2_score(y_test, reg_multiple.predict(X_test)))

print('Regression Tree')
print('Mean squared error: %.2f'
          % mean_squared_error(y_test, reg_multiple.predict(X_test)))
    # The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
          % r2_score(y_test, reg_multiple.predict(X_test)))

print('Random Forest')
print('Mean squared error: %.2f'
          % mean_squared_error(y_test, reg_multiple.predict(X_test)))
    # The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
          % r2_score(y_test, reg_multiple.predict(X_test)))


In [None]:
print(X.shape)
print(y.shape)

In [None]:
r2_score(y_test, random_forest.predict(X_test))

In [None]:
r2_score(y_test, arbre_regression.predict(X_test))

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = ResidualsPlot(reg_multiple)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show() 

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = ResidualsPlot(reg_multiple,hist=False, qqplot=True
)
visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show() 

# K-Means 

In [None]:
X = df[['Miscellaneous-protein', 'Vegetables-protein', 'Obesity-protein',  'Animal fats-fat']].values

In [None]:
from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler().fit(X)
X_scaled = scalerX.transform(X)

# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
#plt.show()
figure = plt.gcf()  # get current figure
figure.set_size_inches(8, 4) # set figure's size manually to your full screen (32x18)
#plt.savefig("Elbow.png", bbox_inches='tight') # bbox_inches removes extra white spaces
plt.show()

In [None]:
# K = 4 

In [None]:
num_opt_clusters=3

# Fitting K-Means to the dataset
kmeans = KMeans(n_clusters = num_opt_clusters, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X_scaled)
dataset =df[['Miscellaneous-protein', 'Vegetables-protein', 'Obesity-protein', 'Animal fats-fat']].copy()
original_len=dataset.shape[0]
for i in range(0,original_len):
    dataset.loc[i,"Cluster"]=y_kmeans[i]
    
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.title('Cluster Analysis',fontsize=20, fontweight='bold')
plt.xlabel('Obesity',fontsize=16, fontweight='bold')
plt.ylabel('Confirmed',fontsize=16, fontweight='bold')

figure = plt.gcf()  # get current figure
figure.set_size_inches(32, 18) # set figure's size manually to your full screen (32x18)
plt.show()

In [None]:
dataset['Deaths'] = df['Deaths-protein']
dataset['Country'] = df['Country-fat']

In [None]:
dataset.groupby('Cluster').mean()

In [None]:
dataset.groupby('Cluster').count()

# Conclusion

In [None]:
In summary, a country’s COVID-19 confirmed and active cases can somehow be explained relatively well by food categories such as the calorie contents of oilcrops, and the protein content in infant food and miscellaneous
food. On the other hand, the same cannot be said about the death and recovered cases. This could be due to the fact that these models do not satisfy the neccessary model assumptions of having equal variance and
normally distributed residuals. However, it is also important to note that mortality has not had an outcome, and hence the first model should only be taken as a grain of salt. 

However, recall that this model only talks about the correlation between food categories and the rate of deaths. 

There is no evidence to suggest that a country’s diet has an effect on the spread of COVID-19. Additionally, there are also many other factors causing the spread of COVID-19 that are totally uncorrelated with diet, eg. how active the general public are, the preventive measures implemented by the
countries, density of population etc.