# Diet Vs Coronavirus - ML Approach

## Objective

The objective of this model is to predict the percentage of deaths (per country) related to the coronavirus pandemic, taking into account statistical data about the food habbits of the population (food types: animal, eggs, fish, beer, etc.). From the predictions of the model, we can conclude which types of food have a bigger impact in the final outcome.

## The dataset

In order to train and evaluate the model, we'll use [Covid-19 healthy dataset](https://www.kaggle.com/mariaren/covid19-healthy-diet-dataset). Unfortunately, the number of labeled examples is pretty low, so we'll create a classic ML model. 

## Import relevant modules

The following hidden code cell imports the necessary packages that we'll use in order to explore, process the data and to create, run and evaluate the model.

In [None]:
#@title Import relevant modules

import pandas as pd
import tensorflow as tf
import numpy as np
from matplotlib import pyplot as plt
from tensorflow.keras import layers

# The following lines adjust the granularity of reporting. 
pd.options.display.max_rows = 30
pd.options.display.float_format = "{:.8f}".format

## Data

The following sections will be dedicated to the processes of data acquisition, exploration and processing.

### Data acquisition

Since our dataset is not that large, we'll use pandas to load the data in memory from a .csv file.

In [None]:
column_names = [
  'country',
  'alchoholic_beverages',
  'animal_products',
  'animal_fats',
  'aquatic_products',
  'cereals_excluding_beer',
  'eggs',
  'fish_and_seafood',
  'fruits',
  'meat',
  'miscellaneous',
  'milk_excluding_butter',
  'offals',
  'oilcrops',
  'pulses',
  'spices',
  'starchy_roots',
  'stimulants',
  'sugar_crops',
  'sugar_and_sweeteners',
  'treenuts',
  'vegetal_products',
  'vegetal_oils',
  'vegetables',
  'obesity',
  'undernourished',
  'confirmed',
  'deaths',
  'recovered',
  'active',
  'population',
  'unit',
]

diet_data = pd.read_csv(
  filepath_or_buffer='https://raw.githubusercontent.com/GrozescuRares/diet_vs_corona/master/diet_vs_corona.csv',
  skiprows=1,
  names=column_names,
)
diet_data = diet_data.reindex(np.random.permutation(diet_data.index))

diet_data.head()

### Data exploration

In this section we'll explore the dataset, since a large part of most machine learning projects is getting to know your data.

In [None]:
#@title Get statistics on the dataset.

diet_data.describe()

After analyzing the statistics we identified some anomalies:


*   There are missing values for the columns: Obesity, Confirmed, Deaths, Recovered and Active.
*   For several columns the value of max seems very high compared to the other quantiles, which suggest that for those column we have some outlier values. For example, for the Fruits - Excluding Wine column we have a maximum value of 9.7. Given the quantile values and the mean, std values, we would expect the max value to be aproximately 3.0. This issue also occours for the the Oilcrops, Pulses, Spices, Starchy Roots and Population columns.

All this considered, we need to carefully choose our features and decide how to handle the examples which have missing values for some columns.



#### Find feature(s) whose raw values correlate with the label

We want to find out which features has more predictive power in the case of our problem. In order to get that information we'll use the [**correlation matrix**](https://medium.com/towards-artificial-intelligence/training-a-machine-learning-model-on-a-dataset-with-highly-correlated-features-debddf5b2e34).

In [None]:
#@title Get correlation matrix

diet_data.corr()

After analyzing the correlation matrix we can conclude that 'Animal Products', 'Cereals - Excluding Beer', 'Vegetal Products' and 'Obesity' correlate more with 'Deaths'. So, we'll use those values as features (numeric features).

In [None]:
# Define features and labels.
feature_names = ['animal_products', 'cereals_excluding_beer', 'obesity', 'vegetal_products']
label_name = 'deaths'

In [None]:
# Inspect features data
diet_data[feature_names].head()

#### Visualize features data distribution

Visualizing the distribution of the data we'll help us in decideing if we need to normalize the data. We'll plot the histogram of each feature using pandas.  

In [None]:
for feature_name in feature_names:
  diet_data.hist(column=feature_name)

By visualizing the histograms we can conclude the following:


*   *Animal Products*, *Obesity* and *Vegetal Products* have a roughly normal distribution. We'll probably just scale their values using z-score formula.
*   *Cereals - Excluding Beer* on the other hand, present a right skewed distribution. Maybe a log scalling we'll help us getting a normal distribution for those two features.



### Data processing

In this section we'll look at how we can normalize our data in order to obtain a normal distribution and we'll decide how should we handle the records with missing values.

#### Dropping records

Since we have missing values for the label and due to the context of the problem, we'll drop that records.

In [None]:
# Get a data frame which only contains the features and the label
training_columns = feature_names + [label_name]
training_df = diet_data[training_columns]
training_df = training_df.astype(np.float32)

# Drop records with nan values
training_df = training_df.dropna()

print('Dropped records with missing values.')

#### Data normalization

In the last section we plotted the histogram for all the features and we saw that the values of *Animal fats* and *Cereals - Excluding Beer* are not uniformly distributed. In this section we'll explore z-score and log scalling.
**Note**: We'll apply the scalling on a copy of diet_data, just for visualizing the difference. The actual scalling will be done within the model creation.

In [None]:
def zscore(mean, std, val):
  epsilon = 0.000001
  
  return (val - mean) / (epsilon + std)

z_score_scaled_feature_names = ['animal_products', 'obesity', 'vegetal_products']
log_scaled_feature_names = ['cereals_excluding_beer']

training_df_copy = training_df.copy()
z_score_scaled_features = training_df_copy[z_score_scaled_feature_names].copy()

# Apply z-score on 'Animal Products', 'Obesity' and 'Vegetal Products'
for feature_name in z_score_scaled_feature_names:
  mean = z_score_scaled_features[feature_name].mean()
  std = z_score_scaled_features[feature_name].std()
  z_score_scaled_features[feature_name] = zscore(mean, std, z_score_scaled_features[feature_name])
  z_score_scaled_features.hist(column=feature_name)

log_scaled_features = training_df_copy[log_scaled_feature_names].copy()
for feature_name in log_scaled_feature_names:
  # Apply log scaling for 'Cereals - Excluding Beer'
  log_scaled_features[feature_name] = np.log(log_scaled_features[feature_name])
  log_scaled_features.hist(column=feature_name)

It seems that after applying z-score and log scaling we got a much more normal distribution for all of our features. So, we are definitely going to stick with this approach on model creation. 

#### Data noise and label normalization

The last thing that we need to do before creating the model is removing the noise of the label values and bring the label to a similar range as the features. For reducing the complexity of the computations, we'll keep just the first 4 digits after the floating point.

For avoiding logging 0 values which cause -inf results, we add +1 at logging

In [None]:
training_df[label_name] = training_df[label_name].astype(np.float32) * 100.0
training_df[label_name] = training_df[label_name].round(4)
training_df[label_name] = training_df[label_name].map(lambda val: np.log(val + 1))

training_df.describe()

In [None]:
training_df.hist(column=label_name)

## Model

The sections bellow are dedicted to the processes of creating, training and evaluating the model.

### Splitting the dataset

We split the dataset into training and testing data, separating the features from the label.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(training_df[feature_names], training_df[label_name], test_size=0.10)

print('We have {} training records and {} records for evaluating the model.'.format(len(X_train), len(X_test)))

### Creating the input layer

In this section we'll create the input layer that will be used by our model. When defining the columns will take into consideration the normalization methods that we discussed in the *Data normalization* phase.

In [None]:
# Create the features normalized using z-score.
z_score_scaled_features = [
  tf.feature_column.numeric_column(
      feature_name,
      normalizer_fn=lambda val: zscore(X_train.mean()[feature_name], X_train.std()[feature_name], val),
  )
  for feature_name in z_score_scaled_feature_names
]

# Create the features normalized using log scaling
log_scaled_features = [
  tf.feature_column.numeric_column(
      feature_name,
      normalizer_fn=lambda val: tf.math.log(val),
  )
  for feature_name in log_scaled_feature_names
]

# Create the input layer
input_layer = layers.DenseFeatures(z_score_scaled_features + log_scaled_features)

print('Created input layer.')

### Define functions that create and train a model; define plot function

We'll define a function for creating and compiling a simple linear regression model.

In [None]:
def create_model(my_learning_rate, input_layer):
  """Create and compile a simple linear regression model."""

  model = tf.keras.models.Sequential()

  # Add the layer containing the feature columns to the model.
  model.add(input_layer)

  # Add one linear layer to the model to yield a simple linear regressor.
  model.add(tf.keras.layers.Dense(units=1, input_shape=(1, )))

  # Construct the layers into a model that TensorFlow can execute.
  model.compile(
    optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),
    loss='mean_squared_error',
    metrics=[tf.keras.metrics.RootMeanSquaredError()],
  )

  return model

print('Defined create_model function.')

The function bellow represents the training process of the model on a given dataset.

In [None]:
def train_model(model, x, y, epochs, batch_size):
  """Feed a dataset into the model in order to train it."""

  features = {name:np.array(value) for name, value in x.items()}
  label = y.to_numpy()

  history = model.fit(
    x=features,
    y=label,
    batch_size=batch_size,
    epochs=epochs,
    shuffle=True,
  )

  # The list of epochs is stored separately from the rest of history.
  epochs = history.epoch
  
  # Isolate the mean absolute error for each epoch.
  hist = pd.DataFrame(history.history)
  rmse = hist['root_mean_squared_error']

  return epochs, rmse

print('Defined train_model function.')   

We'll define the function that we are going to use in order to plot the results of the training process.

In [None]:
def plot_the_loss_curve(epochs, rmse):
  """Plot a curve of loss vs. epoch."""

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Root Mean Squared Error')

  plt.plot(epochs, rmse, label="Loss")
  plt.legend()
  plt.ylim([rmse.min()*0.94, rmse.max()* 1.05])
  plt.show()

print('Defined plot function.')

### Train the model

In this section we'll create the model and train it on the labeled examples.

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.003
epochs = 64
batch_size = 12

# Create and compile the model.
model = create_model(learning_rate, input_layer)

# Train the model on the training set.
epochs, rmse = train_model(model, X_train, Y_train, epochs, batch_size)

plot_the_loss_curve(epochs, rmse)

### Evaluate the model

In [None]:
print("\n: Evaluate the new model against the test set:")

test_features = {name:np.array(value) for name, value in X_test.items()}

results = model.evaluate(x=test_features, y=Y_test.to_numpy(), batch_size=batch_size)

In [None]:
new_data = {
  'animal_products': [17.7],
  'cereals_excluding_beer': [7.9],
  'obesity': [10.5],
  'vegetal_products': [26.2],
}

new_data = {name:np.array(value) for name, value in new_data.items()}

results = model.predict(new_data)

print('The predicted deaths percentage is {}.'.format(results[0][0]))

After evaluating the model, despite the fact that the predictions are not that accurate due to the very low amount of examples used for training, we can still observe that increasing the percentage of features that have a negative corellation (if their values increase, the outcome decreases) while decreasing the percentage of features that have a positive corellation, we get a lower value for the percentage of deaths. For example a distribution of:


```
{
  'animal_products': 18.7,
  'cereals_excluding_beer': 7.9,
  'obesity': 20.5,
  'vegetal_products': 15.2,
}
```
will always result in a greater death percentage outcome than:


```
{
  'animal_products': 14.7,
  'cereals_excluding_beer': 7.9,
  'obesity': 20.5,
  'vegetal_products': 19.2,
}
```

So, we can conlude that by changing just a little bit the proportions of fat income types, we can make a difference by the end of the day.




# Diet Vs Coronavirus - Data Approach

## Objective

The objective of this data analyses is to confirm that a population with a healthy diet and lifestyle has a low rate of deaths related to the coronavirus pandemic.

## The dataset

In order to prove our theory, we'll use [Covid-19 healthy dataset](https://www.kaggle.com/mariaren/covid19-healthy-diet-dataset). In the ML Approach, we observed that *Animal Products*, *Cereals - Excluding Beer*, *Vegetal Products* and *Obesity* are the most correlated to the deaths percentage, so we'll be using those values.

## Import relevant modules

The following hidden code cell imports the necessary packages that we'll use in order to explore, process and visualize the data

In [None]:
#@title Import relevant modules

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# The following lines adjust the granularity of reporting. 
pd.options.display.max_rows = 30
pd.options.display.float_format = "{:.8f}".format

## Load the data

We'll load the data in memory using pandas, selecting only the five columns that we are intereseted in.

In [None]:
column_names = [
  'country',
  'alchoholic_beverages',
  'animal_products',
  'animal_fats',
  'aquatic_products',
  'cereals_excluding_beer',
  'eggs',
  'fish_and_seafood',
  'fruits',
  'meat',
  'miscellaneous',
  'milk_excluding_butter',
  'offals',
  'oilcrops',
  'pulses',
  'spices',
  'starchy_roots',
  'stimulants',
  'sugar_crops',
  'sugar_and_sweeteners',
  'treenuts',
  'vegetal_products',
  'vegetal_oils',
  'vegetables',
  'obesity',
  'undernourished',
  'confirmed',
  'deaths',
  'recovered',
  'active',
  'population',
  'unit',
]
used_column_names = [
  'animal_products',
  'cereals_excluding_beer',
  'vegetal_products',
  'obesity',
  'deaths',
]

diet_data_simple = pd.read_csv(
  filepath_or_buffer='https://raw.githubusercontent.com/GrozescuRares/diet_vs_corona/master/diet_vs_corona.csv',
  skiprows=1,
  names=column_names,
  usecols=used_column_names,
)

diet_data_simple = diet_data_simple.dropna()
diet_data_simple.head()

## Analyze the data

This section is dedicated to the process of analyzing the data and confirming our theory. We'll start by taking a look on the statistics related to the dataset.

In [None]:
#@title Get statistics on the dataset.

diet_data_simple.describe()

After observing and analyzing those statistics, we consider that a good approach would be to sort the records by the deaths percentage and then selecting ten rows; five of them representing records with highest percentage of deaths, and another five with the lowest. Last but not least, we'll do an average of the values for each group of records and than we'll compare them using piecharts.

In [None]:
#@title Sort data by deaths
diet_data_sorted = diet_data_simple.sort_values(by=['deaths'])

diet_data_sorted

In [None]:
#@title Separate data in groups
diet_data_sorted = diet_data_sorted[diet_data_sorted.deaths != 0.0]

highest_deaths_rate_data = diet_data_sorted.tail(10)
lowest_deaths_rate_data = diet_data_sorted.head(10)

print('Data was separated in two groups by deaths rate.')

In [None]:
#@title Check records with the highest death rate

highest_deaths_rate_data

In [None]:
#@title Check records with the lowest death rate

lowest_deaths_rate_data

In [None]:
#@title Compute average for both groups

highest_deaths_rate_mean = {column_name:highest_deaths_rate_data[column_name].mean() for column_name in used_column_names[:-1]}
print('Average values for records with highest death rate: \n{}'.format(highest_deaths_rate_mean))

lowest_deaths_rate_mean = {column_name:lowest_deaths_rate_data[column_name].mean() for column_name in used_column_names[:-1]}
print('Average values for records with lowest death rate: \n{}'.format(lowest_deaths_rate_mean))

In [None]:
#@title Visualize charts

labels = used_column_names[:-1]

x = np.array([0, 2, 4, 6])  # the label locations
width = 0.7  # the width of the bars

fig, ax = plt.subplots(figsize=(20, 12))
rects1 = ax.bar(x - width/2, highest_deaths_rate_mean.values(), width, label='High deaths rate group')
rects2 = ax.bar(x + width/2, lowest_deaths_rate_mean.values(), label='Low deaths rate group')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Percentage')
ax.set_title('Percentage of fat income')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend(loc='upper right')

def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 1),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)

plt.show()

As you can observe from the bar chart, it is clear that a population which has a healthy diet consisting of vegetal products and cereals has a lower death rate in comparison with a population which has a higher obesity rate and consumes more animal products.
In conclusion, based on this data we can confirm that a population with a healthy diet and lifestyle has a low rate of deaths related to the coronavirus pandemic. 