## Modeling with AutoML 
H2O will speed up some of the basic data processing steps, feature engineering, and model testing so I can focus on determining the best model for this dataset.


In [1]:
# Importing necessary libraries
!pip install --upgrade pip
!pip install requests
!pip install tabulate
!pip install "colorama>=0.3.8"
!pip install future



In [None]:
!pip3 install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

In [3]:
# Importing necessary libraries
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
import psutil
import random
import logging
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
import statsmodels.api as sm

In [None]:
min_mem_size=6 
pct_memory=0.5
virtual_memory=psutil.virtual_memory()
min_mem_size=int(round(int(pct_memory*virtual_memory.available)/1073741824,0))
print(min_mem_size)

In [None]:
port_no=random.randint(5555,55555)
try:
  h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
except:
  logging.critical('h2o.init')
  h2o.download_all_logs(dirname=logs_path, filename=logfile)    
  h2o.cluster().shutdown()
  sys.exit(2)

## Importing the dataset

In [None]:
# Addding the dataset from github https://github.com/vraosharma-northeastern/exploratory-data-analysis/blob/main/Nutrition%20/food.csv
!wget https://raw.githubusercontent.com/vishnuraosharma/exploratory-data-analysis/main/Kcal%20Predictions/food.csv

In [4]:
#Reading the file into a dataframe and viewing the first few rows
dff = pd.read_csv('food.csv')

# Loop through columns and remove redundant 'Data.' tag from column names
for col in dff.columns:
    new_col = col.replace('Data.', '')  # Remove 'Data.' from the column name
    dff.rename(columns={col: new_col}, inplace=True)  # Rename the column name in dataframe

# Show the first few rows of the dataset
dff.describe()

Unnamed: 0,Nutrient Data Bank Number,Alpha Carotene,Ash,Beta Carotene,Beta Cryptoxanthin,Carbohydrate,Cholesterol,Choline,Fiber,Kilocalories,...,Major Minerals.Potassium,Major Minerals.Sodium,Major Minerals.Zinc,Vitamins.Vitamin A - IU,Vitamins.Vitamin A - RAE,Vitamins.Vitamin B12,Vitamins.Vitamin B6,Vitamins.Vitamin C,Vitamins.Vitamin E,Vitamins.Vitamin K
count,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0,...,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0,7413.0
mean,14116.44368,21.210711,1.852459,159.043437,8.776744,21.785381,37.162822,20.673546,1.993147,219.655875,...,268.348172,331.590719,1.875125,767.568191,99.43707,1.172903,0.269547,9.075651,0.842837,9.448604
std,8767.416214,269.714183,2.993228,1126.285026,154.18486,27.123491,119.738438,45.48199,4.292873,171.668713,...,404.91622,977.046544,4.193682,3871.307652,761.653061,4.512816,0.565116,63.443284,4.169756,66.067619
min,1001.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8121.0,0.0,0.83,0.0,0.0,0.49,0.0,0.0,0.0,82.0,...,103.0,31.0,0.19,0.0,0.0,0.0,0.03,0.0,0.0,0.0
50%,12539.0,0.0,1.24,0.0,0.0,9.29,2.0,0.0,0.3,181.0,...,210.0,86.0,0.77,33.0,0.0,0.01,0.11,0.0,0.05,0.0
75%,18424.0,0.0,2.2,1.0,0.0,30.59,60.0,20.0,2.3,331.0,...,328.0,428.0,2.46,280.0,24.0,0.83,0.33,3.5,0.39,1.7
max,93600.0,14251.0,99.8,42891.0,7923.0,100.0,3100.0,1388.0,79.0,902.0,...,16500.0,38758.0,181.61,100000.0,30000.0,98.89,12.0,2400.0,149.4,1714.5


## Data Preprocessing

Because *2nd Household Weight* and *1st Household Weight* are the same measure expressed in different units and this column has more nulls, we will drop *2nd Household Weight* and it's corresponding *Household Weights.2nd Household Weight Description* below.

Also, because Household Weight Description doesn't really tell us anything that the *Household Weights.1st Household Weight* already containes, we will drop *Household Weights.1st Household Weight Description* as well.

Furthermore, all foods should have a Weight (g) greater than 0. We will remove all rows where the Weight (g) is 0.

In [None]:
# Drop 'Household Weight Description' columns
dff.drop(['Household Weights.2nd Household Weight Description', 'Household Weights.2nd Household Weight','Household Weights.1st Household Weight Description'], axis=1, inplace=True)

# Drop rows where Household Weights.1st Household Weight is 0
dff = dff[dff['Household Weights.1st Household Weight'] != 0]

We will also drop *Vitamin A - IU* because it is redundant to *Vitamin A - RAE* and the [NIH](https://ods.od.nih.gov/factsheets/VitaminA-HealthProfessional/#:~:text=The%20units%20of%20measurement%20for,beta%2Dcarotene%20%3D%200.3%20mcg%20RAE) recommends using *Vitamin A - RAE* to measure Vitamin A intake.

In [None]:
# Drop *Vitamin A - IU* because it is redundant with *Vitamin A - RAE*
dff.drop(['Vitamins.Vitamin A - IU'], axis=1, inplace=True)

Looking at the means, std. deviations, mins, and maxes, of each of our fields we can see that our data needs to be evaluated column by column to determine if each field makes sense. At face value, a mean of 21.8 for Carbohydrates and 37.2 for Cholesterol seems off, but we need to consider the units of each column. 

Of course, we will treat the Data Bank Number as a surrogate key and disregard its distribution. Similarly, Category and Description are categorical variables and we will them in our unit analysis.

Documentation for the dataset is a bit weak. For example, though the first few attributes have clearly defined units in Kaggle, the units for the remaining numeric attributes are not mentioned. Using the source of the data, the USDA FoodData Central, we can compare the values in our dataset to the source of truth and assume the units for each column. To make this simple, we will take the dataset's first entry, Butter [1001](https://fdc.nal.usda.gov/fdc-app.html#/food-details/790508/nutrients). We will assume that the units for each column are the same as the units in the source of truth unless values are off by order(s) of magnitude:

**Attribute: Unit**
1. Ash: g
2. Alpha Carotene: µg
3. Beta Carotene: µg
4. Beta Cryptoxanthin: µg
5. Carbohydrate: g
6. Cholesterol: mg
7. Choline: mg
8. Fat.Monosaturated Fat: g
9. Fat.Polysaturated Fat: g
10. Fat.Saturated Fat: g
11. Fat.Total Lipid: g
12. Fiber: g
13. Household Weights.1st Household Weight: g
14. Kilocalories: kcal
15. Lutein and Zeaxanthin: µg
16. Lycopene: µg
17. Major Minerals.Calcium: mg
18. Major Minerals.Copper: mg
19. Major Minerals.Iron: mg
20. Major Minerals.Magnesium: mg
21. Major Minerals.Phosphorus: mg
22. Major Minerals.Potassium: mg
23. Major Minerals.Sodium: mg
24. Major Minerals.Zinc: mg
25. Manganese: mg
26. Niacin: mg
27. Pantothenic Acid: mg
28. Protein: g
29. Refuse Percentage: % by volume
30. Retinol: µg
31. Riboflavin: mg
32. Selenium: µg
33. Sugar Total: g
34. Thiamin: mg
35. Vitamins.Vitamin A - RAE: µg
36. Vitamins.Vitamin B12: µg
37. Vitamins.Vitamin B6: µg
38. Vitamins.Vitamin C: µg
39. Vitamins.Vitamin E: mg
40. Vitamins.Vitamin K: µg
41. Water: g

Taking a look at the data again, everything seems to make sense. I find it a little strange that the Carbohydrate standard deviation is so high, but that could be because variety of foods in the dataset.

Also, though most of these fields are self-explanatory, I'll provide a quick definition for the ones that aren't:
* **Ash**: The inorganic residue remaining after the water and organic matter have been removed by heating in the presence of oxidizing agents, which provides a measure of the total amount of minerals within a food.
* **Refuse Percentage**: The percentage of a food that is not normally consumed, e.g. bones, shells, seeds, etc.
* **Retinol**: A form of Vitamin A.

As a final note, I will not be translating categorical data found in the *Categgory* column into a set of dummy variables because categorization is arbitrary. For example, the column contains a few entries that are not necessarily food types, e.g. 'Spices and Herbs', 'no category', etc. and occassionally foods of the same type are split into categories by brand e.g. 'Soup', 'Campbell's Soup'. I could clean this up by introducing a new roll-up category, but this is beyond the scope of this project.

To practice working with dummy variables, however, I will remove the *Vitamins.Vitamin B12* column because, as we can see below, it has the lowest correlation with Kilocalories of all the Vitamins. I'll then add a dummy variable for the column. 

Finally, let's drop the *Data Bank Number* column because it is a surrogate key and the *Category* and *Description* columns because they are categorical variables that we are not going to be using in our model.

In [None]:
# Drop the 'Data Bank Number', 'Category', and 'Description' columns
dff.drop(['Category', 'Description', 'Nutrient Data Bank Number'], axis=1, inplace=True)

## Train Test Split
Let's load the pre-processed data into an H2O frame and split it up into training and test sets. We will use 80% of the data for training and 20% for testing.

Remember, our predictors will be all the remaining columns except for *Kilocalories*, our response variable.

In [None]:
# Show initial shape of dataframe
print('Initial shape of dataframe:', dff.shape)

In [None]:
# Convert the dataframe to an H2OFrame
df = h2o.H2OFrame(dff)

# Splitting the data into training and test sets
pct_rows=0.80
df_train, df_test = df.split_frame([pct_rows], seed = 1)

# Show shape of training set and test set
print('Training set shape:', df_train.shape)
print('Test set shape:', df_test.shape)

In [None]:
# Setting the predictor and response variables
y = 'Kilocalories'
X = list(df.columns)
X.remove(y)

# Print predictors
print('Predictors:', X)

# Print response variable
print('Response:', y)

## Fitting Some Models

### Linear Regression