# Recipes Analysis

**Name(s)**: Daniel Budidharma, Tristan Leo

**Website Link**: https://vdanielb.github.io/RecipesAnalysis/

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'

# from dsc80_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

First let's load in the dataset and take a look at it.

In [2]:
recipes = pd.read_csv('data/RAW_recipes.csv')
interactions = pd.read_csv('data/RAW_interactions.csv')

In [3]:
display(recipes.head())
display(interactions.head())

Unnamed: 0.1,Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,111,1 brownies in the world best ever,333281,40,985201,2008-10-27,"['60-minutes-or-less', 'time-to-make', 'course...","[138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0]",10,['heat the oven to 350f and arrange the rack i...,"these are the most; chocolatey, moist, rich, d...","['bittersweet chocolate', 'unsalted butter', '...",9
1,115,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,"['60-minutes-or-less', 'time-to-make', 'cuisin...","[595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0]",12,"['pre-heat oven the 350 degrees f', 'in a mixi...",this is the recipe that we use at my school ca...,"['white sugar', 'brown sugar', 'salt', 'margar...",11
2,118,412 broccoli casserole,306168,40,50969,2008-05-30,"['60-minutes-or-less', 'time-to-make', 'course...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"['preheat oven to 350 degrees', 'spray a 2 qua...",since there are already 411 recipes for brocco...,"['frozen broccoli cuts', 'cream of chicken sou...",9
3,119,millionaire pound cake,286009,120,461724,2008-02-12,"['time-to-make', 'course', 'cuisine', 'prepara...","[878.3, 63.0, 326.0, 13.0, 20.0, 123.0, 39.0]",7,"['freheat the oven to 300 degrees', 'grease a ...",why a millionaire pound cake? because it's su...,"['butter', 'sugar', 'eggs', 'all-purpose flour...",7
4,125,2000 meatloaf,475785,90,2202916,2012-03-06,"['time-to-make', 'course', 'main-ingredient', ...","[267.0, 30.0, 12.0, 12.0, 29.0, 48.0, 2.0]",17,"['pan fry bacon , and set aside on a paper tow...","ready, set, cook! special edition contest entr...","['meatloaf mixture', 'unsmoked bacon', 'goat c...",13


Unnamed: 0,user_id,recipe_id,date,rating,review
0,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
1,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunk...
2,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprin..."
3,124416,120345,2011-08-06,0,"Just an observation, so I will not rate. I fo..."
4,2000192946,120345,2015-05-10,2,This recipe was OVERLY too sweet. I would sta...


## Step 2: Data Cleaning and Exploratory Data Analysis

We first remove unnamed: 0 from our `recipes` dataframe. Unnamed:0  is just the index number on the original dataset before we took a subset of it.

In [4]:
recipes = recipes.drop(columns=["Unnamed: 0"])

Let's look at a particular row in `interactions`

In [5]:
display(interactions.iloc[3:4])
print(interactions['review'].iloc[3])

Unnamed: 0,user_id,recipe_id,date,rating,review
3,124416,120345,2011-08-06,0,"Just an observation, so I will not rate. I fo..."


Just an observation, so I will not rate.  I followed this procedure with strawberries instead of raspberries.  Perhaps this is the reason it did not work well.  Sorry to report that the strawberries I did in August were moldy in October.  They were stored in my downstairs fridge, which is very cold and infrequently opened.  Delicious and fresh-tasting prior to that, though.  So, keep a sharp eye on them.  Personally I would not keep them longer than a month.  This recipe also appears as #120345 posted in July 2009, which is when I tried it.  I also own the Edna Lewis cookbook in which this appears.


Notice that the lowest possible rating a user could give is 1 star. So how does this recipe have a rating of 0? It turns out that that means the reviewer just didn't leave a rating. Like the review in this particular row says, "...so I will not rate". It makes sense then to replace these values with NaN.

In [6]:
interactions['rating'] = interactions['rating'].replace(0, np.nan)

Another thing we should notice is that the values in the tags column in `recipes` isn't actually a list. This is also true for other columns with values that look like lists. They're actually strings! To convert them into a list, we define a function and apply it to all those columns:

In [7]:
def convert_col_string_to_list(df, col):
    translation_table = str.maketrans({"[": "", 
                                   "]": "",
                                    "\'":""})
    df[col] = df[col].str.translate(translation_table).str.split(', ')

for col in ['tags','nutrition', 'steps', 'ingredients']:
    convert_col_string_to_list(recipes, col)

And let's verify they're lists now

In [8]:
print("The type of the value is: ",type(recipes['tags'].iloc[4268]))
(recipes['tags'].iloc[4268])

The type of the value is:  <class 'list'>


['30-minutes-or-less',
 'time-to-make',
 'course',
 'main-ingredient',
 'cuisine',
 'preparation',
 'occasion',
 'north-american',
 'low-protein',
 'healthy',
 'condiments-etc',
 'vegetables',
 'american',
 'southwestern-united-states',
 'tex-mex',
 'easy',
 'dietary',
 'spicy',
 'low-sodium',
 'low-cholesterol',
 'low-calorie',
 'low-carb',
 'garnishes',
 'healthy-2',
 'low-in-something',
 'onions',
 'peppers',
 'tomatoes',
 'taste-mood',
 'number-of-servings',
 '3-steps-or-less']

Now we can actually perform list operations on those columns. Next, we're interested in finding the average rating per recipe. To do that we'll first have to merge the recipes and ratings dataframes.

In [9]:
recipes_with_ratings = recipes.merge(interactions, left_on='id', right_on='recipe_id',how='left')
recipes_with_ratings.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,user_id,recipe_id,date,rating,review
0,1 brownies in the world best ever,333281,40,985201,2008-10-27,"[60-minutes-or-less, time-to-make, course, mai...","[138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0]",10,[heat the oven to 350f and arrange the rack in...,"these are the most; chocolatey, moist, rich, d...","[bittersweet chocolate, unsalted butter, eggs,...",9,386585.0,333281.0,2008-11-19,4.0,"These were pretty good, but took forever to ba..."
1,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,"[60-minutes-or-less, time-to-make, cuisine, pr...","[595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0]",12,"[pre-heat oven the 350 degrees f, in a mixing ...",this is the recipe that we use at my school ca...,"[white sugar, brown sugar, salt, margarine, eg...",11,424680.0,453467.0,2012-01-26,5.0,Originally I was gonna cut the recipe in half ...
2,412 broccoli casserole,306168,40,50969,2008-05-30,"[60-minutes-or-less, time-to-make, course, mai...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"[preheat oven to 350 degrees, spray a 2 quart ...",since there are already 411 recipes for brocco...,"[frozen broccoli cuts, cream of chicken soup, ...",9,29782.0,306168.0,2008-12-31,5.0,This was one of the best broccoli casseroles t...
3,412 broccoli casserole,306168,40,50969,2008-05-30,"[60-minutes-or-less, time-to-make, course, mai...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"[preheat oven to 350 degrees, spray a 2 quart ...",since there are already 411 recipes for brocco...,"[frozen broccoli cuts, cream of chicken soup, ...",9,1196280.0,306168.0,2009-04-13,5.0,I made this for my son's first birthday party ...
4,412 broccoli casserole,306168,40,50969,2008-05-30,"[60-minutes-or-less, time-to-make, course, mai...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"[preheat oven to 350 degrees, spray a 2 quart ...",since there are already 411 recipes for brocco...,"[frozen broccoli cuts, cream of chicken soup, ...",9,768828.0,306168.0,2013-08-02,5.0,Loved this. Be sure to completely thaw the br...


`recipes_with_ratings` is now a dataframe with multiple rows for a single recipe, each row corresponding to a review for that recipe. If it has no reviews, then the columns associated with a review should be NaN. Now let's compute the average rating per recipe and include that in our original `recipes` dataframe, no duplicates.

In [10]:
recipes_with_ratings['average_rating'] = recipes_with_ratings.groupby('id')['rating'].transform(lambda x: x.mean())
recipes = recipes_with_ratings.drop_duplicates(subset='id')
recipes = recipes.drop(columns=['user_id', 'date', 'recipe_id','rating','review'])
print(recipes.shape) #verify it still has same number of rows. It does
recipes.head()

(83782, 13)


Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,average_rating
0,1 brownies in the world best ever,333281,40,985201,2008-10-27,"[60-minutes-or-less, time-to-make, course, mai...","[138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0]",10,[heat the oven to 350f and arrange the rack in...,"these are the most; chocolatey, moist, rich, d...","[bittersweet chocolate, unsalted butter, eggs,...",9,4.0
1,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,"[60-minutes-or-less, time-to-make, cuisine, pr...","[595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0]",12,"[pre-heat oven the 350 degrees f, in a mixing ...",this is the recipe that we use at my school ca...,"[white sugar, brown sugar, salt, margarine, eg...",11,5.0
2,412 broccoli casserole,306168,40,50969,2008-05-30,"[60-minutes-or-less, time-to-make, course, mai...","[194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0]",6,"[preheat oven to 350 degrees, spray a 2 quart ...",since there are already 411 recipes for brocco...,"[frozen broccoli cuts, cream of chicken soup, ...",9,5.0
6,millionaire pound cake,286009,120,461724,2008-02-12,"[time-to-make, course, cuisine, preparation, o...","[878.3, 63.0, 326.0, 13.0, 20.0, 123.0, 39.0]",7,"[freheat the oven to 300 degrees, grease a 10-...",why a millionaire pound cake? because it's su...,"[butter, sugar, eggs, all-purpose flour, whole...",7,5.0
7,2000 meatloaf,475785,90,2202916,2012-03-06,"[time-to-make, course, main-ingredient, prepar...","[267.0, 30.0, 12.0, 12.0, 29.0, 48.0, 2.0]",17,"[pan fry bacon , and set aside on a paper towe...","ready, set, cook! special edition contest entr...","[meatloaf mixture, unsmoked bacon, goat cheese...",13,5.0


Now we can start on some EDA.

The distribution of ratings should theoretically look something like a normal distribution, with most people rating 3 stars for average satisfaction, while few people would have extreme experiences that would warrant a 5 star or 1 star. Does our ratings column look like a normal distribution? Let's check.

In [11]:
px.histogram(recipes, x="average_rating")

Surprisingly a lot of 5s. Does this mean every recipe on food.com is a masterpiece? Probably not. It just means people are generous with ratings. It also might mean recipes that would've been rated low just don't get reviewed as much as recipes that are rated high. This makes sense, higher reviews lead to more views which lead to even more reviews.  
<br> Still, this isn't good because it means the average rating doesn't tell us much about the actual quality of the recipe compared to other recipes. If everything is 5 stars, how do I know which recipe is better than the other? It is for this reason that we think any analysis involving the average rating probably won't be very useful.

We can do something similar with number of reviews of each recipe. We calculate the number of reviews for each recipe and then we plot a histogram.

In [12]:
ids_num_ratings = recipes_with_ratings.groupby('id').count()['name']
recipes = recipes.set_index('id')
recipes['num_reviews'] = ids_num_ratings
recipes = recipes.reset_index()
px.histogram(recipes['num_reviews'])

As you can see, an overwhelming majority of recipes have only 1 review. So any analysis or prediction involving this would also likely be meaningless. For example, I can build a very accurate model that predicts the number of reviews a recipe will get by doing no calculations and just predicting 1 every time. 

Let's also look at what kind of tags there are:

In [13]:
recipes['tags'].explode().unique()

array(['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient',
       'preparation', 'for-large-groups', 'desserts', 'lunch', 'snacks',
       'cookies-and-brownies', 'chocolate', 'bar-cookies', 'brownies',
       'number-of-servings', 'cuisine', 'north-american', 'canadian',
       'british-columbian', 'side-dishes', 'vegetables', 'easy',
       'beginner-cook', 'broccoli', 'occasion', 'american',
       'southern-united-states', 'dinner-party', 'holiday-event', 'cakes',
       'dietary', 'christmas', 'thanksgiving', 'low-sodium',
       'low-in-something', 'taste-mood', 'sweet', '4-hours-or-less',
       'main-dish', 'potatoes', 'meatloaf', 'simply-potatoes2',
       'weeknight', '30-minutes-or-less', 'beef', 'diabetic',
       'kid-friendly', 'stove-top', 'comfort-food', 'inexpensive',
       'ground-beef', 'meat', 'greens', 'lettuces', 'tomatoes',
       'equipment', '3-steps-or-less', 'soups-stews', 'beans', 'pork',
       'mexican', 'stews', 'crock-pot-slow-cooker', 's

## Step 3: Assessment of Missingness

Let's see how many missing data we have, as well as a breakdown of missing values in each column.

In [14]:
print('total missing values: ', recipes.isna().sum().sum())
recipes.isna().sum()

total missing values:  2680


id                   0
name                 1
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description         70
ingredients          0
n_ingredients        0
average_rating    2609
num_reviews          0
dtype: int64

In [15]:
print('total missing values: ', interactions.isna().sum().sum())
interactions.isna().sum()

total missing values:  52001


user_id          0
recipe_id        0
date             0
rating       51832
review         169
dtype: int64

We'll look at some of these. Firstly, let's look at the one missing name value in `recipes`.

In [16]:
recipes[recipes['name'].isna()]

Unnamed: 0,id,name,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,average_rating,num_reviews
238,368257,,10,779451,2009-04-27,"[15-minutes-or-less, time-to-make, course, pre...","[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]",6,"[in a bowl , combine ingredients except for ol...",-------------,"[lemon, honey, horseradish mustard, garlic clo...",10,,0


Since it's only 1 missing value in this column out of hundreds of thousands of rows, doing a missingness analysis on this column would be pretty meaningless, and it would be negligible anyway. Although it is very weird that this one recipe name is missing.

Another column in `recipes` with missing values is 'description'. We believe this is NMAR because if the user believes there is no need to describe the dish, then it will simply have no description and therefore be a missing value.

Next we should consider the rating column. It has the most missing values out of all the columns. This makes sense because there are many people who write reviews or comments on the recipe without leaving a rating. Our guess is this is MCAR. We'll perform a permutation test to verify that. Our hypotheses are:
- **Null Hypothesis**: The rating column is MCAR
- **Alternative Hypothesis**: The rating column is not MCAR

In [17]:
#TODO : The thing

## Step 4: Hypothesis Testing

We're interested in comparing American and Asian dishes. Specifically, we're concerned about health. Now, a healthy diet is usually a balanced diet, so we can't conclude one nutrient is objectively better to always have more of. But we can at the very least say saturated fat is objectively **bad** for you. Many national and international health organizations, such as [The American Heart Association](https://www.heart.org/en/healthy-living/healthy-eating/eat-smart/fats/saturated-fats) and [World Health Organization](https://www.who.int/news/item/17-07-2023-who-updates-guidelines-on-fats-and-carbohydrates) recommend either limiting or replacing saturated fat intake.<br><br>
So to compare the healthiness of American and Asian dishes, we will be focusing on saturated fat content. We will do this comparison using a hypothesis test. 

First, some data wrangling. We need to extract the saturated fat from the nutrition column, which is currently a column of lists, with each list containing the values of various nutrients. We know from looking at the website that the saturated fat is the second last entry in each list, so we extract that and assign it to a new column

In [18]:
recipes['saturated_fat'] = recipes['nutrition'].apply(lambda x: float(x[-2]))

We should keep in mind the saturated fat values are in percentages of daily value. 

Next, we assign labels to every row depending on if it's an American or Asian recipe. This information is stored in the tags, and all the tags are lowercase which makes our job easier. We assign a new column to see if the recipe is asian, american, or neither:

In [19]:
recipes['asian_or_american'] = recipes['tags'].apply(lambda x: 'asian' if 'asian' in x else 'american' if 'american' in x else 'neither')

Now we filter the dataset to only include Asian and American recipes. And we perform a permutation test on them. We name this dataframe `asia_america_recipes`

In [20]:
asia_america_recipes = recipes[recipes['asian_or_american']!='neither']
asia_america_recipes.iloc[18:21]

Unnamed: 0,id,name,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,average_rating,num_reviews,saturated_fat,asian_or_american
150,432027,pink stuff,75,1646101,2010-07-06,"[time-to-make, course, main-ingredient, cuisin...","[954.5, 48.0, 572.0, 38.0, 48.0, 129.0, 49.0]",4,[mix cool whip and cottage cheese in large bow...,this is my friends moms recipe. i vary it from...,"[cool whip, low-fat small-curd cottage cheese,...",5,5.0,1,129.0,american
153,395202,pumpkin pie filling for mexico,70,128945,2009-10-18,"[time-to-make, course, main-ingredient, cuisin...","[772.0, 30.0, 150.0, 33.0, 58.0, 41.0, 36.0]",7,[using an egg beater or food processor pure th...,"i live in chapala, jalisco, mexico and canned ...","[cooked sweet potatoes, mexican crema, milk, e...",13,,1,41.0,american
155,315110,rathu isso curry sri lankan red prawn curry,35,518707,2008-07-22,"[curries, 60-minutes-or-less, time-to-make, co...","[456.0, 27.0, 206.0, 47.0, 37.0, 77.0, 18.0]",5,[wash prawns and remove heads but leave shells...,another sri lankan recipe from chamaine solomo...,"[prawns, onion, garlic cloves, fresh ginger, c...",14,5.0,1,77.0,asian


Now that we have a wrangled dataset, we can get to work constructing our hypothesis test. To decide our alternative hypothesis, we see which one currently has the higher mean saturated fat

In [21]:
mean_satfat_asia = asia_america_recipes[asia_america_recipes['asian_or_american']=='asian']['saturated_fat'].mean()
mean_satfat_america = asia_america_recipes[asia_america_recipes['asian_or_american']=='american']['saturated_fat'].mean()
print('Asian mean saturated fat: ', mean_satfat_asia, '\nAmerican mean saturated fat: ', mean_satfat_america)

Asian mean saturated fat:  30.088743169398906 
American mean saturated fat:  44.822358346094944


We observe that American recipes have higher saturated fat on average. So that will be our alternative hypothesis. Our hypotheses are:
- **Null Hypothesis**: American and Asian recipes on food.com have the same amount of saturated fat.
- **Alternative Hypothesis**: American recipes have more saturated fat than Asian recipes.
- Our test statistic will be `Mean saturated fat in American recipes` - `Mean saturated fat in Asian recipes`

In [22]:
observed_stat = mean_satfat_america - mean_satfat_asia

num_simulations = 5000
shuffled_df = asia_america_recipes.copy()
simulated_stats = []

for i in range(num_simulations):
    shuffled_df['asian_or_american'] = np.random.permutation(shuffled_df['asian_or_american'])

    shuffled_satfat_america = shuffled_df[shuffled_df['asian_or_american']=='american']['saturated_fat'].mean()
    shuffled_satfat_asia = shuffled_df[shuffled_df['asian_or_american']=='asian']['saturated_fat'].mean()

    one_sim_stat = shuffled_satfat_america-shuffled_satfat_asia
    simulated_stats.append(one_sim_stat)

simulated_stats = np.array(simulated_stats)
p_value = np.count_nonzero(simulated_stats >= observed_stat)
print('The p value is: ', p_value)

The p value is:  0


In [23]:
fig = px.histogram(simulated_stats)
fig.add_vline(x=observed_stat, line_width=2,  line_color="red")
fig.add_annotation(
    x=observed_stat,
    y=1,
    yref="paper",
    text="Observed statistic",
    showarrow=True,
    arrowhead=1
)
fig.show()

Our p-value is 0. This means we can confidently reject the null hypothesis. We conclude that American recipes have more saturated fat than Asian recipes.

For fun, we'll plot the distribution of the saturated fat in Asian recipes vs the distribution of saturated fat in American recipes to make sure

In [24]:
fig = px.histogram(asia_america_recipes[asia_america_recipes['asian_or_american']=='asian']['saturated_fat'])
fig.data[0].name = 'Asia'
fig.add_trace(
    go.Histogram(
        x=asia_america_recipes[asia_america_recipes['asian_or_american']=='american']['saturated_fat'],
        opacity=0.7,
        name='America'
    )
)
fig.show()

## Step 5: Framing a Prediction Problem

One challenge we face as college students is trying to manage time. So we decided to build a model that could predict the total cooking time (in minutes) of whatever one might want to cook. This will be a regression problem. 

We will prioritize RMSE as our performance metric. We feel this is better than R^2 for this problem because when I want an estimate of how long a recipe will take to make, I'd be more worried about how "off" that estimate might be compared to how "good" the fit of my model is. RMSE is also more interpretable: if my RMSE is 10 minutes, then that means my estimate will probably be off by 10 minutes on average. So we will prioritize RMSE, but we will also still keep track of R^2 to see the fit of our model.

One easy way to build a really accurate model for this is to look at the tags with that say '60-minutes-or-less' or '30-minutes-or-less'. However, this would be uninteresting and also kind of defeat the purpose. In the "real world", when you're trying to cook a new recipe, you won't know those tags. So we'll ignore that. We will also ignore nutrition values other than calories. There's now way we could know exactly how much carbs or protein our recipe will have, but people are generally more familiar with estimating calories, so we'll use that.

## Step 6: Baseline Model

For our baseline model, our features will be number of ingredients, number of steps, and calories per serving. First, we import the necessary libraries.

In [25]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer

The amount of calories is still stored in the `nutrition` column, so we extract that information and assign it to a new column `calories`. We will also remove some outliers. We choose to remove recipes that take more than 3 hours to make and remove recipes that have calories equal to or over 2000, since that's the recommended daily calorie intake of an adult male.

In [26]:
recipes['calories'] = recipes['nutrition'].apply(lambda x: float(x[0]))
recipes_no_outliers = recipes[(recipes['minutes'] < 180) & (recipes['calories']<2000)]

Now let's do some scatterplots to get an idea of the fit of our model.

In [27]:
px.scatter(recipes_no_outliers, x='n_steps', y='minutes')

In [28]:
px.scatter(recipes_no_outliers, x='n_ingredients', y='minutes')

In [29]:
px.scatter(recipes_no_outliers, x='calories', y='minutes')

It turns out the data has no clear pattern, so a linear regression probably won't do well. We'll try it out anyway. First, a train test split. We will be using this same train test split throughout most of the project. And also we will use random_state=12 for reprocudability and consistency.

In [30]:
X_train, X_test, y_train, y_test = (
    train_test_split(recipes_no_outliers[["n_ingredients", "n_steps", "nutrition"]], recipes_no_outliers["minutes"], random_state=12)
)

In [31]:
def extract_calories(nutrition_col):
    return (nutrition_col.apply(lambda x: x[0]))

#extracts calories from nutrition col
nutrition_transformer = Pipeline([
    ("calories", FunctionTransformer(lambda x: x.apply(extract_calories).values.reshape(-1, 1))),
])

preprocessor = ColumnTransformer([
    ("calories", nutrition_transformer, ["nutrition"])
    ],
remainder='passthrough')

pl = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

model = pl.fit(X_train, y_train)
print("Training R^2: ", model.score(X_train,y_train))
print("Test R^2: ", model.score(X_test,y_test))
print("Test RMSE: ", np.sqrt(mean_squared_error(y_pred=model.predict(X_test), y_true=y_test)))

Training R^2:  0.1993134599394183
Test R^2:  0.2091395180276311
Test RMSE:  27.368402303468894


As you can see it performs pretty badly. Not surprising considering how the scatterplots looked. So we choose to use a decision tree instead. We will set max_depth = 10 to avoid overfitting and set random_state=12 for reproducability. 

In [32]:
from sklearn.tree import DecisionTreeRegressor

pl = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", DecisionTreeRegressor(max_depth=10,random_state=12))
])

model = pl.fit(X_train, y_train)
print("Training R^2: ", model.score(X_train,y_train))
print("Test R^2: ", model.score(X_test,y_test))
print("Test RMSE: ", np.sqrt(mean_squared_error(y_pred=model.predict(X_test), y_true=y_test)))

Training R^2:  0.2646823618997144
Test R^2:  0.1768201428544265
Test RMSE:  27.922022650693687


Let's evaluate our model using K-Fold Cross Validation

In [33]:
kf = KFold(n_splits=5, shuffle=True)
X = recipes_no_outliers[["n_ingredients", "n_steps", "nutrition"]]
y = recipes_no_outliers["minutes"]

scoring = {
    "R2": make_scorer(r2_score)
}

scores = {}
for metric in scoring:
    score = cross_val_score(pl, X, y, cv=kf, scoring=scoring[metric])
    scores[metric] = score.mean()

rmse_scores = np.sqrt(-cross_val_score(pl, X, y, cv=kf, scoring="neg_mean_squared_error"))

# Print results
print(f"Mean RMSE: {rmse_scores.mean():.2f}")
print(f"Mean R2 Score: {scores['R2']:.2f}")

Mean RMSE: 28.32
Mean R2 Score: 0.17


Still not very good. To improve our final mode we'll use a random forest to avoid overfitting and also GridSearchCV to tune our hyperparameters. We'll also include more features. Then we will see if LinearRegression or RandomForestRegressor is better.

## Step 7: Final Model

We'll do a random forest and use GridsearchCV to fine tune our model.  
  
Furthermore, we're going to one hot encode a list ingredients. There are too many unique ingredients in this whole dataset to feasibly one hot encode, so we'll focus on a few common ingredients.  That is, we're gonna feature engineer if a recipe contains these ingredients: ['beef', 'pork', 'chicken', 'corn', 'potatoes', 'rice', 'bread', 'pasta', 'milk', 'cheese', 'butter', 'sugar', 'flour', 'tomatoes', 'squash']. Since there could be many types of the same ingredient (e.g. sweet corn vs normal corn, unsalted butter vs salted butter), we will make it so that any instance of that word appearing in the ingredients column means the ingredient is present. For example, if a recipe has 'sweet corn' as an ingredient, we consider that as containing corn

Also, we're going to one hot encode if a recipe is of category 'breakfast', 'lunch', 'dinner-party', 'desserts', or 'snacks'.

Since our model includes a new column 'ingredients' and 'tags', we will have to make a new train_test_split. But we'll still use random_state=12 for consistency so we can more accurately compared with the baseline model

In [34]:
X_train, X_test, y_train, y_test = (
    train_test_split(recipes_no_outliers[["n_ingredients", "n_steps", "nutrition", "ingredients", "tags"]], 
                     recipes_no_outliers["minutes"], 
                     random_state=12)
)

In [35]:
from sklearn.ensemble import RandomForestRegressor

target_ingredients = ['beef', 'pork', 'chicken', 'corn', 'potatoes', 'rice', 'bread', 'pasta',
                      'milk', 'cheese', 'butter', 'sugar', 'flour', 'tomatoes', 'squash']
def ingredient_onehot_encoder(X):
    df_encoded = pd.DataFrame()
    
    for ingredient in target_ingredients:
        df_encoded[ingredient] = X['ingredients'].apply(lambda x: int(any(ingredient in item for item in x)))

    return df_encoded

nutrition_features = ['calories', 'total_fat', 'sugar', 'sodium', 
                      'protein', 'saturated_fat', 'carbohydrates']
def extract_nutrients(X):
    df = pd.DataFrame()
    for i,nutrition in enumerate(nutrition_features):
        df[nutrition] = X['nutrition'].apply(lambda x: x[i])

    return df

target_types = ['breakfast', 'lunch', 'dinner-party', 'desserts', 'snacks']
def extract_category(X):
    df = pd.DataFrame()
    for category in target_types:
        df[f'is_{category}'] = X['tags'].apply(lambda x: int(category in x))
    return df

final_preprocessor = ColumnTransformer([
    ("calories", nutrition_transformer, ["nutrition"]),
    ('onehot', FunctionTransformer(ingredient_onehot_encoder, validate=False), ['ingredients']),
    ('food_category', FunctionTransformer(extract_category, validate=False), ['tags']),
    ('n_steps', 'passthrough', ['n_steps']),
    ('n_ingredients', 'passthrough', ['n_ingredients'])
    ],
    remainder='drop')

pl = Pipeline([
    ("preprocessor", final_preprocessor),
    ("regressor", RandomForestRegressor(max_depth=10,random_state=12))
])

model = pl.fit(X_train, y_train)
print("Test R^2: ", model.score(X_test,y_test))
print("Test RMSE: ", np.sqrt(mean_squared_error(y_pred=model.predict(X_test), y_true=y_test)))

Test R^2:  0.26774067124671386
Test RMSE:  26.334917623056903


Now that we have a basic pipeline. Let's explore some more options of how we could improve our final model. 

To whoever is reading this notebook, the process below takes a long time since I'm comparing a lot of different models and training techniques. Scroll to the end of this section to get a breakdown of performances of each model and what model we finally decide on as our final model

### Linear Regression vs Random Forest Regressor performance

First, let's tune the hyperparameters of our Random Forest Regressor and then compare its performance against the linear model. 

We will tune the number of trees, as well as the max depth. This is to find a sweet spot between bias and variance, and ultimately avoid underfitting or overfitting. We will also test a few criterions. The reason we don't go straight for squared error as the criterion is that a DecisionTreeRegressor only uses a criterion to optimize for the best local split, not necessarily minimize RMSE for the whole model. So it's possible that the poisson criterion, for example, minimizes overall RMSE. We use every possible criterion in DecisionTreeRegressor object except for MAE. This is because MAE is very slow and makes GridSearchCV run for hours. Sadly, that is a limitation.

In [36]:
from sklearn.model_selection import GridSearchCV
hyperparameters = {
    'regressor__n_estimators':  [10, 50, 100],
    'regressor__max_depth': np.arange(2, 30, 10), 
    'regressor__criterion': ['squared_error', 'friedman_mse', 'poisson']
}
grids = GridSearchCV(
    pl,
    n_jobs=-1, 
    param_grid=hyperparameters,
    return_train_score=True,
    scoring='neg_mean_squared_error'
)

grids.fit(X_train, y_train)
print("Test R^2 RandomForest model: ", grids.score(X_test, y_test))
print("Test RMSE RandomForest model: ", np.sqrt(mean_squared_error(y_pred=grids.predict(X_test), y_true=y_test)))

Test R^2 RandomForest model:  -692.4467420208257
Test RMSE RandomForest model:  26.31438279764178


It is very worrying that R^2 is negative. We used R^2 as scoring before this and it showed a slightly higher RMSE but an R^2 = 0.28. We'll do another GridSearchCV but scored on R^2 to compare the RMSEs later down. For now, we check our Linear Regression model

In [None]:
linear_pl = Pipeline([
    ("preprocessor", final_preprocessor),
    ("regressor", LinearRegression())
])
linear_model = linear_pl.fit(X_train, y_train)

print("Test R^2 linear model: ", linear_model.score(X_test,y_test))
print("Test RMSE linear model: ", np.sqrt(mean_squared_error(y_pred=linear_model.predict(X_test), y_true=y_test)))

Test R^2 linear model:  0.2520543407044272
Test RMSE linear model:  26.61549395757894


Our random forest regressor performs better here, going by RMSE. **So we'll focus on using random forest from now on**. Let's keep in mind its best params for even more fine tuning later

In [38]:
grids.best_params_

{'regressor__criterion': 'squared_error',
 'regressor__max_depth': np.int64(12),
 'regressor__n_estimators': 100}

### Good recipes only vs all recipes (still excluding outliers)

 It is at this point we suspect that our fit is really bad because of bad data quality. Some people on food.com could just upload random recipes, with random n_steps, random minutes, random ingredients, etc. and there's no good quality check. So we'll try to filter recipes so it will only have "good quality" data points. We decide a data point is of "good quality" if it has an average rating >=4. From the EDA, we saw there are a lot of recipes with an average rating above 4, so this shouldn't hurt our sample size too much. The pipeline will still be the same.

 However, for the purpose of this project, it's not really "fair" for comparison with the baseline model since we'll have to get a new set of training data and testing data with only good recipes compared to the training and test data on the baseline model. So if good recipes ends up having better results, we'll still write about both models (fit on all recipes and only good recipes) in our report and in our notebook

In [39]:
good_recipes = recipes_no_outliers[recipes_no_outliers["average_rating"] >= 4]
good_recipes.shape #still a good amount of data!

(69364, 17)

In [40]:
X_train_good, X_test_good, y_train_good, y_test_good = (
    train_test_split(good_recipes[["n_ingredients", "n_steps", "nutrition", "ingredients", 'tags']], 
                     good_recipes["minutes"], 
                     random_state=12)
)


model_good = pl.fit(X_train_good, y_train_good)
print("Test R^2: ", model_good.score(X_test_good,y_test_good))
print("Test RMSE: ", np.sqrt(mean_squared_error(y_pred=model_good.predict(X_test_good), y_true=y_test_good)))

Test R^2:  0.26484795893757374
Test RMSE:  26.333788257702953


In [None]:
grids_good = GridSearchCV(
    pl,
    n_jobs=-1,
    param_grid=hyperparameters,
    return_train_score=True,
    scoring='neg_mean_squared_error'
)

grids_good.fit(X_train_good, y_train_good)
print("Test R^2: ", grids_good.score(X_test_good, y_test_good))
print("Test RMSE: ", np.sqrt(mean_squared_error(y_pred=grids_good.predict(X_test_good), y_true=y_test_good)))


invalid value encountered in cast



Test R^2:  -690.4144251477217
Test RMSE:  26.275738336871175


We can fine tune our model further by looking at best_params_ and doing another GridSearchCV based on the best_params_ right now. We train 2 models using GridSearchCV, one based on R2 optimization, the other based on RMSE.

In [42]:
grids_good.best_params_

{'regressor__criterion': 'poisson',
 'regressor__max_depth': np.int64(12),
 'regressor__n_estimators': 100}

In [None]:
hyperparameters = {
    'regressor__n_estimators':  np.arange(80,121,10),
    'regressor__max_depth': np.arange(12, 25, 2), 
    'regressor__criterion': ['squared_error', 'poisson'],
    'regressor__max_features': [None, 'sqrt', 'log2'],
}
good_model_rmse = GridSearchCV(
    pl,
    n_jobs=-1, 
    param_grid=hyperparameters,
    return_train_score=True,
    scoring='neg_mean_squared_error'
)
good_model_rmse.fit(X_train_good, y_train_good)
print("Test R^2, trained on good recipes only, optimized on RMSE: ", good_model_rmse.score(X_test_good, y_test_good))
print("RMSE, trained on good recipes only, optimized on RMSE: ", np.sqrt(mean_squared_error(y_pred=good_model_rmse.predict(X_test_god), y_true=y_test_good)))
print("\n")
print("Test R^2, tested on all recipes regardless of rating: ", good_model_rmse.score(X_test, y_test))
print("Test RMSE, tested on all recipes regardless of rating: ", np.sqrt(mean_squared_error(y_pred=good_model_rmse.predict(X_test), y_true=y_test)))


invalid value encountered in cast



Test R^2:  -674.2368983833263
RMSE:  25.966072063046546
Test R^2 on all recipes regardless of rating:  -540.5248409956642
Test RMSE on all recipes regardless of rating:  23.24919011483334


In [None]:
good_model_rmse.best_params_

{'regressor__criterion': 'poisson',
 'regressor__max_depth': np.int64(16),
 'regressor__max_features': 'sqrt',
 'regressor__n_estimators': np.int64(110)}

In [None]:
good_model_r2 = GridSearchCV(
    pl,
    n_jobs=-1, 
    param_grid=hyperparameters,
    return_train_score=True,
)
good_model_r2.fit(X_train_good, y_train_good)
print("Test R^2, trained on good recipes only, optimized on R2: ", good_model_r2.score(X_test_good, y_test_good))
print("RMSE, trained on good recipes only, optimized on R2: ", np.sqrt(mean_squared_error(y_pred=good_model_r2.predict(X_test_good), y_true=y_test_good)))
print("\n")
print("Test R^2, tested on all recipes regardless of rating: ", good_model_r2.score(X_test, y_test))
print("Test RMSE, tested on all recipes regardless of rating: ", np.sqrt(mean_squared_error(y_pred=good_model_r2.predict(X_test), y_true=y_test)))


invalid value encountered in cast



Test R^2, trained on good recipes only, optimized on RMSE:  0.28523544959518354
RMSE, trained on good recipes only, optimized on RMSE:  25.966072063046546
Test R^2, tested on all recipes regardless of rating:  0.429288475474092
Test RMSE, tested on all recipes regardless of rating:  23.24919011483334


It looks like our model fit on good recipes has a better RMSE when tested on the full dataset. But hang on, it's very unintuitive that I'd get a better R^2 and RMSE on the full dataset by fitting on a subset of the dataset. Maybe some data points in X_train_good is included in X_test and so it's a result of it being tested on some of its own training data?

In [106]:
#check for any X_tests that are in X_train_good
len(X_test[X_test.index.isin(X_train_good.index)])

12987

Seems to be the case. So we ignore R^2 score tested on the whole dataset here.

Let's compare the performance of our model trained and tested on good_recipes compared to our model trained and tested on all recipes (without outliers). We do another GridSearchCV on our model trained on all recipes. We train 2 models, one optimized for minimal RMSE, the other one optimized for max R^2.

In [67]:
full_model_rmse = GridSearchCV(
    pl,
    n_jobs=-1, 
    param_grid=hyperparameters,
    return_train_score=True,
    scoring='neg_mean_squared_error',
)
full_model_rmse.fit(X_train, y_train)
print("Test R^2 on full recipes, optimized on RMSE: ", full_model_rmse.score(X_test, y_test))
print("Test RMSE on full recipes, optimized on RMSE: ", np.sqrt(mean_squared_error(y_pred=full_model_rmse.predict(X_test), y_true=y_test)))


invalid value encountered in cast



Test R^2 on full recipes, optimized on RMSE:  -672.7940250945276
Test RMSE on full recipes, optimized on RMSE:  25.938273363786717


In [68]:
full_model_rmse.best_params_

{'regressor__criterion': 'squared_error',
 'regressor__max_depth': np.int64(16),
 'regressor__max_features': 'sqrt',
 'regressor__n_estimators': np.int64(120)}

In [69]:
full_model_r2 = GridSearchCV(
    pl,
    n_jobs=-1, 
    param_grid=hyperparameters,
    return_train_score=True,
)
full_model_r2.fit(X_train, y_train)
print("Test R^2, model optimized for R^2: ", full_model_r2.score(X_test, y_test))
print("Test RMSE, model optimized for R^2: ", np.sqrt(mean_squared_error(y_pred=full_model_r2.predict(X_test), y_true=y_test)))


invalid value encountered in cast



Test R^2, model optimized for R^2:  0.2896324560286031
Test RMSE, model optimized for R^2:  25.938273363786717


### Our Final Model

So now we have a number of possible models. We show them in this dataframe along with their RMSE and R^2 score rounded to two decimal points. The Dataframe is sorted so that the lowest RMSE is at the top. If there is a tie, we pick the model with the higher R^2.

In [89]:
possible_models = pd.DataFrame()
possible_models['metric'] = ['RMSE', 'R^2']
possible_models['Baseline Linear Regression'] = [27.37, 0.21]
possible_models['Baseline Decision Tree Regressor'] = [28.32, 0.17]

possible_models['Linear Regression on full recipes'] = [26.62, 0.25]
possible_models['Random Forest Regressor on full recipes, optimized on RMSE'] = [25.94, -672.79]
possible_models['Random Forest Regressor on full recipes, optimized on R^2'] = [25.94, 0.29]

possible_models['Random Forest Regressor on only good recipes, optimized on RMSE'] = [25.97, -674.24]
possible_models['Random Forest Regressor on only good recipes, optimized on R^2'] = [25.97, 0.29]

possible_models = possible_models.T.drop('metric')
possible_models.columns = ['RMSE', 'R^2']
possible_models = possible_models.sort_values(by=['RMSE', 'R^2'], ascending=[True, False])
possible_models

Unnamed: 0,RMSE,R^2
"Random Forest Regressor on full recipes, optimized on R^2",25.94,0.29
"Random Forest Regressor on full recipes, optimized on RMSE",25.94,-672.79
"Random Forest Regressor on only good recipes, optimized on R^2",25.97,0.29
"Random Forest Regressor on only good recipes, optimized on RMSE",25.97,-674.24
Linear Regression on full recipes,26.62,0.25
Baseline Linear Regression,27.37,0.21
Baseline Decision Tree Regressor,28.32,0.17


Curiously, RMSE is the same regardless of if we optimize GridSearchCV for R^2 or RMSE. R^2 however becomes negative when optimizing on RMSE. There are a few possible explanations for why this is. The main idea is that when we optimize for RMSE (minimizing it), the model is focusing solely on reducing the absolute prediction error. However, this doesn't guarantee that the model captures any underlying patterns in our data, which is what R^2 measures.

**Possibility number 1** is that when optimizing for RMSE, GridSearchCV is selecting different hyperparameters than when optimizing for R^2. These hyperparameters might make predictions with low RMSE but high R^2. 

**Possibility number 2** is that the models optimized for different metrics might emphasize different features, and the difference in feature importance might explain the difference in R2


However when we check for these possibilities, they're both wrong! The best params and feature importances are the same across both!

In [100]:
full_model_r2.best_params_ == full_model_rmse.best_params_

True

In [105]:
full_model_r2.best_estimator_["regressor"].feature_importances_ == full_model_rmse.best_estimator_["regressor"].feature_importances_

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

Our final model will be a Random Forest Regressor fit and tested on full recipes, optimized to minimize RMSE and maximize R^2	

In [107]:
final_model = full_model_r2
final_model

In [108]:
final_model.best_estimator_["regressor"].feature_importances_

array([0.17933863, 0.03160233, 0.01452837, 0.02727502, 0.01420941,
       0.02883087, 0.01094859, 0.01444908, 0.00497241, 0.01605361,
       0.02065575, 0.01979831, 0.02091821, 0.04099357, 0.01387434,
       0.00541447, 0.01304659, 0.01900575, 0.01559504, 0.01296939,
       0.0061225 , 0.28260076, 0.18679701])

Looking at feature importances, it seems the top 3 most important features that are used in our mode are n_steps, n_ingredients, and calories. This is to be expected.

## Step 8: Fairness Analysis

In [50]:
# TODO

# Previous attempt

below is the code for a previous prediction problem. We decided not to continue it for reasons you can read down below, but we keep it here for documentation purposes. We comment out most of the code so as not to slow down running everything

## Step 5.1: Framing a Prediction Problem

Our original plan was to predict if a recipe was American or not American based on nutrition, n_ingredients, n_steps. However, this proved to be uninteresting. While our model did reach an accuracy of 88.9%, our recall, precision, and F1 score were 0. After further investigation, it seems it's because our model guessed 0 (not American) every time. This is due to how most of recipes are not American, so it makes sense that our model would want to predict not American every time to maximize accuracy. While this did make the model more accurate, it made the F1 score very low. We ultimately decided not to continue with this prediction problem because making a model that only predicted one thing every time isn't interesting at all, even if it is highly accurate. 

## Step 6.1: Baseline Model

In [51]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error, r2_score, f1_score, precision_score, recall_score, accuracy_score
from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer

In [52]:
# from sklearn.tree import DecisionTreeClassifier
# recipes['is_american'] = recipes['tags'].apply(lambda x: 1 if 'american' in x else 0)
# X_train, X_test, y_train, y_test = (
#     train_test_split(recipes[["n_ingredients", "n_steps", "nutrition"]]
#                      , recipes["is_american"],
#                      random_state=12)
# )

# def extract_calories(nutrition_col):
#     return (nutrition_col.apply(lambda x: x[0]))

# #extracts calories from nutrition col
# nutrition_transformer = Pipeline([
#     ("extract", FunctionTransformer(lambda x: x.apply(extract_calories).values.reshape(-1, 1))),
# ])

# preprocessor = ColumnTransformer([
#     ("nutrition", nutrition_transformer, ["nutrition"])
#     ],
# remainder='passthrough')

# pl = Pipeline([
#     ("preprocessor", preprocessor),
#     ("regressor", DecisionTreeClassifier(max_depth=15,random_state=12))
# ])

# model = pl.fit(X_train, y_train)
# print("Training Accuracy: ", model.score(X_train,y_train))
# print("Test Accuracy: ", model.score(X_test,y_test))

In [53]:
# # Define K-Fold Cross-Validation
# kf = KFold(n_splits=5, shuffle=True)
# X = recipes[["n_ingredients", "n_steps", "nutrition"]]
# y = recipes["is_american"]

# # Define scoring metrics
# scoring = {
#     "F1": make_scorer(f1_score),
#     "Precision": make_scorer(precision_score),
#     "Recall": make_scorer(recall_score),
#     "Accuracy": make_scorer(accuracy_score)
# }

# # Perform cross-validation
# scores = {}
# for metric in scoring:
#     score = cross_val_score(pl, X, y, cv=kf, scoring=scoring[metric])
#     scores[metric] = score.mean()

# # Print results
# print(f"Mean F1: {scores['F1']:.2f}")
# print(f"Mean Precision: {scores['Precision']:.2f}")
# print(f"Mean Recall: {scores['Recall']:.2f}")
# print(f"Mean Accuracy: {scores['Accuracy']:.2f}")

## Step 7.1: Final Model

In [54]:
# from sklearn.ensemble import RandomForestClassifier

# X_train, X_test, y_train, y_test = (
#     train_test_split(recipes[["n_ingredients", "n_steps", "nutrition", 'ingredients']], 
#                      recipes["is_american"],
#                     random_state=12)
# )


# def ingredient_onehot_encoder(X):
#     target_ingredients = ['beef', 'pork', 'chicken', 'corn', 'potatoes', 'rice', 'bread', 'pasta',
#                       'milk', 'cheese', 'butter', 'sugar', 'flour', 'tomatoes', 'squash']
#     df_encoded = pd.DataFrame()
    
#     for ingredient in target_ingredients:
#         df_encoded[ingredient] = X['ingredients'].apply(lambda x: int(any(ingredient in item for item in x)))

#     return df_encoded

# def extract_nutrients(X):
#     nutrition_features = ['calories', 'total_fat', 'sugar', 'sodium', 
#                           'protein', 'saturated_fat', 'carbohydrates']
#     df = pd.DataFrame()
#     for i,nutrition in enumerate(nutrition_features):
#         df[nutrition] = X['nutrition'].apply(lambda x: x[i])

#     return df

# preprocessor = ColumnTransformer([
#     ('nutrition', FunctionTransformer(extract_nutrients, validate=False), ['nutrition']),
#     ('onehot', FunctionTransformer(ingredient_onehot_encoder, validate=False), ['ingredients'])
#     ],
# remainder='passthrough')

# pl = Pipeline([
#     ("preprocessor", preprocessor),
#     ("regressor", RandomForestClassifier(max_depth=10,random_state=12))
# ])

# model = pl.fit(X_train, y_train)
# print("Training Accuracy: ", model.score(X_train,y_train))
# print("Test Accuracy: ", model.score(X_test,y_test))

In [55]:
# from sklearn.model_selection import GridSearchCV
# hyperparameters = {
#     'regressor__n_estimators':  [10, 50, 100],
#     'regressor__max_depth': np.arange(2, 30, 10), 
#     'regressor__criterion': ['gini', 'entropy']
# }
# grids = GridSearchCV(
#     pl,
#     n_jobs=-1, # Use multiple processors to parallelize
#     param_grid=hyperparameters,
#     return_train_score=True
# )
# grids.fit(X_train, y_train)

In [56]:
# y_pred = grids.predict(X_test)

In [57]:
# print("accuracy: ", grids.score(X_test, y_test))
# print("precision: ", precision_score(y_pred,y_test))
# print("recall: ", recall_score(y_pred,y_test))
# print("f1: ", f1_score(y_pred,y_test))

In [58]:
# grids.best_params_

In [59]:
# y_pred.sum()

While accuracy did go up by 0.9%, our precision, recall, and F1 score became 0. After further investigation, it seems it's because our model guesses 0 (not American) every time, shown by how the sum of y_pred is 0. This is due to how most of the data is not American, so it makes sense that our model would want to predict not American every time to maximize accuracy. While this does make the model more accurate, it makes our F1 score very low. We ultimately decided not to continue with this prediction problem because making a model that only predicts one thing every time isn't interesting at all, even if it is highly accurate.