# Recipes Analysis

**Name(s)**: Daniel Budidharma, Tristan Leo

**Website Link**: https://vdanielb.github.io/RecipesAnalysis/

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'

# from dsc80_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

First let's load in the dataset and take a look at it.

In [None]:
recipes = pd.read_csv('data/RAW_recipes.csv')
interactions = pd.read_csv('data/RAW_interactions.csv')

In [None]:
display(recipes.head())
display(interactions.head())

## Step 2: Data Cleaning and Exploratory Data Analysis

We first remove unnamed: 0 from our `recipes` dataframe. That is just the index number on the original dataset before we took a subset of it.

In [None]:
recipes = recipes.drop(columns=["Unnamed: 0"])

Let's look at a particular row in `interactions`

In [None]:
display(interactions.iloc[3:4])
print(interactions['review'].iloc[3])

Notice that the lowest possible rating a user could give is 1 star. So how does this recipe have a rating of 0? It turns out that that means the reviewer just didn't leave a rating. Like the review in this particular row says, "...so I will not rate". It makes sense then to replace these values with NaN.

In [None]:
interactions['rating'] = interactions['rating'].replace(0, np.nan)

Another thing we should notice is that the values in the tags column in `recipes` isn't actually a list. This is also true for other columns with values that look like lists. They're actually strings! To convert them into a list, we define a function and apply it to all those columns:

In [None]:
def convert_col_string_to_list(df, col):
    translation_table = str.maketrans({"[": "", 
                                   "]": "",
                                    "\'":""})
    df[col] = df[col].str.translate(translation_table).str.split(', ')

for col in ['tags','nutrition', 'steps', 'ingredients']:
    convert_col_string_to_list(recipes, col)

And let's verify they're lists now

In [None]:
print("The type of the value is: ",type(recipes['tags'].iloc[4268]))
(recipes['tags'].iloc[4268])

Now we can actually perform list operations on those columns. Next, we're interested in finding the average rating per recipe. To do that we'll first have to merge the recipes and ratings dataframes.

In [None]:
recipes_with_ratings = recipes.merge(interactions, left_on='id', right_on='recipe_id',how='left')
recipes_with_ratings.head()

`recipes_with_ratings` is now a dataframe with multiple rows for a single recipe, each row corresponding to a review for that recipe. If it has no reviews, then the columns associated with a review should be NaN. Now let's compute the average rating per recipe and include that in our original `recipes` dataframe, no duplicates.

In [None]:
recipes_with_ratings['average_rating'] = recipes_with_ratings.groupby('id')['rating'].transform(lambda x: x.mean())
recipes = recipes_with_ratings.drop_duplicates(subset='id')
recipes = recipes.drop(columns=['user_id', 'date', 'recipe_id','rating','review'])
print(recipes.shape)
recipes.head()

Now we can start on some EDA.

The distribution of ratings should theoretically look something like a normal distribution, with most people rating 3 stars for average satisfaction, while few people would have extreme experiences that would warrant a 5 star or 1 star. Does our ratings column look like a normal distribution? Let's check.

In [None]:
px.histogram(recipes, x="average_rating")

Surprisingly a lot of 5s. Does this mean every recipe on food.com is a masterpiece? Probably not. It just means people are generous with ratings. It also might mean recipes that would've been rated low just don't get reviewed as much as recipes that are rated high. This makes sense, higher reviews lead to more views which lead to even more reviews.  
<br> Still, this isn't good because it means the average rating doesn't tell us much about the actual quality of the recipe compared to other recipes. If everything is 5 stars, how do I know which recipe is better than the other? It is for this reason that we think any analysis involving the average rating probably won't be very useful.

We can do something similar with number of reviews of each recipe. We define a function to get the number of reviews of each recipe id. And then we plot a histogram.

In [None]:
def get_num_reviews(id):
    return interactions[interactions['user_id'] == id].shape[0]
recipes['num_reviews'] = recipes['id'].apply(get_num_reviews)
px.histogram(recipes['num_reviews'])

As you can see, an overwhelming majority of recipes have 0 reviews. So any analysis or prediction involving this would also likely be meaningless. For example, I can build a very accurate model that predicts the number of reviews a recipe will get by doing no calculations and just predicting 0 every time.

## Step 3: Assessment of Missingness

Let's see how many missing data we have, as well as a breakdown of missing values in each column.

In [None]:
print('total missing values: ', recipes.isna().sum().sum())
recipes.isna().sum()

In [None]:
print('total missing values: ', interactions.isna().sum().sum())
interactions.isna().sum()

We'll look at some of these. Firstly, let's look at the one missing name value in `recipes`.

In [None]:
recipes[recipes['name'].isna()]

Since it's only 1 missing value in this column out of hundreds of thousands of rows, doing a missingness analysis on this column would be pretty meaningless, and it would be negligible anyway.

Another column in `recipes` with missing values is 'description'. We believe this is NMAR because if the user believes there is no need to describe the dish, then it will simply have no description and therefore be a missing value.

Next we should consider the rating column. It has the most missing values out of all the columns. This makes sense because there are many people who write reviews or comments on the recipe without leaving a rating. Our guess is this is MCAR. We'll perform a permutation test to verify that. Our hypotheses are:
- **Null Hypothesis**: The rating column is MCAR
- **Alternative Hypothesis**: The rating column is not MCAR

In [None]:
#TODO : The thing

## Step 4: Hypothesis Testing

We're interested in comparing American and Asian dishes. Specifically, we're concerned about health. Now, a healthy diet is usually a balanced diet, so we can't conclude one nutrient is objectively better to always have more of. But we can at the very least say saturated fat is objectively **bad** for you. Many national and international health organizations, such as [The American Heart Association](https://www.heart.org/en/healthy-living/healthy-eating/eat-smart/fats/saturated-fats) and [World Health Organization](https://www.who.int/news/item/17-07-2023-who-updates-guidelines-on-fats-and-carbohydrates) recommend either limiting or replacing saturated fat intake.<br><br>
So to compare the healthiness of American and Asian dishes, we will be focusing on saturated fat content. We will do this comparison using a hypothesis test. 

First, some data wrangling. We need to extract the saturated fat from the nutrition column, which is currently a column of lists, with each list containing the values of various nutrients. We know from looking at the website that the saturated fat is the second last entry in each list, so we extract that and assign it to a new column

In [None]:
recipes['saturated_fat'] = recipes['nutrition'].apply(lambda x: float(x[-2]))

We should keep in mind the saturated fat values are in percentages of daily value. 

Next, we assign labels to every row depending on if it's an American or Asian recipe. This information is stored in the tags, and all the tags are lowercase which makes our job easier. We assign a new column to see if the recipe is asian, american, or neither:

In [None]:
recipes['asian_or_american'] = recipes['tags'].apply(lambda x: 'asian' if 'asian' in x else 'american' if 'american' in x else 'neither')

Now we filter the dataset to only include Asian and American recipes. And we perform a permutation test on them. We name this dataframe `asia_america_recipes`

In [None]:
asia_america_recipes = recipes[recipes['asian_or_american']!='neither']
asia_america_recipes.iloc[18:21]

Now that we have a wrangled dataset, we can get to work constructing our hypothesis test. To decide our alternative hypothesis, we see which one currently has the higher mean saturated fat

In [None]:
mean_satfat_asia = asia_america_recipes[asia_america_recipes['asian_or_american']=='asian']['saturated_fat'].mean()
mean_satfat_america = asia_america_recipes[asia_america_recipes['asian_or_american']=='american']['saturated_fat'].mean()
print('Asian mean saturated fat: ', mean_satfat_asia, '\nAmerican mean saturated fat: ', mean_satfat_america)

We observe that American recipes have higher saturated fat on average. So that will be our alternative hypothesis. Our hypotheses are:
- **Null Hypothesis**: American and Asian recipes on food.com have the same amount of saturated fat.
- **Alternative Hypothesis**: American recipes have more saturated fat than Asian recipes.
- Our test statistic will be `Mean saturated fat in American recipes` - `Mean saturated fat in Asian recipes`

In [None]:
observed_stat = mean_satfat_america - mean_satfat_asia

num_simulations = 10000
shuffled_df = asia_america_recipes.copy()
simulated_stats = []

for i in range(num_simulations):
    shuffled_df['asian_or_american'] = np.random.permutation(shuffled_df['asian_or_american'])

    shuffled_satfat_america = shuffled_df[shuffled_df['asian_or_american']=='american']['saturated_fat'].mean()
    shuffled_satfat_asia = shuffled_df[shuffled_df['asian_or_american']=='asian']['saturated_fat'].mean()

    one_sim_stat = shuffled_satfat_america-shuffled_satfat_asia
    simulated_stats.append(one_sim_stat)

simulated_stats = np.array(simulated_stats)
p_value = np.count_nonzero(simulated_stats >= observed_stat)
print('The p value is: ', p_value)

In [None]:
fig = px.histogram(simulated_stats)
fig.add_vline(x=observed_stat, line_width=2,  line_color="red")
fig.add_annotation(
    x=observed_stat,
    y=1,
    yref="paper",
    text="Observed statistic",
    showarrow=True,
    arrowhead=1
)
fig.show()

Our p-value is 0. This means we can confidently reject the null hypothesis. We conclude that American recipes have more saturated fat than Asian recipes.

For fun, we'll plot the distribution of the saturated fat in Asian recipes vs the distribution of saturated fat in American recipes to make sure

In [None]:
fig = px.histogram(asia_america_recipes[asia_america_recipes['asian_or_american']=='asian']['saturated_fat'])
fig.data[0].name = 'Asia'
fig.add_trace(
    go.Histogram(
        x=asia_america_recipes[asia_america_recipes['asian_or_american']=='american']['saturated_fat'],
        opacity=0.7,
        name='America'
    )
)
fig.show()

## Step 5.1: Framing a Prediction Problem

Our original plan was to predict if a recipe was American or not American based on nutrition, n_ingredients, n_steps. However, this proved to be uninteresting. While our model did reach an accuracy of 88.9%, our recall, precision, and F1 score were 0. After further investigation, it seems it's because our model guessed 0 (not American) every time. This is due to how most of recipes are not American, so it makes sense that our model would want to predict not American every time to maximize accuracy. While this did make the model more accurate, it made the F1 score very low. We ultimately decided not to continue with this prediction problem because making a model that only predicted one thing every time isn't interesting at all, even if it is highly accurate. 

## Step 6.1: Baseline Model

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error, r2_score, f1_score, precision_score, recall_score, accuracy_score
from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer

In [None]:
from sklearn.tree import DecisionTreeClassifier
recipes['is_american'] = recipes['tags'].apply(lambda x: 1 if 'american' in x else 0)
X_train, X_test, y_train, y_test = (
    train_test_split(recipes[["n_ingredients", "n_steps", "nutrition"]]
                     , recipes["is_american"],
                     random_state=12)
)

def extract_calories(nutrition_col):
    return (nutrition_col.apply(lambda x: x[0]))

#extracts calories from nutrition col
nutrition_transformer = Pipeline([
    ("extract", FunctionTransformer(lambda x: x.apply(extract_calories).values.reshape(-1, 1))),
])

preprocessor = ColumnTransformer([
    ("nutrition", nutrition_transformer, ["nutrition"])
    ],
remainder='passthrough')

pl = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", DecisionTreeClassifier(max_depth=15,random_state=12))
])

model = pl.fit(X_train, y_train)
print("Training Accuracy: ", model.score(X_train,y_train))
print("Test Accuracy: ", model.score(X_test,y_test))

In [None]:
# Define K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True)
X = recipes[["n_ingredients", "n_steps", "nutrition"]]
y = recipes["is_american"]

# Define scoring metrics
scoring = {
    "F1": make_scorer(f1_score),
    "Precision": make_scorer(precision_score),
    "Recall": make_scorer(recall_score),
    "Accuracy": make_scorer(accuracy_score)
}

# Perform cross-validation
scores = {}
for metric in scoring:
    score = cross_val_score(pl, X, y, cv=kf, scoring=scoring[metric])
    scores[metric] = score.mean()

# Print results
print(f"Mean F1: {scores['F1']:.2f}")
print(f"Mean Precision: {scores['Precision']:.2f}")
print(f"Mean Recall: {scores['Recall']:.2f}")
print(f"Mean Accuracy: {scores['Accuracy']:.2f}")

## Step 7.1: Final Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = (
    train_test_split(recipes[["n_ingredients", "n_steps", "nutrition", 'ingredients']], 
                     recipes["is_american"],
                    random_state=12)
)


def ingredient_onehot_encoder(X):
    target_ingredients = ['beef', 'pork', 'chicken', 'corn', 'potatoes', 'rice', 'bread', 'pasta',
                      'milk', 'cheese', 'butter', 'sugar', 'flour', 'tomatoes', 'squash']
    df_encoded = pd.DataFrame()
    
    for ingredient in target_ingredients:
        df_encoded[ingredient] = X['ingredients'].apply(lambda x: int(any(ingredient in item for item in x)))

    return df_encoded

def extract_nutrients(X):
    nutrition_features = ['calories', 'total_fat', 'sugar', 'sodium', 
                          'protein', 'saturated_fat', 'carbohydrates']
    df = pd.DataFrame()
    for i,nutrition in enumerate(nutrition_features):
        df[nutrition] = X['nutrition'].apply(lambda x: x[i])

    return df

preprocessor = ColumnTransformer([
    ('nutrition', FunctionTransformer(extract_nutrients, validate=False), ['nutrition']),
    ('onehot', FunctionTransformer(ingredient_onehot_encoder, validate=False), ['ingredients'])
    ],
remainder='passthrough')

pl = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestClassifier(max_depth=10,random_state=12))
])

model = pl.fit(X_train, y_train)
print("Training Accuracy: ", model.score(X_train,y_train))
print("Test Accuracy: ", model.score(X_test,y_test))

In [None]:
model.get_params

In [None]:
from sklearn.model_selection import GridSearchCV
hyperparameters = {
    'regressor__n_estimators':  [10, 50, 100],
    'regressor__max_depth': np.arange(2, 30, 10), 
    'regressor__criterion': ['gini', 'entropy']
}
grids = GridSearchCV(
    pl,
    n_jobs=-1, # Use multiple processors to parallelize
    param_grid=hyperparameters,
    return_train_score=True
)
grids.fit(X_train, y_train)

In [None]:
y_pred = grids.predict(X_test)

In [None]:
print("accuracy: ", grids.score(X_test, y_test))
print("precision: ", precision_score(y_pred,y_test))
print("recall: ", recall_score(y_pred,y_test))
print("f1: ", f1_score(y_pred,y_test))

In [None]:
grids.best_params_

In [None]:
y_pred.sum()

While accuracy did go up by 0.9%, our precision, recall, and F1 score became 0. After further investigation, it seems it's because our model guesses 0 (not American) every time, shown by how the sum of y_pred is 0. This is due to how most of the data is not American, so it makes sense that our model would want to predict not American every time to maximize accuracy. While this does make the model more accurate, it makes our F1 score very low. We ultimately decided not to continue with this prediction problem because making a model that only predicts one thing every time isn't interesting at all, even if it is highly accurate.

## Step 5: Framing a Prediction Problem

One challenge we face as college students is trying to manage time. So we decided to build a model that could predict the total cooking time of whatever one might want to cook.

Our initial plan was to use a linear regression model, but the results weren't good as you will see later. Our final model will use a RandomForestRegressor. 

## Step 6: Baseline Model

For our baseline model, our features will be number of ingredients, number of steps, and calories per serving. First, we import the necessary libraries.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer

The amount of calories is still stored in the `nutrition` column, so we extract that information and assign it to a new column `calories`. We will also remove some outliers. We choose to remove recipes that take more than 3 hours to make and remove recipes that have calories equal to or over 2000, since that's the recommended daily calorie intake of an adult male.

In [None]:
recipes['calories'] = recipes['nutrition'].apply(lambda x: float(x[0]))
recipes_no_outliers = recipes[(recipes['minutes'] < 180) & (recipes['calories']<2000)]

Now let's do some scatterplots to get an idea of the fit of our model.

In [None]:
px.scatter(recipes_no_outliers, x='n_steps', y='minutes')

In [None]:
px.scatter(recipes_no_outliers, x='n_ingredients', y='minutes')

In [None]:
px.scatter(recipes_no_outliers, x='calories', y='minutes')

It turns out the data has no clear pattern, so a linear regression probably won't do well. We'll try it out anyway.

In [None]:
X_train, X_test, y_train, y_test = (
    train_test_split(recipes_no_outliers[["n_ingredients", "n_steps", "nutrition"]], recipes_no_outliers["minutes"], random_state=1)
)


def extract_calories(nutrition_col):
    return (nutrition_col.apply(lambda x: x[0]))

#extracts calories from nutrition col
nutrition_transformer = Pipeline([
    ("extract", FunctionTransformer(lambda x: x.apply(extract_calories).values.reshape(-1, 1))),
])

preprocessor = ColumnTransformer([
    ("nutrition", nutrition_transformer, ["nutrition"])
    ],
remainder='passthrough')

pl = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

model = pl.fit(X_train, y_train)
print("Training R^2: ", model.score(X_train,y_train))
print("Test R^2: ", model.score(X_test,y_test))

As you can see it performs pretty badly. Not surprising considering how the scatterplots looked. So we choose to use a decision tree instead. We will set max_depth = 10 to avoid overfitting and set random_state=12 for reproducability. 

In [None]:
from sklearn.tree import DecisionTreeRegressor
X_train, X_test, y_train, y_test = (
    train_test_split(recipes_no_outliers[["n_ingredients", "n_steps", "nutrition"]], 
                     recipes_no_outliers["minutes"], random_state=12)
)


def extract_calories(nutrition_col):
    return (nutrition_col.apply(lambda x: x[0]))

#extracts calories from nutrition col
nutrition_transformer = Pipeline([
    ("extract", FunctionTransformer(lambda x: x.apply(extract_calories).values.reshape(-1, 1))),
])

preprocessor = ColumnTransformer([
    ("nutrition", nutrition_transformer, ["nutrition"])
    ],
remainder='passthrough')

pl = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", DecisionTreeRegressor(max_depth=10,random_state=12))
])

model = pl.fit(X_train, y_train)
print("Training R^2: ", model.score(X_train,y_train))
print("Test R^2: ", model.score(X_test,y_test))

Let's evaluate our model using K-Fold Cross Validation

In [None]:
# Define K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True)
X = recipes_no_outliers[["n_ingredients", "n_steps", "nutrition"]]
y = recipes_no_outliers["minutes"]

# Define scoring metrics
scoring = {
    "MAE": make_scorer(mean_absolute_error),
    "MSE": make_scorer(mean_squared_error),
    "R2": make_scorer(r2_score)
}

# Perform cross-validation
scores = {}
for metric in scoring:
    score = cross_val_score(pl, X, y, cv=kf, scoring=scoring[metric])
    scores[metric] = score.mean()

# Compute RMSE separately since it's the square root of MSE
rmse_scores = np.sqrt(-cross_val_score(pl, X, y, cv=kf, scoring="neg_mean_squared_error"))

# Print results
print(f"Mean MAE: {scores['MAE']:.2f}")
print(f"Mean MSE: {scores['MSE']:.2f}")
print(f"Mean RMSE: {rmse_scores.mean():.2f}")
print(f"Mean R² Score: {scores['R2']:.2f}")

Still not very good. To improve our final mode we'll use a random forest to avoid overfitting and also GridSearchCV to tune our hyperparameters. We'll also include more features.

## Step 7: Final Model

We'll do a random forest and use GridsearchCV to fine tune our model.  
  
Furthermore, we're going to one hot encode a list ingredients. There are too many unique ingredients in this whole dataset to feasibly one hot encode, so we'll focus on a few common ingredients.  That is, we're gonna feature engineer if a recipe contains these ingredients: ['beef', 'pork', 'chicken', 'corn', 'potatoes', 'rice', 'bread', 'pasta', 'milk', 'cheese', 'butter', 'sugar', 'flour', 'tomatoes', 'squash']. Since there could be many types of the same ingredient (e.g. sweet corn vs normal corn, unsalted butter vs salted butter), we will make it so that any instance of that word appearing in the ingredients column means the ingredient is present. For example, if a recipe has 'sweet corn' as an ingredient, we consider that as containing corn
  
And also, we're going to feature engineer more nutrition columns. That is, instead of just using calories, we'll also use 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', and 'carbohydrates' in our model

In [None]:
from sklearn.ensemble import RandomForestRegressor

X_train, X_test, y_train, y_test = (
    train_test_split(recipes_no_outliers[["n_ingredients", "n_steps", "nutrition", 'ingredients']], 
                     recipes_no_outliers["minutes"], 
                     random_state=12)
)

target_ingredients = ['beef', 'pork', 'chicken', 'corn', 'potatoes', 'rice', 'bread', 'pasta',
                      'milk', 'cheese', 'butter', 'sugar', 'flour', 'tomatoes', 'squash']

def ingredient_onehot_encoder(X):
    df_encoded = pd.DataFrame()
    
    for ingredient in target_ingredients:
        df_encoded[ingredient] = X['ingredients'].apply(lambda x: int(any(ingredient in item for item in x)))

    return df_encoded

def extract_nutrients(X):
    nutrition_features = ['calories', 'total_fat', 'sugar', 'sodium', 
                          'protein', 'saturated_fat', 'carbohydrates']
    df = pd.DataFrame()
    for i,nutrition in enumerate(nutrition_features):
        df[nutrition] = X['nutrition'].apply(lambda x: x[i])

    return df

preprocessor = ColumnTransformer([
    ('nutrition', FunctionTransformer(extract_nutrients, validate=False), ['nutrition']),
    ('onehot', FunctionTransformer(ingredient_onehot_encoder, validate=False), ['ingredients'])
    ],
remainder='passthrough')

pl = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(max_depth=10,random_state=12))
])

model = pl.fit(X_train, y_train)
print("Training R^2: ", model.score(X_train,y_train))
print("Test R^2: ", model.score(X_test,y_test))

In [None]:
from sklearn.model_selection import GridSearchCV
hyperparameters = {
    'regressor__n_estimators':  [10, 50, 100],
    'regressor__max_depth': np.arange(2, 30, 10), 
    'regressor__criterion': ['squared_error', 'friedman_mse', 'poisson']
}
grids = GridSearchCV(
    pl,
    n_jobs=-1, # Use multiple processors to parallelize
    param_grid=hyperparameters,
    return_train_score=True
)
grids.fit(X_train, y_train)
print("R^2: ", grids.score(X_test, y_test))
print("RMSE: ", np.sqrt(mean_squared_error(y_pred=grids.predict(X_test), y_true=y_test)))

In [None]:
grids.best_params_

Not bad. It is at this point we suspect that our fit is really bad because of bad data quality. Some people on food.com could just upload random recipes, with random n_steps, random minutes, random ingredients, etc. and there's no good quality check. So we'll try to filter recipes so it will only have "good quality" data points. We decide a data point is of "good quality" if it has an average rating >=4. From the EDA, we saw there are a lot of recipes with an average rating above 4, so this shouldn't hurt our sample size too much. The pipeline still still be the same.

In [None]:
good_recipes = recipes_no_outliers[recipes_no_outliers["average_rating"] >= 4.5]
good_recipes.shape #still 55818 rows!

In [None]:
X_train_good, X_test_good, y_train_good, y_test_good = (
    train_test_split(good_recipes[["n_ingredients", "n_steps", "nutrition", 'ingredients']], 
                     good_recipes["minutes"], 
                     random_state=12)
)

model_good = pl.fit(X_train_good, y_train_good)
print("Training R^2: ", model_good.score(X_train_good,y_train_good))
print("Test R^2: ", model_good.score(X_test_good,y_test_good))

In [None]:
grids_good = GridSearchCV(
    pl,
    n_jobs=-1, # Use multiple processors to parallelize
    param_grid=hyperparameters,
    return_train_score=True
)
grids_good.fit(X_train_good, y_train_good)
print("r2: ", grids.score(X_test_good, y_test_good))
print("RMSE: ", np.sqrt(mean_squared_error(y_pred=grids.predict(X_test_good), y_true=y_test_good)))

In [None]:
grids_good.best_params_

## Step 8: Fairness Analysis

In [None]:
# TODO