# Project 4 Checkpoint 1

**Name(s)**: (your name(s) here)

**Website Link**: (your website link)

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
import plotly.express as px
pd.options.plotting.backend = 'plotly'
from scipy.stats import chi2_contingency
from statsmodels.stats.multitest import multipletests
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from dsc80_utils import * # Feel free to uncomment and use this.
from collections import defaultdict, Counter
import random
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from itertools import chain
import ast
import re
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier

## Step 1: Introduction


 I'm interested in see how healthy certain meals can be with the information in the dataset. I was also intrigued about how big the dataset was. It make me think that I can build automatic recipe generator, with certain inputs (ingredients). Suppose you open your refrigerator and see ingredients but you don't know what to make. With train model, you can come up with a recipe that you can cook up, depending on the ingredients you have.

## Step 2: Data Cleaning and Exploratory Data Analysis

In [3]:
raw_interactions = pd.read_csv("Raw_interactions.csv")
raw_recipes = pd.read_csv("RAW_recipes.csv")

In [4]:
RR = pd.merge(raw_recipes, raw_interactions, left_on = 'id', right_on = 'recipe_id', how = 'left')

In [5]:
RR.columns

Index(['Unnamed: 0', 'name', 'id', 'minutes', 'contributor_id', 'submitted',
       'tags', 'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients', 'user_id', 'recipe_id', 'date', 'rating', 'review'],
      dtype='object')

In [6]:
RR['rating'] = RR['rating'].replace(0, np.nan)

0 rating may mean missing data, instead of users leaving 0 rating. If we leave it as 0, it could skew statistical analyses (e.g. average). If we have it as np.nan, it will be excluded from the calculation. Moreover, using np.nan rather than 0 can make further data manipulation easier such as data visualization and imputing missing values. 

In [7]:
average_ratings = RR.groupby('name')['rating'].mean()

In [8]:
average_ratings

name
0 carb   0 cal gummy worms              4.75
0 point ice cream  only 1 ingredient    5.00
0 point soup   ww                       4.78
                                        ... 
zydeco soup                             5.00
zydeco spice mix                        5.00
zydeco ya ya deviled eggs               5.00
Name: rating, Length: 83628, dtype: float64

In [9]:
recipes = raw_recipes.merge(average_ratings.rename('average_rating'), on='name', how='left')

In [10]:
recipes

Unnamed: 0.1,Unnamed: 0,name,id,minutes,...,description,ingredients,n_ingredients,average_rating
0,111,1 brownies in the world best ever,333281,40,...,"these are the most; chocolatey, moist, rich, d...","['bittersweet chocolate', 'unsalted butter', '...",9,4.0
1,115,1 in canada chocolate chip cookies,453467,45,...,this is the recipe that we use at my school ca...,"['white sugar', 'brown sugar', 'salt', 'margar...",11,5.0
2,118,412 broccoli casserole,306168,40,...,since there are already 411 recipes for brocco...,"['frozen broccoli cuts', 'cream of chicken sou...",9,5.0
...,...,...,...,...,...,...,...,...,...
83779,231634,zydeco ya ya deviled eggs,308080,40,...,"deviled eggs, cajun-style","['hard-cooked eggs', 'mayonnaise', 'dijon must...",8,5.0
83780,231635,cookies by design cookies on a stick,298512,29,...,"i've heard of the 'cookies by design' company,...","['butter', 'eagle brand condensed milk', 'ligh...",10,1.0
83781,231636,cookies by design sugar shortbread cookies,298509,20,...,"i've heard of the 'cookies by design' company,...","['granulated sugar', 'shortening', 'eggs', 'fl...",7,3.0


In [11]:
recipes.columns

Index(['Unnamed: 0', 'name', 'id', 'minutes', 'contributor_id', 'submitted',
       'tags', 'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients', 'average_rating'],
      dtype='object')

In [12]:
recipes['ingredients'][1]

"['white sugar', 'brown sugar', 'salt', 'margarine', 'eggs', 'vanilla', 'water', 'all-purpose flour', 'whole wheat flour', 'baking soda', 'chocolate chips']"

### Univaraiate Analysis

In [13]:
recipes.columns

Index(['Unnamed: 0', 'name', 'id', 'minutes', 'contributor_id', 'submitted',
       'tags', 'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients', 'average_rating'],
      dtype='object')

In [14]:
fig1 = px.histogram(recipes, x='minutes', nbins=20, title='Distribution of minutes', 
                    labels={'minutes': 'Minutes'})

# Update layout to set x-axis maximum value
fig1.update_layout(
    xaxis_title='Minutes',
    yaxis_title='Count',
    xaxis=dict(range=[0, 100000])  # Sets x-axis range from 0 to 100,000
)

fig1.show()

In [15]:
fig2 = px.histogram(recipes, x='average_rating', nbins=20, title='Distribution of Average Ratings', 
                    labels={'average_rating': 'Average Rating'})
fig2.update_layout(xaxis_title='Average Rating', yaxis_title='Count')
fig2.show()

The distribution of average rating column is skwered to the left. Most of the recipes have 5 star rating.

### Bivariate Analysis

In [14]:
filtered_recipes = recipes[(raw_recipes['minutes'] > 0) & (raw_recipes['minutes'] <= 100000)]

In [17]:
fig4 = px.scatter(
    filtered_recipes,
    x='average_rating',
    y='minutes',
    title='Scatter Plot of Average Rating vs Cooking Time (Limited to ≤100k Minutes)',
    labels={'average_rating': 'Average Rating', 'minutes': 'Cooking Time (Minutes)'}
)
fig4.show()

In [18]:
fig5 = px.scatter(
    recipes,
    x='average_rating',
    y='n_ingredients',
    title='Scatter Plot of Average Rating vs Cooking Time (Limited to ≤100k Minutes)',
    labels={'average_rating': 'Average Rating', 'minutes': 'Cooking Time (Minutes)'}
)
fig5.show()

In [19]:
recipes.columns

Index(['Unnamed: 0', 'name', 'id', 'minutes', 'contributor_id', 'submitted',
       'tags', 'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients', 'average_rating'],
      dtype='object')

In [20]:
fig6 = px.scatter(
    recipes,
    x='n_ingredients',
    y='n_steps',
    title='Scatter Plot of Average Rating vs Cooking Time (Limited to ≤100k Minutes)',
    labels={'average_rating': 'Average Rating', 'minutes': 'Cooking Time (Minutes)'}
)
fig6.show()

In [21]:
fig6 = px.scatter(
    recipes,
    x='n_steps',
    y='average_rating',
    title='Scatter Plot of Average Rating vs Cooking Time (Limited to ≤100k Minutes)',
    labels={'average_rating': 'Average Rating', 'minutes': 'Cooking Time (Minutes)'}
)
fig6.show()

## Step 3: Assessment of Missingness

In [22]:
grouped_by_steps = filtered_recipes.groupby('n_steps').agg({
    'average_rating': ['mean', 'median', 'count'],
    'minutes': ['mean', 'median', 'max']
}).reset_index()

print("Grouped by Number of Steps:")
print(grouped_by_steps)

# Pivot table: Analyze average rating by number of steps and ingredients
pivot_table = filtered_recipes.pivot_table(
    index='n_steps',
    columns='n_ingredients',
    values='average_rating',
    aggfunc='mean'
)

Grouped by Number of Steps:
   n_steps average_rating               minutes               
                     mean median count     mean  median    max
0        1           4.65    5.0  1083    31.94     5.0   7440
1        2           4.67    5.0  2578    36.95     7.0   4335
2        3           4.66    5.0  3955    50.38    10.0  10100
..     ...            ...    ...   ...      ...     ...    ...
81      93           5.00    5.0     1   360.00   360.0    360
82      98           5.00    5.0     1  2930.00  2930.0   2930
83     100           5.00    5.0     1  1680.00  1680.0   1680

[84 rows x 7 columns]


In [23]:
column_with_most_nan = recipes.isnull().sum().idxmax()
print(column_with_most_nan)

average_rating


In [24]:
recipes.isnull().sum().sort_values(ascending=False).head(5)

average_rating    2597
description         70
name                 1
Unnamed: 0           0
id                   0
dtype: int64

In [25]:
raw_recipes[raw_recipes['description'].isnull()]

Unnamed: 0.1,Unnamed: 0,name,id,minutes,...,steps,description,ingredients,n_ingredients
1486,3674,almond cookie bites,401761,16,...,"['preheat oven to 350 degrees f', 'in medium b...",,"['all-purpose flour', ""fisher chef's naturals ...",9
3087,8317,apricot gorgonzola crescent appetizers,332410,40,...,['heat oven to 350f spray large cookie sheet w...,,['pillsbury refrigerated crescent dinner rolls...,6
3685,9827,asparagus milanese,382664,15,...,"['snap off the tough ends of the asparagus', '...",,"['asparagus', 'parmigiano-reggiano cheese', 'b...",5
...,...,...,...,...,...,...,...,...,...
81188,224680,wasatch mountain chili,290480,50,...,['in a large saucepan over medium heat cook on...,,"['onion', 'olive oil', 'hominy', 'great northe...",14
81701,225923,white bean chicken chili giada de laurentiis,430591,75,...,['in a large heavy-bottomed saucepan or dutch ...,,"['olive oil', 'onion', 'garlic cloves', 'groun...",18
83070,229667,yukon gold potatoes jacques pepin style,387006,20,...,['place the potatoes in a deep skillet and add...,,"['yukon gold potatoes', 'salt', 'fresh ground ...",6


In [15]:
# Create a binary indicator for 'description' column's missingness
recipes['description_missing'] = recipes['description'].isnull().astype(int)

# Perform permutation test for 'description' column's missingness
def permutation_test(data, col_to_test):
    observed_diff = data[data['description_missing'] == 1][col_to_test].mean() - \
                    data[data['description_missing'] == 0][col_to_test].mean()
    
    combined = data[col_to_test].dropna().values
    n_missing = data['description_missing'].sum()
    
    perm_diffs = []
    for _ in range(1000):  # Number of permutations
        permuted = np.random.permutation(combined)
        perm_missing = permuted[:n_missing]
        perm_non_missing = permuted[n_missing:]
        perm_diffs.append(perm_missing.mean() - perm_non_missing.mean())
    
    p_value = (np.abs(perm_diffs) >= np.abs(observed_diff)).mean()
    return observed_diff, p_value

# Test dependency of 'description' column's missingness on other columns
results_description = {}
for col in ['minutes', 'n_ingredients', 'n_steps']:
    diff, p_val = permutation_test(recipes.dropna(subset=[col]), col)
    results_description[col] = {'observed_diff': diff, 'p_value': p_val}

# Display results
print("Permutation Test Results for 'description' Missingness:")
for col, res in results_description.items():
    print(f"{col}: Observed Difference = {res['observed_diff']:.4f}, P-Value = {res['p_value']:.4f}")


Permutation Test Results for 'description' Missingness:
minutes: Observed Difference = -43.8247, P-Value = 0.5780
n_ingredients: Observed Difference = -1.4724, P-Value = 0.0020
n_steps: Observed Difference = 0.9954, P-Value = 0.2000


## Step 4: Hypothesis Testing

### Research Question:
What types of recipes tend to be healthier?

### Keyword Definition in Reseach Question
Below is the defintions of the keywords 'types' and 'healthier' in the research question in terms of quantifiable, representable metrics in the dataset. \
\
'Types': Each recipe has a column called 'tags' which consists a list of tags realted to the recipe. The 'tags' will be the 'type' of recipes. Each recipes can have multiple types. \
\
'Healthier': Determine whether a recipe is healthy or not by looking at the column 'nutrition' which consist nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]. Specfically, consider protein (PDV) and sugar (PDV). If protein (PDV) is greater or equal to 20, sugar (PDV) is lower than 5, and carbohydrate (PDV) is less than 26. 

### Null Hypothesis
The types of recips does not affect its healthiness. Protein, sugar, and carbohydrate percent daily values are independent of recipe types.

### Alternative Hypothesis
The type of recipe affects its healtiness. Protein, sugar and carbohydrate percent daily values depend on recipe types.

### Data Preprocessing

In [16]:
recipes

Unnamed: 0.1,Unnamed: 0,name,id,minutes,...,ingredients,n_ingredients,average_rating,description_missing
0,111,1 brownies in the world best ever,333281,40,...,"['bittersweet chocolate', 'unsalted butter', '...",9,4.0,0
1,115,1 in canada chocolate chip cookies,453467,45,...,"['white sugar', 'brown sugar', 'salt', 'margar...",11,5.0,0
2,118,412 broccoli casserole,306168,40,...,"['frozen broccoli cuts', 'cream of chicken sou...",9,5.0,0
...,...,...,...,...,...,...,...,...,...
83779,231634,zydeco ya ya deviled eggs,308080,40,...,"['hard-cooked eggs', 'mayonnaise', 'dijon must...",8,5.0,0
83780,231635,cookies by design cookies on a stick,298512,29,...,"['butter', 'eagle brand condensed milk', 'ligh...",10,1.0,0
83781,231636,cookies by design sugar shortbread cookies,298509,20,...,"['granulated sugar', 'shortening', 'eggs', 'fl...",7,3.0,0


In [17]:
recipes.columns

Index(['Unnamed: 0', 'name', 'id', 'minutes', 'contributor_id', 'submitted',
       'tags', 'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients', 'average_rating', 'description_missing'],
      dtype='object')

In [18]:
recipes[['nutrition', 'tags']].dtypes

nutrition    object
tags         object
dtype: object

Currently, the column 'nutrition' and 'tags' are both in string. However, they both have brackets [] inside, so changing to approriate type of list is necessary before the hypothesis test.

In [19]:
recipes['nutrition'] = recipes['nutrition'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
recipes['tags'] = recipes['tags'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

In [20]:
recipes['nutrition'][1]

[595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0]

In [21]:
recipes['tags'][1]

['60-minutes-or-less',
 'time-to-make',
 'cuisine',
 'preparation',
 'north-american',
 'for-large-groups',
 'canadian',
 'british-columbian',
 'number-of-servings']

Now, we will divide each nutrition facts to each columns.

In [22]:
recipes[['calories', 'total_fat_PDV', 'sugar_PDV', 'sodium_PDV', 'protein_PDV', 'saturated_fat_PDV', 'carbohydrates_PDV']] = pd.DataFrame(recipes['nutrition'].tolist(), index=recipes.index)

In [23]:
recipes['protein_PDV']

0         3.0
1        13.0
2        22.0
         ... 
83779     6.0
83780     7.0
83781     4.0
Name: protein_PDV, Length: 83782, dtype: float64

Then, we want to know all the unique tags that are in the dataset how many are there.

In [24]:
unique_tags = set(chain.from_iterable(recipes['tags']))
unique_tags = sorted(unique_tags)
print(unique_tags)

['', '1-day-or-more', '15-minutes-or-less', '3-steps-or-less', '30-minutes-or-less', '4-hours-or-less', '5-ingredients-or-less', '60-minutes-or-less', 'Throw the ultimate fiesta with this sopaipillas recipe from Food.com.', 'a1-sauce', 'african', 'american', 'amish-mennonite', 'angolan', 'appetizers', 'apples', 'april-fools-day', 'argentine', 'artichoke', 'asian', 'asparagus', 'australian', 'austrian', 'avocado', 'bacon', 'baja', 'baked-beans', 'baking', 'bananas', 'bar-cookies', 'barbecue', 'bass', 'bean-soup', 'beans', 'beans-side-dishes', 'bear', 'beef', 'beef-barley-soup', 'beef-crock-pot', 'beef-kidney', 'beef-liver', 'beef-organ-meats', 'beef-ribs', 'beef-sauces', 'beef-sausage', 'beginner-cook', 'beijing', 'belgian', 'berries', 'beverages', 'birthday', 'biscotti', 'bisques-cream-soups', 'black-bean-soup', 'black-beans', 'blueberries', 'bok-choys', 'brazilian', 'bread-machine', 'bread-pudding', 'breads', 'breakfast', 'breakfast-casseroles', 'breakfast-eggs', 'breakfast-potatoes',

Now, we will determine whether each recipes are helthy or not, based on the definiton of 'healthy' defined above.

In [25]:
recipes['is_healthy'] = ((recipes['protein_PDV'] >= 20) & (recipes['sugar_PDV'] < 5) & (recipes['carbohydrates_PDV'] <= 26)).astype(int)

In [26]:
recipes['is_healthy'].value_counts()

is_healthy
0    78869
1     4913
Name: count, dtype: int64

We can see that there's 4913 heathy recipes!

### Determining the Type of Test 
Since we are dealing with discrete, unpaired, categorical data type 'tags', we will use chi-square test. 

In [27]:
len(unique_tags)

549

In [28]:
recipes_tags = recipes

There's 549 unique tags in the dataset, as seen above. Perforiming indiviual hypothesis test for each tage will not only take long and inefficient, it will increas the risk of Type 1 error. 

### Analyzing each Tag

In [29]:
# Step 1: Create binary columns for all tags (vectorized)
binary_tags = pd.DataFrame(
    {tag: recipes['tags'].apply(lambda x: int(tag in x)) for tag in unique_tags}
)

# Step 2: Add binary columns to the original DataFrame
recipes_tags = pd.concat([recipes, binary_tags], axis=1)

# Step 3: Calculate counts for contingency tables
healthy_counts = recipes_tags.groupby('is_healthy')[unique_tags].sum()
total_counts = recipes_tags[unique_tags].sum()

# Ensure all counts are non-negative
A = healthy_counts.loc[1]  # Healthy recipes with the tag
C = healthy_counts.loc[0]  # Healthy recipes without the tag
B = total_counts - A       # Unhealthy recipes with the tag
D = len(recipes) - total_counts - C  # Unhealthy recipes without the tag

# Clip values to ensure non-negativity (optional but safe)
A = A.clip(lower=0)
B = B.clip(lower=0)
C = C.clip(lower=0)
D = D.clip(lower=0)

# Create contingency tables for all tags
d = {'A': A, 'B': B, 'C': C, 'D': D}
contingency_tables = pd.DataFrame(data = d)


In [30]:
recipes

Unnamed: 0.1,Unnamed: 0,name,id,minutes,...,protein_PDV,saturated_fat_PDV,carbohydrates_PDV,is_healthy
0,111,1 brownies in the world best ever,333281,40,...,3.0,19.0,6.0,0
1,115,1 in canada chocolate chip cookies,453467,45,...,13.0,51.0,26.0,0
2,118,412 broccoli casserole,306168,40,...,22.0,36.0,3.0,0
...,...,...,...,...,...,...,...,...,...
83779,231634,zydeco ya ya deviled eggs,308080,40,...,6.0,5.0,0.0,0
83780,231635,cookies by design cookies on a stick,298512,29,...,7.0,21.0,9.0,0
83781,231636,cookies by design sugar shortbread cookies,298509,20,...,4.0,11.0,6.0,0


In [31]:
recipes_tags

Unnamed: 0.1,Unnamed: 0,name,id,minutes,...,winter,yams-sweet-potatoes,yeast,zucchini
0,111,1 brownies in the world best ever,333281,40,...,0,0,0,0
1,115,1 in canada chocolate chip cookies,453467,45,...,0,0,0,0
2,118,412 broccoli casserole,306168,40,...,0,0,0,0
...,...,...,...,...,...,...,...,...,...
83779,231634,zydeco ya ya deviled eggs,308080,40,...,0,0,0,0
83780,231635,cookies by design cookies on a stick,298512,29,...,0,0,0,0
83781,231636,cookies by design sugar shortbread cookies,298509,20,...,0,0,0,0


In [32]:
# Perform chi-square tests for all tags
def calculate_p_value(row):
    observed = [[row['A'], row['B']], [row['C'], row['D']]]
    if any(val < 0 for val in row):  # Skip invalid rows
        return None
    _, p_value, _, _ = chi2_contingency(observed)
    return p_value

chi2_results = contingency_tables.apply(calculate_p_value, axis=1)

# Prepare results as a DataFrame (drop tags with invalid results)
results = pd.DataFrame({'tag': unique_tags, 'p_value': chi2_results}).dropna()

In [33]:
results_df = pd.DataFrame(results)
results_df['adjusted_p_value'] = multipletests(results_df['p_value'], method='bonferroni')[1]

In [34]:
significant_tags = results_df[(results_df['adjusted_p_value'] < 0.05) & (results_df['adjusted_p_value'] != 0)]
print(significant_tags)

                                         tag    p_value  adjusted_p_value
                                               3.22e-22          1.77e-19
1-day-or-more                  1-day-or-more   2.78e-62          1.53e-59
5-ingredients-or-less  5-ingredients-or-less  1.03e-178         5.63e-176
...                                      ...        ...               ...
weeknight                          weeknight   2.22e-32          1.22e-29
white-rice                        white-rice  9.90e-119         5.44e-116
whole-duck                        whole-duck   8.13e-22          4.46e-19

[256 rows x 3 columns]


In [35]:
sorted_significant_tags = significant_tags.sort_values(by='adjusted_p_value', ascending=False)
print(sorted_significant_tags)


                   tag    p_value  adjusted_p_value
gelatin        gelatin   7.95e-05          4.36e-02
czech            czech   5.93e-05          3.25e-02
stews            stews   4.47e-05          2.46e-02
...                ...        ...               ...
deer              deer  4.80e-283         2.64e-280
beef-liver  beef-liver  4.44e-286         2.44e-283
pickeral      pickeral  1.85e-286         1.02e-283

[256 rows x 3 columns]


In [36]:
healthy_tags = sorted_significant_tags['tag'].tolist()
print(healthy_tags)

['gelatin', 'czech', 'stews', 'polish', 'corn', 'quick-breads', 'veggie-burgers', 'presentation', 'granola-and-porridge', 'tomatoes', 'thanksgiving', 'celebrity', 'dehydrator', 'brazilian', 'high-fiber', 'chowders', 'food-processor-blender', 'brunch', 'icelandic', 'tropical-fruit', 'reynolds-wrap', 'lentils', 'saudi-arabian', 'jewish-ashkenazi', 'pancakes-and-waffles', 'asparagus', 'berries', 'cocktails', 'amish-mennonite', 'brewing', 'marinades-and-rubs', 'comfort-food', 'oaxacan', 'chili', 'greens', 'irish', 'ecuadorean', 'birthday', 'curries', 'russian', 'costa-rican', 'guatemalan', 'oamc-freezer-make-ahead', 'broccoli', 'spaghetti', 'indian', 'lasagna', 'egg-free', 'savory-sauces', 'stove-top', 'pakistani', 'oven', 'mexican', 'baja', 'thai', 'freezer', 'gluten-free', 'tempeh', 'pressure-canning', 'vietnamese', 'ontario', 'spring', 'condiments-etc', 'chard', 'soups-stews', 'novelty', 'szechuan', 'creole', 'brown-rice', 'medium-grain-rice', 'peruvian', 'pennsylvania-dutch', 'colombia

In [37]:
recipes['only_healthy_tags'] = recipes['tags'].apply(lambda x: all(tag in healthy_tags for tag in x))

# Get recipes with only healthy tags
recipes_with_only_healthy_tags = recipes[recipes['only_healthy_tags']]

# Display the filtered recipes
recipes_with_only_healthy_tags


Unnamed: 0.1,Unnamed: 0,name,id,minutes,...,saturated_fat_PDV,carbohydrates_PDV,is_healthy,only_healthy_tags
473,1257,3 ingredient moroccan dry rub,505748,3,...,0.0,1.0,0,True
550,1450,4 layer pizza dip,506106,40,...,20.0,1.0,0,True
601,1590,5 minute salad,506141,5,...,27.0,8.0,0,True
...,...,...,...,...,...,...,...,...,...
81658,225804,whiskey marinade,506238,245,...,0.0,5.0,0,True
82435,227866,winter wonderland martini,506222,5,...,28.0,3.0,0,True
83543,231003,zucchini baked in sour cream,505947,30,...,42.0,2.0,0,True


In [38]:
recipes.size

2010768

### Conclusion 
Out of 549 hypothesis test, we were able to reject the null hypothesis for 256 tags which are stored in the variable 'healthy tags'.\
\
In other words out of 549 unique recipe tags, the hypothesis test identified 256 tags that are statistically significant in their association with being healthy. These 256 “healthy tags” were then used to filter recipes, resulting in 102 recipes that exclusively consist of these healthy tags. This represents a small subset of over 2 million recipes analyzed.

## Step 5: Framing a Prediction Problem

The goal of my prediction problem is to predict likely ingredients given a set of tags. 

In [39]:
recipes['ingredients'] = recipes['ingredients'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
unique_ingrdients = sorted(set(chain.from_iterable(recipes['ingredients'])))
unique_ingrdients

['1% fat buttermilk',
 '1% fat cottage cheese',
 '1% low-fat chocolate milk',
 '1% low-fat milk',
 '10 bean soup mix',
 '10% cream',
 '100% bran',
 '12-inch flour tortillas',
 '15 bean mix',
 '15 bean soup mix',
 '15% cream',
 '18% table cream',
 '2% buttermilk',
 '2% cheddar cheese',
 '2% evaporated milk',
 '2% fat cottage cheese',
 '2% large-curd cottage cheese',
 '2% low-fat milk',
 '2% mexican cheese blend',
 '2% milk',
 '2% mozzarella cheese',
 '3 bean mix',
 '35% cream',
 '4% fat cottage cheese',
 '6 inch fat-free whole wheat pita bread',
 '6-inch corn tortillas',
 '6-inch flour tortillas',
 '6-inch whole wheat pitas',
 '7-up',
 '7-up soda',
 '70% lean ground beef',
 '8-inch baked pie shell',
 '8-inch flour tortillas',
 '8-inch graham cracker crust',
 '8-inch pre-baked crumb crust',
 '80% lean ground beef',
 '85% lean ground beef',
 '9 inch pie shell',
 '9" pastry pie shells',
 '9" unbaked pie shell',
 '9-grain bread',
 '9-inch baked pie crust',
 '9-inch deep dish pie crust',
 '9

In [40]:
len(unique_ingrdients)

11193

In [41]:
len(unique_tags)

549

In [42]:
recipes.columns

Index(['Unnamed: 0', 'name', 'id', 'minutes', 'contributor_id', 'submitted',
       'tags', 'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients', 'average_rating', 'description_missing', 'calories',
       'total_fat_PDV', 'sugar_PDV', 'sodium_PDV', 'protein_PDV',
       'saturated_fat_PDV', 'carbohydrates_PDV', 'is_healthy',
       'only_healthy_tags'],
      dtype='object')

In [43]:
unique_ingrdients

['1% fat buttermilk',
 '1% fat cottage cheese',
 '1% low-fat chocolate milk',
 '1% low-fat milk',
 '10 bean soup mix',
 '10% cream',
 '100% bran',
 '12-inch flour tortillas',
 '15 bean mix',
 '15 bean soup mix',
 '15% cream',
 '18% table cream',
 '2% buttermilk',
 '2% cheddar cheese',
 '2% evaporated milk',
 '2% fat cottage cheese',
 '2% large-curd cottage cheese',
 '2% low-fat milk',
 '2% mexican cheese blend',
 '2% milk',
 '2% mozzarella cheese',
 '3 bean mix',
 '35% cream',
 '4% fat cottage cheese',
 '6 inch fat-free whole wheat pita bread',
 '6-inch corn tortillas',
 '6-inch flour tortillas',
 '6-inch whole wheat pitas',
 '7-up',
 '7-up soda',
 '70% lean ground beef',
 '8-inch baked pie shell',
 '8-inch flour tortillas',
 '8-inch graham cracker crust',
 '8-inch pre-baked crumb crust',
 '80% lean ground beef',
 '85% lean ground beef',
 '9 inch pie shell',
 '9" pastry pie shells',
 '9" unbaked pie shell',
 '9-grain bread',
 '9-inch baked pie crust',
 '9-inch deep dish pie crust',
 '9

In [44]:
unique_tags

['',
 '1-day-or-more',
 '15-minutes-or-less',
 '3-steps-or-less',
 '30-minutes-or-less',
 '4-hours-or-less',
 '5-ingredients-or-less',
 '60-minutes-or-less',
 'Throw the ultimate fiesta with this sopaipillas recipe from Food.com.',
 'a1-sauce',
 'african',
 'american',
 'amish-mennonite',
 'angolan',
 'appetizers',
 'apples',
 'april-fools-day',
 'argentine',
 'artichoke',
 'asian',
 'asparagus',
 'australian',
 'austrian',
 'avocado',
 'bacon',
 'baja',
 'baked-beans',
 'baking',
 'bananas',
 'bar-cookies',
 'barbecue',
 'bass',
 'bean-soup',
 'beans',
 'beans-side-dishes',
 'bear',
 'beef',
 'beef-barley-soup',
 'beef-crock-pot',
 'beef-kidney',
 'beef-liver',
 'beef-organ-meats',
 'beef-ribs',
 'beef-sauces',
 'beef-sausage',
 'beginner-cook',
 'beijing',
 'belgian',
 'berries',
 'beverages',
 'birthday',
 'biscotti',
 'bisques-cream-soups',
 'black-bean-soup',
 'black-beans',
 'blueberries',
 'bok-choys',
 'brazilian',
 'bread-machine',
 'bread-pudding',
 'breads',
 'breakfast',
 '

In [45]:
recipes['steps'][1]

"['pre-heat oven the 350 degrees f', 'in a mixing bowl , sift together the flours and baking powder', 'set aside', 'in another mixing bowl , blend together the sugars , margarine , and salt until light and fluffy', 'add the eggs , water , and vanilla to the margarine / sugar mixture and mix together until well combined', 'add in the flour mixture to the wet ingredients and blend until combined', 'scrape down the sides of the bowl and add the chocolate chips', 'mix until combined', 'scrape down the sides to the bowl again', 'using an ice cream scoop , scoop evenly rounded balls of dough and place of cookie sheet about 1 - 2 inches apart to allow for spreading during baking', 'bake for 10 - 15 minutes or until golden brown on the outside and soft & chewy in the center', 'serve hot and enjoy !']"

In [46]:
recipes['ingredients'][1]

['white sugar',
 'brown sugar',
 'salt',
 'margarine',
 'eggs',
 'vanilla',
 'water',
 'all-purpose flour',
 'whole wheat flour',
 'baking soda',
 'chocolate chips']

In [47]:
recipes

Unnamed: 0.1,Unnamed: 0,name,id,minutes,...,saturated_fat_PDV,carbohydrates_PDV,is_healthy,only_healthy_tags
0,111,1 brownies in the world best ever,333281,40,...,19.0,6.0,0,False
1,115,1 in canada chocolate chip cookies,453467,45,...,51.0,26.0,0,False
2,118,412 broccoli casserole,306168,40,...,36.0,3.0,0,False
...,...,...,...,...,...,...,...,...,...
83779,231634,zydeco ya ya deviled eggs,308080,40,...,5.0,0.0,0,False
83780,231635,cookies by design cookies on a stick,298512,29,...,21.0,9.0,0,False
83781,231636,cookies by design sugar shortbread cookies,298509,20,...,11.0,6.0,0,False


## Step 6: Baseline Model

In [48]:
recipes_copy = recipes.copy(deep = True)

In [49]:
recipes_copy['ingredients']

0        [bittersweet chocolate, unsalted butter, eggs,...
1        [white sugar, brown sugar, salt, margarine, eg...
2        [frozen broccoli cuts, cream of chicken soup, ...
                               ...                        
83779    [hard-cooked eggs, mayonnaise, dijon mustard, ...
83780    [butter, eagle brand condensed milk, light bro...
83781    [granulated sugar, shortening, eggs, flour, cr...
Name: ingredients, Length: 83782, dtype: object

In [50]:
recipes_copy['tags']

0        [60-minutes-or-less, time-to-make, course, mai...
1        [60-minutes-or-less, time-to-make, cuisine, pr...
2        [60-minutes-or-less, time-to-make, course, mai...
                               ...                        
83779    [60-minutes-or-less, time-to-make, course, mai...
83780    [30-minutes-or-less, time-to-make, course, pre...
83781    [30-minutes-or-less, time-to-make, course, pre...
Name: tags, Length: 83782, dtype: object

In [51]:
recipes['minutes']

0        40
1        45
2        40
         ..
83779    40
83780    29
83781    20
Name: minutes, Length: 83782, dtype: int64

### One-hot Encoding for 'Ingredients' Column

In [52]:
mlb_ingredients = MultiLabelBinarizer()
ingredients_encoded = pd.DataFrame(
    mlb_ingredients.fit_transform(recipes_copy['ingredients']),
    columns=mlb_ingredients.classes_,
    index=recipes_copy.index
)

In [53]:
ingredients_encoded

Unnamed: 0,1% fat buttermilk,1% fat cottage cheese,1% low-fat chocolate milk,1% low-fat milk,...,ziti rigati,zoom quick hot cereal,zucchini,zucchini with italian-style tomato sauce
0,0,0,0,0,...,0,0,0,0
1,0,0,0,0,...,0,0,0,0
2,0,0,0,0,...,0,0,0,0
...,...,...,...,...,...,...,...,...,...
83779,0,0,0,0,...,0,0,0,0
83780,0,0,0,0,...,0,0,0,0
83781,0,0,0,0,...,0,0,0,0


In [54]:
len(unique_ingrdients)

11193

In [55]:
ingredients_encoded = ingredients_encoded.add_prefix('ingredient_')

In [56]:
ingredients_encoded

Unnamed: 0,ingredient_1% fat buttermilk,ingredient_1% fat cottage cheese,ingredient_1% low-fat chocolate milk,ingredient_1% low-fat milk,...,ingredient_ziti rigati,ingredient_zoom quick hot cereal,ingredient_zucchini,ingredient_zucchini with italian-style tomato sauce
0,0,0,0,0,...,0,0,0,0
1,0,0,0,0,...,0,0,0,0
2,0,0,0,0,...,0,0,0,0
...,...,...,...,...,...,...,...,...,...
83779,0,0,0,0,...,0,0,0,0
83780,0,0,0,0,...,0,0,0,0
83781,0,0,0,0,...,0,0,0,0


### One-hot Encoding for 'Tags' column

In [57]:
mlb_tags = MultiLabelBinarizer()
tags_encoded = pd.DataFrame(
    mlb_tags.fit_transform(recipes_copy['tags']),
    columns=mlb_tags.classes_,
    index=recipes_copy.index
)
tags_encoded = tags_encoded.add_prefix('tag_')

In [58]:
tags_encoded

Unnamed: 0,tag_,tag_1-day-or-more,tag_15-minutes-or-less,tag_3-steps-or-less,...,tag_winter,tag_yams-sweet-potatoes,tag_yeast,tag_zucchini
0,0,0,0,0,...,0,0,0,0
1,0,0,0,0,...,0,0,0,0
2,0,0,0,0,...,0,0,0,0
...,...,...,...,...,...,...,...,...,...
83779,0,0,0,1,...,0,0,0,0
83780,0,0,0,0,...,0,0,0,0
83781,0,0,0,0,...,0,0,0,0


In [60]:
len(unique_tags)

549

### Combining the Encoded Columns to the Original Dataframe

In [61]:
recipes_encoded = pd.concat([
    recipes_copy.drop(['ingredients', 'tags'], axis=1), 
    ingredients_encoded, 
    tags_encoded
], axis=1)

### Splitting the Data into Features and Target

In [62]:
X = recipes_encoded.drop('minutes', axis=1)
y = recipes_encoded['minutes']

In [64]:
y

0        40
1        45
2        40
         ..
83779    40
83780    29
83781    20
Name: minutes, Length: 83782, dtype: int64

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

: 

## Step 7: Final Model

In [59]:
# TODO

## Step 8: Fairness Analysis

In [None]:
# TODO