In [3]:
import pandas as pd
import numpy as np
from ast import literal_eval

In [4]:
df = pd.read_csv('recipes_w_search_terms.csv')

In [45]:
orig_df = df

In [None]:
df = orig_df

In [57]:
df["ingredients"] = df["ingredients"].apply(literal_eval)
df["steps"] = df["steps"].apply(literal_eval)

In [5]:
df["tags"] = df["tags"].apply(literal_eval)
df["search_terms"] = df["search_terms"].apply(literal_eval)

In [6]:
df[df["search_terms"].apply(lambda x: 'appetizer' in x)]

Unnamed: 0,id,name,description,ingredients,ingredients_raw_str,serving_size,servings,steps,tags,search_terms
16,232099,Scarlett's Crock Pot Cheese (And Prawn) Fondue...,Ideal for long hot summer days when you don't ...,"['butter', 'shallots', 'prawns', 'garlic clove...","[""1 tablespoon butter"",""2 shallots,...",1 (74 g),6,['Preheat the cooker to high whilst preparing ...,"[time-to-make, course, main-ingredient, cuisin...","{appetizer, dinner}"
34,350688,Roasted Shrimp Cocktail,This recipe for shrimp cocktail is way better ...,"['shrimp', 'olive oil', 'kosher salt', 'fresh ...","[""2 lbs shrimp"",""1 tablespoon olive ...",1 (115 g),6,"['Preheat the oven to 400 degrees F.', 'Peel a...","[15-minutes-or-less, time-to-make, course, mai...","{low-carb, snack, low-calorie, lunch, appetize..."
35,232115,Caramel Butter Brie,This makes a wonderful sweet/salty appetizer t...,"['butter', 'light brown sugar', 'white sugar',...","[""1/4 cup butter"",""1/4 cup light brown...",1 (77 g),8,"['Preheat oven to 350 degrees.', 'Melt butter,...","[60-minutes-or-less, time-to-make, course, mai...",{appetizer}
39,127155,Crab &amp; Fresh Basil Stuffed Mushrooms,A wonderful tasting and attractive hors d’oeuv...,"['monterey jack cheese', 'plain breadcrumbs', ...","[""3 cups monterey jack cheese, shredded ...",1 (185 g),8,"['Combine cheese, breadcrumbs, salt, green oni...","[60-minutes-or-less, time-to-make, course, mai...","{low-calorie, low-carb, appetizer, dinner}"
41,145160,Thanksgiving Mashed Potatoes,Mashed potatoes for this special occasion,"['potatoes', 'butter', 'salt', 'milk', 'chicke...","[""5 large potatoes (washed)"",""1/2 cup ...",1 (538 g),5,"['Bake the potatoes and when ready, mash them ...","[15-minutes-or-less, time-to-make, course, mai...","{side, low-calorie, appetizer}"
...,...,...,...,...,...,...,...,...,...,...
494868,229578,Curried Corn-Crab Cakes,A fanastic twist on crab cakes.,"['fresh corn kernels', 'onion', 'red bell pepp...","[""3/4 cup fresh corn kernels (about 2 ears...",1 (115 g),8,['Heat a large nonstick skillet over medium-hi...,"[30-minutes-or-less, time-to-make, course, mai...","{appetizer, dinner}"
494878,215289,I-Love-Pickles Fried Dill Pickles,Everyone should try this southern treat just o...,"['all-purpose flour', 'cornstarch', 'baking po...","[""1 cup all-purpose flour"",""1/4 cup c...",1 (123 g),8,"['Stir flour, cornstarch, baking powder and sa...","[60-minutes-or-less, time-to-make, course, mai...","{side, appetizer}"
494898,227056,Oven Baked Zucchini Slices,"This is so easy to make, and so good! Better f...","['breadcrumbs', 'parmesan cheese', 'mayonnaise...","[""1/4 cup breadcrumbs (plain or Italian)"",...",1 (62 g),4,"['Mix bread crumbs and cheese in one bowl.', '...","[30-minutes-or-less, time-to-make, course, mai...","{side, appetizer, baked}"
494931,126024,Cheese Straws,Interesting.,"['flour', 'salt', 'cheese', 'shortening', 'wat...","[""1 cup flour"",""1/2 teaspoon salt"",""1...",1 (74 g),4,"['Mix flour, salt, cheese and shortening.', 'A...","[15-minutes-or-less, time-to-make, course, mai...","{side, appetizer}"



#### State your main research question
What patterns can I find in the ingredients of the recipes? Can I connect ingredients with tags or search terms? Can I generate good names for the recipes based on the other variables? What patterns might cluster analysis reveal about the different types of recipes and cuisines?
#### Brief summary of where your data came from
I got my data from a public Kaggle dataset of recipes collected from Food.com, one of the biggest recipe sites. Some work to clean the data has already been done, such as extracting the name of the ingredient from the raw list. The kaggle page (https://www.kaggle.com/datasets/shuyangli94/foodcom-recipes-with-search-terms-and-tags/data) doesn't describe the legality specifically, but they mention some studies that have used this dataset, and it seems to be okay to use.
#### Explanation/description (in words) of all the variables in your data (italicized are targets.)
- *Name*: a string which was the title of the recipe. It is of interest for text analysis and generation.
- Description: the string description proveded for the recipe. This could be of interest for text analysis and alsocould help to categorize the data better.
- Ingredients: this is the variable I'm the most excited about- I want to use the ingredients list to cluster recipes and predict their tags.
- Ingredients_raw_str: I may or may not even use this variable, it is the non-cleaned version of the ingredients. It has quantities and instructions.
- Serving_size: this variable has weight in grams for 1 serving of the recipe. 
- Servings: the number of servings a recipe makes.
- Steps: plain text directions as an array of strings
- *Tags*: user created tags that describe the recipe. I want to try and predict these dags.
- *Search_terms*: these are values that would return the recipe if you searched them on the site. It could be useful to try and predict these as well.


#### Summary statistics for all variables 
##### For numeric variables include: sample size, mean, standard deviation, and 5 number summary (min, q1, q2, q3, max)
##### For categorical variables include: sample size, category counts

Most of my variables are text-based, so it might take some improvization for this section.

In [8]:
print("Total number of observations:")
df.shape[0]

Total number of observations:


494963

The name and description are text fields. Every recipe has a name, but some recipes have no description.

In [11]:
print("Sample size of descriptions:")
df["description"].dropna().shape[0]

Sample size of descriptions:


485362

In [25]:
print("Description of the number of ingredients:")
num_ingredients = df["ingredients"].apply(lambda x: len(x))

num_ingredients.describe()

Description of the number of ingredients:


count    494963.000000
mean        144.185139
std          64.664102
min           2.000000
25%          98.000000
50%         135.000000
75%         179.000000
max         843.000000
Name: ingredients, dtype: float64

In [35]:
print("Description of the serving size")
serving_size = df["serving_size"].apply(lambda x: int(x[3:-2]))
serving_size.describe()

Description of the serving size


count    4.949630e+05
mean     3.750634e+02
std      2.702044e+03
min     -4.750000e+02
25%      1.220000e+02
50%      2.190000e+02
75%      3.810000e+02
max      1.595816e+06
Name: serving_size, dtype: float64

In [40]:
print("Description of the number of servings:")
df["servings"].describe()

Description of the number of servings:


count    494963.000000
mean          7.063164
std          94.677417
min           1.000000
25%           4.000000
50%           4.000000
75%           8.000000
max       32767.000000
Name: servings, dtype: float64

In [39]:
print("The max number of servings is from a recipe for whale meat stew...")
df.iloc[df["servings"].idxmax()]

Description of the number of servings:


id                                                                 72549
name                                                   Alaskan Blue Stew
description            I copied this recipe off the wall of Sourdough...
ingredients            ['whale meat', 'unbleached flour', 'olive oil'...
ingredients_raw_str    ["1 (242000   lb)   blue whale meat, boned and...
serving_size                                                   1 (199 g)
servings                                                           32767
steps                  ['Cut whale in bite size pieces (including blu...
tags                   ['weeknight', 'time-to-make', 'course', 'main-...
search_terms                                          {'stew', 'dinner'}
Name: 51114, dtype: object

In [41]:
print("Summary of number of steps")
df["steps"].apply(lambda x: len(x)).describe()

Summary of number of steps


count    494963.000000
mean        598.236620
std         428.468252
min           2.000000
25%         320.000000
50%         501.000000
75%         757.000000
max       12688.000000
Name: steps, dtype: float64

In [48]:
print("this is an outlier for number of steps, it isn't written in the usual format.")
df.iloc[df["steps"].apply(lambda x: len(x)).idxmax()]["steps"]

this is an outlier for number of steps, it isn't written in the usual format.


'[\'First of all: these are not typical directions, but you need to know about needed equipment before attempting this cake. Here it is:\', \'8-inch round cake pan, at least 2 inches high.\', \'8-inch round cake pan with removable bottom or 8-inch springform pan.\', \'untreated heavy-duty jelly-roll pans.\', \'rubber spatula, offset spatula, and flexible 8-inch metal icing spatula.\', \'decorating turntable, lazy Susan, or inverted round cake pan.\', \'ridged plastic shelf liner, freezer paper, or 055 Mylar (I used the plastic shelf liner).\', \'parchment paper and waxed paper.\', \'MAKING THE CAKE:\', \'Position a rack in the lower third of the oven or just below the center of the oven and preheat the oven to 350°F Fit the bottom of an 8-inch round cake pan, one at least 2 inches high, with parchment paper and set aside.\', "Pour the clarified butter into a 1-quart bowl and stir in the vanilla extract, if you\'re using it. The butter must be hot when added to the batter, so either kee

In [49]:
print("Summary of number of tags")
df["tags"].apply(lambda x: len(x)).describe()

Summary of number of tags


count    494963.000000
mean        242.596499
std          99.800796
min           4.000000
25%         168.000000
50%         230.000000
75%         304.000000
max        1029.000000
Name: tags, dtype: float64

In [50]:
print("Summary of number of search terms")
df["search_terms"].apply(lambda x: len(x)).describe()

Summary of number of search terms


count    494963.000000
mean         32.087202
std          19.781546
min           7.000000
25%          19.000000
50%          28.000000
75%          43.000000
max         164.000000
Name: search_terms, dtype: float64

#### Two or three interesting graphs that start to address your main question of interest

I'm not certain that my question of interest can really be addressed with a graph. I'll need to do some more work to analyze the text here and start making predictions.

#### Answer these questions:
##### Were there any challenges or obstacles in finding the right dataset for your project?
It took a while to find a good dataset having to do with food, but that would present a significant research question I could answer. I like this one because of the ingredients list. I'm interested in analyzing the connections between ingredients, and this dataset is perfect for that.
##### Are there any other problems, concerns, or challenges that you are facing regarding your project?
I need to learn more about text analysis and pre-trained models, since that's going to be a big part of how I analyze this data.