## Introduction
This notebook is an exploration of the Food.com dataset that focuses primarily on looking at variables that are specific to particular ingredients, such as the frequency with which they are included in recipes in the dataset, and their usage in submitted recipes over time. This notebook also creates a model for predicting probabilities that a given recipe includes a particular ingredient, given the name chosen for that recipe.

   
## Table of Contents
- [Reading in and merging the data](#1)
- [Looking at data related to ingredients](#2)
    - [Ingredient frequency](#2.1)
    - [Ingredient-specific review scores](#2.2)
    - [Ingredient usage over time](#2.3)
- [Naive Bayes Model](#3)
    - [Getting recipe name tokens](#3.1)
    - [Estimating probabilities](#3.2)
    - [Trying out the model](#3.3)

In [None]:
from collections import defaultdict
from plotnine import *
import plotnine
import numpy as np
import pandas as pd
import ast
import os
import math
import warnings
warnings.filterwarnings('ignore')

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading in and merging the data <a name="1"></a>
Read in the files from the dataset that we want to look at.

In [None]:
raw_recipes = pd.read_csv("/kaggle/input/food-com-recipes-and-user-interactions/RAW_recipes.csv")
raw_interactions = pd.read_csv("/kaggle/input/food-com-recipes-and-user-interactions/RAW_interactions.csv")
pp_recipes = pd.read_csv("/kaggle/input/food-com-recipes-and-user-interactions/PP_recipes.csv")
pp_users = pd.read_csv("/kaggle/input/food-com-recipes-and-user-interactions/PP_users.csv")
ingr_map = pd.read_pickle("/kaggle/input/food-com-recipes-and-user-interactions/ingr_map.pkl")

Let's create a merged version of the recipes data (where rows refer to a unique recipe) that has both the raw and preprocessed data, and just the columns we care about. Let's also rename the ID column to specifically state that it is referring to a recipe ID, which will be useful when merging data between dataframes.

In [None]:
recipes = pp_recipes.merge(right=raw_recipes, left_on="id", right_on="id")
recipes = recipes[["id", "name", "submitted", "ingredient_ids", "ingredients", "n_ingredients"]]
recipes = recipes.rename({"id":"recipe_id"}, axis="columns")
recipes.head()

That look's good, let's move on the interactions data.

In [None]:
interactions = raw_interactions[["user_id", "recipe_id", "rating", "review"]]
interactions.head()

That looks good too, let's move on to the ingredients. For the ingredients dataframe, let's again rename the columns to make a clear distinction between ID types before merging any data. The original file has raw strings from which the ingredients were parsed out and preprocessed. Let's just look at the preprocessed strings and get rid of duplicate rows, so that a single row in this dataframe now refers to a unique ingredient in the dataset. That way we can add on to this dataframe with new variables that have values specific to a particular ingredient, like its frequency in the dataset, or the average review score for recipes that include it.

In [None]:
ingr_df = ingr_map.copy(deep=True)
ingr_df = ingr_df.rename({"id":"ingr_id","replaced":"ingr_name"}, axis="columns")
ingr_df = ingr_df[["ingr_id", "ingr_name"]]
ingr_df = ingr_df.drop_duplicates(ignore_index=True)
ingr_df.head(10)

## Looking at data related to ingredients <a name="2"></a>
### Looking at the frequency of specific ingredients <a name="2.1"></a>

In order to define other variables with respect to particular ingredients, let's create a version of the recipes dataframe that is exploded with respect to the column that has the list of ingredient IDs. This will form a dataframe where each row refers to a specific ingredient present in a specific recipe. Note that in order to do this, we need to convert the contents of the ingredient IDs field to be actual lists rather than strings representing lists. 

In [None]:
recipes_exploded = recipes.copy(deep=True)
recipes_exploded["ingredient_ids"] = recipes_exploded['ingredient_ids'].apply(lambda x : ast.literal_eval(x))
recipes_exploded = recipes_exploded.explode(column="ingredient_ids", ignore_index=True)
recipes_exploded.head(10)

Because ingredient IDs are now in their own column, we can now group by ingredient and check the quantity of recipes that each belongs to, and add this to the ingredients dataframe.

In [None]:
ingr_df["num_recipes"] = ingr_df["ingr_id"].map(dict(recipes_exploded.groupby("ingredient_ids")["recipe_id"].size()))
ingr_df.head(10)

Let's get the total number of recipes in the dataset so that these counts can be represented as frequencies.

In [None]:
total_number_of_recipes = recipes["recipe_id"].unique().size
ingr_df["frequency"] = ingr_df["num_recipes"]/total_number_of_recipes
ingr_df.head(10)

Some of the ingredients in this dataframe are not actually used in any of the recipes that are included in the dataset, so let's get rid of those.

In [None]:
ingr_df = ingr_df.dropna()
ingr_df.head(10)

### Getting scores specific to individual ingredients <a name="2.2"></a>

We also might want to know if there is a relationship between particular ingredients and review scores. Scores are attributed to recipes not ingredients, so we need to merge the ingredient lists from the recipes dataframe with the interactions dataframe, so that we can then explode that dataframe with respect to the ingredients list.

In [None]:
interactions_exploded = interactions.copy(deep=True)
interactions_exploded = interactions_exploded.merge(how="left", right=recipes[["recipe_id","ingredient_ids"]], left_on="recipe_id", right_on="recipe_id")
interactions_exploded = interactions_exploded.dropna()
interactions_exploded["ingredient_ids"] = interactions_exploded['ingredient_ids'].apply(lambda x : ast.literal_eval(x))
interactions_exploded = interactions_exploded.explode(column="ingredient_ids", ignore_index=True)
interactions_exploded.head(10)

We can use those relationships to add mean rating for each recipe that includes a given ingredient and the total number of reviews that were created for all recipes including a given ingredient to the ingredients dataframe.

In [None]:
ingr_df["mean_rating"] = ingr_df["ingr_id"].map(dict(interactions_exploded.groupby("ingredient_ids")["rating"].mean()))
ingr_df["num_ratings"] = ingr_df["ingr_id"].map(dict(interactions_exploded.groupby("ingredient_ids")["rating"].size()))
ingr_df.head(10)

In [None]:
plotnine.options.dpi = 100
plotnine.options.figure_size=(6,4)
(ggplot(ingr_df)
 + geom_point(aes(x="frequency", y="mean_rating"), color="black", show_legend=False)
 + theme_bw()
 + ylab("Mean Rating")
 + xlab("Frequency")
)

For ingredients that are used more frequently, the mean rating for all recipes that they are in tends to be closer and closer to the mean for all recipe ratings in the dataset, as expected. The inclusion of less frequently does not necessarily seem to impact the average review score for recipes that include it. As another way to visualize this, let's consider that 'rare' ingredients are ones mentioned 5 or fewer times in the dataset, then bin recipes based on how many 'rare' ingredients they include, then look at the average score distributions for all recipes in each bin.

In [None]:
threshold = 5
rare_ingredients = ingr_df.loc[ingr_df["num_recipes"] <= threshold, "ingr_id"].values
recipes_exploded["rare_ingr"] = recipes_exploded["ingredient_ids"].map(lambda x: x in rare_ingredients)
recipes_exploded.head(10)

In [None]:
recipes["num_rare_ingr"] = recipes["recipe_id"].map(dict(recipes_exploded.groupby("recipe_id")["rare_ingr"].sum()))
recipes.head(10)

In [None]:
recipes["mean_rating"] = recipes["recipe_id"].map(dict(interactions.groupby("recipe_id")["rating"].mean()))

plotnine.options.dpi = 100
plotnine.options.figure_size=(6,4)
(ggplot(recipes)
 + geom_boxplot(aes(x="num_rare_ingr", y="mean_rating", group="num_rare_ingr"), show_legend=False)
 + theme_bw()
 + ylab("Mean Rating")
 + xlab("Number of Rare Ingredients")
)

We should also look at the number of recipes in each category (in each boxplot).

In [None]:
recipes.groupby("num_rare_ingr").size()

### Looking at ingredient inclusion over time <a name="2.3"></a>
Another aspect of this dataset at the ingredient level is how the inclusion of specific ingredients in submitted recipes changes over time. These trends might have something to do with changing popularity of specific foods or recipes, or prices or availability of particular ingredients. Drawing conclusions would require in depth knowledge of the submitted recipes were sampled from all recipes submitted during these periods, and when the recipes are accessed is also relevant rather than just when the recipes were submitted, but this section provides a simple example of getting started with this type of analysis.

Let's get started by making a histogram of recipes based no the year they were submitted.

In [None]:
recipes["submitted"]
recipes["year"] = recipes["submitted"].map(lambda x: x.split("-")[0])
recipes = recipes.sort_values(by="year")

plotnine.options.dpi = 100
plotnine.options.figure_size=(6,4)
(ggplot(recipes)
 + geom_histogram(aes(x="year"), color="darkgray", show_legend=False)
 + theme_bw()
 + ylab("Count")
 + xlab("Year")
 + theme(axis_text_x=element_text(rotation=65, hjust=1))
)

The most well-represented portion of the data is from 2002 to 2009, so lets limit the dataset to those years when looking at trends in ingredient usage.

In [None]:
recipes_subset = recipes_exploded.copy(deep=True)
recipes_subset["year"] = recipes_subset["submitted"].map(lambda x: x.split("-")[0]).astype(int)
recipes_subset["year"] = recipes_subset["year"].astype(int)
recipes_subset = recipes_subset[(recipes_subset["year"]>=2002) & (recipes_subset["year"]<=2009)]
recipes_subset.reset_index(inplace=True)
recipes_subset.head()

We're using the version of the recipes dataframe here that has been exploded with respect to the list of ingredients in each recipe. This way, we can now group by individual years and find the number of recipes that use each individual ingredient.

In [None]:
by_year_df = pd.DataFrame(recipes_subset.groupby("year")["ingredient_ids"].value_counts())
by_year_df.columns = ["num_recipes"]
by_year_df.reset_index(inplace=True)
year_to_num_recipes = dict(recipes_subset.groupby("year").size())
by_year_df["fraction_using"] =  by_year_df[["year","num_recipes"]].apply(lambda row: row["num_recipes"]/year_to_num_recipes[row["year"]], axis=1)
by_year_df.head(10)

Now we have a dataframe with the number of recipes that use each ingredient in a given year, as well as the fraction of recipes submitted during that year that use this ingredient. Let's merge this with our existing ingredients dataframe so so that we have access to the other columns like ingredient names.

In [None]:
by_year_df = by_year_df.merge(right=ingr_df[["ingr_name","ingr_id"]], how="left", left_on="ingredient_ids", right_on="ingr_id")
by_year_df = by_year_df[["ingr_name", "ingr_id", "year", "num_recipes", "fraction_using"]]
by_year_df.head(10)

Now we can start looking at trends for individual ingredients within this period. As an example, let's look at trends for different types of cooking oils. We can start by subsetting the ingredients dataframe for ingredients that contain the word 'oil' and then sorting by frequency to find some cooking oils that are well-represented in the dataset.

In [None]:
cooking_oils_df = ingr_df.iloc[list(ingr_df["ingr_name"].apply(lambda x: ("oil" in x.split())))].sort_values(by="frequency", ascending=False, ignore_index=True).head(10)
cooking_oils_df

In [None]:
cooking_oil_names = list(cooking_oils_df["ingr_name"].values)
cooking_oil_names_of_interest = cooking_oil_names[:]
cooking_oil_names_of_interest.remove("oil")
cooking_oil_names_of_interest.remove("cooking oil")
print(cooking_oil_names_of_interest)

In [None]:
plot_data = by_year_df[by_year_df["ingr_name"].isin(cooking_oil_names_of_interest)]

plotnine.options.dpi = 100
plotnine.options.figure_size=(8,4)
(ggplot(plot_data)
 + geom_point(aes(x="year", y="fraction_using"), color="darkgray", show_legend=False)
 + facet_wrap("ingr_name", ncol=4, scales="fixed")
 + theme_bw()
 + theme(subplots_adjust={'wspace': 0.1})
 + ylab("Fraction Using")
 + xlab("Year")
 + theme(axis_text_x=element_text(rotation=45, hjust=1))
)

Looks like the fraction of recipes that use many of these types of oils are relatively consisent across this time period, with the exception of olive oil, which looks to be increasing during this time. Could this be because more recipes are using olive oil that were previously using other kinds of oil? Another way to get a clearer look at this question would be to plot the fractions of recipes using any kind of ingredient containing the word oil that use each type, so we can see how the proportions are changing over time.

In [None]:
# Subset the dataset again to include oils, this time including all of them but replacing them with 'other' if they aren't in our list of commonly used ones.
by_year_df_subsets = by_year_df.iloc[list(by_year_df["ingr_name"].apply(lambda x: ("oil" in x.split())))]
by_year_df_subsets["ingr_name"] = by_year_df_subsets["ingr_name"].map(lambda x: {name:name for name in cooking_oil_names}.get(x, "other"))
cooking_oil_names.append("other")
by_year_df_subsets.head(10)

In [None]:
# This is necessary to make sure the order of the sections in a given bar is the way we want it and consistent with the legend.
category_order = cooking_oil_names
by_year_df_subsets['ingr_name'] = pd.Categorical(by_year_df_subsets['ingr_name'], categories=category_order, ordered=True)

# Get a color palette and create a mapping between each ingredient that needs to be in the plot and a color palette.
from palettable.mycarta import Cube1_11
pal = Cube1_11.hex_colors
color_mapping = dict(zip(cooking_oil_names, pal[:len(cooking_oil_names)]))

# Make the plot.
(ggplot(data=by_year_df_subsets)
    + aes(y='fraction_using', x='year', fill="ingr_name")
    + scale_x_continuous(breaks=np.arange(2002,2010))
    + theme_bw()
    + geom_bar(position="fill", stat="identity")
    + ylab("Proportion")
    + xlab("Year")
    + scale_fill_manual(name="Cooking Oil", values=color_mapping, limits=cooking_oil_names)
)

## Ingredient Prediction with a Naive Bayes Model <a name="3"></a>

The names of recipes in the dataset, although messy, have relationships with the ingredients that are likely to be in the recipe. These relationships might be explicitly indicated by the name, such as 'apple pie' containing apples, or less directly, such as 'apple pie' containing flour. We can use a Naive Bayes model to predict the probabilty that a recipe includes a given ingredient given the words that are present in the recipe's name.

### Tokenizing the recipe names <a name="3.1"></a>

Let's look at the names that were provided for the recipes.

In [None]:
recipes_exploded_wrt_name_tokens = recipes.copy(deep=True)
recipes_exploded_wrt_name_tokens["name_tokens"] = recipes_exploded_wrt_name_tokens["name"].map(lambda x: x.split())
recipes_exploded_wrt_name_tokens = recipes_exploded_wrt_name_tokens.explode(column="name_tokens", ignore_index=True)
recipes_exploded_wrt_name_tokens = recipes_exploded_wrt_name_tokens[["recipe_id", "name_tokens", "ingredient_ids"]]
recipes_exploded_wrt_name_tokens.head(10)

Each row now refers to a specific name token in a specific recipe, with its list of ingredients. Let's explode with respect to the ingredient IDs again.

In [None]:
recipes_exploded_wrt_name_tokens["ingredient_ids"] = recipes_exploded_wrt_name_tokens['ingredient_ids'].apply(lambda x : ast.literal_eval(x))
recipes_exploded_wrt_name_tokens = recipes_exploded_wrt_name_tokens.explode(column="ingredient_ids", ignore_index=True)
recipes_exploded_wrt_name_tokens.head(10)

### Estimating the probabilities <a name="3.2"></a>
Now that we have this dataframe, we can easily extract the counts of ingredient tokens being in recipes with specific name tokens.

In [None]:
ingredient_id_to_prob = dict(zip(ingr_df["ingr_id"].values, ingr_df["frequency"]))

In [None]:
name_token_and_ingredient_id_to_counts = dict(recipes_exploded_wrt_name_tokens.groupby("name_tokens")["ingredient_ids"].value_counts())
name_token_to_count = dict(recipes_exploded_wrt_name_tokens.groupby("name_tokens").size())

In [None]:
name_token_and_ingredient_id_to_prob = lambda name_token,ingr_id: (name_token_and_ingredient_id_to_counts.get((name_token,ingr_id), 0)+1)/name_token_to_count[name_token]
name_token_and_ingredient_id_to_prob("beef", 1685)

Let's create another labmda that lets us pass in an ingrediet ID and list of name tokens and get back to probability for that ingredient given those name tokens.

In [None]:
ingr_prob = lambda ingr_id,name_tokens: math.log(ingredient_id_to_prob[ingr_id]) + np.sum([math.log(name_token_and_ingredient_id_to_prob(name_token,ingr_id)) for name_token in name_tokens])
ingr_prob(2200, ["beef", "chicken", "soy", "rice"])

Let's create another lambda that lets us pass in a list of tokens and get back a mapping between ingredient IDs and scores obtained using the model.

In [None]:
ingr_probs = lambda name_tokens: {ingr_id:ingr_prob(ingr_id,name_tokens) for ingr_id in ingr_df["ingr_id"].values}
#ingr_probs(["beef", "chicken", "soy", "rice"])

### Trying out the model <a name="3.3"></a>
Let's compose some potential recipe names out of individual tokens, and pass them to the model to see what the most likely ingredients for that name are.

In [None]:
recipe_name_tokens = ["stir", "fry"]
results = pd.DataFrame(zip(*ingr_probs(recipe_name_tokens).items())).transpose()
results.columns = ["ingr_id", "score"]
results = ingr_df.merge(results, on="ingr_id")
results = results.sort_values(by="score", ascending=False, ignore_index=True)
results.head(20)

In [None]:
recipe_name_tokens = ["beef", "burrito"]
results = pd.DataFrame(zip(*ingr_probs(recipe_name_tokens).items())).transpose()
results.columns = ["ingr_id", "score"]
results = ingr_df.merge(results, on="ingr_id")
results = results.sort_values(by="score", ascending=False, ignore_index=True)
results.head(20)

In [None]:
recipe_name_tokens = ["apple", "pie"]
results = pd.DataFrame(zip(*ingr_probs(recipe_name_tokens).items())).transpose()
results.columns = ["ingr_id", "score"]
results = ingr_df.merge(results, on="ingr_id")
results = results.sort_values(by="score", ascending=False, ignore_index=True)
results.head(20)