# Concept
In the original conceptualization of the project we said that we wanted to find out the CO2 inpact of each recipe, its nutritional value and its cultural origin. To do so, the plan was to combine the data from three different datasets: Cooking recipes, Recipe 1M+ and Open Food Facts Database. After carefully observing the data, we found that Recipe 1M+ contains all the information we need to implement our plan. It contains several different datasets that combine recipe and ingredient information. Here we analyse two of them in order to see if any of this is feasible.


We load the data in the spark environment in case it is too big to be handled in pandas.

In [1]:
import pandas as pd
import numpy as np
import pyspark
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Fisrt Glance at the data


We first explore a dataset containing nutritional information and organization of the quantities of all ingredients of each recipe. Although this data is much smaller than the full dataset, it is very clean and can give us an idea of what information we can retrieve.

In [4]:
root_path = '../recipes_with_nutritional_info.json'
recipes = spark.read.json(root_path,multiLine=True)

In [5]:
recipes.show(10)

+--------------------+----------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  fsa_lights_per100g|        id|         ingredients|        instructions| nutr_per_ingredient| nutr_values_per100g|partition|            quantity|               title|                unit|                 url|     weight_per_ingr|
+--------------------+----------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|[green, green, gr...|000095fc1d|[[yogurt, greek, ...|[[Layer all ingre...|[[0.8845044000000...|[81.1294613189476...|    train|   [[8], [1], [1/4]]|     Yogurt Parfaits|[[ounce], [cup], ...|http://tastykitch...|[226.796, 152.0, ...|
|[red, orange, ora...|00051d5b9d|[[sugars, granula...|[[Cream sugar 

In [7]:
recipes.printSchema()

root
 |-- fsa_lights_per100g: struct (nullable = true)
 |    |-- fat: string (nullable = true)
 |    |-- salt: string (nullable = true)
 |    |-- saturates: string (nullable = true)
 |    |-- sugars: string (nullable = true)
 |-- id: string (nullable = true)
 |-- ingredients: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- text: string (nullable = true)
 |-- instructions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- text: string (nullable = true)
 |-- nutr_per_ingredient: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- fat: double (nullable = true)
 |    |    |-- nrg: double (nullable = true)
 |    |    |-- pro: double (nullable = true)
 |    |    |-- sat: double (nullable = true)
 |    |    |-- sod: double (nullable = true)
 |    |    |-- sug: double (nullable = true)
 |-- nutr_values_per100g: struct (nullable = true)
 |    |-- energy: double (nullable = true)
 |    

Reflecting on the dataset and the feasibility of our objectives we can say that evaluating the CO2 impact of each recipe would require the use of an aditional dataset giving the emissions per ingredient. Even then, a calculation of the total impact would not be accurate since high variations of CO2 emissions of a given ingredient are expected depending on its origin and its use.
The nutritional value aspect of the analysis has already been done by the team of scientists who created the data. This information is however not available for the full dataset of recipes. We could try to expand it by matching repeating ingredients. This might be difficult given that ingredients are often quite specific(there are many different types for many of them). However, we consider that if we combine these types and take a single value for their nutritional attributes, the inaccuracy would not influence global tendencies.

Reflecting on the dataset, it seems to be very rich. As we know, it has been preprocessed to be fed to a machine learning algorithm. Because of that, we expect it to be very clean, with no NAN or duplicates. We will still test it just in case. For the moment, we focus mainly on the recipes and the way they can be clustered, leaving their nutritional values for a later stage of the analysis. Thus we will work on a dataframe that ignores the corresponding columns.


In [8]:
recipes=recipes.select(["id","title","ingredients","instructions"])

In [10]:
print('Size before removing NaN and duplicates:', recipes.count())
recipes=recipes.dropna()
recipes=recipes.drop_duplicates()
print('Size after removing NaN and duplicates:', recipes.count())

Size before removing NaN and duplicates: 51235
Size after removing NaN and duplicates: 51235


We have 50 thousand recipes as a subset of the 800 available. It is perfect for understanding how the data can be structured and what could be done with it.
Let's see what the inner structure is.


In [12]:
one_line = recipes.take(1)
print(type(one_line[0][2]))
one_line[0][2]

<class 'list'>


[Row(text='sugars, granulated'),
 Row(text='cocoa, dry powder, unsweetened'),
 Row(text='wheat flour, white, all-purpose, unenriched'),
 Row(text='salt, table'),
 Row(text='butter, without salt'),
 Row(text='vanilla extract')]

We see that in the ingredients column we have for each row a list of ingredients, where each line corresponds to an ingredient followed by words describing it. We would like to get a feeling of the order of magnitude of the list of ingredients.
To do so we will export the data to a pandas dataframe as this makes it easier to explode it by ingredient. 


In [31]:
recipes_pd = recipes.toPandas()

In [32]:
recipes_pd.head(10)

Unnamed: 0,id,title,ingredients,instructions
0,02968fae50,Hot Chocolate Sauce over Pound Cake,"[(sugars, granulated,), (cocoa, dry powder, un...",[(Fill bottom of double boiler with enough wat...
1,030ae71300,Italian Wheat Bread With Pesto,"[(water, bottled, generic,), (honey,), (salt, ...",[(Put ingredients in machine in the order reco...
2,03a63c5699,Pie Crust,"[(butter, without salt,), (shortening, vegetab...","[(Combine the butter, shortening, 2 cups of th..."
3,05aaa57168,Easy Roast Beef,"[(beef, grass-fed, ground, raw,), (soup, cream...","[(Rinse beef, pat dry, and place in slow cooke..."
4,06916e9d4c,Homemade Ginger Beer,"[(spices, ginger, ground,), (water, bottled, g...","[(Place ginger in a food processor, and proces..."
5,06a4bf1502,Easy Braunschweiger Ball,"[(pork sausage, link/patty, unprepared,), (che...",[(Use Braunschweiger and cream cheese at room ...
6,0a3d094811,Three Step Taco Salad,"[(beef, grass-fed, ground, raw,), (seasoning m...",[(Cook the beef in large skillet over medium-h...
7,0bb944d2e1,Bittersweet Chocolate Sauce,"[(sugars, granulated,), (cocoa, dry powder, un...",[(Combine the sugar and cocoa in a heavy botto...
8,0f2df80545,Santa Fe Cured Pork Loin,"[(pork, fresh, loin, tenderloin, separable lea...","[(In large saucepan, heat all ingredients EXCE..."
9,0f628acbbd,Creamy Chicken Alfredo,"[(cream, fluid, heavy whipping,), (butter, wit...","[(Directions:,), (Place the cream, butter and ..."


In [33]:
recipes_pd = recipes_pd.explode("ingredients")

In [35]:
recipes_pd

Unnamed: 0,id,title,ingredients,instructions
0,02968fae50,Hot Chocolate Sauce over Pound Cake,"(sugars, granulated,)",[(Fill bottom of double boiler with enough wat...
0,02968fae50,Hot Chocolate Sauce over Pound Cake,"(cocoa, dry powder, unsweetened,)",[(Fill bottom of double boiler with enough wat...
0,02968fae50,Hot Chocolate Sauce over Pound Cake,"(wheat flour, white, all-purpose, unenriched,)",[(Fill bottom of double boiler with enough wat...
0,02968fae50,Hot Chocolate Sauce over Pound Cake,"(salt, table,)",[(Fill bottom of double boiler with enough wat...
0,02968fae50,Hot Chocolate Sauce over Pound Cake,"(butter, without salt,)",[(Fill bottom of double boiler with enough wat...
...,...,...,...,...
51234,ffaaf44bde,Cream Fresh Cauliflower Soup,"(milk, fluid, 1% fat, without added vitamin a ...","[(Saute onion in butter until tender.,), (Brea..."
51234,ffaaf44bde,Cream Fresh Cauliflower Soup,"(cheese, parmesan, hard,)","[(Saute onion in butter until tender.,), (Brea..."
51234,ffaaf44bde,Cream Fresh Cauliflower Soup,"(spices, curry powder,)","[(Saute onion in butter until tender.,), (Brea..."
51234,ffaaf44bde,Cream Fresh Cauliflower Soup,"(cornstarch,)","[(Saute onion in butter until tender.,), (Brea..."


We see that there are more than 300 thousand entries in the dataframe. We expect some of them to repeat and most of them to repeat if we combine similar ingredients together.

# Scaling up
We would now like to scale up and use the full database of recipes. We are going to import a subset of them. We select the first 50 thousand in order to get a sense of the data while still being able to use spark locally.


In [37]:
root_path = '../first_50k.json'
recipes_50k = spark.read.json(root_path,multiLine=True)
recipes_50k.show(10)

+----------+--------------------+--------------------+---------+--------------------+--------------------+
|        id|         ingredients|        instructions|partition|               title|                 url|
+----------+--------------------+--------------------+---------+--------------------+--------------------+
|000018c8a5|[[6 ounces penne]...|[[Preheat the ove...|    train|Worlds Best Mac a...|http://www.epicur...|
|000033e39b|[[1 c. elbow maca...|[[Cook macaroni a...|    train|Dilly Macaroni Sa...|http://cookeatsha...|
|000035f7ed|[[8 tomatoes, qua...|[[Add the tomatoe...|    train|            Gazpacho|http://www.foodne...|
|00003a70b1|[[2 12 cups milk]...|[[Preheat oven to...|     test|Crunchy Onion Pot...|http://www.food.c...|
|00004320bb|[[1 (3 ounce) pac...|[[Dissolve Jello ...|    train|Cool 'n Easy Crea...|http://www.food.c...|
|0000631d90|[[12 cup shredded...|[[In a large skil...|    train|Easy Tropical Bee...|http://www.food.c...|
|000075604a|[[2 Chicken thigh...|[[Pi

Looking at the data, apart from the missing nutritional values, we see that it is similar to the previous one. One aspect that could be problematic is that the ingredients are less clean as this is exactly what we are interested in.
Hence we will need to reorganize the data. To do so we will have to isolate the measures and the descriptions of the ingredients. Although this is not an obvious task, we can observe that there is a very limited number of words describing the quantity(e.g. cups, c., ounce, tbsp) which can be separated along with the numberical value.
As for separating the actual ingredient from its specifications, we will need to perform a deeper analysis.

In [38]:
###

A first idea would be to check which words already appear in the clean ingredients list and which words appear the most often.

In [39]:
###

Once we have a clean ingredients list for all the recipes, we will proceed with the analysis of the data.
The first step would be to find for which ones of the newly found there is availible nutritional information from the first dataset. We will consider removing the specifications of the ingredients in order to simplify the task and have our set of usable recipes as large as possible. We assume for the time being that not differentiating between regular and Kosher salt for example wouldn't be a huge loss.

In [40]:
###

The next step of the project will consist in clustering the recipes based on the similarity of their ingredients. We will consider doing so in different ways.
One possibility is to come up with a score system based on a metric value associated with the ingredients. Another would be to train an algorithm with a subset of recipes for which their type and origin are easily extractable.(E.g. 'pasta' is Italian, 'cake' is pastry)


In [41]:
###

At this stage we should be able to compare our different clustering methods and select a final classification.
Once this is done, we can ask numerous questions about the differences between different cuisines. Which one is the healthiest? Which one uses the most of a specific ingredient? Which types of recipes are simpler to execute? ...

In [None]:
###