# COGS 118B - Final Project

# Insert title here

## Group members

- Zixian Cai
- Richard Masser-Frye
- Vinuthna Hasthi
- Pranav Nair
- Jesus Tello

# Abstract 
How does a recipe define a food item? What makes a cake a cake? In this project, we seek to cluster recipes based on their ingredients and other variables, and determine whether a machine learning model can distinguish recipes with regard to the type of food item that the recipe is supposed to describe. To narrow the scope of the problem, we will be focusing on baked goods, and attempting to cluster the data by type of baked good (e.g. cakes, cookies, muffins, etc). The data we will be using is a set of recipes, with each recipe including a title, set of ingredients (with associated quantities), and set of instructions. Data will be extracted from the ingredients and instructions, and will be fed into a gaussian mixture model (GMM) or other clustering scheme. We will measure success via the efficacy of the clustering algorithms and their accuracy when it comes to separating the recipes into the correct category.

TODO: edit me to include results and generally be better representative of how the project turned out

# Background

Broadly, recipes are sets of instructions for the purpose of preparing specific dishes. They generally include some information about the dish, the ingredients used in making the dish, and how to assemble or process those ingredients. In particular, baked goods are dishes (such as cookies, cakes, etc) that contain flour as a primary ingredient and involve combining ingredients to make a batter which is placed into an oven to bake. The consistency of the baked good generally depends on the proportions of flour, egg, sugar, butter, and baking powder used in the batter, and on how long and at what temperature it is baked.

Recipe datasets have previously been used in machine learning research. Often, the research is not particularly focused on recipes, but uses them for their advantageous qualities (most recipes share a similar structure, they can be easily scraped from websites, they can't be copyrighted, etc). For example, a 2017 paper by Yang et al.<a name="yang"></a>[<sup>[1]</sup>](#yangnote) used machine learning to analyze "reference expressions" (phrases that refer to previously mentioned objects in a text) and employed instructions from recipes as an example of this. Other published works are more explicitly food-focused; Herranz et al.<a name="herranz"></a>[<sup>[2]</sup>](#herranznote) provides an overview of techniques that can be applied to recipes, ingredient lists, and even images of food. The paper associated with the RecipeNLG dataset by Bien et al.<a name="recipenlg"></a>[<sup>[3]</sup>](#recipenlgnote) also provides an overview of previous efforts in machine learning on recipes, including the two aforementioned papers.

# Problem Statement

The problem at hand is the categorization of baked-goods recipes based on their ingredients. Though they are not explicitly numerical, these attributes have aspects that can be abstracted into numbers; specifically, the quantities of each ingredient in grams. Each recipe can thus be expressed as a vector, and clustering techniques (such as GMM or hierarchical clustering) can be attempted to the dataset to sort the recipes into groups. We can also use high-dimensional data visualization techniques like PCA or manifold learning to get a sense of the overall shape of the dataset. Finally, we can measure the success of the clustering by comparing the predicted labels to the true labels found in the original dataset. 

# Data

We used a dataset called RecipeNLG created by researchers at Poznan University of Technology.<a name="recipenlg2"></a>[<sup>[3]</sup>](#recipenlgnote)

- Links: [RecipeNLG official website](https://recipenlg.cs.put.poznan.pl/), [RecipeNLG Kaggle page](https://www.kaggle.com/datasets/paultimothymooney/recipenlg)
- RecipeNLG contains 2,231,142 recipes in all, but for our project we filtered it down to 235,762 recipes that had "cake", "cookie" or "muffin" in the title. After processing, it was further filtered down to 145,087 data points.
- Post-processing, an observation consists of quantities of each of the following ingredients, in grams:
    - egg
    - flour
    - sugar
    - butter
    - vanilla extract
    - milk
    - evaporated milk
    - condensed milk
    - shortening
    - powdered sugar
    - cornmeal
    - baking soda
    - baking powder
    - oats
- Whether a particular datapoint is from a cake, cookie, or muffin recipe is stored in a column called 'category', and is treated as a ground truth observation (i.e. it is not used for unsupervised tasks)

The initial filtering of the data for baked-goods recipes was done in [filter_recipes.ipynb](https://github.com/richmass1/group_template/blob/main/filter_recipes.ipynb), by selecting recipes that had in their titles "cake", "cookie", or "muffin". Converting the recipes into data points was done in [process_recipes.ipynb](https://github.com/richmass1/group_template/blob/main/process_recipes.ipynb), by searching each ingredient string (for example, `'1/2 c. sugar'`) for an ingredient name, then extracting the number and unit, converting the number to a float, then converting the unit to grams. 

After processing, the data was further filtered by removing data points that had no flour, data points with invalid values and datapoints with values that seemed overly large.

# Proposed Solution

Our strategy was to try a number of ML algorithms on the dataset, and see what did the best with our particular dataset. An early idea was to first reduce dimensionality, then use GMM to cluster the data points, and see whether GMM could accurately separate the recipe into their true categories. We also decided to try alternative clustering techniques like hierarchical clustering and spectral clustering. Finally, we tried PCA, UMAP, and t-SNE as data visualization techniques.

# Evaluation Metrics

For clustering techniques like GMM and hierarchical clustering, we initially proposed using Bayesian information criterion (BIC) to evaluate the clustering itself, and adjusted Rand index (ARI) to evaluate how well it assigned data points. However, it turned out that the data was not conducive to simple clustering, rendering ARI unnecessary. 

For dimensionality reduction algorithms (PCA, t-SNE, UMAP) we used a more supervised approach. For all 3 techniques, we used silhouette scores with true labels to gauge how well it placed recipes of the same type near each other. For PCA and UMAP, we trained a k-nearest neighbors (kNN) classifier on the embedded data and then used ARI to evaluate the performance of the classifier.

# Results

### Starting out, getting a picture of the data with PCA
<img width=400 src="./images/PCA_projection.png">

Image from [this notebook](https://github.com/richmass1/group_template/blob/main/118b%20PCA%20project.ipynb)

<img width=400 src="./images/PCAwithlabels.png">

Image from [this notebook](https://github.com/richmass1/group_template/blob/main/PCA_UMAP_tSNE.ipynb)

## Clustering Techniques

### GMM

<img width=400 src="./images/GMM.png">

Image from [this notebook](https://github.com/richmass1/group_template/blob/main/118B%20GMM%20Project.ipynb)

### Hierarchical clustering
<img width=400 src="./images/hierarchical.png">

Image from [this notebook](https://github.com/richmass1/group_template/blob/main/Hierarchical%20Spectral%20Clustering.ipynb)

# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

For the most part, we believe we will not run into any ethics or privacy issues. We acknowledge potential ethical concerns with the cultural sensitivities associated with foods. When it comes to the ethical concerns of data privacy, we will ensure that we will not breach any privacy rules or laws when it comes to personal recipes. The data used in our analysis will strictly be obtained from publicly available sources, and in doing so, private recipes will not be breached. Overall, our team will address any emerging issue or concerns when it comes to ethics and privacy. We will adapt our project if we see any unethical behavior done by our team, algorithm, or data.

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="yangnote"></a>1.[^](#yang): Zichao Yang, Phil Blunsom, Chris Dyer, and Wang Ling. (9 Aug 2017) Reference-Aware Language Models. https://arxiv.org/pdf/1611.01628.pdf<br> 

<a name="herranznote"></a>2.[^](#herranz): Luis Herranz, Weiqing Min and Shuqiang Jiang. (22 Jan 2018) Food recognition and recipe analysis: integrating visual content, context and external knowledge. https://arxiv.org/pdf/1801.07239.pdf<br> 

<a name="recipenlgnote"></a>3.[^](#recipenlg), [^](#recipenlg2): Michał Bień, Michał Gilski, Martyna Maciejewska, Wojciech Taisner, Dawid Wisniewski, and Agnieszka Lawrynowicz. (Dec 2020) RecipeNLG: A Cooking Recipes Dataset for Semi-Structured Text Generation. https://aclanthology.org/2020.inlg-1.4/<br>