# Ingradient Descent: Breaking Food Down into Ingredients

*Written by Brad Ritzema and Zach Chin*

## The Real-World Goal


For our project, we wanted to create a neural network that had the ability to turn a picture of food into a list of ingredients. For example, we wanted a picture of a taco to output ingredients such as tortilla, carne asada, cilantro, cheese, etc. This problem is interesting because it goes beyond your typical single-label classification. It goes beyond determining if the image is a cat or dog, or identifying a number from 1 to 10. 

This project is interesting even without the technical challenges! Imagine you are in a restaurant trying a dish you’ve never had before. It would be interesting to take a picture of that food, learn about the ingredients, and be able to recreate it at home. Additionally, from a health and safety perspective, those with food allergies can determine if their food contains any potential allergens before taking a bite.

## The Dataset(s)



We utilized two different datasets in our modelling process: [Recipes5k](http://www.ub.edu/cvub/recipes5k/) and [Recipe1M+](http://pic2recipe.csail.mit.edu/). 

Recipe1M+ is a dataset curated by researchers from the Polytechnic University of Catalonia, MIT, and the Qatar Computing Research Institute. It contains one million cooking recipes and 13 million images of food (hence the name, 1M+). Each recipe is associated with a number of images, a list of ingredients, and instructions for preparing the dish. This dataset was difficult to work with at the start of this project: the images were randomly spread out across directories with no apparent pattern and the size of the dataset acted as a double-edged sword and required significant time to download important files. Additionally, the list of ingredients had not been cleaned, making it a challenge to develop a common vocabulary for the model.

The other dataset, Recipes5k, is a dataset curated by researchers at the University of Barcelona. It is derived from the Food-101 dataset, which contains 101 different types of food along with their names. Recipes5k takes the names from this dataset, along with multiple images and lists of ingredients taken from [Yummly](http://www.yummly.com/) for each food type. The dataset’s main strength is that the ingredients for each image have already been cleaned; for example, similar ingredients like “tomato paste” and “tomato slices'' were condensed into the term “tomato.” The images themselves were contained in directories named after the food type. This made it very simple to generate a common vocabulary for our model. On the downside, the dataset makes use of *txt* files rather than *json* files to describe the data, which resulted in us having to link the data ourselves across the *txt* files rather than load it with one *json* file. Despite this, Recipes5k’s simple images and labels suited our needs perfectly for the task of relating an image of food to its ingredients.

## The Technical Goal


One goal we set out to accomplish was to determine how different datasets affected the results of our model. For this experiment, we worked on the two datasets mentioned above. While the Recipes5k had folders dedicated to foods such as apple pie, burgers, and tacos, Recipe1M+ had no organization and was just a collection of recipes and data. However, this enabled it to contain a much greater variety for food than Recipes5k. We hypothesized that due to its categorization of food (and therefore limited amount of food types), Recipes5k would produce an inferior model.


## Technical Approach

To determine whether our hypothesis was right or wrong, we created 2 separate models. We trained one model on the Recipes5k dataset and the other model on the Recipe1M+ dataset. Analyzing the results from each model was a bit difficult. Both models had a high `accuracy_multi` metric (above 0.984). However, looking at our actual model results, we found these metrics to be off. There seemed to be more errors (at least by our standards) occurring than what the `accuracy_multi` suggested. So, we did a manual review of our models and analyzed the predictions. We found that the models had very similar results with no clear-cut winner.

\**Note: `accuracy_multi` is a metric used in the *fastai* library to compute the accuracy of multi-label classification models during training. We will refer to this metric frequently throughout this paper.*


## Exploring the Data

When exploring the Recipe1M+ dataset, we noticed a trend that significantly raised the difficulty of our multi-label classification goal. The list of ingredients had not been cleaned, leaving items like “1 cup Philadelphia Herb & Garlic Cream Cheese Product” or “1 (6 oz.) ready-to-use graham cracker crumb crust” in our labels. Additionally, the images were not merely labelled “Mac & Cheese'' or “Chili,” but rather “World’s Best Mac & Cheese” and “Around the Kitchen Chili” (challenges that were not present in the Recipes5k dataset). This required more work to clean the Recipe1M+ data in order to be able to use it to train a model.

## The Modeling Setup

For our model setup, we decided to use higher level API’s from the *fastai* library. We used fastai’s `DataBlock` and `DataLoaders` class to load our data into the format compatible with *fastai*'s functions. We set the `DataBlock` up as an `ImageBlock`-`MultiCategoryBlock`, which specified that our problem dealt with placing multiple labels on images. We set up the `DataBlock`’s `splitter`, `get_x`, and `get_y` functions* to work with the *pandas* data frame that we had created. 

We used the `cnn_learner` function in order to create the model with a provided architecture, a pre-trained 18-layer ResNet, along with a threshold of 0.2. We found that on the Recipes5k dataset, the model did not benefit much from adding additional layers. On the Recipe1M+ dataset, it was infeasible to add more layers due to the significant amount of time it took to train an 18-layer ResNet. 

We set up the categories to be one-hot-encoded, so we utilized binary cross entropy with logits as our loss function (although we didn’t need to specify it, as *fastai* chose it automatically). We utilized `accuracy_multi` as our metric, although (as will be explained in the "Analysis of Errors" section) this did not suit the context of our data well.

\**Functions to get the training-validation set split, the images, and their corresponding labels, respectively.*

## Validation Approach

Since both datasets had already nicely partitioned each recipe/image into a training and validation set, we decided to use a training-validation split to train the models. However, we were only able to download part of the training set from the Recipe1M+ dataset, so we needed to split that sample further ourselves.

## Baseline Results

We built our simplest model off of the Recipes5k dataset due to its simplicity (compared to Recipe1M+). The initial model returned a tensor of probabilities, which we were able to use to generate the human-readable predictions. According to the output during our model's training, this model was able to make predictions with 97% accuracy, much to our surprise. When testing the model with out-of-sample images, however, we received drastically different results than expected. The baseline model tended to output extremely common ingredients, such as oil, salt, and pepper, but missed certain ingredients that were highly apparent in the input image. Additionally, the model would predict some ingredients that were not present in the image. For example, the model would fail to output “lettuce” when given a picture of a salad and printed “chicken” when fed a picture of apple pie.


## Attempts at Improved Results

We made multiple attempts at improving our results. First, we modified the number of epochs we were running. Our baseline models ran for one epoch in order to reduce training time, and running more than one epoch significantly increased our `accuracy_multi` metric. Second, we tried to find the optimal threshold. Too high of a threshold would only include common ingredients such as salt, pepper, and oil. Too low of a threshold would include many unrelated ingredients. We tested thresholds ranging from 0.1 to 0.4, and eventually landed with a threshold of 0.2. Another thing that we did was training the model (Recipe1M+) on less images. We began with 30,000 images and decreased the amount to 2,500 to speed up training and train over more epochs. 

Many of our attempts at improvement worked, but some didn’t. In order to narrow our label space, we tried dropping the least common ingredients found in the data. However, we found our accuracy actually decreased following this. We also tried dropping our most common labels (e.g. salt, pepper, oil) to see if that would push the model to make more detailed predictions. We found that it had little to no effect. 


## Analysis of Errors

When analyzing our model, it is easy to see where it went wrong, but not why it went wrong. One error that we saw is the over-prediction of the most common ingredients. For example, when predicting the ingredients in a piece of carrot cake, our model predicted that garlic was an ingredient. We believe this occurred because of the frequency of garlic’s appearance within the data - it is used in a significant number of different recipes. As such, we hypothesized (although couldn’t definitively conclude) that our model was “playing it safe” by predicting common ingredients like garlic.

On the opposite end, we also saw the under-prediction of ingredients. When predicting ingredients for an image of paella, for example, we found that it was missing "saffron". We hypothesized that the high price of saffron lead to little usage of it in many recipes, which could lead to under-prediction. We found that the amount of errors that we discovered manually in our model nullified the quantitative scores that the model gave us. The model gave us accuracy ratings upwards of 0.98, yet we believe that there is more actual error than the number suggested. This could be due to a lack of knowledge in alternatives to the `accuracy_multi` metric that we used. `accuracy_multi` calculates the mean of all ingredients’ correct inclusion/exclusion in the predicted labels. Since our vocabulary was so large, accidentally including (or failing to include) an ingredient would not lower the average by much, and thus would lead to a very small penalty for errors.


## Analysis of the Effects of Alternate Choices

As covered by the “Attempts at Improved Results” section, we experimented with a variety of models. Models trained on more epochs tended to have a higher accuracy metric than those trained on less epochs. Unexpectedly, we discovered that the model architecture, whether it be an 18-layer, 30-layer, or 50-layer ResNet did not make much difference when training a model on the Recipes5k dataset. 

Our testing of different thresholds also didn’t seem to make too much of a difference in the accuracy_multi metric. As explained above, this could have been caused by a sizable label space, and thus a small impact on the overall penalty. This needed to be kept in mind throughout the entire training process - while it was exciting to see accuracy metric measures over 97%, our testing of each model revealed the underlying problem.
	
One of the bigger alternative choices we experimented with was the ingredients list to use to filter the image labels. Recipes5k conveniently came with one such common ingredients list, and thus we were able to use it quickly to train our baseline model. However, this severely limited the kinds of ingredients present in our data - less common ingredients like “saffron” or “gochujang” would be completely ignored. As such, we attempted to filter the ingredients in our sample of Recipe1M+ by using a blacklist rather than an allowed ingredients list. Models trained with this blacklist tended to perform more poorly, as the data contained within Recipe1M+ made it nearly impossible to extract simple ingredients. As such, we would end up with predictions like “neighborhood” and “inches”, since our blacklist couldn’t cover every non-ingredient word in the English vocabulary. As a result, we decided to stick to the common ingredients list, as our experiments proved it to be more consistent and reliable when making predictions.


## Summary of Our Findings

Before starting this project, we had no idea what we were going to discover and how much we were going to discover. One of the main takeaways was the difficulties of working with data. The Recipe1M+ dataset particularly gave us difficulty. First, due to its immense size, the data was very difficult to download to our personal machines. Also, this dataset was very difficult to navigate. In addition, this dataset did not have a clean list of ingredients to convert to labels and put into our model - the ingredients lists that it provided had many variations, such as “Kirkland milk”, “2% milk”, etc. 

We found that we could possibly address this by condensing our data into categories. For example, milk would just be classified as dairy. However, we ended up not doing this because we felt that this would change the scope of our project. We wanted to predict ingredients/components rather than simple food categories. 

Categorization could have helped with another problem that we ran into and discovered. We found that in order to correctly validate our model, it was necessary that all labels that were in the validation set had to be included in the training set.

Once we had cleaned up our data and had a working model, we discovered more about the limitations of our model (and possibly even our data). We tried adjusting the hyperparameters (like threshold) and initially got a large improvement. However, we were unable to fine-tune it as much as we would have liked. This suggests 2 options for future work. One, it might be preferable to build a lower-level model in order to have more control over our model’s training, and thus fine tune our work. Two, we need to adjust the data we have and the data we are working with.

Finally, we came to a conclusion on the hypothesis that was mentioned earlier. Our hypothesis was that, due to its categorization, Recipes5k would produce an inferior model. Recipes5k only had 101 different dishes and we were worried how this would affect our results. However, we actually discovered that the model produced very similar results and had similar accuracies to the model trained on Recipe1M+.

Analyzing our results, we also found some of the biases our models potentially had. They could easily overpredict some ingredients and underpredict others. With such a small subset of foods, the model trained on Recipes5k would likely not recognize dishes from non-Western cultures, and could potentially mislabel the food with Western ingredients.


## Limitations and Future Directions

One of the most significant limitations we encountered on this project was time and computing resources. With a large dataset like Recipe1M+, we spent a large portion of the project creating subsets of the data to use to train our model. The space of this dataset also required massive computational resources; Google Colab was barely sufficient enough to train one epoch using 0.2% of the thirteen million images available. Our limited access to more powerful GPUs greatly hindered our ability to train multiple variations of a model quickly and with ease.

As briefly mentioned in the “Exploring the Data” section, the lists of ingredients in Recipe1M+ were packed with a plethora of difficulties, some that we could not solve with our education. One challenge was figuring out how to clean labels like “1 cup Philadelphia Herb & Garlic Cream Cheese Product.” In order to extract “cream cheese,” we would need a blacklist or a common ingredients list to reference in order to eliminate the other extra words. However, “garlic” was considered a standalone ingredient as well, so filtering it out of the label was not - at least, conceivably - possible. Additionally, we struggled to remove duplicate words (e.g. “potato and potatoes”), which led to even more complexity in our model’s vocabulary.

Overall, more future study and research would need to be done into developing methods to clean ingredient labels such as those found in Recipe1M+. Even an exhaustive list of ingredients could possibly suffice. Recipes5k did provide a simple but limited list of common ingredients, which we ultimately ended up using to filter the ingredients in Recipe1M+. However, this list disregards variation in ingredients across cultures, as it was built using Western foods as a reference. In order for us to limit this bias, we need accurate and standardized methods for determining what counts as an ingredient and how to separate ingredient phrases within a text.


## Appendix

| Element                                                                                                                                                                                                                                                                                          | Included? | Notes |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-------|
| A succinct but descriptive **title**                                                                                                                                                                                                                                                             |X           |       |
| A **real-world question or goal** and *why* it's interesting.                                                                                                                                                                                                                                    |X           |       |
| A description of the **dataset**: what sort of data does it contain? Where did it come from? Why did you choose it? What are its strengths and limitations?                                                                                                                                      |X           |       |
| A specific technical goal or question                                                                                                                                                                                                                                                            |X           |       |
| Your technical **approach** for achieving that goal or answering that question                                                                                                                                                                                                                   |X           |       |
| What you noticed from **exploring the data** (e.g., counts by category, distributions of continuous variables, things you notice from inspecting individual samples at random)                                                                                                                   |X           |       |
| Your **modeling setup**: what are your features? Targets? Metrics? Loss function?                                                                                                                                                                                                                |X           |       |
| Your **validation approach**: train-val-test split? cross-validation?                                                                                                                                                                                                                            |X           |       |
| Your **baseline results**: applying the simplest model you can think of; how good were the results (quantitatively and perhaps qualitatively)?                                                                                                                                                   |X           |       |
| Your **attempts at improved results**: what did you adjust, and why? How did the results change?                                                                                                                                                                                                 |X           |       |
| An **analysis of errors** (quantitatively and perhaps qualitatively)                                                                                                                                                                                                                             |X           |       |
| **An analysis of the effects of alternative choices.** You can consider differences in model architecture, specific task, hyperparameter choices, inclusion/exclusion criteria, etc. Remember to think about the choice of **metrics** and the **uncertainty** involved in any estimate of them. |X           |       |
| A **summary of your findings**. Did you achieve your goal or answer your question?                                                                                                                                                                                                               |X           |       |
| **Limitations and future directions**                                                                                                                                                                                                                                                            |X           |       |