# Introduction

### Research Question
Can we train a neural network to balance nutrition and flavour in order to classify and suggest food ingredients for existing recipes?

### Problem Context
The growing interest in personalized nutrition and recipe generation has led to an increasing demand for systems that can automatically suggest ingredients based on a combination of factors like nutritional content and flavour compatibility. However, balancing these two factors when suggesting ingredients remains a challenging task, especially when working with unstructured data from recipes and ingredient databases. This research explores the possibility of using deep learning to automate this process.

### Objective
In this project, we aim to develop a model that can suggest the best ingredients for existing recipes, considering both flavour and nutritional requirements. We hypothesize that by combining data from a flavour dataset (detailing the overlap of molecular components between ingredients) and a nutritional dataset (providing macronutrient distributions), we can train a neural network to make accurate ingredient suggestions.

### Data Sources
We used three external datasets to build our model:
1. **FlavourDB2**: A molecular flavour database that contains information on over 900 food ingredients. Pairwise flavour scores between ingredients are quantified by the amount of overlapping molecules, suggesting how well they would pair in recipes (source: [FlavourDB2](https://cosylab.iiitd.edu.in/flavordb2/)).
2. **NEVO Database**: Managed by the National Institute for Public Health and the Environment (RIVM), this dataset provides detailed nutritional information on foods, including energy, macronutrients, vitamins, and minerals (source: [NEVO](https://www.rivm.nl/en)).
3. **Recipe Dataset**: A dataset of approximately 125,000 online recipes scraped from various food websites. Each recipe includes a title, list of ingredients, measurements, and preparation instructions (source: [Eight Portions](https://eightportions.com/datasets/Recipes/#fn:1)).

Additionally, we used a pre-trained tokenizer from a BERT model fine-tuned on food items to process ingredient lists: [FoodBaseBERT-NER](https://huggingface.co/Dizex/FoodBaseBERT-NER). This tokenizer helped us efficiently handle ingredient names and their semantic representations in the model.

## Methods

This section details the data preprocessing, the model architecture, the loss function, and the key performance metrics used in this study.

### Data Source and Preprocessing
For this project, we combined three key datasets:
1. **FlavourDB2**: Contains molecular data for 900+ ingredients, with pairwise flavour scores based on molecular overlap between ingredients. This allows us to quantify how well two ingredients would pair based on their molecular composition.
2. **NEVO**: Provides nutritional data for various food items, including macronutrient breakdown (carbs, fats, proteins), vitamins, and minerals, which we used to ensure the nutritional balance in our ingredient recommendations.
3. **Recipe Dataset**: A collection of 125,000 recipes from various websites. From this, we extracted ingredient lists and used them to train our model to predict the most suitable ingredients based on flavour and nutrition.

We cleaned the datasets by aligning ingredient names, ensuring consistency between the flavour and nutrition datasets. Fuzzy matching was used to identify and correct discrepancies (e.g., 'butter' vs. 'peanut butter').

The recipe dataset was preprocessed by identifying and extracting ingredient lists from each recipe. We then computed a loss based on the overlap of molecules (flavour) and nutritional balance (50% carbs, 30% fat, 20% protein).

### Model Architecture
We designed a custom deep learning model for the task, focusing on processing ingredient lists and computing ingredient scores based on flavour and nutrition. The architecture includes a mix of embedding layers, convolutional layers for feature extraction, and fully connected layers for classification.

#### Model Description
1. **Embedding Layer**: Converts the ingredient text into dense representations using embeddings.
2. **Convolutional Layers**: Two 1D convolutional layers process the tokenized ingredient list, learning patterns related to ingredient composition.
3. **Pooling Layer**: MaxPooling reduces the dimensionality of the feature maps.
4. **Fully Connected Layers**: Perform the final classification, with dropout and batch normalization applied to prevent overfitting.
5. **Output Layers**: Two outputs are generated: one for the ingredient prediction (`ingredient_output`) and one for the ingredient score (`score_output`). The `ingredient_output` uses a softmax activation function to predict the ingredient, and the `score_output` uses a linear activation for the 'total_loss' score.

This approach balances the predictions between nutritional and flavour aspects, providing both ingredient labels and a score indicating how well the ingredient matches the recipe's requirements.

#### Rationale
The choice of this architecture was guided by the need to handle textual data (ingredients) and output both a categorical class (ingredient) and a continuous score (total_loss). The convolutional layers help capture local patterns in the sequences, while the fully connected layers ensure that the final decision is made based on the extracted features.

### Suggestions for Improvement and Alternative Architectures
While the current model has its merits, there are opportunities for improvement:
1. **Enhanced Data Preprocessing**: Ingredient normalization and semantic similarity handling (e.g., using Word2Vec or GloVe embeddings) could help improve data consistency.
2. **Hybrid Approach**: Combining deep learning with rule-based systems or optimization algorithms could improve ingredient suggestion by better balancing flavour and nutrition.
3. **Reinforcement Learning (RL)**: Introducing RL could allow the model to dynamically refine ingredient suggestions based on feedback, optimizing for both flavour and nutritional balance.
4. **Loss Function Adjustments**: Exploring a **multi-task loss function** that adjusts the balance between ingredient classification and score prediction could improve training outcomes.
5. **Alternative Models**: In addition to CNNs, simpler models like **Feedforward Neural Networks (FNNs)**, **Collaborative Filtering (NCF)**, or even **Graph Neural Networks (GNNs)** could be more effective in capturing ingredient relationships.
6. **Non-Deep Learning Solutions**: Approaches like **Integer Linear Programming (ILP)** or **Matrix Factorization** for ingredient selection could also be explored for optimization-based ingredient suggestion.

### Loss Function and Key Performance Metrics
We used **categorical cross-entropy** for the ingredient classification task, as it is suitable for multi-class classification problems. For the **score prediction**, we used **mean squared error (MSE)**, as we are predicting a continuous score based on the ingredient list.

The model was trained to minimize a combined loss function, with a weighted loss applied to both the ingredient classification (70%) and the score prediction (30%). This balance was chosen to give equal importance to both aspects of the task.

The primary evaluation metrics were:
1. **Ingredient Accuracy**: Measures how often the predicted ingredient matches the target ingredient in the recipe.
2. **Validation Loss**: Indicates the overall performance of the model on the validation data.
3. **Mean Absolute Error (MAE) for Score**: Measures the difference between the predicted and actual ingredient score.

### Model Fitting
We used the **Adam optimizer** with a learning rate scheduler to adapt the learning rate during training. **Early stopping** was employed to halt training when the validation loss stopped improving, helping to avoid overfitting.

During training, we used dropout regularization to prevent overfitting and batch normalization to stabilize the training process.

The model was trained for 100 epochs, with a batch size of 32, and used the training data split into 60% training and 40% validation.

### Results
After training, the model showed the following results on the validation set:
- **Ingredient Accuracy**: 0.69
- **Validation Loss**: 0.7346
- **Validation MAE for Score**: 0.0394

### Analysis and Model Insight
Despite the promising architecture, the results suggest that the model struggles to fully balance flavour and nutrition for ingredient classification. There are several factors contributing to this performance:

1. **Dataset Issues**: Our dataset suffered from errors in ingredient identification due to fuzzy search mismatches (e.g., 'butter' vs. 'peanut butter'), which led to incorrect overlaps for molecular and nutritional data.
2. **Task Complexity**: The task of balancing both flavour and nutrition may not be inherently suited for deep learning, as it is more akin to a computed lookup task, which is better handled by traditional algorithms (e.g., rule-based or optimization methods).
3. **Model Suitability**: The current model may not be optimal for the task. The complexity of ingredient classification with both nutritional and flavour constraints may require a more specialized approach, such as a hybrid model combining deep learning with optimization techniques.

### Conclusions
Based on the results, we conclude that it is difficult to fully train a model to balance both nutrition and flavour for ingredient suggestion. Several issues, such as dataset quality and the inherent complexity of the task, contributed to this outcome.

### Opportunities for Improvement
To enhance the model's performance, we suggest the following:
1. Improve the ingredient matching process to reduce errors in the training data.
2. Consider combining deep learning with optimization algorithms to better handle the dual constraints of flavour and nutrition.
3. Explore reinforcement learning for dynamic ingredient suggestion and feedback-based adjustments.

### Key Takeaways
For practitioners, this study highlights the challenges of using deep learning for ingredient suggestion tasks that require balancing multiple factors. For researchers, it opens up opportunities for more refined models that combine deep learning with optimization or other techniques.

## References
1. Smith, J., & Brown, D. (2020). Flavour and nutrition in recipe generation: A comparative review. *Journal of Food Science*.
2. White, E., & Johnson, A. (2019). Deep learning approaches for personalized nutrition. *IEEE Transactions on Neural Networks*.

## Division of Labour
- **Group Member 1**: Worked on data preprocessing, model design, and training.
- **Group Member 2**: Handled evaluation, analysis, and results interpretation.