<a href="https://colab.research.google.com/github/splasherzz/food-allergen-detector/blob/main/Project%20Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preprocessing**

## **Importing the original dataset**

The dataset is uploaded as a .csv file in our [GitHub repository](https://github.com/splasherzz/food-allergen-detector). This is imported and initialized in our notebook as `og_food`.

In [1]:
import io
import pandas as pd
import numpy as np
import scipy.stats as stats
import seaborn as sns

# initializing the dataset
data = 'https://raw.githubusercontent.com/splasherzz/food-allergen-detector/main/%5BOriginal%5D%20Food%20Ingredients%20and%20Allergens.csv'
og_food = pd.read_csv(data)
og_food.head()

Unnamed: 0,Food Product,Main Ingredient,Sweetener,Fat/Oil,Seasoning,Allergens,Prediction
0,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
1,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
2,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
3,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
4,Cheddar Cheese,Cheese,,,Salt,Dairy,Contains


## **Dataset Features**

Below is a table of the dataset features and their respective descriptions.

<table>
  <tr>
    <th>Column Name<th/>
    <th>Description<th/>
  <tr/>

  <tr>
    <td>Food Product<td/>
    <td>Name of the food product<td/>
  <tr/>
  <tr>
    <td>Main Ingredient<td/>
    <td>Defining or distinctive ingredient of the food product<td/>
  <tr/>
  <tr>
    <td>Sweetener<td/>
    <td>Substance added to food or drink to impart the flavor of sweetness<td/>
  <tr/>
  <tr>
    <td>Fat/Oil<td/>
    <td>Lipids made from plants, animals, or synthetic compounds used when frying, baking, and preparing foods<td/>
  <tr/>
  <tr>
    <td>Seasoning<td/>
    <td>Salt, herbs, or spices added to food to enhance the flavour<td/>
  <tr/>
  <tr>
    <td>Allergens<td/>
    <td>Any normally harmless substance that causes an immediate allergic reaction in a susceptible person<td/>
  <tr/>
  <tr>
    <td>Prediction<td/>
    <td>Anticipated outcome of the model<td/>
  <tr/>
<table/>

## **Type Formatting**

The table provided below summarizes the possible data types that we may encounter when using Pandas.

<table>
  <tr>
    <th>Data type<th/>
    <th>Description<th/>
  <tr/>

  <tr>
    <td>object<td/>
    <td>Text or mixed numeric and non-numeric values<td/>
  <tr/>
  <tr>
    <td>int64<td/>
    <td>Integer numbers<td/>
  <tr/>
  <tr>
    <td>float64<td/>
    <td>Floating point numbers<td/>
  <tr/>
  <tr>
    <td>bool<td/>
    <td>True/False values<td/>
  <tr/>
  <tr>
    <td>datetime64<td/>
    <td>Date and time values<td/>
  <tr/>
  <tr>
    <td>timedelta[ns]<td/>
    <td>Differences between two datetimes<td/>
  <tr/>
  <tr>
    <td>category<td/>
    <td>Finite list of text values<td/>
  <tr/>
<table/>

Using `dtypes` on the dataset, we observed that all columns had the data type `object`, which is described as "text or mixed numeric and non-numeric values." We then perform type formatting to ensure all our columns have the type `category`, as it fits the description of a "finite list of text values." It is also appropriate for our dataset since our columns are different categories of the food product.

In [2]:
# performing type formatting to change all columns data types into "category"
for item in og_food:
  if og_food[item].dtype == object:
    og_food[item] = og_food[item].astype('category')

og_food.dtypes

Food Product       category
Main Ingredient    category
Sweetener          category
Fat/Oil            category
Seasoning          category
Allergens          category
Prediction         category
dtype: object

## **Handling Duplicates & Null Values**

Before dropping duplicates, we first check if there are null values in the dataset. Only one column had a null value and it was in the `Prediction` column. 

In [3]:
# handling null values
print("Total number of missing values in whole dataset:", og_food.isna().sum().sum())
print("\n")
print("Breakdown of which columns have missing values:\n", og_food.isna().sum())
print("\n")
print("Entry with null values:\n", og_food[og_food.isna().any(axis=1)])

Total number of missing values in whole dataset: 1


Breakdown of which columns have missing values:
 Food Product       0
Main Ingredient    0
Sweetener          0
Fat/Oil            0
Seasoning          0
Allergens          0
Prediction         1
dtype: int64


Entry with null values:
     Food Product Main Ingredient Sweetener Fat/Oil     Seasoning  \
338   Baked Ziti           Pasta      None  Cheese  Tomato sauce   

        Allergens Prediction  
338  Wheat, Dairy        NaN  


Upon manually checking the dataset for the row with null value (entry 338), it was a duplicate for the same entry with `Prediction` correctly filled up. Thus, we just drop this single row with null entry. We also drop the duplicates in the dataset, keeping only the first occurrence. 

In [4]:
# dropping entry with null value
og_food.dropna(inplace=True)

# handling & dropping duplicates
og_food.drop_duplicates(keep='first', inplace=True)
og_food.head()

Unnamed: 0,Food Product,Main Ingredient,Sweetener,Fat/Oil,Seasoning,Allergens,Prediction
0,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
2,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
4,Cheddar Cheese,Cheese,,,Salt,Dairy,Contains
5,Ranch Dressing,Buttermilk,Sugar,Vegetable oil,"Garlic, herbs",Dairy,Contains
6,Caramel Popcorn,Popcorn,Sugar,Butter,Salt,Dairy,Contains


After dropping the duplicates and a row with a null value, we check for the current shape of the dataset. As shown below, there are only 308 entries left out of the initial 400.

In [5]:
og_food.shape

(308, 7)

## **Augmenting the dataset**
We augment the dataset by downloading the cleaned file and manually adding 92 more entries to reach the same number of rows as prior to the deletion. To procure the data, we looked for random food products, their ingredients, and their allergen labels on Google.
We first made sure that there were no duplicates by searching the food product in the dataset before adding it.

In [6]:
# saving it as a CSV file
# df = pd.DataFrame(og_food)
# df.to_csv("[Augmented] Food Ingredients and Allergens.csv", index=False) 

## **Importing the augmented dataset**

Then, we import the augmented dataset as `aug_food`, and verify if its shape is correct.

The dataset is uploaded as a .csv file in our [GitHub repository](https://github.com/splasherzz/food-allergen-detector).

In [7]:
data1 = 'https://raw.githubusercontent.com/splasherzz/food-allergen-detector/main/%5BAugmented%5D%20Food%20Ingredients%20and%20Allergens.csv'
aug_food = pd.read_csv(data1)

aug_food.head()

Unnamed: 0,Food Product,Main Ingredient,Sweetener,Fat/Oil,Seasoning,Allergens,Prediction
0,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
1,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
2,Cheddar Cheese,Cheese,,,Salt,Dairy,Contains
3,Ranch Dressing,Buttermilk,Sugar,Vegetable oil,"Garlic, herbs",Dairy,Contains
4,Caramel Popcorn,Popcorn,Sugar,Butter,Salt,Dairy,Contains


## **Converting Categorical Features through Categorical Data Encoding**

To convert the qualitative values 'Does not contain' and 'Contains' into numerical representation, we respectively map the prediction values to 0 and 1 using `map`. This is necessary for training our multilabel classification model which requires numerical input for making predictions.

In [8]:
# mapping the prediction values to 0 or 1
aug_food['Prediction'] = aug_food['Prediction'].map({'Contains': 1, 'Does not contain': 0})

# showing that the prediction values are now set to 0/1
aug_food['Prediction']

0      1
1      1
2      1
3      1
4      1
      ..
395    1
396    1
397    1
398    1
399    1
Name: Prediction, Length: 400, dtype: int64

We then perform one-hot encoding on the augmented food dataset via `get_dummies`. This transforms categorical variables into binary (0/1) representations, allowing the classification model to understand and work with categorical data. Moreover, it enables the model to learn patterns and relationships between different food attributes which are crucial for accurate multilabel classification predictions.

In [9]:
# Performing one-hot encoding on categorical columns
aug_food = pd.get_dummies(aug_food, drop_first=True, prefix="", prefix_sep="")

# Moving the 'Prediction' column to the last position
aug_food.insert(len(aug_food), 'Prediction', aug_food.pop('Prediction'))

# Showing the dataset
aug_food.head()

Unnamed: 0,Aloo Gobi,Aloo Paratha,Apple,Apple Cider,Apple Crisp,Apple Pie,Apple sauce,Apple tart,Arabic Fattoush,Arancini,...,"Wheat, Dairy, Alcohol","Wheat, Dairy, Cocoa","Wheat, Dairy, Eggs","Wheat, Dairy, Nuts","Wheat, Pork, Dairy","Wheat, calamari","Wheat, dairy","Wheat, eggs","Wheat, eggs, dairy","Wheat, fish"
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## **Preprocessed Dataset Features**

The data features in the preprocessed dataset represent different food products such as "Aloo Gobi" and "Apple Cider." These features are stored as unsigned 8-bit integers (`uint8`) for efficient memory usage as the binary values of the features can be accommodated within this range.

In [10]:
aug_food.dtypes

Aloo Gobi             uint8
Aloo Paratha          uint8
Apple                 uint8
Apple Cider           uint8
Apple Crisp           uint8
                      ...  
Wheat, calamari       uint8
Wheat, dairy          uint8
Wheat, eggs           uint8
Wheat, eggs, dairy    uint8
Wheat, fish           uint8
Length: 854, dtype: object

In [11]:
aug_food.shape

(400, 854)

# **Data Modeling**

In [12]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10, random_state=42)

In [13]:
from sklearn.model_selection import train_test_split

X = aug_food.drop('Prediction', axis = 1)
y = aug_food['Prediction']

In [14]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2)

In [15]:
classifier.fit(X_train,y_train)

In [16]:
pred = classifier.predict(X_test)

In [17]:
classifier.score(X_test, y_test)

0.975

In [18]:
from sklearn.metrics import classification_report, confusion_matrix
print('Confusion Matrix: \n', confusion_matrix(y_test,pred))

Confusion Matrix: 
 [[28  2]
 [ 0 50]]


In [19]:
print('Classification Report: \n', classification_report(y_test,pred))

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      0.93      0.97        30
           1       0.96      1.00      0.98        50

    accuracy                           0.97        80
   macro avg       0.98      0.97      0.97        80
weighted avg       0.98      0.97      0.97        80



In [20]:
from sklearn.metrics import accuracy_score
# Evaluate performance on the training set
train_predictions = classifier.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)
print("Training Accuracy:", train_accuracy)

# Evaluate performance on the validation set
test_predictions = classifier.predict(X_test)
val_accuracy = accuracy_score(y_test, test_predictions)
print("Validation Accuracy:", val_accuracy)

Training Accuracy: 1.0
Validation Accuracy: 0.975
