<a href="https://colab.research.google.com/github/splasherzz/food-allergen-detector/blob/main/Copy%20of%20Project%20Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preprocessing**

## **Importing the original dataset**

The dataset is uploaded as a .csv file in our [GitHub repository](https://github.com/splasherzz/food-allergen-detector). This is imported and initialized in our notebook as `og_food`.

In [522]:
import io
import pandas as pd
import numpy as np
import scipy.stats as stats
import seaborn as sns

# initializing the dataset
data = 'https://raw.githubusercontent.com/splasherzz/food-allergen-detector/main/datasets/%5BOriginal%5D%20Food%20Ingredients%20and%20Allergens.csv'
og_food = pd.read_csv(data)
og_food.head()

Unnamed: 0,Food Product,Main Ingredient,Sweetener,Fat/Oil,Seasoning,Allergens,Prediction
0,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
1,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
2,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
3,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
4,Cheddar Cheese,Cheese,,,Salt,Dairy,Contains


## **Dataset Features**

Below is a table of the dataset features and their respective descriptions.

<table>
  <tr>
    <th>Column Name<th/>
    <th>Description<th/>
  <tr/>

  <tr>
    <td>Food Product<td/>
    <td>Name of the food product<td/>
  <tr/>
  <tr>
    <td>Main Ingredient<td/>
    <td>Defining or distinctive ingredient of the food product<td/>
  <tr/>
  <tr>
    <td>Sweetener<td/>
    <td>Substance added to food or drink to impart the flavor of sweetness<td/>
  <tr/>
  <tr>
    <td>Fat/Oil<td/>
    <td>Lipids made from plants, animals, or synthetic compounds used when frying, baking, and preparing foods<td/>
  <tr/>
  <tr>
    <td>Seasoning<td/>
    <td>Salt, herbs, or spices added to food to enhance the flavour<td/>
  <tr/>
  <tr>
    <td>Allergens<td/>
    <td>Any normally harmless substance that causes an immediate allergic reaction in a susceptible person<td/>
  <tr/>
  <tr>
    <td>Prediction<td/>
    <td>Anticipated outcome of the model<td/>
  <tr/>
<table/>

## **Type Formatting**

The table provided below summarizes the possible data types that we may encounter when using Pandas.

<table>
  <tr>
    <th>Data type<th/>
    <th>Description<th/>
  <tr/>

  <tr>
    <td>object<td/>
    <td>Text or mixed numeric and non-numeric values<td/>
  <tr/>
  <tr>
    <td>int64<td/>
    <td>Integer numbers<td/>
  <tr/>
  <tr>
    <td>float64<td/>
    <td>Floating point numbers<td/>
  <tr/>
  <tr>
    <td>bool<td/>
    <td>True/False values<td/>
  <tr/>
  <tr>
    <td>datetime64<td/>
    <td>Date and time values<td/>
  <tr/>
  <tr>
    <td>timedelta[ns]<td/>
    <td>Differences between two datetimes<td/>
  <tr/>
  <tr>
    <td>category<td/>
    <td>Finite list of text values<td/>
  <tr/>
<table/>

Using `dtypes` on the dataset, we observed that all columns had the data type `object`, which is described as "text or mixed numeric and non-numeric values." We then perform type formatting to ensure all our columns have the type `category`, as it fits the description of a "finite list of text values." It is also appropriate for our dataset since our columns are different categories of the food product.

In [523]:
# performing type formatting to change all columns data types into "category"
for item in og_food:
  if og_food[item].dtype == object:
    og_food[item] = og_food[item].astype('category')

og_food.dtypes

Food Product       category
Main Ingredient    category
Sweetener          category
Fat/Oil            category
Seasoning          category
Allergens          category
Prediction         category
dtype: object

## **Handling Duplicates & Null Values**

Before dropping duplicates, we first check if there are null values in the dataset. Only one column had a null value and it was in the `Prediction` column. 

In [524]:
# handling null values
print("Total number of missing values in whole dataset:", og_food.isna().sum().sum())
print("\n")
print("Breakdown of which columns have missing values:\n", og_food.isna().sum())
print("\n")
print("Entry with null values:\n", og_food[og_food.isna().any(axis=1)])

Total number of missing values in whole dataset: 1


Breakdown of which columns have missing values:
 Food Product       0
Main Ingredient    0
Sweetener          0
Fat/Oil            0
Seasoning          0
Allergens          0
Prediction         1
dtype: int64


Entry with null values:
     Food Product Main Ingredient Sweetener Fat/Oil     Seasoning  \
338   Baked Ziti           Pasta      None  Cheese  Tomato sauce   

        Allergens Prediction  
338  Wheat, Dairy        NaN  


Upon manually checking the dataset for the row with null value (entry 338), it was a duplicate for the same entry with `Prediction` correctly filled up. Thus, we just drop this single row with null entry. We also drop the duplicates in the dataset, keeping only the first occurrence. 

In [525]:
# dropping entry with null value
og_food.dropna(inplace=True)

# handling & dropping duplicates
og_food.drop_duplicates(keep='first', inplace=True)
og_food.head()

Unnamed: 0,Food Product,Main Ingredient,Sweetener,Fat/Oil,Seasoning,Allergens,Prediction
0,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
2,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
4,Cheddar Cheese,Cheese,,,Salt,Dairy,Contains
5,Ranch Dressing,Buttermilk,Sugar,Vegetable oil,"Garlic, herbs",Dairy,Contains
6,Caramel Popcorn,Popcorn,Sugar,Butter,Salt,Dairy,Contains


After dropping the duplicates and a row with a null value, we check for the current shape of the dataset. As shown below, there are only 308 entries left out of the initial 400.

In [526]:
og_food.shape

(308, 7)

## **Augmenting the dataset**
We augment the dataset by downloading the cleaned file and manually adding 92 more entries to reach the same number of rows as prior to the deletion. To procure the data, we looked for random food products, their ingredients, and their allergen labels on Google.
We first made sure that there were no duplicates by searching the food product in the dataset before adding it.

In [527]:
# saving it as a CSV file
# df = pd.DataFrame(og_food)
# df.to_csv("[Augmented] Food Ingredients and Allergens.csv", index=False) 

## **Importing the augmented dataset**

Then, we import the augmented dataset as `aug_food`, and verify if its shape is correct.

The dataset is uploaded as a .csv file in our [GitHub repository](https://github.com/splasherzz/food-allergen-detector).

In [528]:
data1 = 'https://raw.githubusercontent.com/splasherzz/food-allergen-detector/main/datasets/%5BAugmented%5D%20Food%20Ingredients%20and%20Allergens.csv'
aug_food = pd.read_csv(data1)

aug_food['Main Ingredient'] = aug_food['Main Ingredient'].str.replace(r'\s*\([^)]*\)', '', regex=True)
aug_food['Fat/Oil'] = aug_food['Fat/Oil'].str.replace(r'\s*\([^)]*\)', '', regex=True)
aug_food['Seasoning'] = aug_food['Seasoning'].str.replace(r'\s*\([^)]*\)', '', regex=True)

aug_food.head()

Unnamed: 0,Food Product,Main Ingredient,Sweetener,Fat/Oil,Seasoning,Allergens,Prediction
0,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
1,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
2,Cheddar Cheese,Cheese,,,Salt,Dairy,Contains
3,Ranch Dressing,Buttermilk,Sugar,Vegetable oil,"Garlic, herbs",Dairy,Contains
4,Caramel Popcorn,Popcorn,Sugar,Butter,Salt,Dairy,Contains


## **Converting Categorical Features through Categorical Data Encoding**

To convert the qualitative values 'Does not contain' and 'Contains' into numerical representation, we respectively map the prediction values to 0 and 1 using `map`. This is necessary for training our multilabel classification model which requires numerical input for making predictions.

In [529]:
# mapping the prediction values to 0 or 1
aug_food['Prediction'] = aug_food['Prediction'].map({'Contains': 1, 'Does not contain': 0})

# showing that the prediction values are now set to 0/1
aug_food['Prediction']


0      1
1      1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     1
21     1
22     1
23     1
24     1
25     1
26     1
27     1
28     1
29     1
30     1
31     1
32     1
33     1
34     1
35     1
36     1
37     1
38     1
39     1
40     1
41     1
42     1
43     1
44     1
45     1
46     1
47     1
48     1
49     1
50     1
51     1
52     1
53     1
54     1
55     1
56     1
57     1
58     1
59     1
60     1
61     1
62     1
63     1
64     1
65     1
66     1
67     1
68     1
69     1
70     1
71     1
72     1
73     1
74     1
75     1
76     1
77     1
78     1
79     1
80     1
81     1
82     1
83     1
84     1
85     1
86     1
87     1
88     1
89     1
90     1
91     1
92     1
93     1
94     1
95     1
96     1
97     1
98     1
99     1
100    1
101    1
102    1
103    0
104    0
105    0
106    0
107    1
108    1
109    1
110    0
1

We then perform one-hot encoding on the augmented food dataset via `get_dummies`. This transforms categorical variables into binary (0/1) representations, allowing the classification model to understand and work with categorical data. Moreover, it enables the model to learn patterns and relationships between different food attributes which are crucial for accurate multilabel classification predictions.

In [530]:
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)

# Performing one-hot encoding on categorical columns
aug_food['Main Ingredient'] = aug_food['Main Ingredient'].str.lower()
aug_food['Main Ingredient'] = aug_food['Main Ingredient'].str.split(',')

aug_main = pd.get_dummies(aug_food['Main Ingredient'].apply(pd.Series).stack()).groupby(level=0).sum()

aug_food = pd.concat([aug_food, aug_main], axis=1)
aug_food = aug_food.drop('Main Ingredient', axis=1)


aug_food['Sweetener'] = aug_food['Sweetener'].str.lower()
aug_food['Sweetener'] = aug_food['Sweetener'].str.split(',')

aug_sweetener = pd.get_dummies(aug_food['Sweetener'].apply(pd.Series).stack()).groupby(level=0).sum()

aug_food = pd.concat([aug_food, aug_sweetener], axis=1)
aug_food = aug_food.drop('Sweetener', axis=1)


aug_food['Fat/Oil'] = aug_food['Fat/Oil'].str.lower()
aug_food['Fat/Oil'] = aug_food['Fat/Oil'].str.split(',')

aug_fat_oil = pd.get_dummies(aug_food['Fat/Oil'].apply(pd.Series).stack()).groupby(level=0).sum()

aug_food = pd.concat([aug_food, aug_fat_oil], axis=1)
aug_food = aug_food.drop('Fat/Oil', axis=1)


aug_food['Seasoning'] = aug_food['Seasoning'].str.lower()
aug_food['Seasoning'] = aug_food['Seasoning'].str.split(',')

aug_seasoning = pd.get_dummies(aug_food['Seasoning'].apply(pd.Series).stack()).groupby(level=0).sum()

aug_food = pd.concat([aug_food, aug_seasoning], axis=1)
aug_food = aug_food.drop('Seasoning', axis=1)


aug_food['Allergens'] = aug_food['Allergens'].str.lower()
aug_food['Allergens'] = aug_food['Allergens'].str.split(',')

aug_allergen = pd.get_dummies(aug_food['Allergens'].apply(pd.Series).stack()).groupby(level=0).sum()

aug_food = pd.concat([aug_food, aug_allergen], axis=1)
print(aug_allergen)

aug_food = aug_food.drop('Allergens', axis=1)

moving_pred = aug_food['Prediction']
aug_food = aug_food.drop('Prediction', axis = 1)
aug_food['Prediction'] = moving_pred

aug_food = aug_food.drop('Food Product', axis = 1)

# Showing the dataset


         alcohol   anchovies   beef   calamari   celery   cocoa   coconut  \
0    0         0           0      0          0        0       0         0   
1    0         0           0      0          0        1       0         0   
2    0         0           0      0          0        0       0         0   
3    0         0           0      0          0        0       0         0   
4    0         0           0      0          0        0       0         0   
5    0         0           0      0          0        0       0         0   
6    0         0           0      0          0        0       0         0   
7    0         0           0      0          0        0       0         0   
8    0         0           0      0          0        0       0         0   
9    0         0           0      0          0        0       0         0   
10   0         0           0      0          0        0       0         0   
11   0         0           0      0          0        0       0         0   

## **Preprocessed Dataset Features**

The data features in the preprocessed dataset represent different food products such as "Aloo Gobi" and "Apple Cider." These features are stored as unsigned 8-bit integers (`uint8`) for efficient memory usage as the binary values of the features can be accommodated within this range.

In [531]:
aug_food.dtypes

almond flour              uint8
almonds                   uint8
apple                     uint8
apples                    uint8
arborio rice              uint8
avocado                   uint8
avocadoes                 uint8
bacon                     uint8
bananas                   uint8
basmati rice              uint8
beef                      uint8
beet                      uint8
bell peppers              uint8
berries                   uint8
black beans               uint8
blueberries               uint8
bread                     uint8
brie cheese               uint8
broccoli                  uint8
brussels sprouts          uint8
buttermilk                uint8
butternut squash          uint8
cabbage                   uint8
calamari                  uint8
carrots                   uint8
cauliflower               uint8
cheese                    uint8
chicken                   uint8
chicken breast            uint8
chicken broth             uint8
chicken wings             uint8
chickpea

In [532]:
aug_food.shape

(400, 475)

# **Data Modeling**

In [533]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=200)

In [534]:
from sklearn.model_selection import train_test_split

X = aug_food.drop('Prediction', axis = 1)
y = aug_food['Prediction']

In [535]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.3, random_state = 41)

In [536]:
classifier.fit(X_train,y_train)

In [537]:
classifier.score(X_test, y_test)

0.9916666666666667

In [538]:
from sklearn.metrics import classification_report, confusion_matrix
pred = classifier.predict(X_test)
print('Confusion Matrix: \n', confusion_matrix(y_test,pred))

Confusion Matrix: 
 [[45  0]
 [ 1 74]]


In [539]:
print('Classification Report: \n', classification_report(y_test,pred))

Classification Report: 
               precision    recall  f1-score   support

           0       0.98      1.00      0.99        45
           1       1.00      0.99      0.99        75

    accuracy                           0.99       120
   macro avg       0.99      0.99      0.99       120
weighted avg       0.99      0.99      0.99       120



In [540]:
from sklearn.metrics import accuracy_score
# Evaluate performance on the training set
train_predictions = classifier.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)
print("Training Accuracy:", train_accuracy)

# Evaluate performance on the validation set
test_predictions = classifier.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print("Testing Accuracy:", test_accuracy)

Training Accuracy: 1.0
Testing Accuracy: 0.9916666666666667


# **Saving the model**



In [541]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

k = 20

kfold = KFold(n_splits=k)

scores = cross_val_score(classifier, X_train, y_train, cv=kfold)

for i, score in enumerate(scores):
    print(f"Fold {i+1}: {score}")

mean_score = scores.mean()
std_score = scores.std()

# Train the final model using the entire training set
classifier.fit(X_train, y_train)

# Evaluate the model on the validation set
val_score = classifier.score(X_test, y_test)

print(val_score)

print(mean_score)
print(std_score)


Fold 1: 1.0
Fold 2: 1.0
Fold 3: 1.0
Fold 4: 1.0
Fold 5: 0.9285714285714286
Fold 6: 1.0
Fold 7: 1.0
Fold 8: 1.0
Fold 9: 0.9285714285714286
Fold 10: 1.0
Fold 11: 1.0
Fold 12: 1.0
Fold 13: 1.0
Fold 14: 1.0
Fold 15: 1.0
Fold 16: 1.0
Fold 17: 0.9285714285714286
Fold 18: 1.0
Fold 19: 0.9285714285714286
Fold 20: 1.0
0.9916666666666667
0.9857142857142855
0.028571428571428557
