<a href="https://colab.research.google.com/github/splasherzz/food-allergen-detector/blob/main/Project%20Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preprocessing**

## **Importing the original dataset**

The dataset is uploaded as a .csv file in our [GitHub repository](https://github.com/splasherzz/food-allergen-detector). This is imported and initialized in our notebook as `og_food`.

In [None]:
import io
import pandas as pd
import numpy as np
import scipy.stats as stats
import seaborn as sns

# initializing the dataset
data = 'https://raw.githubusercontent.com/splasherzz/food-allergen-detector/main/%5BOriginal%5D%20Food%20Ingredients%20and%20Allergens.csv'
og_food = pd.read_csv(data)
og_food.head()

Unnamed: 0,Food Product,Main Ingredient,Sweetener,Fat/Oil,Seasoning,Allergens,Prediction
0,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
1,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
2,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
3,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
4,Cheddar Cheese,Cheese,,,Salt,Dairy,Contains


## **Dataset Features**

Below is a table of the dataset features and their respective descriptions.

<table>
  <tr>
    <th>Column Name<th/>
    <th>Description<th/>
  <tr/>

  <tr>
    <td>Food Product<td/>
    <td>Name of the food product<td/>
  <tr/>
  <tr>
    <td>Main Ingredient<td/>
    <td>Defining or distinctive ingredient of the food product<td/>
  <tr/>
  <tr>
    <td>Sweetener<td/>
    <td>Substance added to food or drink to impart the flavor of sweetness<td/>
  <tr/>
  <tr>
    <td>Fat/Oil<td/>
    <td>Lipids made from plants, animals, or synthetic compounds used when frying, baking, and preparing foods<td/>
  <tr/>
  <tr>
    <td>Seasoning<td/>
    <td>Salt, herbs, or spices added to food to enhance the flavour<td/>
  <tr/>
  <tr>
    <td>Allergens<td/>
    <td>Any normally harmless substance that causes an immediate allergic reaction in a susceptible person<td/>
  <tr/>
  <tr>
    <td>Prediction<td/>
    <td>Anticipated outcome of the model<td/>
  <tr/>
<table/>

## **Type Formatting**

The table provided below summarizes the possible data types that we may encounter when using Pandas.

<table>
  <tr>
    <th>Data type<th/>
    <th>Description<th/>
  <tr/>

  <tr>
    <td>object<td/>
    <td>Text or mixed numeric and non-numeric values<td/>
  <tr/>
  <tr>
    <td>int64<td/>
    <td>Integer numbers<td/>
  <tr/>
  <tr>
    <td>float64<td/>
    <td>Floating point numbers<td/>
  <tr/>
  <tr>
    <td>bool<td/>
    <td>True/False values<td/>
  <tr/>
  <tr>
    <td>datetime64<td/>
    <td>Date and time values<td/>
  <tr/>
  <tr>
    <td>timedelta[ns]<td/>
    <td>Differences between two datetimes<td/>
  <tr/>
  <tr>
    <td>category<td/>
    <td>Finite list of text values<td/>
  <tr/>
<table/>

Using `dtypes` on the dataset, we observed that all columns had the data type `object`, which is described as "text or mixed numeric and non-numeric values." We then perform type formatting to ensure all our columns have the type `category`, as it fits the description of a "finite list of text values." It is also appropriate for our dataset since our columns are different categories of the food product.

In [None]:
# performing type formatting to change all columns data types into "category"
for item in og_food:
  if og_food[item].dtype == object:
    og_food[item] = og_food[item].astype('category')

og_food.dtypes

Food Product       category
Main Ingredient    category
Sweetener          category
Fat/Oil            category
Seasoning          category
Allergens          category
Prediction         category
dtype: object

## **Handling Duplicates & Null Values**

Before dropping duplicates, we first check if there are null values in the dataset. Only one column had a null value and it was in the `Prediction` column. 

In [None]:
# handling null values
print("Total number of missing values in whole dataset:", og_food.isna().sum().sum())
print("\n")
print("Breakdown of which columns have missing values:\n", og_food.isna().sum())
print("\n")
print("Entry with null values:\n", og_food[og_food.isna().any(axis=1)])

Total number of missing values in whole dataset: 1


Breakdown of which columns have missing values:
 Food Product       0
Main Ingredient    0
Sweetener          0
Fat/Oil            0
Seasoning          0
Allergens          0
Prediction         1
dtype: int64


Entry with null values:
     Food Product Main Ingredient Sweetener Fat/Oil     Seasoning  \
338   Baked Ziti           Pasta      None  Cheese  Tomato sauce   

        Allergens Prediction  
338  Wheat, Dairy        NaN  


Upon manually checking the dataset for the row with null value (entry 338), it was a duplicate for the same entry with `Prediction` correctly filled up. Thus, we just drop this single row with null entry. We also drop the duplicates in the dataset, keeping only the first occurrence. 

In [None]:
# dropping entry with null value
og_food.dropna(inplace=True)

# handling & dropping duplicates
og_food.drop_duplicates(keep='first', inplace=True)
og_food.head()

Unnamed: 0,Food Product,Main Ingredient,Sweetener,Fat/Oil,Seasoning,Allergens,Prediction
0,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
2,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
4,Cheddar Cheese,Cheese,,,Salt,Dairy,Contains
5,Ranch Dressing,Buttermilk,Sugar,Vegetable oil,"Garlic, herbs",Dairy,Contains
6,Caramel Popcorn,Popcorn,Sugar,Butter,Salt,Dairy,Contains


After dropping the duplicates and a row with a null value, we check for the current shape of the dataset. As shown below, there are only 308 entries left out of the initial 400.

In [None]:
og_food.shape

(308, 7)

## **Augmenting the dataset**
We augment the dataset by downloading the cleaned file and manually adding 92 more entries to reach the same number of rows as prior to the deletion. To procure the data, we looked for random food products, their ingredients, and their allergen labels on Google.
We first made sure that there were no duplicates by searching the food product in the dataset before adding it.

In [None]:
# saving it as a CSV file
# df = pd.DataFrame(og_food)
# df.to_csv("[Augmented] Food Ingredients and Allergens.csv", index=False) 

## **Importing the augmented dataset**

Then, we import the augmented dataset as `food`, and verify if its shape is correct.

The dataset is uploaded as a .csv file in our [GitHub repository](https://github.com/splasherzz/food-allergen-detector).

In [None]:
data = 'https://raw.githubusercontent.com/splasherzz/food-allergen-detector/main/%5BAugmented%5D%20Food%20Ingredients%20and%20Allergens.csv'
food = pd.read_csv(data)
df = pd.DataFrame(food)

food.head()

Unnamed: 0,Food Product,Main Ingredient,Sweetener,Fat/Oil,Seasoning,Allergens,Prediction
0,Almond Cookies,Almonds,Sugar,Butter,Flour,"Almonds, Wheat, Dairy",Contains
1,Chicken Noodle Soup,Chicken broth,,,Salt,"Chicken, Wheat, Celery",Contains
2,Cheddar Cheese,Cheese,,,Salt,Dairy,Contains
3,Ranch Dressing,Buttermilk,Sugar,Vegetable oil,"Garlic, herbs",Dairy,Contains
4,Caramel Popcorn,Popcorn,Sugar,Butter,Salt,Dairy,Contains


## **Converting Categorical Features**

The prediction values will be set to 0s or 1s such that 'Does not contain' becomes 0, and 'Contains' becomes 1. We display the prediction values below.

In [None]:
# mapping the prediction values to 0 or 1
df['Prediction'] = df['Prediction'].map({'Contains': 1, 'Does not contain': 0})

# showing that the prediction values are now set to 0/1
df['Prediction']

0      1
1      1
2      1
3      1
4      1
      ..
395    1
396    1
397    1
398    1
399    1
Name: Prediction, Length: 400, dtype: int64

We proceed to do multilabel classification on the dataset.

In [None]:
# extracting the classifications per column
food = pd.get_dummies(df['Food Product'], drop_first=True)
ingr = pd.get_dummies(df['Main Ingredient'], drop_first=True)
sweet = pd.get_dummies(df['Sweetener'], drop_first=True)
fat = pd.get_dummies(df['Fat/Oil'], drop_first=True)
seas = pd.get_dummies(df['Seasoning'], drop_first=True)
aller = pd.get_dummies(df['Allergens'], drop_first=True)

# changing the dataframe columns to form a table with the classfications and their corresponding prediction value
df.drop(['Food Product','Main Ingredient','Sweetener','Fat/Oil','Seasoning','Allergens'], axis=1, inplace=True)
df = pd.concat([food,ingr,sweet,fat,seas,aller,df], axis=1)

# showing the dataset
df.head()
#df.to_csv("Preprocessed.csv", index=False)

Unnamed: 0,Aloo Gobi,Aloo Paratha,Apple,Apple Cider,Apple Crisp,Apple Pie,Apple sauce,Apple tart,Arabic Fattoush,Arancini,...,"Wheat, Dairy, Cocoa","Wheat, Dairy, Eggs","Wheat, Dairy, Nuts","Wheat, Pork, Dairy","Wheat, calamari","Wheat, dairy","Wheat, eggs","Wheat, eggs, dairy","Wheat, fish",Prediction
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## **Preprocessed Dataset Features**

(describe data features here e.g. `Food` - name of food product etc etc)

In [None]:
food.dtypes

Aloo Gobi                      uint8
Aloo Paratha                   uint8
Apple                          uint8
Apple Cider                    uint8
Apple Crisp                    uint8
                               ...  
Wheat Bread                    uint8
White Bread                    uint8
Zucchini Bread                 uint8
Zucchini Noodles               uint8
Zucchini Noodles with Pesto    uint8
Length: 350, dtype: object

In [None]:
food.shape

(400, 350)

# **Data Modeling**