# 2 Lactose Intolerance Dish Classifier

### Setting up the enviornment

Setting environment by installing the 'mlend' package and importing necessary libraries. Mounting a Google Drive to the Colab environment for downloading and accessing dataset files.

In [1]:
!pip install mlend



In [2]:
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spkit as sp

from skimage import exposure
from skimage.color import rgb2hsv, rgb2gray
import skimage as ski

import mlend
from mlend import download_yummy, yummy_load

import os, sys, re, pickle, glob
import urllib.request
import zipfile

import IPython.display as ipd
from tqdm import tqdm
import librosa

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
baseDir = download_yummy(save_to = '/content/drive/MyDrive/Data/MLEnd')
baseDir

Downloading 3250 image files from https://github.com/MLEndDatasets/Yummy
100%|[0m▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓[0m|3250\3250|003250.jpg
Done!


'/content/drive/MyDrive/Data/MLEnd/yummy'

# Dataset

In the dataset preprocessing phase, 3,250 food samples from the dataset were classified based on their lactose content. Each dish, characterized by its name and ingredients, was labeled as either 'lactose' or 'non-lactose'. This classification was achieved by identifying lactose-related keywords in the ingredients and dish names. Additionally, the dataset was refined by replacing underscores with spaces in the 'Dish_name' column to enhance readability. The dataset was then split into training and testing sets according to the 'Benchmark_A' attribute, ensuring a structured approach for the subsequent machine learning tasks

In [4]:
#load dataset
df = pd.read_csv('/content/drive/MyDrive/Data/MLEnd/yummy/MLEndYD_image_attributes_benchmark.csv').set_index('filename')
df

Unnamed: 0_level_0,Diet,Cuisine_org,Cuisine,Dish_name,Home_or_restaurant,Ingredients,Healthiness_rating,Healthiness_rating_int,Likeness,Likeness_int,Benchmark_A
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
000001.jpg,non_vegetarian,japanese,japanese,chicken_katsu_rice,marugame_udon,"rice,chicken_breast,spicy_curry_sauce",neutral,3.0,like,4.0,Train
000002.jpg,non_vegetarian,english,english,english_breakfast,home,"eggs,bacon,hash_brown,tomato,bread,tomato,bake...",unhealthy,2.0,like,4.0,Train
000003.jpg,non_vegetarian,chinese,chinese,spicy_chicken,jinli_flagship_branch,"chili,chicken,peanuts,sihuan_peppercorns,green...",neutral,3.0,strongly_like,5.0,Train
000004.jpg,vegetarian,indian,indian,gulab_jamun,home,"sugar,water,khoya,milk,salt,oil,cardamon,ghee",unhealthy,2.0,strongly_like,5.0,Train
000005.jpg,non_vegetarian,indian,indian,chicken_masala,home,"chicken,lemon,turmeric,garam_masala,coriander_...",healthy,4.0,strongly_like,5.0,Train
...,...,...,...,...,...,...,...,...,...,...,...
003246.jpg,vegetarian,indian,indian,zeera_rice,home,"1_cup_basmati_rice,2_cups_water,2_tablespoons_...",healthy,4.0,strongly_like,5.0,Train
003247.jpg,vegetarian,indian,indian,paneer_and_dal,home,"fried_cottage_cheese,ghee,lentils,milk,wheat_f...",healthy,4.0,strongly_like,5.0,Test
003248.jpg,vegetarian,indian,indian,samosa,home,"potato,onion,peanut,salt,turmeric_powder,red_c...",very_unhealthy,1.0,like,4.0,Test
003249.jpg,vegan,indian,indian,fruit_milk,home,"kiwi,banana,apple,milk",very_healthy,5.0,strongly_like,5.0,Train


In [5]:
# Define keywords for lactose based products
lactose_prod = ['milk', 'cheese', 'butter', 'yogurt', 'custard', 'cream']

# Function to determine if the dish is lactose free or not
def classify_dish(row):
    for word in lactose_prod:
        if word in row['Dish_name'].lower() or word in row['Ingredients'].lower():
            return 'lactose'
    return 'non-lactose'

In [6]:
# Apply the function to each row in the dataframe
df['lactose'] = df.apply(classify_dish, axis=1)
df

Unnamed: 0_level_0,Diet,Cuisine_org,Cuisine,Dish_name,Home_or_restaurant,Ingredients,Healthiness_rating,Healthiness_rating_int,Likeness,Likeness_int,Benchmark_A,lactose
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
000001.jpg,non_vegetarian,japanese,japanese,chicken_katsu_rice,marugame_udon,"rice,chicken_breast,spicy_curry_sauce",neutral,3.0,like,4.0,Train,non-lactose
000002.jpg,non_vegetarian,english,english,english_breakfast,home,"eggs,bacon,hash_brown,tomato,bread,tomato,bake...",unhealthy,2.0,like,4.0,Train,non-lactose
000003.jpg,non_vegetarian,chinese,chinese,spicy_chicken,jinli_flagship_branch,"chili,chicken,peanuts,sihuan_peppercorns,green...",neutral,3.0,strongly_like,5.0,Train,non-lactose
000004.jpg,vegetarian,indian,indian,gulab_jamun,home,"sugar,water,khoya,milk,salt,oil,cardamon,ghee",unhealthy,2.0,strongly_like,5.0,Train,lactose
000005.jpg,non_vegetarian,indian,indian,chicken_masala,home,"chicken,lemon,turmeric,garam_masala,coriander_...",healthy,4.0,strongly_like,5.0,Train,non-lactose
...,...,...,...,...,...,...,...,...,...,...,...,...
003246.jpg,vegetarian,indian,indian,zeera_rice,home,"1_cup_basmati_rice,2_cups_water,2_tablespoons_...",healthy,4.0,strongly_like,5.0,Train,non-lactose
003247.jpg,vegetarian,indian,indian,paneer_and_dal,home,"fried_cottage_cheese,ghee,lentils,milk,wheat_f...",healthy,4.0,strongly_like,5.0,Test,lactose
003248.jpg,vegetarian,indian,indian,samosa,home,"potato,onion,peanut,salt,turmeric_powder,red_c...",very_unhealthy,1.0,like,4.0,Test,non-lactose
003249.jpg,vegan,indian,indian,fruit_milk,home,"kiwi,banana,apple,milk",very_healthy,5.0,strongly_like,5.0,Train,lactose


In [7]:
df['Dish_name'] = df['Dish_name'].str.replace('_', ' ')
df

Unnamed: 0_level_0,Diet,Cuisine_org,Cuisine,Dish_name,Home_or_restaurant,Ingredients,Healthiness_rating,Healthiness_rating_int,Likeness,Likeness_int,Benchmark_A,lactose
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
000001.jpg,non_vegetarian,japanese,japanese,chicken katsu rice,marugame_udon,"rice,chicken_breast,spicy_curry_sauce",neutral,3.0,like,4.0,Train,non-lactose
000002.jpg,non_vegetarian,english,english,english breakfast,home,"eggs,bacon,hash_brown,tomato,bread,tomato,bake...",unhealthy,2.0,like,4.0,Train,non-lactose
000003.jpg,non_vegetarian,chinese,chinese,spicy chicken,jinli_flagship_branch,"chili,chicken,peanuts,sihuan_peppercorns,green...",neutral,3.0,strongly_like,5.0,Train,non-lactose
000004.jpg,vegetarian,indian,indian,gulab jamun,home,"sugar,water,khoya,milk,salt,oil,cardamon,ghee",unhealthy,2.0,strongly_like,5.0,Train,lactose
000005.jpg,non_vegetarian,indian,indian,chicken masala,home,"chicken,lemon,turmeric,garam_masala,coriander_...",healthy,4.0,strongly_like,5.0,Train,non-lactose
...,...,...,...,...,...,...,...,...,...,...,...,...
003246.jpg,vegetarian,indian,indian,zeera rice,home,"1_cup_basmati_rice,2_cups_water,2_tablespoons_...",healthy,4.0,strongly_like,5.0,Train,non-lactose
003247.jpg,vegetarian,indian,indian,paneer and dal,home,"fried_cottage_cheese,ghee,lentils,milk,wheat_f...",healthy,4.0,strongly_like,5.0,Test,lactose
003248.jpg,vegetarian,indian,indian,samosa,home,"potato,onion,peanut,salt,turmeric_powder,red_c...",very_unhealthy,1.0,like,4.0,Test,non-lactose
003249.jpg,vegan,indian,indian,fruit milk,home,"kiwi,banana,apple,milk",very_healthy,5.0,strongly_like,5.0,Train,lactose


# Machine Learning Pipeline
The machine learning pipeline for this project, begins with feature extraction using TfidfVectorizer from sklearn.feature_extraction.text, transforming the textual data into a vectorized format suitable for modeling. This vectorized data, is then processed through the LinearSVC model from sklearn, chosen for its effectiveness in handling classification tasks. The pipeline is structured to split the dataset into a 70-30% ratio for training and testing, based on the 'Benchmark_A' column, ensuring good model evaluation. The output of the pipeline is the model’s prediction of whether a food item is lactose or non-lactose based on dish name, providing a clear classification based on the carbohydrate content.


In [8]:
# Function to extract labels and encoded labels from DataFrame

class_mapping = {'non-lactose': 0, 'lactose': 1}

def get_labels_and_encoded(df, class_mapping):
    Y = df['lactose'].tolist()
    Y_encoded = [class_mapping[label] for label in Y]
    return Y, Y_encoded

# Get labels for the training set
train_Y, train_Y_encoded = get_labels_and_encoded(df[df['Benchmark_A'] == 'Train'], class_mapping)

# Get labels for the testing set
test_Y, test_Y_encoded = get_labels_and_encoded(df[df['Benchmark_A'] == 'Test'], class_mapping)

# Extract 'Dish_name' for TrainSet and TestSet
TrainSet = {
    'X': df[df['Benchmark_A'] == 'Train']['Dish_name'].tolist(),
    'Y': train_Y,
    'Y_encoded': train_Y_encoded
}

TestSet = {
    'X': df[df['Benchmark_A'] == 'Test']['Dish_name'].tolist(),
    'Y': test_Y,
    'Y_encoded': test_Y_encoded
}


In [9]:
#Check if our structure is right
print(TrainSet.keys())
print(TestSet.keys())

dict_keys(['X', 'Y', 'Y_encoded'])
dict_keys(['X', 'Y', 'Y_encoded'])


In [10]:
print(TrainSet['X'])

['chicken katsu rice', 'english breakfast', 'spicy chicken', 'gulab jamun', 'chicken masala', 'boiled eggs', 'scrambled eggs', 'chicken wings', 'cod fish,mash and beans', 'veg roll', 'pan fried salmon', 'instant noodle', 'biryani', 'rice beetroot curry', 'pie and chips', 'bolognese pasta', 'chicken biryani', 'bottlegourd', 'upma', 'maggie', 'cottage cheese butter masala', 'brinjal curry', 'falafel', 'cauliflower curry rice', 'fried pork and egg', 'mixed vegetable fried rice', 'ramen', 'sliced pork with garlic sauce', 'maggi noodles', 'shredded chicken chow mein', 'salad', 'cauliflower curry & indian bread', 'rice sambhar', 'full chicken', 'pizza', 'punjabi khichdi', 'big bombay sub', 'egg rice bowl', 'shawarma plate', 'manti', 'alexandrian koshari', 'bear wings', 'turkish brunch', 'southern fried chicken wrap', 'coconut curry served with roti', 'onion and paratha', 'rice with broccoli masala', 'margherita pizza', 'scrambled egg', 'stir fried noodle', 'dal makhani with garlic naan bread

In [11]:
print(TrainSet['Y'])

['non-lactose', 'non-lactose', 'non-lactose', 'lactose', 'non-lactose', 'non-lactose', 'lactose', 'lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'lactose', 'non-lactose', 'lactose', 'non-lactose', 'non-lactose', 'lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'lactose', 'non-lactose', 'lactose', 'non-lactose', 'lactose', 'non-lactose', 'non-lactose', 'lactose', 'non-lactose', 'lactose', 'lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'non-lactose', 'lactose', 'lactose', 'lactose', 'lactose', 'non-lactose', 'non-lactose', 'lactose', 'lactose', 'non-lactose', 'non-lactose', 'lactose', 'non-lactose', 'la

In [12]:
print(TrainSet['Y_encoded'])

[0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [13]:
X_train = TrainSet['X']
X_test  = TestSet['X']

Y_train = TrainSet['Y_encoded']
Y_test  = TestSet['Y_encoded']

# Transformation Stage
In the transformation stage, the 'Dish_name' text data was converted into numerical vectors using the TF-IDF vectorizer. This method was chosen because it effectively emphasizes words that are more relevant to the context of the dataset, reducing the impact of common but less informative words. The transformation process involved fitting the vectorizer to the training data and then transforming both training and testing data into TF-IDF scores. This approach captures the semantic importance of the dish names, which is crucial for the model to distinguish between different types of dishes and make accurate predictions.


In [14]:
X_train = np.array(X_train)
X_test = np.array(X_test)

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the vectorizer
vectorizer = TfidfVectorizer()

# Fit on the training data
vectorizer.fit(X_train)

# Transform both training and testing data
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [16]:
df_train = df[df['Benchmark_A'] == 'Train']
df_test = df[df['Benchmark_A'] == 'Test']

vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(df_train['Dish_name'])
X_test_tfidf = vectorizer.transform(df_test['Dish_name'])


# Modelling
For the modelling stage, the choice of a Linear Support Vector Classifier (LinearSVC) was driven by its proficiency in handling high-dimensional feature spaces, typical of text classification tasks. Given the binary nature of the lactose intolerance classification problem, LinearSVC's ability to find the optimal separating hyperplane makes it a good option for distinguishing between lactose and non-lactose dishes. The model was fine-tuned with a regularization parameter C set to 0.1 to balance the trade-off between a large margin and classification error.


In [17]:
# Linear Model
from sklearn.svm import LinearSVC

model = LinearSVC(C=0.1)
model.fit(X_train_tfidf, Y_train)

 # Methodology
The training process involved fitting the LinearSVC model on a TF-IDF transformed dataset, enabling it to learn from the textual features of dish names. Performance was evaluated using accuracy as the primary metric, complemented by a detailed classification report that includes precision, recall, and f1-score. The decision to employ these metrics stemmed from their collective ability to provide a comprehensive view of the model's predictive capabilities across both classes.


In [18]:
from sklearn.metrics import classification_report

ytp = model.predict(X_train_tfidf)
ysp = model.predict(X_test_tfidf)

train_accuracy = np.mean(ytp==Y_train)
test_accuracy  = np.mean(ysp==Y_test)

# Classification Report
print(classification_report(Y_test, ysp, target_names=['Non-Lactose', 'Lactose']))
print('Training Accuracy:\t',train_accuracy)
print('Test Accuracy:\t',test_accuracy)

              precision    recall  f1-score   support

 Non-Lactose       0.79      0.95      0.87       687
     Lactose       0.78      0.41      0.54       288

    accuracy                           0.79       975
   macro avg       0.79      0.68      0.70       975
weighted avg       0.79      0.79      0.77       975

Training Accuracy:	 0.8197802197802198
Test Accuracy:	 0.7928205128205128


# Results
Upon training, the LinearSVC model yielded a training accuracy of around 82% and a test accuracy close to 79%. The classification report indicated a high level of precision for the Non-Lactose class and a lower, yet reasonable, precision for the Lactose class. The results suggest that the model could reliably identify non-lactose dishes while still maintaining an acceptable level of identification for lactose-containing dishes.


# Conclusions
The experiment with LinearSVC on the text data provided promising results, showing potential for further refinement and application. Future work could explore additional features, other classification algorithms, or a combination of text and image data to enhance model performance. Experimentation with different regularization strengths or kernel functions could also yield improvements. Conclusively, the model demonstrates a good foundation for identifying lactose and non-lactose dishes, with room for further optimization.
