This notebook contains the different machine learning models on which the training data of the <i>food-101</i> dataset will be trained on. Algorithms with different complexities will be used here. Its always a good practice to start with simplest model and later trying complex ones. But before we get our hands dirty with modeling, one more step lies between EDA and modeling which is Feature Engineering. In a usual scenario, feature engineering should get its separate notebook but because the dataset is already clean, images are already arranged in proper folders, all food items have 1000 images(except for one data object as seen in EDA), and data is well split into training and test set so, there is not much to do in feature engineering. Also, if we have to make some changes in the dataset it might be based on the model we choose. 

Starting the first model.
# Support Vector Machine 

Support vector machine is discriminative classifier formally defined by a separating hyperplane.

<i>SVM</i> is one of the simples models that we can you for classification. Images of different size could impact the learning of <i>SVM</i>. However, this is just an assumption. To see if this assumption holds we can train SVM on 2 datasets and evaluate the performance. To achieve this let's create a copy dataset where all the images are stored as square and of size <b>300x300</b>. 

### Feature engineering for SVM

Converting the rectangular images to square can be achieved through two ways. Either by shrinking the dimensions or cutting them out. Resizing the dimension will keep all the information but will move the image away from real world example. For example, let's see how the smallest image in the dataset will look like if we resize it to be square. 

<b>Original Image</b>

In [None]:
from matplotlib import pyplot as plt
from PIL import Image

%matplotlib inline

image = Image.open('../../data/raw/food-101/images/macarons/3247436.jpg')
plt.imshow(image)
plt.show()

<b>Resized Image</b>

In [None]:
import numpy as np

# Taking square root of the length * breath
sqrWidth = np.ceil(np.sqrt(image.size[0] * image.size[1])).astype(int) 
im_resize = image.resize((sqrWidth, sqrWidth))
plt.imshow(im_resize)
plt.show()

This looks quite bad but still holds the information about the food. Let's see what happens when we cut the extra dimensions to make the image square.

In [None]:
# Create a new square white image with dimenion equal to smaller side of original image
# then paste the original image over the white image
def make_square(image, max_size=600, fill_color=(0, 0, 0)):
    x, y = image.size
    size = min(max_size, x, y)
    new_im = Image.new('RGB', (size, size), fill_color)
    new_im.paste(image, (int((size - x) / 2), int((size - y) / 2)))
    return new_im

new_image = make_square(image)
plt.imshow(new_image)
plt.show()

This looks more like a real image, infact this image removes noise from the original image. However, we lose information while cropping the image. 

For the later method of cropping an image we can do a little variation and create a new kind of square image. Instead of using the smaller side of the image, we can use the longer one and fill the extra space with black or white color to generate a square image.

In [None]:
# Create a new square white image with dimenion equal to smaller side of original image
# then paste the original image over the white image
def make_big_square(image, min_size=50, fill_color=(0, 0, 0)):
    x, y = image.size
    size = max(min_size, x, y)
    new_im = Image.new('RGB', (size, size), fill_color)
    new_im.paste(image, (int((size - x) / 2), int((size - y) / 2)))
    return new_im

new_image = make_big_square(image)
plt.imshow(new_image)
plt.show()

This keeps all the information and convert the image to a square but also adds a lot more information. We don't know yest, whether this helps with learning or not. We can create new dataset of images in this format as well to compare performance of algorithm on. 

For the resized image, shrining the longer dimension up to a certain length makes sense. If ratio of dimension is very high then the resizing can be far from realism. Let's see how many of the images in the dataset have ratio of more than 2:1. 

In [None]:
from tqdm import tqdm
import os

path = '../../data/raw/food-101/images'

imageCount = 0
fileNameList = []

for r, d, f in tqdm(os.walk(path)):
    for file in f:
        fileName = os.path.join(r, file)
        image = Image.open(fileName)
        # dividing the longer side of image with the smaller one
        ratio = (max(image.size[0],image.size[1]) / min(image.size[0],image.size[1])) 
        if(ratio >= 2):
            fileNameList.append(fileName)
            imageCount += 1

print("Number of images with ratio more than 2:1 are-" + str(imageCount))

So, there are 47 images which have a ratio of more than 2:1, which is nothing compared to the total of 100999 images. Let's display 3 images from the list.

In [None]:
from IPython.display import Image as Images, display
display(Images(filename=fileNameList[0]))
display(Images(filename=fileNameList[21]))
display(Images(filename=fileNameList[45]))

These 47 images contains a lot of false images as well. Let's take a look at false images. But because the images are very few in number we don't need to delete anything.

In [None]:
display(Images(filename=fileNameList[1]))
display(Images(filename=fileNameList[17]))
display(Images(filename=fileNameList[28]))
display(Images(filename=fileNameList[29]))

Although, this is good that only 47 images have dimension ratio of more than 2:1 but we don't know how many images are rectangle. To do the performance check of how different algorithms behave with different image sizes and scaling, we need to have a good quantity of images with rectangle shape. Let's count the number of images which are rectangle. 

In [None]:
path = '../../data/raw/food-101/images'

rectangleImageCount = 0

for r, d, f in tqdm(os.walk(path)):
    for file in f:
        fileName = os.path.join(r, file)
        image = Image.open(fileName)
        if(image.size[0] != image.size[1]):
            rectangleImageCount += 1

print("Number of rectangle images are: " + str(rectangleImageCount))

There are 38793 images which are rectangle in the dataset, which is 38.4%. This number is high enough to see the change in learning performance based on different reshaping techniques. Let's start with creating first datasets.

### ImageShrink
First dataset contains all square images achieved by shrinking the longer dimension to match the shorter one. We made a copy of data set and will now replace each rectangular image with a square in this dataset.

In [None]:
path = '../../data/raw/food-101/imagesShrink'

for r, d, f in tqdm(os.walk(path)):
    for file in f:
        fileName = os.path.join(r, file)
        image = Image.open(fileName)
        # Finding the shorter dimension
        shorterDimension = min(image.size[0],image.size[1])
        im_resize = image.resize((shorterDimension, shorterDimension))
        # Replacing the original images with resized one.
        im_resize.save(fileName, 'JPEG' )
            

Let's see if that worked. Displaying a random rectangular image from both directories.

In [None]:
display(Images(filename='../../data/raw/food-101/images/apple_pie/693210.jpg'))
display(Images(filename='../../data/raw/food-101/imagesShrink/apple_pie/693210.jpg'))

This looks good. Moving onto creating another dataset with longer length cropped to fit the square size. To do this, we make a copy of the images dataset by the name of imagesCrop and run the below function.

### ImageCrop

In [None]:
pathCrop = '../../data/raw/food-101/imagesCrop/'

for r, d, f in tqdm(os.walk(pathCrop)):
    for file in f:
        fileName = os.path.join(r, file)
        image = Image.open(fileName)
        # cropping the image using the make_square function used earlier
        new_image = make_square(image)
        # Replacing the original images with resized one.
        new_image.save(fileName, 'JPEG' )

Let's see if this operation was completed successfully or not. 

In [None]:
display(Images(filename='../../data/raw/food-101/images/apple_pie/693210.jpg'))
display(Images(filename='../../data/raw/food-101/imagesCrop/apple_pie/693210.jpg'))

This worked. Also, as we can see that the subject has been cropped out while transforming the image. This could lead to some problem if the food object is not in the center of the image. If we see a big drop of performance for cropped images, we try to find a solution with which the food object could be translated to the center before being cropped. But, for now let's create the third dataset where rectangular images are transformed to square by extending the shorter dimension to fit with the longer one. To do this, we make a copy of the images dataset by the name of imagesExtend and run the below function.

### ImagesExtend

In [None]:
pathExtend = '../../data/raw/food-101/imagesExtend/'

for r, d, f in tqdm(os.walk(pathExtend)):
    for file in f:
        fileName = os.path.join(r, file)
        image = Image.open(fileName)
        # cropping the image using the make_square function used earlier
        new_image = make_big_square(image)
        # Replacing the original images with resized one.
        new_image.save(fileName, 'JPEG' )

Let's see if this operation was completed successfully or not. 

In [None]:
display(Images(filename='../../data/raw/food-101/images/apple_pie/693210.jpg'))
display(Images(filename='../../data/raw/food-101/imagesExtend/apple_pie/693210.jpg'))

Let's now start with the learning part and compare the performance of SVM on these four datasets. 

But this will not be that simple, each pixel is treated as a feature. So starting with the simplest configuration. We use the cropped square images and convert them to same size i.e. 100x100 pixels. Over to that, color images contain 3 extra dimensions each for red, green and blue. We can avoid that too now by converting all images to the black and white. Created a new directory with name imagesCrop100x100. Using the below function to convert all the images in that folder to size 100x100 and black&white in color. 

In [None]:
pathExtend = '../../data/raw/food-101/imagesCrop100x100/'

for r, d, f in tqdm(os.walk(pathExtend)):
    for file in f:
        fileName = os.path.join(r, file)
        # converting image to greyscale
        image = Image.open(fileName).convert("RGB")
        image = image.convert('L')
        # resizing image to 100x100
        im_resize = image.resize((100, 100))
        # Replacing the original images with resized one.
        im_resize.save(fileName, 'JPEG' )

Let's take a look at the image now

In [None]:
display(Images(filename='../../data/raw/food-101/images/apple_pie/693210.jpg'))
display(Images(filename='../../data/raw/food-101/imagesCrop100x100/apple_pie/693210.jpg'))

This looks right. As computer only understands numbers let's convert the image to an array of pixel values.

In [None]:
from skimage.io import imread
import numpy as np

pathExtend = '../../data/raw/food-101/imagesCrop100x100/'

count = 0
imagesData = []
for r, d, f in tqdm(os.walk(pathExtend)):
    for file in f:
        fileName = os.path.join(r, file)
        # reading the pixel values into a matrix
        image2DArray = imread(fileName)
        # converting the matrix to a 1D array
        flattenedImage = image2DArray.flatten()
        flattenedImage = np.array(flattenedImage)
        # Appending the file name to the array 
        flattenedImage = np.append(flattenedImage, int(file[:-4]))
        # Appending the class label to the array
        flattenedImage = np.append(flattenedImage, count)
        imagesData.append(flattenedImage)
    count += 1

Well, a lot of things happened above. Let's take a look at the values. Analyzing randomly selected 7890th image in the list.

In [None]:
print(imagesData[7890])
print("Total number of data points are: " + str(len(imagesData[7890])))

This gives the idea about the data. There are 10002 values for the image. Which makes sense as 10000 of those are each pixel value of 100x100 Gray-scale image. Second last value is the name of the image. It can be used to identify the image. Last value in the array represents the label of the image and will work as our class on the basis of which we will do the classification. Going from <b>1</b> for <i>apple pie</i> to <b>101</b> for <i>waffles</i>.

Going by that logic, the 7890th should be an image of <i>bibimbap</i> i.e. 597420.jpg. Let's display both the images, i.e original and then gray-scale 100x100 image. Also, we can regenerate the image back from the pixel values to do the comparison. 

In [None]:
display(Images(filename='../../data/raw/food-101/images/bibimbap/597420.jpg'))
display(Images(filename='../../data/raw/food-101/imagesCrop100x100/bibimbap/597420.jpg'))

Well this looks good. Let's regenerate the image from pixel values to see if it matches the image above.

In [None]:
imagePixels = imagesData[7890][:-2]

# Convert the pixels into an array using numpy
imagePixels = np.array(imagePixels, dtype=np.uint8)

# reshaping to a 2D array
imagePixels = np.reshape(imagePixels, (-1, 100))

# Use PIL to create an image from the new array of pixels
new_image = Image.fromarray(imagePixels)
# displaying image in grayscale
plt.imshow(new_image, cmap = plt.cm.gray)
plt.show()

Bingo!! The regenerated image is same as the original image. We can now move to next step i.e. splitting the data into training and test set. Luckily, the data came already split into train and test set. If we see into the <i>meta</i> directory we will see that there are 4 files, 2 json and 2 text. The json and text files contains the copy of each other in different format. Looking in to the json file, one is <i>test</i> and other is <i>train</i>.   

The image data is present with us in pixel form in variable <b>imagesData</b> we can create 2 variables out of it called <b>trainDataList</b> and <b>testDataList</b>. We can also store these lists into files so that we can quickly load it next time. Also we are going to use the HDF5 file to store the data. When it comes to lot of data, HDF5 is very fast in reading and writing of data, compared to using the text file.  

In [None]:
import json
import sys

# changing the output to maxmimum size 
np.set_printoptions(threshold=sys.maxsize)

# Reading the train json file and converting the data to a dictionary
with open(r'..\..\data\raw\food-101\meta\train.json', 'r') as f:
    trainDataDictionary = json.load(f)
    
# Writing the data to a file for reuse 
trainingDataFile = open(r'..\..\data\processed\trainingimageData.txt','w')

# creating a new list out of images Data that contains data of only those data that are in training set     
trainDataList = []
for key in tqdm(trainDataDictionary.keys()):
    for val in trainDataDictionary[key]:
        imageName = val.split('/')[1]
        for imageDataArray in imagesData:
            if(int(imageName) == imageDataArray[-2]):
                trainDataList.append(imageDataArray)
                trainingDataFile.write(str(imageDataArray))
                trainingDataFile.write('\n')              
trainingDataFile.close()

# doing the same for the test data 
# Reading the test json file and converting the data to a dictionary
with open(r'..\..\data\raw\food-101\meta\test.json', 'r') as f:
    testDataDictionary = json.load(f)
    
# Writing the data to a file for reuse 
testDataFile = open(r'..\..\data\processed\testimageData.txt','w')
    
# creating a new list out of images Data that contains data of only those data that are in training set     
testDataList = []
for key in tqdm(testDataDictionary.keys()):
    for val in testDataDictionary[key]:
        imageName = val.split('/')[1]
        for imageDataArray in imagesData:
            if(int(imageName) == imageDataArray[-2]):
                testDataList.append(imageDataArray)
                testDataFile.write(str(imageDataArray))
                testDataFile.write('\n')              
testDataFile.close()

Also we are going to use the HDF5 file to store the data. When it comes to lot of data, HDF5 is very fast in reading and writing of data, compared to using the text file.

In [None]:
import json
import sys
import h5py

# Address to store the HDF5 file 
hdf5Path = r'..\..\data\processed\dataset.hdf5'

# Reading the train json file and converting the data to a dictionary
with open(r'..\..\data\raw\food-101\meta\train.json', 'r') as f:
    trainDataDictionary = json.load(f)

# Fixing a shape for the array in which image data will be stored
trainShape = (75750, 10002)

# Open the hdf5 file in write mode
hdf5File = h5py.File(hdf5Path, mode='w')
hdf5File.create_dataset("train_images", trainShape, np.uint32)

count = 0

# Storing the data to the HDF5 file   
for key in tqdm(trainDataDictionary.keys()):
    for val in trainDataDictionary[key]:
        imageName = val.split('/')[1]
        for imageDataArray in imagesData:
            if(int(imageName) == imageDataArray[-2]):
                hdf5File["train_images"][count, ...] = imageDataArray     
        count += 1

# doing the same for the test data 
# Reading the test json file and converting the data to a dictionary
with open(r'..\..\data\raw\food-101\meta\test.json', 'r') as f:
    testDataDictionary = json.load(f)
    
# Fixing a shape for the array in which image data will be stored
testShape = (25250, 10002)

hdf5File.create_dataset("test_images", testShape, np.uint32)

count = 0

# creating a new list out of images Data that contains data of only those data that are in training set     
for key in tqdm(testDataDictionary.keys()):
    for val in testDataDictionary[key]:
        imageName = val.split('/')[1]
        for imageDataArray in imagesData:
            if(int(imageName) == imageDataArray[-2]):
                hdf5File["test_images"][count, ...] = imageDataArray  
        count += 1

hdf5File.close()

Interesting, the size of <i>testimageData</i> file is 1.85 GB and and <i>trainingimageData</i> file is 5.56 GB. These are big files. The <i>dataset.hdf5</i> that contains both the train and test data is 963 MB. This is big difference. Apart from the file size the read and write time from the .hdf5 file is insanely faster then text or .csv files when it comes to lot of data.

Let's read back the values from the file into some variables. 

In [None]:
import h5py

# Address to store the HDF5 file 
hdf5Path = r'..\..\data\processed\dataset.hdf5'

# open the hdf5 file
hdf5File = h5py.File(hdf5Path, "r")

trainData = hdf5File["train_images"][:]
testData = hdf5File["test_images"][:]

hdf5File.close()

This was crazy fast!!! Lets' separate the label and remove image name from the image data.

In [None]:
trainLabels = trainData[:,-1:]
trainDataCopy = np.copy(trainData)
trainDataCopy = trainDataCopy[:,:-2] 

testLabels = testData[:,-1:]
testDataCopy = np.copy(testData)
testDataCopy = testDataCopy[:,:-2] 

We are going to use the Sci-kit learn library to use the SVM model to fit the data. But before that we need to change the data so that it gets easy compatibility with the model. 

In [None]:
# Converting to a dataframe
trainDataCopy = pd.DataFrame(trainDataCopy)

# Flatenning the labels to be 1D array
trainLabels = trainLabels.flatten()

This looks ready to go into classifier.

In [None]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

svc = svm.SVC(verbose=True)
clf = GridSearchCV(svc, param_grid, verbose=True)
clf.fit(trainDataCopy, trainLabels)

This looks like a big task for SVM. When can reduce the problem size to make it easier for algorithm to learn and see the difference in performance. Instead of making it a 101 classification problem when can instead do a binary classification first. From the same dataset we can just take the data of <span style="color:green"><b>Cup Cakes</b></span> and <span style="color:green"><b>Donuts</b></span>.

To achieve this we can read the train and test data again, but this time we can only read the values that are given for <i>donut</i> and <i>cup_cakes</i>. Let's retrieve this data from the json files.

In [None]:
# Reading the train and test json file and converting the data to a List
with open(r'..\..\data\raw\food-101\meta\train.json', 'r') as file:
    binaryTrainFoodJson = json.load(file)
    
keys = ["donuts", "cup_cakes"]

binaryTrainList = [binaryTrainFoodJson.get(key) for key in keys]

with open(r'..\..\data\raw\food-101\meta\test.json', 'r') as file:
    binaryTestFoodJson = json.load(file)
    
keys = ["donuts", "cup_cakes"]

binaryTestList = [binaryTestFoodJson.get(key) for key in keys]

Awesome! So, we have the train and test image data in <i>binaryTrainList</i> and <i>binaryTestList</i>. Let's store that in HDF5 file. 

In [None]:
import json
import sys
import h5py

# Address to store the HDF5 file 
hdf5Path = r'..\..\data\processed\dataset.hdf5'

# Fixing a shape for the array in which image data will be stored
trainShape = (1500, 10002)

# Open the hdf5 file in write mode
hdf5File = h5py.File(hdf5Path, mode='w')
hdf5File.create_dataset("subset_train_images", trainShape, np.uint32)

count = 0

# Storing the data to the HDF5 file   
for i in tqdm(range(2)):
    for item in binaryTrainList[i]:
        imageName = item.split('/')[1]
        for imageDataArray in imagesData:
            if(int(imageName) == imageDataArray[-2]):
                hdf5File["subset_train_images"][count, ...] = imageDataArray     
        count += 1

# doing the same for the test data 
# Fixing a shape for the array in which image data will be stored
testShape = (500, 10002)

hdf5File.create_dataset("subset_test_images", testShape, np.uint32)

count = 0

# creating a new list out of images Data that contains data of only those data that are in training set     
for i in tqdm(range(2)):
    for item in binaryTestList[i]:
        imageName = item.split('/')[1]
        for imageDataArray in imagesData:
            if(int(imageName) == imageDataArray[-2]):
                hdf5File["subset_test_images"][count, ...] = imageDataArray  
        count += 1

hdf5File.close()

Data being stored let's rerun the training 

In [None]:
# Address to store the HDF5 file 
hdf5Path = r'..\..\data\processed\dataset.hdf5'

# open the hdf5 file
hdf5File = h5py.File(hdf5Path, "r")

trainData = hdf5File["subset_train_images"][:]
testData = hdf5File["subset_test_images"][:]

hdf5File.close()

Separating label and removing the image name from the dataset

In [None]:
trainLabels = trainData[:,-1:]
trainDataCopy = np.copy(trainData)
trainDataCopy = trainDataCopy[:,:-2] 

testLabels = testData[:,-1:]
testDataCopy = np.copy(testData)
testDataCopy = testDataCopy[:,:-2] 

Doing data transformation

In [None]:
# Converting to a dataframe
trainDataCopy = pd.DataFrame(trainDataCopy)
testDataCopy = pd.DataFrame(testDataCopy)

# Flatenning the labels to be 1D array
trainLabels = trainLabels.flatten()
testLabels = testLabels.flatten()

Classification using SVM

In [None]:
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(trainDataCopy, trainLabels)

#Predict the response for test dataset
testPrediction = clf.predict(testDataCopy)

As training done, lets see the performance of the training. Starting with the accuracy.

In [None]:
from sklearn import metrics

print("Accuracy:", metrics.accuracy_score(testLabels, testPrediction)) 

So, we got 53.4% accuracy. This is a very bad value. However, accuracy is not the best performance indicator when it comes to classification. Let's check the precision and recall

In [None]:
print("Precision:", metrics.precision_score(testLabels, testPrediction, pos_label=30))
print("Recall:", metrics.recall_score(testLabels, testPrediction, pos_label=30))

With precision being 53.23% and Recall being 56%, these are some bad values. 

We ran the algorithm on a very basic configuration. Let's run the algorithm again with a different kernel and some other hyper-parameter values. 

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001, 0.0005, 0.005], 'kernel': ['rbf']},
 ]

svc = svm.SVC(verbose=True)
clf = GridSearchCV(svc, param_grid, verbose=True)

%time clf.fit(trainDataCopy, trainLabels)
print(clf.best_params_)

The best results we got were for {'C': 1, 'kernel': 'linear'}. Which was also the default algorithm that ran first time. Let's see how we can improve it. We can try to extract some meaningful features. We can use PCA principal component analysis to extract 200 fundamental components of the image to feed into out support vector machine classifier.

In [None]:
from sklearn.decomposition import PCA as RandomizedPCA
from sklearn.pipeline import make_pipeline
from sklearn import svm

pca = RandomizedPCA(n_components=150, whiten=True, random_state=42)
clf = svm.SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, clf)

#Train the model using the training sets
model.fit(trainDataCopy, trainLabels)

#Predict the response for test dataset
testPrediction = model.predict(testDataCopy)

In [None]:
print("Accuracy:", metrics.accuracy_score(testLabels, testPrediction)) 

print("Precision:", metrics.precision_score(testLabels, testPrediction, pos_label=30))
print("Recall:", metrics.recall_score(testLabels, testPrediction, pos_label=30))

With Accuracy: 61.6%, Precision: 62.5% and Recall 58%, we see a jump in the all performance measures by using a different kernel and using 150 principle component. Lets' try a combination of hyper-parameters again on this new configuration.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'svc__C': [1, 5, 10, 50],
              'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}

grid = GridSearchCV(model, param_grid)

%time grid.fit(trainDataCopy, trainLabels)
print(grid.best_params_)

The best configuration turns out to be with c = 5 and gamma being 0.005. Let's see the performance with this configuration.

In [None]:
model = grid.best_estimator_
testPrediction = model.predict(testDataCopy)

In [None]:
print("Accuracy:", metrics.accuracy_score(testLabels, testPrediction)) 

print("Precision:", metrics.precision_score(testLabels, testPrediction, pos_label=30))
print("Recall:", metrics.recall_score(testLabels, testPrediction, pos_label=30))

With Accuracy: 61%, Precision: 61.9, Recall: 5%8 Its nearly same. In fact, it is fractionally low. 

Things to remember
* images where we will lose the information if we crop to make it rectangular
* using an algorithm to find the important part of food image which can be used to check the performance as well specially for cropped images.
* Checking performance with or without data augmentation
* after making it square, check performance with or without making all images of same size
* use 5 different classifier atleast to know the performance