# Image Classification for Amazon Products

**Purpose**:

Image classification is a crucial task for online marketplaces such as Amazon, which rely on images to showcase their products to customers. Amazon has a vast array of products, from electronics to clothing, and image classification can help to accurately categorize these products. In this proposal, we will discuss a machine-learning project for the image classification of Amazon products.
Specifically, we will use our own customized web scraping tool to download thousands of images for a range of human wearable products, including Earbuds, VR sets, Fitness trackers, Hearing Aids, Watches, Sunglasses, and Hats. Our purpose is to build a model that can accurately classify the products in the images.


**Methodology**:

The first step in this project is data preprocessing. The images are resized to a standard size, and the pixel values are normalized to a range of [0, 1]. The images are also labelled based on the product category to which they belong. Then, the data is split into training and validation sets, where the training set is used to train the model, and the validation set is used to evaluate the performance of the model during training.
As an exploration step, we will first use some conventional machine learning models like logistic regression, K-nearest neighbours (KNN), support vector machines (SVM), and Gaussian Bayes Classifier. The model is trained using the labelled images, and its performance is evaluated on a separate validation dataset with a cross-validation technique. To reduce computational power, there is a dimensionality reduction method using principal component analysis (PCA) on top of each training model. PCA can be used to reduce the dimensionality of the image data by identifying the most important pixels. This can improve the efficiency of machine learning algorithms and reduce overfitting. Eventually, each trained model is then used to classify new images into their respective product categories.
Contrary to conventional machine learning models, we will also use deep learning methods like Convolutional Neural Networks (CNN) to classify images, which is probably a better way to deal with complex datasets. This method will be covered in the later classes, the general idea here is the model applies convolutional filters to the input image, which extract features at different spatial scales. The extracted features are then fed into a series of fully connected layers that perform the classification task. The output of the model is a softmax layer that predicts the probability distribution over the product categories.


First we will try a range of conventional machine learning models like Random Forest Classifier, KNN, Decision Tree Classifier, and Naive Bayes classifier. Later on we will also implement Deep Learning like CNN.

The overall layout for this analysis is:
1. import all required packages
2. Load the data and label each image
3. Visualialize some figures and process them
4. Try differetnt machine learning modesl, find the best hypter parameters, and evaluate respective performance
5. Use the trained model to do some prediction

In [28]:
import numpy as np
import random
import pandas as pd
import seaborn as sb
from matplotlib import pyplot as plt

from pathlib import Path
import os
import re

#need to pip install opencv-python
import cv2

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score,mean_squared_error 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.utils import shuffle

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn import svm

In [25]:
#Get the total classes we have
class_names = [class_name for class_name in os.listdir("./images") if not class_name.startswith(".")]
class_names_label = {class_name:i for i, class_name in enumerate(class_names)}
nb_classes = len(class_names)
IMAGE_SIZE = (150, 150)
print(f"We have in total {nb_classes} different classes."+
      f"\nAnd they are:\n {', '.join(classes for classes in class_names)}.")

We have in total 12 different classes.
And they are:
 tshirt, sunglasses, watches, speaker, chair, pens, shorts, phone, earbuds, hat, shoes, bottle.


## Load Data

In [26]:
# '''
# OS.walk() generate the file names in a directory tree by walking the tree either top-down or bottom-up.
# For each directory in the tree rooted at directory top (including top itself), 
# it yields a 3-tuple (dirpath, dirnames, filenames).
# For example, we have 12 folder in ./images, it will loop through 13 times = root + 12 folders
# Read more in https://www.geeksforgeeks.org/os-walk-python/
# '''
# class_name = []
# for root, dirs, files in os.walk("./images", topdown = True):
#     for name in files:
#         print(os.path.join(root, name))

# for i in range(10):
#     plt.imshow(images[i].astype('uint8'))

In [34]:
#Load the data from each folder
images = []
labels = []
IMAGE_SIZE = (150, 150)

for folder in os.listdir("./images"):
    #Because there are some configure file also sitting there
    if folder.startswith("."):
        continue
    label = class_names_label[folder]
    for file in os.listdir(os.path.join("./images", folder)):
        # Get the path name of the image
        if file.startswith("."):
            continue
        img_path = os.path.join("./images", folder, file)
        #open and resize the image, read in as 3d array
        image = cv2.imread(img_path)
        #cv2.cvtColor() method is used to convert an image from one color space to another.
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        #OpenCV provides the function cv2. resize() to resize an image. Resizing in OpenCV 
        #is referred to as scaling. We can resize an image by specifying the image size or scaling factor. 
        #The aspect ratio is preserved when we specify the scaling factor.
        image = cv2.resize(image, IMAGE_SIZE)
        
        # Append the image and its corresponding label to the output
        images.append(image)
        labels.append(label)
        
images = np.array(images, dtype = 'float32')
labels = np.array(labels, dtype = 'int32')

#Shuffle arrays or sparse matrices in a consistent way.
#This is a convenience alias to resample(*arrays, replace=False) to do random permutations of the collections.
#otherwise imaages are grouped all together based on their classes
images, labels = shuffle(images, labels, random_state=25)

## Visualization