## Overview
For this project, I wanted to explore image recognition through supervised machine learning. When creating an image classifier, I found a popular solution was utilizing a Convolutional Neural Network with the Keras library. Keras is a deep learning library written in Python that is used for easy and fast prototyping.  

This project is two fold:
1. Creating an image classifier 
2. Explore practical uses for model 
                         
For this project, I will be using fashion product images. A common business question as it relates to products that can be solved machine learning is "how might we make recommendations based on what user's like or dislike?" While this project does not go as far as creating a complete solution to solve for this case, it tackles the beginning steps towards it.

## Method

Since I had planned on training a image classifier model, this would required a lot of images and I did not want to compile these myself. I was able to find this [Fashion Product Data Set](https://www.kaggle.com/paramaggarwal/fashion-product-images-small) on Kaggle. This data set includes over 44,000 images with a variety of types or products. Additionally, it includes multiple attributes and categories that the model can be trained on. 

This dataset also came with [Starter Code](https://www.kaggle.com/paramaggarwal/fashion-product-images-classifier) for an image classifier that was automatically created with a Kaggle bot. I decided to utilize this code while making modifications and additions based on other example solutions for image classifiers across the web. In particular, I wanted to follow some best practices in process such as futher segmenting the data into training, testing and evaluation sets. I've added another step of evalution and prediting the output.


## Analysis



#### Load Libraries
This first step is identifying and loading all the libraries for this project. This included essentials such as numpy and pandas, as well as machine learning specific libraries in keras and sklearn. 

In [None]:

#setting up all the essentials
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import matplotlib.image as mpimg


#machine learning libraries 
from keras.models import Sequential, Model
from keras_preprocessing.image import ImageDataGenerator
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras import regularizers, optimizers
from keras.applications.mobilenet_v2 import MobileNetV2
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler

import os # accessing directory structure

#### Importing Data

Next, was setting up the data for further use. There was two folders provided in the Fashion Product dataset. One was the training data and the other was the testing data. This next set of code imports the CSV file. Since each image file name is connected by the row ID of the csv, creating a new column with the ID and .jpg extention makes it easily access in future code. This was done to both training and test datasets. Then we view each table to see that it's correct, as well as what other information this dataset has.

In [None]:
#import and setup training dataset
TRAINING_DATASET_PATH = "/kaggle/input/myntradataset/"
training_data = pd.read_csv(TRAINING_DATASET_PATH + "styles.csv", error_bad_lines=False)
training_data['image'] = training_data.apply(lambda row: str(row['id']) + ".jpg", axis=1)
training_data = training_data.sample(frac=1).reset_index(drop=True)
training_data.head(10)

In [None]:
#import and setup test dataset
TEST_DATASET_PATH = "/kaggle/input/"
test_data = pd.read_csv(TEST_DATASET_PATH + "styles.csv", nrows=128, error_bad_lines=False)
test_data['image'] = test_data.apply(lambda row: str(row['id']) + ".jpg", axis=1)
test_data = test_data.sample(frac=1).reset_index(drop=True)
test_data.head(10)

#### Image Pre-Processing
In this section, we pre-process the images to be ready for the neural network. 

As if our dataset wasn't big enough, a very common practice is to create additonal images for training through data augmentation. We generate images by rotation and flipping them. This gives the model additional training material to increase the accuracy and general makes for a better model.

While we already have a training and test dataset, we are going to further split the training set into two segements: training and validation. There have been various school's of thought in the idea ratio, 80/20, 70/30, etc, however it seems that it really depends on your data and finding the best fit, so we're going to start with 80/20. The rational between having these 3 segements is to have one for training, one for validating the model and another for further evalation on how this model world work on brand new images.   

Lastly, we are going to identify what the images are going to be trained on. For starting out, I've indicated the masterCategory column as the class that they will training for. The masterCategory has 5 broad fashion product types: Accessories, Apparel, Footwear, Free Items, and Personal Care. I thought this would be easiest as a starting point since there aren't too many categories and they are distinct categories, as opposed to other columns in the dataset that may be more ambiguous and harder to train. 

In [None]:
#create additional images for training, while splitting dataset into two datasets for training and validation data
train_datagen = ImageDataGenerator(
        rescale=1./255,
        validation_split=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

#process training dataset images
training_generator = train_datagen.flow_from_dataframe(
    dataframe=training_data,
    directory=TRAINING_DATASET_PATH + "images",
    x_col="image",
    y_col="masterCategory",
    target_size=(96,96),
    batch_size=32,
    subset="training"
)

#process validation dataset images
valid_generator = train_datagen.flow_from_dataframe(
    dataframe=training_data,
    directory=TRAINING_DATASET_PATH + "images",
    x_col="image",
    y_col="masterCategory",
    target_size=(96,96),
    batch_size=32,
    subset="validation"
)

#process testing dataset images
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_dataframe(
    dataframe=test_data,
    directory=TEST_DATASET_PATH + "images",
    x_col="image",
    y_col="masterCategory",
    target_size=(96,96),
    batch_size=32
)

#create classes (categories to be sorted into)
classes = len(training_generator.class_indices)

#### Build the model

Once all images have been prepped, its time to build and train the model. 

In [None]:
# # create the base pre-trained model
base_model = MobileNetV2(input_shape=(96, 96, 3), include_top=False, weights='imagenet')

# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(classes, activation='softmax')(x)

# this is the model we will train
model = Model(inputs=base_model.input, outputs=predictions)

# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional InceptionV3 layers
for layer in base_model.layers:
    layer.trainable = False

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

In [None]:
#training the model
model.fit_generator(
    generator=training_generator,
    steps_per_epoch=training_generator.n/training_generator.batch_size,

    validation_data=valid_generator,
    validation_steps=valid_generator.n/valid_generator.batch_size,

    epochs=3 #the number of times to repeat training for higher accuracy
)

#save model to file
model.save('/kaggle/working/model.h5')

#### Evaluating the model
Now that the model has been built and trained. This next section takes the model and evaluates it against the test dataset. The method returns scalar test loss and will print out the model's loss and accuracy. 

In [None]:
# Evaluate model with validation data set
STEP_SIZE_TEST=test_generator.n//test_generator.batch_size

evaluation = model.evaluate_generator(generator=valid_generator,
steps=STEP_SIZE_TEST)

# Print out final values of all metrics
key2name = {'acc':'Accuracy', 'loss':'Loss', 
    'val_acc':'Validation Accuracy', 'val_loss':'Validation Loss'}
results = []
for i,key in enumerate(model.metrics_names):
    results.append('%s = %.2f' % (key2name[key], evaluation[i]))
print(", ".join(results))



To get an idea of what the loss and accuracy looks like through the training process, these numbers are plotted in the following graph. 

In [None]:
fig = plt.figure(figsize=(10,5))

# Plot loss function
plt.subplot(222)
plt.plot(history.history['loss'],'bo--', label = "loss")
plt.plot(history.history['val_loss'], 'ro--', label = "val_loss")
plt.title("train_loss vs val_loss")
plt.ylabel("loss")
plt.xlabel("epochs")

# Plot accuracy
plt.subplot(221)
plt.plot(history.history['acc'],'bo--', label = "acc")
plt.plot(history.history['val_acc'], 'ro--', label = "val_acc")
plt.title("train_acc vs val_acc")
plt.ylabel("accuracy")
plt.xlabel("epochs")
plt.legend()


plt.legend()
plt.show()

With pretty low loss and high accuracy, I think the model looks pretty good to move onto the next step! 

#### Predicting Output
For a more practical use of this model, it's time to test out how well it makes predictions. When given an image, can it successfully assign it the correct category? In this section, we'll go through the list test dataset and use the model assign what it thinks the category should be then export a csv file comparing what the correct caterization should have been according to the original dataset. 

In [None]:
#reset the test generator to make sure the order is correct 
test_generator.reset()

#make predictions
pred=model.predict_generator(test_generator,
steps=STEP_SIZE_TEST,
verbose=1)

In [None]:
#map out the indices to the category names to make more legible 
predicted_class_indices=np.argmax(pred,axis=1)

labels = (training_generator.class_indices)
labels = dict((v,k) for k,v in labels.items())
predictions = [labels[k] for k in predicted_class_indices]
labels



In [None]:
#compare the prediction to the original test dataset
filenames=test_generator.filenames
actual = test_data['masterCategory']

match = []
count = 0
correct = 0
for x in predictions: 
    if predictions[count] == actual[count]:
        match.append("True")
        correct = correct + 1
    else: 
        match.append("False")
    count = count + 1
        

In [None]:
results=pd.DataFrame({"Filename":filenames,
                      "Predictions":predictions,
                      "Actual":actual,
                      "Correct":match})

#expore results to csv and print accuracy
results.to_csv("results.csv",index=False)
print("Number Predictions Correct: " + str(correct) + " out of " + str(count) + " Percent Correct: " + str(correct/count))

## Conclusion

I was able to create a model that had pretty low loss and high accuracy. I am surprised to see that when making predictions, which would be the primary use case, it only had ~37% accuracy. I think that I most likely over fitted my model - where it was very accurate in the training and validation dataset, but it was not generalizable enough for brand new data. I tried to change the training data where originally I only pulled first 5000 entries for managable coding to train it with the entire ~40k, but it still had pretty low correct prediction rates. I also tried reducing the number of epoch (from originally 10) to see if that would help with the generalizability but that also did very little. If I had more time, I would definately be exploring how to tweek the model for better prediction accuracy. 

Once the model is able to predict with better accuracy, I can see this further adapted in multiple applications, especially if the model was trained on multiple categories that this dataset provides. An example would be if given an image of a product, it would identify its category and attributes. Based on those, it could recommend similar products as replacements or even give complimentary items to create a full outfit.  
