# About the Dataset

## Context

The dataset is derived from a growing e-commerce industry and provides a rich source of information for analysis and research. It includes:

- **High-Resolution Product Images**: Professionally captured images showcasing each product.
- **Label Attributes**: Multiple attributes describing the product, manually entered during cataloging.
- **Descriptive Text**: Comments and descriptions detailing the product characteristics.

This dataset offers a comprehensive view of products, including visual, categorical, and descriptive information, making it ideal for various analyses such as image classification, attribute prediction, and product recommendation.

## Content

The dataset is structured as follows:

1. **Product Identification**:
   - Each product is uniquely identified by an ID (e.g., 42431).

2. **Styles File**:
   - **File Name**: `styles.csv`
   - **Content**: Contains a mapping of product IDs to various attributes and categories. This file helps in associating product IDs with their respective attributes and categories.

## Objective

The goal is to build an image classifier that can accurately classify product images into their respective master categories. This involves leveraging high-resolution product images and the `styles.csv` file to train and validate the model.


In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
import math
from PIL import Image
import io

In [57]:
df = pd.read_csv('styles.csv')

In [59]:
df

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName,productDisplayName1,productDisplayName2
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt,,
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans,,
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch,,
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants,,
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt,,
...,...,...,...,...,...,...,...,...,...,...,...,...
44441,17036,Men,Footwear,Shoes,Casual Shoes,White,Summer,2013.0,Casual,Gas Men Caddy Casual Shoe,,
44442,6461,Men,Footwear,Flip Flops,Flip Flops,Red,Summer,2011.0,Casual,Lotto Men's Soccer Track Flip Flop,,
44443,18842,Men,Apparel,Topwear,Tshirts,Blue,Fall,2011.0,Casual,Puma Men Graphic Stellar Blue Tshirt,,
44444,46694,Women,Personal Care,Fragrance,Perfume and Body Mist,Blue,Spring,2017.0,Casual,Rasasi Women Blue Lady Perfume,,


# Exploratory Data Analysis (EDA) Guide

## Objective

The goal is to clean and prepare the dataset for analysis by aggregating product display columns and mapping image paths to product IDs.

## Steps

### 1. Data Aggregation

The dataset has three columns for product display due to improper CSV encoding. We need to combine these into a single column.

### 2. Image Extraction and Mapping

We extract images from a zip file and then link each image to its respective product ID. This is done by using the image filenames to determine the product ID and then creating a new column in the dataset that includes the file paths to these images.






In [61]:
df['productDisplayName1'] = df['productDisplayName1'].fillna('')
df['productDisplayName2'] = df['productDisplayName1'].fillna('')
df['final product name'] = df['productDisplayName'] + ", "+df['productDisplayName1'] + ", "+df['productDisplayName2'] 
df['final product name'] = df['final product name'].str.rstrip(', , ')
df = df.drop(['productDisplayName', 'productDisplayName1', 'productDisplayName2'], axis =1)

In [63]:
df

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,final product name
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt
...,...,...,...,...,...,...,...,...,...,...
44441,17036,Men,Footwear,Shoes,Casual Shoes,White,Summer,2013.0,Casual,Gas Men Caddy Casual Shoe
44442,6461,Men,Footwear,Flip Flops,Flip Flops,Red,Summer,2011.0,Casual,Lotto Men's Soccer Track Flip Flop
44443,18842,Men,Apparel,Topwear,Tshirts,Blue,Fall,2011.0,Casual,Puma Men Graphic Stellar Blue Tshirt
44444,46694,Women,Personal Care,Fragrance,Perfume and Body Mist,Blue,Spring,2017.0,Casual,Rasasi Women Blue Lady Perfume


In [65]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras.losses import SparseCategoricalCrossentropy

In [67]:
from zipfile import ZipFile
zip_path = 'images.zip'
with ZipFile(zip_path) as myzip:
    files_in_zip = myzip.namelist()

In [69]:
image = pd.DataFrame(files_in_zip, columns = ['Image Name'])
image['id'] = pd.Series(image['Image Name']).str.extract('(\d+)')
image['id'] = image['id'].astype(int)
final_df = pd.merge(df, image, how = 'inner', on = 'id')

In [71]:
final_df

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,final product name,Image Name
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt,images/15970.jpg
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans,images/39386.jpg
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch,images/59263.jpg
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants,images/21379.jpg
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt,images/53759.jpg
...,...,...,...,...,...,...,...,...,...,...,...
44436,17036,Men,Footwear,Shoes,Casual Shoes,White,Summer,2013.0,Casual,Gas Men Caddy Casual Shoe,images/17036.jpg
44437,6461,Men,Footwear,Flip Flops,Flip Flops,Red,Summer,2011.0,Casual,Lotto Men's Soccer Track Flip Flop,images/6461.jpg
44438,18842,Men,Apparel,Topwear,Tshirts,Blue,Fall,2011.0,Casual,Puma Men Graphic Stellar Blue Tshirt,images/18842.jpg
44439,46694,Women,Personal Care,Fragrance,Perfume and Body Mist,Blue,Spring,2017.0,Casual,Rasasi Women Blue Lady Perfume,images/46694.jpg


In [73]:
final_df['Image Name'] = "imagedata/"+final_df['Image Name']
old_substring = 'imagedata/images/'
new_substring = 'C:\\Users\\RishiGupta\\OneDrive - ASPECTRATIO PRIVATE LIMITED\\Documents\\data science all\\myntradataset\\images'

# Replace only the specified substrin
final_df['Image Name'] = final_df['Image Name'].str.replace(old_substring, new_substring, regex=False)

## Handling Dataset Imbalance and Sampling

### Dataset Imbalance

Upon analyzing the distribution of master categories, we discovered that the dataset is imbalanced, with some classes having significantly more samples than others. This imbalance can affect model performance and training efficiency.

### Sampling Strategy

To manage this imbalance and to accommodate the constraints of running the model on a CPU, we will sample 5,000 images from the dataset. This approach helps to reduce the computational load while ensuring a manageable dataset size.

### Ensuring Class Representation

To maintain an accurate representation of all classes in the sample:
- **Weighted Sampling**: Use the `sample` function in the pandas library with weights to ensure that each class is proportionally represented in the sample. This ensures that even less frequent classes are included in the sampled dataset.

In [75]:
final_df['masterCategory'].value_counts()

masterCategory
Apparel           21395
Accessories       11289
Footwear           9222
Personal Care      2404
Free Items          105
Sporting Goods       25
Home                  1
Name: count, dtype: int64

In [77]:
final_df = final_df[final_df['masterCategory'] != 'Home']
data = {
    'masterCategory': ['Apparel', 'Accessories', 'Footwear', 'Personal Care', 'Free Items', 'Sporting Goods'],
    'Count': [21395, 11289, 9222, 2404, 105, 25]
}
df = pd.DataFrame(data)
df['Proportion'] = 1/(df['Count'] / df['Count'].sum())

In [79]:
final_df = pd.merge(final_df, df, how = 'inner', on = 'masterCategory')

In [81]:
final_df_sample = final_df.sample(n = 5000, weights = 'Proportion', random_state=42)

In [83]:
final_df_sample['masterCategory'].value_counts()

masterCategory
Apparel           1340
Accessories       1259
Footwear          1227
Personal Care     1044
Free Items         105
Sporting Goods      25
Name: count, dtype: int64

## Dataset Preparation for Neural Network

### 1. Label Encoding

We use a label encoder to convert categorical master categories into numerical values. This step is essential for transforming the textual categories into a format suitable for machine learning models.

### 2. Data Splitting

The dataset is split into three subsets:
- **Training Set (60%)**: Used to train the neural network.
- **Validation Set (20%)**: Used to tune hyperparameters and evaluate model performance during training.
- **Test Set (20%)**: Used to assess the final model's performance on unseen data.

### 3. Data Normalization

Images are normalized to a range of [0,1] to improve convergence during neural network training. This normalization step ensures that the pixel values are scaled uniformly, which helps the model learn more effectively.

### 4. Data Loading and Batching

The dataset is loaded into a TensorFlow `Dataset` object, with images processed and batched for training. We divide the data into batches of 32 images to manage memory usage and accelerate training. Batching allows for more efficient processing and faster convergence of the model.

This preparation process ensures that the dataset is well-structured and optimized for training a neural network, with proper encoding, splitting, normalization, and batching of data.


In [85]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
final_df_sample['masterCategory'] = le.fit_transform(final_df_sample['masterCategory'])

In [87]:
X_train_val, X_test = train_test_split(final_df_sample, test_size=0.2, stratify= final_df_sample['masterCategory'], random_state=42)
X_train, X_val = train_test_split(X_train_val, test_size=0.25, stratify= X_train_val['masterCategory'], random_state=42)

In [89]:
def load_data(final_df_sample):
    category = final_df_sample['masterCategory'].tolist()
    names = final_df_sample['Image Name'].tolist()
    def load_and_preprocess_image(image_path, target_size=(224, 224)):
        image = tf.io.read_file(image_path)
        image = tf.image.decode_jpeg(image, channels=3)
        image = tf.image.resize(image, target_size)
        image = image / 255.0 
        image.set_shape([target_size[0], target_size[1], 3])
        return image

    def process_image_and_label(image_path, label):
        image = load_and_preprocess_image(image_path)
        return image, label
    dataset = tf.data.Dataset.from_tensor_slices((names, category))
    dataset = dataset.map(lambda x, y: process_image_and_label(x, y), num_parallel_calls=tf.data.AUTOTUNE)
    batch_size = 32
    dataset = dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return dataset

In [91]:
dataset_train = load_data(X_train)
dataset_val = load_data(X_val)
dataset_test = load_data(X_test)

## Model Definition and Training

### 1. Model Architecture

We employ transfer learning with ResNet-50 as the base model. ResNet-50 is chosen for several reasons:

- **Feature Extraction**: ResNet-50 excels at extracting features from images due to its deep residual network architecture. This architecture includes shortcut connections that help the network learn complex features more effectively by addressing the vanishing gradient problem.
  
- **Pre-Trained Weights**: ResNet-50 comes with pre-trained weights from large-scale datasets like ImageNet. These pre-trained weights capture a wide range of visual features, which can be beneficial for our specific task even if our dataset is smaller or different in nature.

- **Residual Connections**: The residual connections in ResNet-50 allow the network to train deeper models without degradation in performance. This is achieved by enabling gradients to flow more effectively through the network, which improves feature learning and overall model accuracy.

- **Efficient Training**: Using ResNet-50 as a base model reduces the need for training a deep neural network from scratch, saving computational resources and time. It also leverages the extensive training done on large datasets to enhance the performance of our model.

### 2. Model Configuration

- **Base Model**: 
  - We use ResNet-50 as a pre-trained feature extractor.
  - The layers of ResNet-50 are kept frozen (not updated during training) to preserve the learned features. 

- **Top Neural Network**:
  - A custom neural network is added on top of ResNet-50.
  - This network is specifically trained for our image classification task, allowing the model to adapt the extracted features to our particular dataset.

### 3. Optimizer

We use the Adam optimizer for training:
- **Adam Optimizer**: 
  - Adam is chosen for its ability to adapt the learning rate dynamically based on the model's performance.
  - It adjusts the learning rate during training, which helps in faster convergence and improved accuracy.

### 4. Training Configuration

- **Number of Epochs**: 
  - The model is trained for 25 epochs.
  - This duration allows the model to learn effectively from the dataset while balancing training time and performance.

By leveraging ResNet-50’s advanced feature extraction capabilities and using the Adam optimizer, we create a robust model that efficiently learns to classify images while benefiting from the extensive knowledge embedded in the pre-trained ResNet-50 model.
mage classification task.


In [659]:
model = Sequential()
base_model = tf.keras.applications.ResNet50(
    include_top=False,
    weights="imagenet",
    input_tensor=None,
    input_shape=(224, 224, 3),
    pooling='avg',
)
base_model.trainable = False
model.add(base_model)
model.add(Dense(units=40, activation='relu'))
model.add(Dense(units=25, activation='relu'))
model.add(Dense(units=10, activation='relu'))
model.add(Dense(units=6, activation='linear'))
opt = tf.keras.optimizers.Adam(learning_rate=0.007)
model.compile(loss = SparseCategoricalCrossentropy(from_logits = True), optimizer=opt, metrics=['accuracy'])

In [619]:
model.fit(dataset_train, epochs =25, validation_data = dataset_val)

Epoch 1/25
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m202s[0m 2s/step - accuracy: 0.3142 - loss: 1.4912 - val_accuracy: 0.5920 - val_loss: 1.0685
Epoch 2/25
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m166s[0m 2s/step - accuracy: 0.5854 - loss: 1.0616 - val_accuracy: 0.6680 - val_loss: 0.9030
Epoch 3/25
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m167s[0m 2s/step - accuracy: 0.6516 - loss: 0.8891 - val_accuracy: 0.6600 - val_loss: 0.8768
Epoch 4/25
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m173s[0m 2s/step - accuracy: 0.6781 - loss: 0.9028 - val_accuracy: 0.8120 - val_loss: 0.6194
Epoch 5/25
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m175s[0m 2s/step - accuracy: 0.7632 - loss: 0.6854 - val_accuracy: 0.8220 - val_loss: 0.5638
Epoch 6/25
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m165s[0m 2s/step - accuracy: 0.7897 - loss: 0.6238 - val_accuracy: 0.8310 - val_loss: 0.5318
Epoch 7/25
[1m94/94[0m [32m━━━━

<keras.src.callbacks.history.History at 0x264599b04d0>

In [620]:
weights = model.get_weights()

In [705]:
model.set_weights(weights)

In [669]:
results = model.evaluate(dataset_test)

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 1s/step - accuracy: 0.8475 - loss: 0.5101


In [707]:
y_pred = model.predict(dataset_test)

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 1s/step


In [708]:
probabilities = tf.nn.softmax(y_pred).numpy()
predicted_classes = np.argmax(probabilities, axis=1)

In [709]:
predicted_classes = np.array(predicted_classes, dtype='float32')

In [710]:
y_true = np.array(y_true, dtype='float32')

In [711]:
y_true = []
for image, label in dataset_test:
    for i in range(len(label)):
        y_true.append(label[i].numpy())

In [712]:
predicted_classes = predicted_classes.tolist()

In [713]:
from sklearn.metrics import confusion_matrix

In [714]:
f1_score(y_true, predicted_classes, average = 'weighted')

0.8476366547853702

## Model Performance and Improvement

### Performance Results

After training the model, we obtained the following results:
- **Training Accuracy**: 82%
  - Indicates how well the model performs on the data it was trained on.
  
- **Validation Accuracy**: 87.5%
  - Shows how well the model generalizes to new, unseen data during training. This suggests that the model is effectively learning and not overfitting.

- **Test Accuracy**: 85%
  - Reflects the model's performance on completely unseen data. This accuracy demonstrates the model’s ability to generalize well to new images.

- **F1 Score**: 0.85
  - An F1 score of 0.85 indicates strong performance across all classes, particularly important given the class imbalance in the dataset. The F1 score balances precision and recall, showing that the model performs well even on less frequent classes.

### Potential Improvement

To further enhance model performance, especially for categories with fewer samples, consider implementing image augmentation:
- **Image Augmentation**: Apply techniques such as rotation, flipping, scaling, and color adjustments to artificially increase the number of samples in underrepresented categories. 
  - This can help the model learn more robust features and improve generalization for categories with fewer samples.
  

In [93]:
def augment_image(image):
    image = tf.image.random_brightness(image, max_delta=0.3)
    image = tf.image.random_hue(image, max_delta=0.2)
    image = tf.image.random_saturation(image, lower=0.7, upper=1.3)
    image = tf.image.random_contrast(image, lower=0.7, upper=1.3)
    return image
def augment_and_collect(dataset, label_to_augment, num_augmentations):
    augmented_images = []
    augmented_labels = []
    
    for image, label in dataset:
        if label == label_to_augment:
            for _ in range(num_augmentations):
                aug_image = augment_image(image)
                augmented_images.append(aug_image)
                augmented_labels.append(label)
        else:
            augmented_images.append(image)
            augmented_labels.append(label)

    return tf.data.Dataset.from_tensor_slices((augmented_images, augmented_labels))

In [None]:
dataset_train_augmented = dataset_train.unbatch()
dataset_train_augmented = augment_and_collect(dataset_train_augmented, 5, 10)
dataset_train_augmented = dataset_train_augmented.batch(32).prefetch(tf.data.AUTOTUNE)

In [725]:
c = 0
for image, label in dataset_train_augmented:
    c = c+1
print(c)

106


In [679]:
opt = tf.keras.optimizers.Adam(learning_rate=0.003)
model.compile(loss = SparseCategoricalCrossentropy(from_logits = True), optimizer=opt, metrics=['accuracy'])

In [681]:
model.fit(dataset_train_augmented, epochs =15, validation_data = dataset_val)

Epoch 1/15
[1m106/106[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m201s[0m 2s/step - accuracy: 0.7546 - loss: 0.7167 - val_accuracy: 0.8730 - val_loss: 0.4680
Epoch 2/15
[1m106/106[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m198s[0m 2s/step - accuracy: 0.7675 - loss: 0.6293 - val_accuracy: 0.8620 - val_loss: 0.4729
Epoch 3/15
[1m106/106[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m205s[0m 2s/step - accuracy: 0.7707 - loss: 0.6161 - val_accuracy: 0.8670 - val_loss: 0.4931
Epoch 4/15
[1m106/106[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m196s[0m 2s/step - accuracy: 0.7646 - loss: 0.6155 - val_accuracy: 0.8570 - val_loss: 0.5242
Epoch 5/15
[1m106/106[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m227s[0m 2s/step - accuracy: 0.7745 - loss: 0.6300 - val_accuracy: 0.8700 - val_loss: 0.4794
Epoch 6/15
[1m106/106[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m255s[0m 2s/step - accuracy: 0.7722 - loss: 0.6057 - val_accuracy: 0.8640 - val_loss: 0.5046
Epoch 7/15
[1m106/106

<keras.src.callbacks.history.History at 0x26473bab710>

In [683]:
model.evaluate(dataset_test)

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 1s/step - accuracy: 0.8023 - loss: 0.6330


[0.5693063735961914, 0.8199999928474426]

In [685]:
weights1  = model.get_weights()

In [687]:
model.set_weights(weights1)

## Model Re-Evaluation After Image Augmentation

### Model Training with Augmented Images

After applying image augmentation techniques to increase the number of samples for underrepresented categories, we retrained the model starting from the previously trained version. This approach was chosen to leverage the pre-learned features and improve convergence.

### Performance Results

Upon evaluating the retrained model with augmented images, the following results were observed:
- **Validation Accuracy**: Decreased to 85%
  - The validation accuracy, which was previously 87.5%, dropped to 85%. This suggests that the model did not generalize as well to the validation data after augmentation.

- **Test Accuracy**: Decreased to 82%
  - The test set accuracy, which was previously 85%, fell to 82%. This decline indicates that the model's performance on unseen data has also decreased.

### Analysis

- **Original Model Performance**: The original model, which was trained without image augmentation, achieved higher accuracy on both the validation and test sets.
- **Impact of Augmentation**: The decrease in accuracy suggests that while augmentation aims to address class imbalance, it may have introduced noise or altered the data distribution in a way that negatively impacted the model’s performance.

### Conclusion

Despite the intention to improve the model's performance through augmentation, the results indicate that the original model without augmentation performed better on both validation and test datasets. Further analysis might be needed to refine augmentation techniques or explore other methods for handling class imbalance.
