<div style="display: flex; background-color: #3F579F;">
    <h1 style="margin: auto; font-weight: bold; padding: 30px 30px 0px 30px; color:#fff;" align="center">Automatically classify consumer goods - P6</h1>
</div>
<div style="display: flex; background-color: #3F579F; margin: auto; padding: 5px 30px 0px 30px;" >
    <h3 style="width: 100%; text-align: center; float: left; font-size: 24px; color:#fff;" align="center">| Notebook - Convolutional Neural Networks |</h3>
</div>
<div style="display: flex; background-color: #3F579F; margin: auto; padding: 10px 30px 30px 30px;">
    <h4 style="width: 100%; text-align: center; float: left; font-size: 24px; color:#fff;" align="center">Data Scientist course - OpenClassrooms</h4>
</div>

<div class="alert alert-block alert-info">
    <p>In this notebook, we are going to do the image classification through Convolutional Neural Networks - CNN</p>
</div>

<div style="background-color: #506AB9;" >
    <h2 style="margin: auto; padding: 20px; color:#fff; ">1. Libraries and functions</h2>
</div>

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">1.1. Libraries and functions</h3>
</div>

In [1]:
## General
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_theme(style="darkgrid")

## TensorFlow
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input, decode_predictions
from tensorflow.keras.preprocessing.image import load_img, img_to_array

## Scikit Learn 
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

## Own specific functions 
from functions import *

## Images paths
THUMBNAILS_IMAGES_PATH  = "images/Flipkart/thumbnails/"
ORIGINAL_IMAGES_PATH = "images/Flipkart/"

<div style="background-color: #506AB9;" >
    <h2 style="margin: auto; padding: 20px; color:#fff; ">2. Importing files and Initial analysis</h2>
</div>

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">2.1. Importing and preparing files</h3>
</div>

<div class="alert alert-block alert-info">
    We are going to load the dateset to have data to compare the results
</div>

In [2]:
df_data = pd.read_csv(r"datasets\df_data.csv", index_col=[0])

In [3]:
df_data = df_data[["image", "category_1"]].copy()

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">2.2. Initial analysis</h3>
</div>

In [4]:
df_analysis(df_data, "df_data", analysis_type="complete")


Analysis Header of df_data dataset
--------------------------------------------------------------------------------
- Dataset shape:			 1050 rows and 2 columns
- Total of NaN values:			 0
- Percentage of NaN:			 0.0 %
- Total of full duplicates rows:	 0
- Total of empty rows:			 0
- Total of empty columns:		 0
- Unique indexes:			 True
- Memory usage:				 24.6+ KB

Detailed analysis of df_data dataset
----------------------------------------------------------------------


Unnamed: 0,name,type,records,unique
0,image,object,1050,1050
1,category_1,object,1050,7


<div style="background-color: #506AB9;" >
    <h2 style="margin: auto; padding: 20px; color:#fff; ">3. Convolutional Neural Networks - VGG16<h2>
</div>

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">3.1. Setup the model</h3>
</div>

In [5]:
model = VGG16(weights="imagenet", include_top=False)
model.summary()

Model: "vgg16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None, None, 3)]   0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, None, None, 64)    1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, None, None, 64)    36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, None, None, 64)    0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, None, None, 128)   73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, None, None, 128)   147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, None, None, 128)   0     

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">3.2. Feature extraction</h3>
</div>

In [6]:
vgg16_feature_list = []

for ind in df_data.index:
    
    # loading images
    image = load_img(ORIGINAL_IMAGES_PATH + 
                     df_data["image"][ind],
                     target_size=(224, 224))
    
    image = img_to_array(image)
    image = np.expand_dims(image, axis=0)
    image = preprocess_input(image)
    
    vgg16_feature = model.predict(image)
    vgg16_feature_np = np.array(vgg16_feature)
    vgg16_feature_list.append(vgg16_feature_np.flatten())

vgg16_feature_list_np = np.array(vgg16_feature_list)



KeyboardInterrupt: 

<div class="alert alert-block alert-info">
    <p>Let's see the result</p>
</div>

In [None]:
df_data["VGG16"] = ""

for ind in df_data.index:
    df_data["VGG16"][ind] = vgg16_feature_list_np[ind]

df_data.head()   

<div class="alert alert-block alert-info">
    <p>Now, let's create the BoVW based on the result</p>
</div>

In [None]:
df_VGG16 = np.column_stack(df_data["VGG16"].values.tolist())
df_VGG16 = pd.DataFrame(df_VGG16).T
df_VGG16.head()

In [None]:
plt.figure(figsize=(15, 15))
ax = plt.subplot(311)

ax.set_title("Labels histogram - VGG16", size=20, fontweight="bold")
ax.set_xlabel("Visual words", size=14)
ax.set_ylabel("Frequency", size=14)

ax.plot(df_VGG16[1].ravel())

plt.tight_layout()
plt.show()

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">3.3. PCA and T-SNE dimension reduction</h3>
</div>

<div class="alert alert-block alert-info">
    <p>Let's look at the dataset shape before doing the PCA </p>
</div>

In [None]:
print("Dataset shape: " + str(df_VGG16.shape[0]) + " rows and " + 
      str(df_VGG16.shape[1]) + " columns")

<div class="alert alert-block alert-info">
    <p>Next, we are going to do the PCA </p>
</div>

In [None]:
pca = PCA(n_components=0.80)
VGG16_pca = pca.fit_transform(df_VGG16)

<div class="alert alert-block alert-info">
    <p>Let's look at the dataset shape again </p>
</div>

In [None]:
VGG16_pca.shape

<div class="alert alert-block alert-info">
    <p>Before doing the T-SNE, we are going to <b>Encode</b> through LabelEncoder the first level of the tree categories </p>
</div>

In [None]:
le = LabelEncoder()
df_data["category_encode"] =df_data[["category_1"]].apply(le.fit_transform)
df_data[["category_1", "category_encode"]].head()

<div class="alert alert-block alert-info">
    <p>Let's reduced the dimension through T-SNE</p>
</div>

In [None]:
tsne = TSNE(n_components=2, perplexity=30,
            n_iter=2000, init="random",
            random_state=6, learning_rate="auto")

X_tsne = tsne.fit_transform(VGG16_pca)

VGG16_pca_tsne = pd.DataFrame(X_tsne[:, 0:2], columns=["tsne1", "tsne2"])
VGG16_pca_tsne["class_encode"] = df_data["category_encode"]
VGG16_pca_tsne["class"] = df_data["category_1"]

VGG16_pca_tsne.head()

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">3.4. Clusterization</h3>
</div>

<div style="background-color: #6D83C5;" >
    <h4 style="margin: auto; padding: 20px; color:#fff; ">3.4.1. KMeans</h4>
</div>

<div class="alert alert-block alert-info">
    <p>The number of cluster based on the first level of the tree categories</p>
</div>

In [None]:
n_clusters = df_data["category_1"].nunique()

<div class="alert alert-block alert-info">
    <p>Let's do the clusterization</p>
</div>

In [None]:
kmeans = KMeans(init="k-means++", n_clusters=n_clusters,
                max_iter=1000, random_state=10)

cluster_labels = kmeans.fit_predict(VGG16_pca_tsne[["tsne1", "tsne2"]])
VGG16_pca_tsne["cluster"] = cluster_labels

# Calculating ARI based on the first level of the tree categories
ari = adjusted_rand_score(VGG16_pca_tsne["class_encode"], VGG16_pca_tsne["cluster"])

In [None]:
ari

In [None]:
fig, axes = plt.subplots(1, 2, sharex=True, figsize=(16, 8))
# fig.suptitle(key.upper() + " - ARI score: " + str(round(value, 2)),
#              fontsize=18, fontweight="bold")    

sns.scatterplot(ax=axes[0], x="tsne1", y="tsne2", hue="class", 
                data=VGG16_pca_tsne, legend="brief",
                palette=sns.color_palette("tab10", n_colors=7),
                s=50, alpha=0.6)
axes[0].legend(loc="best", prop={"size": 12},
          title="Categories")
axes[0].set_title("True categories", fontsize=14)

sns.scatterplot(ax=axes[1], x="tsne1", y="tsne2", hue="cluster", 
                data=VGG16_pca_tsne, legend="brief",
                palette=sns.color_palette("tab10", n_colors=7),
                s=50, alpha=0.6)
axes[1].legend(loc="best", prop={"size": 12},
          title="Clusters")
axes[1].set_title("Clusters", fontsize=14)

plt.tight_layout()
plt.show()
print("\n")

In [None]:
https://github.com/valentincorad/CentraleSupelec-Projects/blob/main/Projet%205%20-%20Classifiez%20automatiquement%20des%20biens%20de%20consommation.ipynb

<div class="alert alert-block alert-danger">
    <h1>>>>> FLAG POSITION &lt;&lt;&lt;&lt; </h1>
</div>

In [None]:
raise SystemExit("Stop right there!")

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">3.3. PCA and T-SNE dimension reduction</h3>
</div>

<div class="alert alert-block alert-info">
    <p>Let's look at the dataset shape before doing the PCA </p>
</div>

In [None]:
print("Dataset shape: " + str(df_VGG16.shape[0]) + " rows and " + 
      str(df_VGG16.shape[1]) + " columns")

<div class="alert alert-block alert-info">
    <p>Before doing the T-SNE, we are going to <b>Encode</b> through LabelEncoder the first level of the tree categories </p>
</div>

In [None]:
le = LabelEncoder()
df_data["category_encode"] =df_data[["category_1"]].apply(le.fit_transform)
df_data[["category_1", "category_encode"]].head()

<div class="alert alert-block alert-info">
    <p>Let's reduced the dimension through T-SNE</p>
</div>

In [None]:
tsne = TSNE(n_components=2, perplexity=30,
            n_iter=2000, init="random",
            random_state=6, learning_rate="auto")

X_tsne = tsne.fit_transform(df_VGG16)

VGG16_tsne = pd.DataFrame(X_tsne[:, 0:2], columns=["tsne1", "tsne2"])
VGG16_tsne["class_encode"] = df_data["category_encode"]
VGG16_tsne["class"] = df_data["category_1"]

VGG16_tsne.head()

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">3.4. Clusterization</h3>
</div>

<div style="background-color: #6D83C5;" >
    <h4 style="margin: auto; padding: 20px; color:#fff; ">3.4.1. KMeans</h4>
</div>

<div class="alert alert-block alert-info">
    <p>The number of cluster based on the first level of the tree categories</p>
</div>

In [None]:
n_clusters = df_data["category_1"].nunique()

<div class="alert alert-block alert-info">
    <p>Let's do the clusterization</p>
</div>

In [None]:
kmeans = KMeans(init="k-means++", n_clusters=n_clusters,
                max_iter=1000, random_state=10)

cluster_labels = kmeans.fit_predict(VGG16_tsne[["tsne1", "tsne2"]])
VGG16_tsne["cluster"] = cluster_labels

# Calculating ARI based on the first level of the tree categories
ari = adjusted_rand_score(VGG16_tsne["class_encode"], VGG16_tsne["cluster"])

In [None]:
ari

In [None]:
    
fig, axes = plt.subplots(1, 2, sharex=True, figsize=(16, 8))
# fig.suptitle(key.upper() + " - ARI score: " + str(round(value, 2)),
#              fontsize=18, fontweight="bold")    

sns.scatterplot(ax=axes[0], x="tsne1", y="tsne2", hue="class", 
                data=VGG16_tsne, legend="brief",
                palette=sns.color_palette("tab10", n_colors=7),
                s=50, alpha=0.6)
axes[0].legend(loc="best", prop={"size": 12},
          title="Categories")
axes[0].set_title("True categories", fontsize=14)

sns.scatterplot(ax=axes[1], x="tsne1", y="tsne2", hue="cluster", 
                data=VGG16_tsne, legend="brief",
                palette=sns.color_palette("tab10", n_colors=7),
                s=50, alpha=0.6)
axes[1].legend(loc="best", prop={"size": 12},
          title="Clusters")
axes[1].set_title("Clusters", fontsize=14)

plt.tight_layout()
plt.show()
print("\n")

<div class="alert alert-block alert-danger">
    <h1>>>>> FLAG POSITION &lt;&lt;&lt;&lt; </h1>
</div>

In [None]:
df_VGG16 = extract_data("VGG16")

In [None]:
df_VGG16.head()

<div class="alert alert-block alert-info">
    <p>Let's reduced the dimension through T-SNE</p>
</div>

In [None]:
tsne = TSNE(n_components=2, perplexity=30, 
            n_iter=2000, init="random",
            random_state=6, learning_rate="auto")

X_tsne = tsne.fit_transform(datasets_pca[i])

<div class="alert alert-block alert-danger">
    <h1>>>>> FLAG POSITION &lt;&lt;&lt;&lt; </h1>
</div>

In [None]:
raise SystemExit("Stop right there!")

In [None]:
model = VGG16()

In [None]:
model.summary()

In [None]:
df = pd.read_csv(r"datasets\flipkart_com-ecommerce_sample_1050.csv")
df.head()

In [None]:
for ind in df.index:
    #print(df["image"][ind])
    
    image = load_img(ORIGINAL_IMAGES_PATH + df["image"][ind],
                     target_size=(224, 224))
    image = img_to_array(image)
    image = image.reshape((1, image.shape[0], 
                           image.shape[1], 
                           image.shape[2]))
    image = preprocess_input(image)
    y_pred = model.predict(image)
    label = decode_predictions(y_pred, top=1)
    
    print(label)
    
    