### Contents
* 1. IMPORTING LIBRARIES AND DATASET
* 2. PERFORM EXPLORATORY DATA ANALYSIS AND DATA CLEANING
* 3. FIND THE OPTIMAL NUMBER OF CLUSTERS USING ELBOW METHOD
    * Apply k-Means
* 4. APPLY PRINCIPAL COMPONENT ANALYSIS AND VISUALIZE THE RESULTS
* 5. APPLY AUTOENCODERS (PERFORM DIMENSIONALITY REDUCTION USING AUTOENCODERS)
    * Apply K-Means again after obtaining results from encoders
    * Final Observations

# 1. IMPORTING LIBRARIES AND DATASET

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import cv2
from IPython.display import display
import zipfile

from sklearn.preprocessing import StandardScaler, normalize
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity

import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.python.keras import Sequential
from tensorflow.keras import layers, optimizers
from tensorflow.keras.layers import *
from tensorflow.keras import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.initializers import glorot_uniform
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint, LearningRateScheduler

In [None]:
sales_df = pd.read_csv("../input/sample-sales-data/sales_data_sample.csv", encoding='unicode_escape')
# MSRP is the manufacturer's suggested retail price (MSRP) or sticker price represents the suggested retail price of products. 
# MSRP is used to standardize the price of products over multiple company store locations.
sales_df

In [None]:
sales_df.info()

In [None]:
# Convert order date to datetime format
sales_df['ORDERDATE'] = pd.to_datetime(sales_df['ORDERDATE'])
# Check the type of data of ORDERDATE
sales_df.dtypes

In [None]:
# checking for null values
sales_df.isnull().sum()

we have `ADDRESSLINE2`, `STATE`, `POSTALCODE` and `TERRITORY` these columns have null values

In [None]:
# since there are lot of Null values in 'addressline2', 'state', 'postal code' and 'territory' we can drop them. 
# Country & City would represent the order grographical information.
# Also we can drop city, address1, phone number, contact_name, contact last_name and contact first_name since they are not required for the analysis

to_drop  = ['ADDRESSLINE1', 'ADDRESSLINE2', 'POSTALCODE', 'CITY', 'TERRITORY', 'PHONE', 'STATE', 'CONTACTFIRSTNAME', 'CONTACTLASTNAME', 'CUSTOMERNAME', 'ORDERNUMBER']
sales_df = sales_df.drop(to_drop, axis = 1)
sales_df.head()

In [None]:
#checking again for null values
sales_df.isnull().sum().sum()

we are good to go now

# 2: PERFORM EXPLORATORY DATA ANALYSIS AND DATA CLEANING 

In [None]:
#number of unique values
sales_df.nunique()

In [None]:
sales_df.COUNTRY.unique()

In [None]:
sales_df.COUNTRY.value_counts()

In [None]:
def barplot_visualization(x):
    '''
    Function to visulize the count of items in a given column
    '''
    #fig = plt.figure(figsize=(12,6))
    fig = px.bar(x=sales_df[x].unique(), y=sales_df[x].value_counts(), height=600, color=sales_df[x].unique(),
                 labels={x:x}
                )
    fig.update_layout(yaxis=dict(title_text='Count', titlefont=dict(size=20)), 
                      xaxis=dict(title_text=x, titlefont=dict(size=20)),
                      title_text=x[0]+ x[1:].lower() +' Bar Plot'
                     )
    fig.show()

g = barplot_visualization('COUNTRY')

In [None]:
barplot_visualization('STATUS')

In [None]:
barplot_visualization('DEALSIZE')

In [None]:
barplot_visualization('PRODUCTLINE')

### Encoding Categorical Variables

In [None]:
status_dict = {'Shipped':1, 'Cancelled':2, 'On Hold':2, 'Disputed':2, 'In Process':0, 'Resolved':0}
sales_df['STATUS'].replace(status_dict, inplace=True)

In [None]:
sales_df = pd.get_dummies(data=sales_df, columns=['PRODUCTLINE', 'DEALSIZE', 'COUNTRY'])
sales_df.shape

In [None]:
sales_df.head()

In [None]:
pd.Categorical(sales_df['PRODUCTCODE'])

In [None]:
pd.Categorical(sales_df['PRODUCTCODE']).codes

In [None]:
# Since the number unique product code is 109, if we add one-hot variables, there 
# would be additional 109 columns, we can avoid that by using categorical encoding
# This is not the optimal way of dealing with it but it's important to avoid curse of dimensionality
sales_df['PRODUCTCODE'] = pd.Categorical(sales_df['PRODUCTCODE']).codes

In [None]:
date_group = sales_df.groupby('ORDERDATE').sum()
date_group

In [None]:
fig = px.line(x = date_group.index, y = date_group.SALES, title = 'Sales vs Date')
fig.update_layout(yaxis=dict(title_text='Sales', titlefont=dict(size=15)), 
                  xaxis=dict(title_text='Date', titlefont=dict(size=15))
                 )
fig.show()

In [None]:
# We can drop 'ORDERDATE' and keep the rest of the date-related data such as 'MONTH'
sales_df.drop("ORDERDATE", axis = 1, inplace = True)
sales_df.shape

In [None]:
plt.figure(figsize = (20, 20))
corr_matrix = sales_df.iloc[:, :10].corr()
sns.heatmap(corr_matrix, annot=True);

**OBESRVATIONS**
- There is a high co-relation in Quarter ID and the monthly IDs
- MSRP is +velly correlated to PRICEEACH and SALES
- PRODUCTCODE is -velly correlated with MSRP, PRICEEACH and SALES
- +ve correlation btw SALES, PRICEEACH, QUANTITYORDERED

In [None]:
# It looks like the Quarter ID and the monthly IDs are highly correlated as they will produce nearly same results
# Let's drop 'QTR_ID' (or 'MONTH_ID') 
sales_df.drop("QTR_ID", axis = 1, inplace = True)
sales_df.shape

In [None]:

# Distplot shows the (1) histogram, (2) kde plot and (3) rug plot.
# (1) Histogram: it's a graphical display of data using bars with various heights. Each bar groups numbers into ranges and taller bars show that more data falls in that range.
# (2) Kde Plot: Kernel Density Estimate is used for visualizing the Probability Density of a continuous variable.
# (3) Rug plot: plot of data for a single quantitative variable, displayed as marks along an axis (one-dimensional scatter plot). 
import plotly.figure_factory as ff

#fig = plt.figure(figsize=(10,10));
for i in range(8):
    if sales_df.columns[i]!='ORDERLINENUMBER':
        fig = ff.create_distplot([sales_df[sales_df.columns[i]].apply(lambda x: float(x))], ['distplot']);
        fig.update_layout(title_text=sales_df.columns[i]);
        fig.show();

In [None]:
# Visualize the relationship between variables using pairplots

fig = px.scatter_matrix(sales_df, 
                        dimensions=sales_df.columns[:8], color='MONTH_ID')# fill color by months
fig.update_layout(title_text='Sales Data',
                  width=1100,
                  height=1100
                 )
fig.show()

**OBESRVATIONS**
* A trend exists between 'SALES' and 'QUANTITYORDERED'  
*  A trend exists between 'MSRP' and 'PRICEEACH' (there are some outlaiers)  
* A trend exists between 'PRICEEACH' and 'SALES'
* It seems that sales growth exists as we move from 2013 to 2014 to 2015 ('SALES' vs. 'YEAR_ID')

# 3: FIND THE OPTIMAL NUMBER OF CLUSTERS USING ELBOW METHOD

In [None]:
# Scale the data
scaler = StandardScaler()
sales_df_scaled = scaler.fit_transform(sales_df)

In [None]:
wcss = []
for i in range(1,15):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(sales_df_scaled)
    wcss.append(kmeans.inertia_) # intertia is the Sum of squared distances of samples to their closest cluster center (WCSS)

plt.plot(wcss, marker='o', linestyle='--')
plt.title('The Elbow Method (Finding right number of clusters)')
plt.xlabel('Number of CLusters')
plt.ylabel('WCSS')
plt.show()

 From this we can observe that, 5th cluster seems to be forming the elbow of the curve. after that we will apply auto encoders to solve this problem

In [None]:
#applying k-means with 5 clusters
kmeans = KMeans(n_clusters=5, init='k-means++')
kmeans.fit(sales_df_scaled)
labels = kmeans.labels_
labels

In [None]:
kmeans.cluster_centers_.shape

In [None]:
cluster_centers = pd.DataFrame(data=kmeans.cluster_centers_, columns=sales_df.columns)
cluster_centers

In [None]:
# In order to understand what these numbers mean, let's perform inverse transformation
cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data=cluster_centers, columns=sales_df.columns)
cluster_centers

In [None]:
sales_df['ORDERLINENUMBER'] = sales_df['ORDERLINENUMBER'].apply(lambda x: float(x))

In [None]:
# Add a label (which cluster) corresponding to each data point
sales_df_cluster = pd.concat([sales_df, pd.DataFrame({'cluster':labels})], axis = 1)
sales_df_cluster

In [None]:
# plot histogram for each feature based on cluster 
for i in sales_df.columns[:8]:
    plt.figure(figsize=(30,6))
    for j in range(5):
        plt.subplot(1,5,j+1)
        cluster = sales_df_cluster[sales_df_cluster['cluster']==j]
        cluster[i].hist()
        plt.title('{} \ncluster {}'.format(i,j))
plt.show()

**OBERSVATIONS:**
* CLUSTER 0 (highest) - customer in this group buy item in high quantity, price of each item ~ 99, they also corresponds to highest total sales of ~ 8293. They are the highest buyers of products with high MSRP ~158.
* CLUSTER 1 - This cluster is nearly close to cluster 4 with MSRP around 94 and average quantity ordered ~34, average piced ~ 83 and sales to 3169
* CLUSTER 2 (lowest) - This group represents customers who buy items in varying quantity ~30, they tend to low price items ~68. Their sales is ~ 2061, they buy products with lowert MSRP of ~62.
* CLUSTER 3 - This is the second highest cluster, this group buy in medium quantity ~38, wwith total sales upto ~ 4405 with average price of ~ 95. The MSRP is around 115
* CLUSTER 4 - This group represents customers who are only active during the holidays. they buy in lower quantity ~35, but they tend to buy average price items around ~87. They also correspond to lower total sales around ~3797, they tend to buy items with MSRP around 116.

**NOTE:** the KMeans result in the final (save version run) might be different with cluster number and values, but the obervations will be simillar

# 4: APPLY PRINCIPAL COMPONENT ANALYSIS AND VISUALIZE THE RESULTS

In [None]:
pca = PCA(n_components=3)
principal_comp = pca.fit_transform(sales_df_scaled)
principal_comp

In [None]:
pca_df = pd.DataFrame(data=principal_comp, columns=['pca1', 'pca2', 'pca3'])
pca_df.head()

In [None]:
pca_df = pd.concat([pca_df, pd.DataFrame({'cluster':labels})], axis=1)
pca_df.head()

In [None]:
fig = px.scatter_3d(pca_df, x='pca1', y='pca2', z='pca3', 
                    color='cluster', symbol='cluster', size_max=18, opacity=0.7)
fig.update_layout(margin = dict(l = 0, r = 0, b = 0, t = 0))

some cluster seems to overlap each other, this issue will be solved by auto encoders

# 5: APPLY AUTOENCODERS (PERFORM DIMENSIONALITY REDUCTION USING AUTOENCODERS)

* auto encoders are a type of Artificial Neural Netwirk that are used to perform data encoding or representation learning
* auto encoders use the input and give the same output
* auto encoders works by adding a bottle neck in network
* this bottleneck g]forces the network to create a compressed (encoded) version of the original input
* auto encoders works well if there is correlation between inputs

In [None]:
sales_df.shape

In [None]:

input_df = Input(shape = (38,))
x = Dense(50, activation = 'relu')(input_df)
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
encoded = Dense(8, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(encoded)
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
decoded = Dense(38, kernel_initializer = 'glorot_uniform')(x)

# autoencoder
autoencoder = Model(input_df, decoded)

# encoder - used for dimensionality reduction
encoder = Model(input_df, encoded)

autoencoder.compile(optimizer = 'adam', loss='mean_squared_error')

In [None]:
autoencoder.fit(sales_df, sales_df, batch_size=128, epochs=500, verbose=3)

In [None]:
encoded_df = autoencoder.predict(sales_df_scaled)

In [None]:
wcss = []
for i in range(1,15):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(encoded_df)
    wcss.append(kmeans.inertia_) # intertia is the Sum of squared distances of samples to their closest cluster center (WCSS)

plt.plot(wcss, marker='o', linestyle='--')
plt.title('The Elbow Method (Finding right number of clusters)')
plt.xlabel('Number of CLusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
# from the above layer 3 clusters seems best choice
kmeans = KMeans(3)
kmeans.fit(encoded_df)
labels = kmeans.labels_
y = kmeans.fit_predict(sales_df_scaled)

In [None]:
df_cluster_dr = pd.concat([sales_df, pd.DataFrame({'cluster':labels})], axis = 1)
df_cluster_dr.head()

In [None]:
cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns = [sales_df.columns])
cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns = [sales_df.columns])
cluster_centers

In [None]:
# plot histogram for each feature based on cluster 
for i in sales_df.columns[:8]:
  plt.figure(figsize = (30, 6))
  for j in range(3):
    plt.subplot(1, 3, j+1)
    cluster = df_cluster_dr[df_cluster_dr['cluster'] == j]
    cluster[i].hist()
    plt.title('{}    \nCluster - {} '.format(i,j))
  
  plt.show()

**FINAL OBESERVATIONS:**
* Cluster 0 - This group represents customers who buy items in high quantity(47), they usually buy items with high prices(99). They bring-in more sales than other clusters. They are mostly active through out the year. They usually buy products corresponding to product code 10-90. They buy products with high mrsp(158).
* Cluster 1 - This group represents customers who buy items in average quantity(37) and they buy tend to buy high price items(95). They bring-in average sales(4398) and they are active all around the year.They are the highest buyers of products corresponding to product code 0-10 and 90-100.Also they prefer to buy products with high MSRP(115) .
* Cluster 2 - This group represents customers who buy items in small quantity(30), they tend to buy low price items(69). They correspond to the lowest total sale(2061) and they are active all around the year.They are the highest buyers of products corresponding to product code 0-20 and 100-110  they then to buy products with low MSRP(77).

In [None]:
# Reduce the original data to 3 dimension using PCA for visualize the clusters
pca = PCA(n_components = 3)
prin_comp = pca.fit_transform(sales_df_scaled)
pca_df = pd.DataFrame(data = prin_comp, columns = ['pca1', 'pca2', 'pca3'])
pca_df = pd.concat([pca_df, pd.DataFrame({'cluster':labels})], axis = 1)
pca_df.head()

In [None]:
# Visualize clusters using 3D-Scatterplot
fig = px.scatter_3d(pca_df, x = 'pca1', y = 'pca2', z = 'pca3',
              color='cluster', symbol = 'cluster', size_max = 10, opacity = 0.7)
fig.update_layout(margin = dict(l = 0, r = 0, b = 0, t = 0))