# **Table of Contents **
* [Introduction](#Introduction)
* [EDA](#EDA)
    - [Univariate](#Univariate)
    - [Bivariate](#Bivariate)
* [Clustering](#Clustering)
    - [K-Means Clustering](#K-Means Clustering)
    - [Clusters visualization with Principal Component Analysis (PCA)](#Cluster vizualisation with Principal Component Analysis - PCA)
* [Conclusion](#Conclusion)

<a id="section-one"></a>
# Introduction
Hi everyone,

I am going to explore the [*wholesale  customer dataset*](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers#) from the UCI Machine Learning Repository and use an unsupervised machine learning clustering model to make a customer segmentation.

The dataset contains information on the clients of a wholesale distributor, and more specifically:
* Consumer annual spending (m.u.) on: fresh products, milk products, grocery products, frozen products, detergents and paper products, delicatessen products
* Retail channel: Horeca (Hotel/Restaurant/Cafe) vs Retail channel (Nominal)
* Purchase region: Lisnon, Oporto or Other.

We are first going to explore the dataset before applying the K-Means Clustering model to discover different segments of customers.

In [None]:
%watermark -a "Adrien DB" -d -v -u 
%watermark --iversions

In [None]:
# Importing the Libraries
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import dabl

In [None]:
# Importing the Dataset
import os
df = pd.read_csv(r"../input/wholesale-customers-data-set/Wholesale customers data.csv")

In [None]:
df.head()

In [None]:
msno.matrix(df, figsize = (30,4))

Our dataset seems to be complete, let's check the type of data that we have:

In [None]:
df_data = dabl.clean(df, verbose=1)
dabl.detect_types(df_data)

We don't have to clean our dataset as we can see that out of our 8 columns, we already have:
* 6 continuous types of feature ('Fresh', 'Milk', 'Grocery', 'Frozen',	'Detergents_Paper', 'Delicassen')
* 2 categoricals features ('Channel',	'Region')

However, let's change the content of 'Channel' and	'Region' to make it clearer later:

In [None]:
df['Channel'] = df['Channel'].map({1:'Horeca', 2:'Retail'})
df['Region'].replace([1,2,3],['Lisbon','Oporto','other'],inplace=True)

<a id="section-two"></a>
# EDA
We are going to start exploring our data with the Univariate analysis (each feature individually), before carrying the Bivariate analysis and compare pairs of features to find correlation between them.

<a id="subsection-two"></a>
## Univariate

In [None]:
def plot_distribution(df, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
    plt.style.use('seaborn-whitegrid')
    fig = plt.figure(figsize=(width,height))
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    rows = math.ceil(float(df.shape[1]) / cols)
    for i, column in enumerate(df.columns):
        ax = fig.add_subplot(rows, cols, i + 1)
        ax.set_title(column)
        if df.dtypes[column] == np.object:
            g = sns.countplot(y=column, data=df)
            substrings = [s.get_text()[:18] for s in g.get_yticklabels()]
            g.set(yticklabels=substrings)
            plt.xticks(rotation=25)
        else:
            g = sns.distplot(df[column])
            plt.xticks(rotation=25)
    
plot_distribution(df, cols=3, width=20, height=20, hspace=0.45, wspace=0.5)

From the graphs on the distribution of product it seems that we have some outliers in the data, let's have a closer look before we decide what to do:

In [None]:
# Let’s remove the categorical columns:
df2 = df[df.columns[+2:df.columns.size]]

#Let’s plot the distribution of each feature
def plot_distribution(df2, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
    plt.style.use('seaborn-whitegrid')
    fig = plt.figure(figsize=(width,height))
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    rows = math.ceil(float(df2.shape[1]) / cols)
    for i, column in enumerate(df2.columns):
        ax = fig.add_subplot(rows, cols, i + 1)
        ax.set_title(column)
        g = sns.boxplot(df2[column])
        plt.xticks(rotation=25)
    
plot_distribution(df2, cols=3, width=20, height=10, hspace=0.45, wspace=0.5)

**Outliers** should be detected but not necessarily removed, it depends of the situation. Here I will assume that the wholesale distributor provided us a dataset with correct data, so I will keep them as is.

<a id="subsection-two"></a>
## Bivariate
*(Not that it is required for our clustering segmentation, let's just see more relations in our dataset out of curiosity)*

Let's use Seaborn pairplot to have a first look at how our data is interracting

In [None]:
sns.set(style="ticks")
g = sns.pairplot(df,corner=True,kind='reg')
g.fig.set_size_inches(15,15)

From the pairplot above, the correlation between the "*detergents and paper products*" and the "*grocery products*" seems to be pretty strong, meaning that consumers would often spend money on these two types of product. Let's look at the Pearson correlation coefficient to confirm this:

In [None]:
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0.5,
            square=True, linewidths=.5, cbar_kws={"shrink": .6},annot=True)

plt.title("Pearson correlation", fontsize =20)

In a Classification or Regression problem we would have explored this r of 0.92 but we'll skip this now to jump into the clustering

<a id="section-three"></a>
# Clustering
<a id="K-Means Clustering"></a>
## K-Means Clustering
Our dataset isn't that big so we can implement a K-Means clustering model here instead of using a hierarchical clustering model.
### The Elbow Method
Let's find the optimal number of clusters by using the Elbow method

In [None]:
# First we need to convert our categorical features (region and channel) to dummy variable:
df2 = pd.get_dummies(df)

In [None]:
X = df2.iloc[:,:].values

sns.set()
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', 
                    max_iter = 300,
                    n_init=10)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)
plt.xticks(ticks=range(1, 11))
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

So the idea here is to select a number of clusters after which we don't see much difference in the WCSS, let's go for 6 

### Training the K-Means model on the dataset

In [None]:
kmeans = KMeans(n_clusters = 6,
                init = 'k-means++',
                max_iter = 300,
                n_init=10,
                random_state = 0)
y_kmeans = kmeans.fit_predict(X)

### Adding the cluster numbers to the dataset

In [None]:
df_cluster = df
df_cluster['Cluster'] = y_kmeans
df_cluster.head()

In [None]:
df_cluster.Cluster.value_counts()

<a id="Cluster vizualisation with Principal Component Analysis - PCA"></a>
## Cluster visualization with Principal Component Analysis - PCA
We cannot visualize our clusters that easily beacause our dataset is multidimentional. So we'll use the Principal Component Analysis to reduce our dataset to a two dimentional one, then add our identified clusters to visualize them.

### Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

### Applying PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pc = pca.fit_transform(df2)
pc_df = pd.DataFrame(pc)
pc_df.columns = ['pc1','pc2']

In [None]:
pca_clustering = pd.concat([pc_df,df_cluster['Cluster']],axis=1)

### Visualizing our clusters on PCA axis

In [None]:
plt.figure(figsize=(7,7))
sns.scatterplot(x='pc1', y='pc2', hue= 'Cluster', data=pca_clustering,palette='Set1').set_title('K-Means Clustering')
plt.show()

<a id="section-four"></a>
# Conclusion

The K-Means clustering model allowed us to segments the customers between 6 distinct groups. We were able to visualize these clusters after performing a dimensionality reduction with the Principle Component Analysis.