## MALL Customer Segmentation

The goal is to identify types of Customers visiting the mall. Who are described as "good mall spenders:" Males or Females?
To identify different customers based on income and their spending which will help to create strategies in terms products and services to cater to their needs.

We start by importing the dependencies.
1. Numpy and Pandas for basic calculations and datatables
2. Matplotlib and its support libraries along with seaborn to plot different kinds of plots for visualisation
3. Sklearn packages for Kmeans calculation, Standardization of data.
4. OS to control flow from directory

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import style
from matplotlib.ticker import MaxNLocator
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from mpl_toolkits.mplot3d import Axes3D

### Data Import

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        path=os.path.join(dirname, filename)
        print(path)

In [None]:
raw_data=pd.read_csv(path)
raw_data

We read the data from the csv file and display it in the console.
On the first glace we notice we have 3 continuous variable and 1 categorical variable
Based on Income, Spending and gender we can create our cluster or groupings

### Understanding the DATA and its distribution

In [None]:
raw_data.drop(["CustomerID"], axis=1, inplace=True)

In [None]:
raw_data.columns

In [None]:
raw_data.info()

Thus based on the above output we understand the data type of each category and we can assume that there is no null values bases on the fact that all 4 variables have equal non null counts

In [None]:
raw_data.describe()

The max and min above and below the 75% and 25% gives us the sense of the data distributions although the presence of outliers cannot be ruled out as of now

### NULL value check

In [None]:
raw_data.isnull().sum()

This confirms that there isn't presence of any null values.

### Gender and its demographics

Grouping data by Gender to indentify the spending habits of each gender and their respective incomes

In [None]:
gender_group_mean=raw_data.groupby(["Gender"]).mean().reset_index()
gender_group_mean

In [None]:
plt.figure(figsize=(6,6))
plt.bar(gender_group_mean["Gender"], gender_group_mean["Annual Income (k$)"], color=["orange","indigo"])

Males on average earns slightly more than females

In [None]:
plt.figure(figsize=(8,6))
plt.bar(gender_group_mean["Gender"], gender_group_mean["Spending Score (1-100)"], color=["b","g"])

The Average spendings of females are larger than males

Thus we can conclude two aspects of the demographics:-
1. Saving of Females are lower than their male counterparts
2. Male population is larger than females still less count and spending of males suggest they are non users who can be converted to users in future
3. More male oriented products or themes or attractions can be placed/ utilized in the malls to draw the male crowds.
4. Female spends generally more on cloths, jewellary, kids store while males spends more on games, consumer electronics, foods and movies with no clear demarkation between gender roles are quite often noted.


In [None]:
plt.figure(figsize=(6,6))
sns.boxenplot(raw_data["Gender"], raw_data["Annual Income (k$)"], palette ="rainbow")
plt.grid(color="silver")

From the above graph we can see the women salary is more distributed with large concentration of women earling betwwen $40k to $80k while the men's income ranged from $45k to $79k. The norrowing of graph indicates presence of outliers. The are generally have similar pay scales where as men have multiple pay scales.

In [None]:
plt.figure(figsize=(14,6))

plt.subplot(1,2,1) #subplot rows, columns and Index
sns.boxenplot(raw_data["Gender"], raw_data["Spending Score (1-100)"], palette ="rocket")
plt.grid(color="silver")

plt.subplot(1,2,2)
sns.boxplot(raw_data["Gender"], raw_data["Spending Score (1-100)"], palette ="seismic")
plt.grid(color="silver")

From the above graph we can see that most male spendings range from $45k ~ $79k aprrox while female spendings range from $40k ~ ~ $78k approx. So there is an overall $4000 less spendings on male counterpart

In [None]:
plt.figure(figsize=(20,8))

plt.subplot(1,2,1)
ax=sns.distplot(raw_data["Spending Score (1-100)"], color="mediumspringgreen")

plt.subplot(1,2,2)
ax=sns.distplot(raw_data["Annual Income (k$)"], color="blue")


Spending scores histogram is quite well balanced with its peak in the center (0 shewness) suggest the absense of noticeable outliers. Where as for Annual income it is very slightly skewed to the right indicate presence of minute outliers.

In [None]:
raw_data.groupby(["Gender"]).count().reset_index()

In the data we have more Female representatives than males


### Converting Categorical variable to Continuous

This is required specially when we will be going to fit_transform the whole dataset to standardscaler

In [None]:
raw_data["Gender"]=raw_data["Gender"].replace(["Male", "Female"], [0,1])
raw_data

### Standard Scaling of data.

#### Using Skleanr's StandardScaler()

The reason who standardize the data becore KMeans() creates clusters based on distances (Euclidean Distances). So all the values of each variable needs to be on the same scale so as to have equal distances.

A varibale whose value ranges like 500g will be perceived more dominant by the algorithm than 5kg despite the fact 5kg is bigger than 500g.

Thus we standardize the data by substracting the mean and dividing it with its is standard deviation, thus each data sample will have unit variance.

In [None]:
std_slr=StandardScaler()
raw_data_std=std_slr.fit_transform(raw_data)
raw_data_std[0:5]

In [None]:
raw_data_std=pd.DataFrame(raw_data_std, columns=raw_data.columns)
raw_data_std

Taking the standardized values of two main variable for clustering

In [None]:
data=raw_data_std.iloc[:,2:].values

### Finding the optimum Clustering

####  using KMeans via the Elbow Method and Silhouette scores

Kmeans is used to do the clusters and to find their centroids.

Basically we run the KMeans() from sklearn with most of its parameters as default except the number of clusters which we fun via a for loop to find out the optimum cluster size and its Silhouette score.

#### Kmeans:-

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares

This algorithm requires the number of clusters to be specified. Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes, another reason for required standardization.

https://scikit-learn.org/stable/modules/clustering.html#k-means

Parameters: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html


#### Silhoutte Scores

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). 

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

Parameters: Mostly at defaults with Labels set via KMeans.Labels_.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

In [None]:
wss=[]
sil=[]
for k in range(2,20):
    kmeans=KMeans(n_clusters=k, random_state=1).fit(data)
    wss.append(kmeans.inertia_)
    labels=kmeans.labels_
    sil.append(silhouette_score(raw_data_std, labels, metric = 'euclidean'))
print(wss)
print(sil)

In [None]:
k=range(2,20)
fig,ax=plt.subplots(figsize=(20,8))
ax.set_facecolor('white')
ax.plot(k,wss, color="green")
ax.xaxis.set_major_locator(MaxNLocator(nbins=20,integer=True)) #forces the scales to be in integer, nbins as scales size.
ax.set_xlabel("No of clusters",color="black", fontsize=20)
ax.set_ylabel("WSS (With in sum of squares)", color="green", fontsize=20)
ax2=ax.twinx() #creates the second axis on the first plot (ax)
ax2.plot(k,sil, color="blue")
ax2.set_ylabel("Silhouette scores", color="Blue", fontsize=20)
ax.grid(True, color="silver")
plt.show()

As we see as the no of cluster increases the **WSS**(with-in sum of squares) decrease *sharply* upto 5 then the curve begin to flat out.
This is the elbow method where there is an initial steep decent and then the rate of decrease of WSS slows decreases suggesting that with the increase in new clusters there is **not much significant improvement to the model**

The silhoutte scores on the other hand tell us the max coefficent value of intercluster distance. We see it peaks out at 5 exactly where the elbow method's WSS steep decent begins to flat-out suggesting that *"5"* is the optimal choice for this dataset.

**Thus choosen no of clusters = 5**

In [None]:
n=5
kmeans=KMeans(n_clusters=n, random_state=1).fit(data)
clusters=kmeans.labels_
centroids=kmeans.cluster_centers_
clusters

In [None]:
raw_data_std2=pd.DataFrame(pd.concat([raw_data_std, pd.Series(kmeans.labels_)], axis=1).rename(columns={0:"Clusters"})).copy()
raw_data2=pd.DataFrame(pd.concat([raw_data, pd.Series(kmeans.labels_)], axis=1).rename(columns={0:"Clusters"})).copy()
raw_data2

In [None]:
raw_data2_copy=raw_data2.copy()
raw_data2_copy.sort_values(["Clusters"], inplace=True)
raw_data_std2.sort_values(["Clusters"], inplace=True)
for i in range(0,n+1):
    raw_data2_copy["Clusters"]=raw_data2_copy["Clusters"].replace(i, chr(i+65))
    raw_data_std2["Clusters"]=raw_data_std2["Clusters"].replace(i, chr(i+65))
raw_data2_copy["Clusters"].unique()

In [None]:
raw_data2_copy

In [None]:
fig=plt.figure(figsize=(14,8))
ax=Axes3D(fig) #used to create 3D plots, part of matplotlib
x=np.array(raw_data["Age"])
y=np.array(raw_data["Annual Income (k$)"])
z=np.array(raw_data["Spending Score (1-100)"])
centroids=np.array(centroids)
ax.scatter(x, y,z, c=y)
plt.title('SPENDING SCORE VS ANNUAL INCOME VS AGE')
ax.set_xlabel('Age')
ax.set_ylabel('ANNUAL Income')
ax.set_zlabel('Spending Score')

A 3d plot showing the cluster compared with Age, Spending Score and Annual income.

In [None]:
x=raw_data.iloc[:,2:].values
y_means=kmeans.fit_predict(data)
print(y_means)
print("X shape:",x.shape)

## Cluster plot and its interpretation

Code snippet: x[y_means==0,0] mean plot x whos cluster center is 0 (x[y_means==0]) and the second "0" is a postional argument of array X, X shape is (n, 2) so [0,1] are positional arguments. This arguements are used to set Annual income (pos=0) as X and Spending Score (pos=1) as Y

In [None]:
plt.figure(figsize=(14,8))

plt.scatter(x[y_means==0,0], x[y_means==0,1], color="cyan", label="Normal") #pos argument 0 for X and 1 for Y
plt.scatter(x[y_means==1,0], x[y_means==1,1], color="indigo", label="High Spenders") #y_means==1 for cluster center 1
plt.scatter(x[y_means==2,0], x[y_means==2,1], color="red", label="Value buyers")
plt.scatter(x[y_means==3,0], x[y_means==3,1], color="green", label="Savers")
plt.scatter(x[y_means==4,0], x[y_means==4,1], color="blue", label="Impulse buyers")

plt.xlabel("Annual Income (Thousand $)")
plt.ylabel("Spending Score (1-100)")
plt.legend(loc="right")
plt.grid(True, color="silver")
plt.title("Customers Demographics")
plt.show()

### Interpretation

***Above and Below*** are the two cluster outputs based on Income and Spending Habits.

Here we can see *5* total distinctive groups with minimum overlaps which suggest our clustering model did a pretty good job.


**Explainations and Suggestions**

Of the 5 total clusters we observe the following:-
1. The **Red** group lies in the bottom left of the plots have both least Income and Spending Capability. They are rational with their spending habits and thus will choose **Value** products which offer maximum benefit for least price. A **mass** marketing strategy will work in this type of consumer segment. **Offers, discounts** are quite attractive to this group. A all in one departmental stores like Walmarts, Big bazar (India) are preferred retailers.

2. The **Blue** group are mostly **Impulse buyers** who dont believe in saving. A good array of display with relevant digital marketing techniques like location based advertisement to offers can be used to attact this group. Display plays a key role in selling to these customers. Can be seen flocking around Exclusive branded stores as well as departmental all in one retails.

3. The **Skyblue** group behaves like an ideal group who likes to maintain a optimised life of earning and spendings. They are generally value buyers who like to buy products of low cost but gives the premium feel. Such customers will opt for EMI based shopping of high value CGs or like to upgrade an old item with a new. Can be seen flocking around Exclusive branded stores as well as departmental all in one retails.

4. The **Green** group are groups who are savers and don't believe in unnecessary expenditure. They generally buy a movie tickets sans the pop corn. Prefer to buy a lot from all in one departmental retails.

5. The **Indigo** group are high earners and high spenders thus a lucrative target for all types of goods. They are thrill, luxury seakers who don't bugde to spend the extra bug for good experience. Exclusive branded outlets are the most preferred choices here.



NB: Group colour are based on the above Graph only.

In [None]:
plt.figure(figsize=(14,8))
sns.scatterplot("Annual Income (k$)", "Spending Score (1-100)", hue="Clusters", data=raw_data_std2)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1], s = 100,c="Blue", label = 'centeroid')
plt.legend()
plt.grid()

The Centroids (centre of mass of a geometric object of uniform density.) of each groups highlighted in blue

### Compute the Silhouette Coefficient for each sample.

In [None]:
sample_silhouette_values=silhouette_samples(raw_data_std, clusters)
raw_data_std2["silhouette_values"]=sample_silhouette_values

In [None]:
raw_data_std2

In [None]:
raw_data_std2.groupby(["Clusters"])["silhouette_values"].mean()

Silhoutte scores for each sample. Ranges from **-1 < score < +1**. Where samples in the same cluster are similar to each other, and well separated, where samples in different clusters are not very similar to each other.

**+1** suggest ***strong similarity*** with-in the group and ***dissimilarity*** with other groups (good clustering)

**0** suggest **no** similarity with-in the group and **no** dissimilarity with other group (no clustering)

**-1** suggest **strong dissimilarity** with-in the group and **similarity** with other groups (poor clustering)

In [None]:
silhouette_score(raw_data_std, clusters) #all data points