# 1. Introduction: Business Goal & Problem Definition

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

The goal of this project is to identify, study and analyze a Mall´s clients clusters, so the business can have a better understanding of its customers segmentations and adapt different marketing strategies to each of them, increasing the commerce´s revenue. For that we´ll use the Mall Customer Segmentation dataset available in Kaggle, containing 200 customers. Each customer has the following attributes:

* Gender
* Age	Annual
* Income (k$)
* Spending Score (1-100)

# 2. Importing Basic Libraries

In [None]:
import io
import openpyxl
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 3. Data Collection

In [None]:
mall_ds = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv', encoding='latin1', sep=",")

mall_ds

# 4. Data Preliminary Exploration

In [None]:
#Checking a dataset sample

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
pd.options.display.float_format="{:,.2f}".format
mall_ds.sample(n=10, random_state=0)

In [None]:
#Checking dataset info by feature

mall_ds.info(verbose=True, null_counts=True)

In [None]:
#Checking the existence of zeros in rows

(mall_ds==0).sum(axis=0).to_excel("zeros_per_feature.xlsx")
(mall_ds==0).sum(axis=0)

In [None]:
#Checking the existence of duplicated rows

mall_ds.duplicated().sum()

In [None]:
#Checking basic statistical data by feature

mall_ds.describe(include="all")

# 5. Data Cleaning

    We´ll perform the following:

    1. Create a calculated column (Spending Score / Annual Income) that could be potentially important to the model

    2. Convert categorical variables (Gender) to dummies
    
    * No missing, zero or invalid values to treat
    * No duplications found
    * No outliers found

In [None]:
#1

mall_ds["spending_score_to_annual_score_ratio"] = mall_ds["Spending Score (1-100)"] / mall_ds["Annual Income (k$)"] #feature engineering

#2

mall_ds = pd.concat([mall_ds, pd.get_dummies(mall_ds["Gender"])], axis=1) #genre dummy coding

mall_ds.to_excel("mall_ds_clean.xlsx")

# 6. Data Exploration

In [None]:
#Plotting Categorical Variables

fig, ax = plt.subplots(1, 2)
fig.suptitle("Gender Frequency", fontsize=15)
mall_ds["Gender"].value_counts().plot.bar(color="purple", ax=ax[0])
mall_ds["Gender"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
plt.xticks(rotation=90)
plt.yticks(rotation=45)


#Plotting Numerical Variables

fig, ax = plt.subplots(1, 3)
fig.suptitle("Age Distribution", fontsize=15)
sns.distplot(mall_ds["Age"], ax=ax[0])
sns.boxplot(mall_ds["Age"], ax=ax[1])
sns.violinplot(mall_ds["Age"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("Annual Income (k$) Distribution", fontsize=15)
sns.distplot(mall_ds["Annual Income (k$)"], ax=ax[0])
sns.boxplot(mall_ds["Annual Income (k$)"], ax=ax[1])
sns.violinplot(mall_ds["Annual Income (k$)"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("Spending Score (1-100) Distribution", fontsize=15)
sns.distplot(mall_ds["Spending Score (1-100)"], ax=ax[0])
sns.boxplot(mall_ds["Spending Score (1-100)"], ax=ax[1])
sns.violinplot(mall_ds["Spending Score (1-100)"], ax=ax[2])

# 7. Correlations Analysis & Features Selection

In [None]:
#Deleting not relevant and original categorical columns

mall_ds2 = mall_ds.drop(["CustomerID", "Gender"], axis=1)

#Plotting a Heatmap

fig, ax = plt.subplots(1, figsize=(25,25))
sns.heatmap(mall_ds2.corr(), annot=True, fmt=",.2f")
plt.title("Heatmap Correlation", fontsize=20)
plt.tick_params(labelsize=12)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

#Plotting a Pairplot

sns.pairplot(mall_ds2)

# 8. Data Modelling

In [None]:
#Defining Xs

X_orig = mall_ds
X = mall_ds2

#Scaling all features

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_scaled = sc_X.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled)

# 9. Machine Learning Algorithms Implementation & Assessment

# 9.1.1 K-means

In [None]:
#Creating a K-means model and checking its Metrics

from sklearn.cluster import KMeans

#Applying the Elbow Method to calculate distortion for a range of number of cluster

distortions = []
for i in range(1, 21):
    km = KMeans(n_clusters=i, init="random", n_init=10, max_iter=300, tol=1e-04, random_state=0)
    km.fit(X_scaled)
    distortions.append(km.inertia_)

#Plotting

plt.plot(range(1, 21), distortions, marker="o")
plt.xlabel("Number of clusters")
plt.ylabel("Distortion")
plt.show()

#Applying the Silhouette Method to interpret and validate of consistency within clusters of data

from sklearn.metrics import silhouette_score
silhouette_coefficients = []
for j in range(2, 21):
    km = KMeans(n_clusters=j, init="random", n_init=10, max_iter=300, tol=1e-04, random_state=0)
    km.fit(X_scaled)
    score = silhouette_score(X_scaled, km.labels_)
    silhouette_coefficients.append(score)

#Plotting

plt.style.use("fivethirtyeight")
plt.plot(range(2, 21), silhouette_coefficients)
plt.xticks(range(2, 21))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()

#Choosing number of clusters

n_clusters = 3
print('Estimated number of clusters: %d' % n_clusters)
km = KMeans(n_clusters=n_clusters)
km.fit(X_scaled)
print("Silhouette Coefficient: %0.3f" % silhouette_score(X_scaled, km.fit(X_scaled).labels_))

#Plotting chosen number of clusters

from yellowbrick.cluster import silhouette_visualizer
silhouette_visualizer(KMeans(n_clusters=n_clusters, random_state=0), X_scaled)

#Visualizing clusters in the dataset
X_orig = pd.DataFrame(X_orig)
X_orig["cluster"] = km.labels_
X_orig.to_excel("model_km.xlsx")

# 9.1.2 Clusters exploration

In [None]:
print("Cluster 0")
X_orig.query("cluster == 0").describe(include="all")

In [None]:
print("Cluster 1")
X_orig.query("cluster == 1").describe(include="all")

In [None]:
print("Cluster 2")
X_orig.query("cluster == 2").describe(include="all")

In [None]:
#Plotting Categorical Variables

fig, ax = plt.subplots(1, len(X_orig["cluster"].unique()))
fig.suptitle("Gender Frequency", fontsize=15)
X_orig.query("cluster == 0")["Gender"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[0])
X_orig.query("cluster == 1")["Gender"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
X_orig.query("cluster == 2")["Gender"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[2])


#Plotting Numerical Variables

fig, ax = plt.subplots(1, len(X_orig["cluster"].unique()))
fig.suptitle("Age Distribution", fontsize=15)
sns.distplot(X_orig.query("cluster == 0")["Age"], label = "Cluster 0", ax=ax[0])
sns.distplot(X_orig.query("cluster == 1")["Age"], label = "Cluster 1", ax=ax[1])
sns.distplot(X_orig.query("cluster == 2")["Age"], label = "Cluster 2",ax=ax[2])

fig, ax = plt.subplots(1, len(X_orig["cluster"].unique()))
fig.suptitle("Annual Income (k$) Distribution", fontsize=15)
sns.distplot(X_orig.query("cluster == 0")["Annual Income (k$)"], label = "Cluster 0", ax=ax[0])
sns.distplot(X_orig.query("cluster == 1")["Annual Income (k$)"], label = "Cluster 1", ax=ax[1])
sns.distplot(X_orig.query("cluster == 2")["Annual Income (k$)"], label = "Cluster 2",ax=ax[2])

fig, ax = plt.subplots(1, len(X_orig["cluster"].unique()))
fig.suptitle("Spending Score (1-100) Distribution", fontsize=15)
sns.distplot(X_orig.query("cluster == 0")["Spending Score (1-100)"], label = "Cluster 0", ax=ax[0])
sns.distplot(X_orig.query("cluster == 1")["Spending Score (1-100)"], label = "Cluster 1", ax=ax[1])
sns.distplot(X_orig.query("cluster == 2")["Spending Score (1-100)"], label = "Cluster 2",ax=ax[2])

fig, ax = plt.subplots(1, len(X_orig["cluster"].unique()))
fig.suptitle("Spending score ratio Distribution", fontsize=15)
sns.distplot(X_orig.query("cluster == 0")["spending_score_to_annual_score_ratio"], label = "Cluster 0", ax=ax[0])
sns.distplot(X_orig.query("cluster == 1")["spending_score_to_annual_score_ratio"], label = "Cluster 1", ax=ax[1])
sns.distplot(X_orig.query("cluster == 2")["spending_score_to_annual_score_ratio"], label = "Cluster 2",ax=ax[2])

In [None]:
# #Plotting scatter graph per pair features

# #Mapping every individual cluster to a color

# colors = ['goldenrod', 'olive', 'navy']

# vectorizer = np.vectorize(lambda x: colors[x % len(colors)])

# #Plotting

# for i in range(0, X_scaled.shape[1]):
#     for j in range(1, X_scaled.shape[1]):
#         plt.scatter(X_scaled.iloc[:,i], X_scaled.iloc[:,j])
#         plt.xlabel(X.columns[i])
#         plt.ylabel(X.columns[j])
#         plt.show()

# 9.2 DBSCAN

In [None]:
#Creating a DBSCAN model and checking its Metrics
#OBS: we´re exploring DBSCAN only as a study exercise in this project - we´ll adopt K-Means

from sklearn.neighbors import NearestNeighbors

#We can calculate the distance from each point to its closest neighbour using the NearestNeighbors. The point itself is included in n_neighbors. The kneighbors method returns two arrays, one which contains the distance to the closest n_neighbors points and the other which contains the index for each of those points

neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_scaled)
distances, indices = nbrs.kneighbors(X_scaled)

#Soring and plotting results

distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
plt.xlabel("Distances to the closest n_neighbors")
plt.ylabel("eps")
plt.show()

from sklearn.cluster import DBSCAN

#Selecting the best eps (the optimal value for epsilon will be found at the point of maximum curvature)

dbs = DBSCAN(eps=0.8)
dbs.fit(X_scaled)

#The labels_ property contains the list of clusters and their respective points

clusters = dbs.labels_

from sklearn import metrics

#Number of clusters in labels, ignoring noise (outlier) (-1) if present

n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise_ = list(clusters).count(-1)
print('Estimated number of clusters: %d' % n_clusters)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X_scaled, clusters))

#Visualizing clusters in the dataset
X_orig = pd.DataFrame(X_orig)
X_orig["cluster"] = dbs.labels_
X_orig.to_excel("model_dbs.xlsx")

# 10. Conclusions

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

In this exercise we went through all the process from collecting data, exploring features and distributions, treating data, understanding correlations, selecting relevant features, data modelling and presenting a clustering model, indicating groups of customers with similarities to explored, as explained below, so the Mall can have a better understanding of its customers segmentations according and adapt different marketing strategies to each of them, bringing more revenue and market share to the business.

First group of clients:
the first group is formed by men, average of 42 years old, with the highest annual income and the lowest spending score of all groups. This group has the biggest potential of all to grow so the Mall should adopt specific strategies to adapt this group´s profile and explore its huge probability on growing in more sophisticated items for men.

Second group of clients:
the second group if formed by women, average of 40 years old, with a considerable annual income and also a low spending score ratio (meaning a low spending when comparing to the income). This is the group with the second highest potential to grow, so also here the Mall should invest in strategies to offer more sophisticated items for women.

Third group of clients:
the third group os formed by a mix of men and women, much younger with an average of 24 years old, and also a much lower annual income, but with the highest spending score of all (4.5x of group 1 and 4.25x of group 2), meaning they represent the most meaningful part of our business today. This is a group with lower opportunities to grow but it needs to be kept as crucial for the business continuity,