![](http://icdn.2cda.pl/g/157025_75710482662543650778.gif)

The purpose of customer segmentation data is to divide customers into groups based on common characteristics so that each group can be marketed effectively and appropriately through customer segmentation.
All customers share a common need for a product or service, but they have behavioral differences due to age, gender, socioeconomic, lifestyle and other factors.

# Table of Content

1. Header Files

2. About Data Set

3. Exploratory Data Analysis
    
    * 3.1 - Read Data
    * 3.2 Variable Description
    * 3.2 - Analysing Missing Values
    * 3.3 - Analysing Outliers
    * 3.4 - Analysing the data set
    * 3.5 - Scaling
    * 3.6 - Encoding
    * Determining Optimal Linkage Method
    
4. Data preprocessing

5. Modelling

# 1. Header Files

In [None]:
import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
import pandas as pd
import scipy.cluster.hierarchy as sch
import seaborn as sns
import warnings

from PIL import Image
from scipy.cluster.hierarchy import cophenet
from scipy.cluster.hierarchy import linkage,dendrogram,cut_tree
from scipy.spatial.distance import pdist
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.preprocessing import MinMaxScaler
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import DBSCAN

# 2. About Data Set

Customer ID - Unique identification of customer

Gender - Sex of the customer

Age - Age of customer

Annual Income - Income of salary in 1000's unit Dollars

Spending Score - Readiness of customer to spend money

# 3. Data Preperation

#### 3.1 Read the data

In [None]:
data = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
data.head()

#### 3.2 Variable Description

In [None]:
data.info()

1. Customer ID - numerical - unique customer number, integer

2. Gender - categorical - binary (Male/Female)

3. Age - numerical - integer

4. Annual Income (k$) - numerical - integer

5. Spending Score (1-100) - numerical - integer

In [None]:
data = data.drop('CustomerID', axis=1)

In [None]:
data.shape

In [None]:
data.head()

#### 3.3 Data summary

In [None]:
data.describe()

#### 3.4 Missing cells

In [None]:
msno.bar(data, color='blue')
plt.show()

In [None]:
print('Data columns with null values:',data.isnull().sum(), sep = '\n')

Looking at the numbers and graphs, there are **no missing values!**

#### 3.5 Analysing Outliers

In [None]:
for x in data.select_dtypes(np.number).columns:
    sns.boxplot(x=data[x], color='blue')
    plt.show()

#### 3.6 Histogram, Pairplot 

In [None]:
plt.rcParams['figure.figsize'] = (10, 8)
sns.heatmap(data.corr(),annot=True,fmt='.1g',cmap='Blues')

plt.rcParams['figure.figsize'] = (10, 8)
sns.pairplot(data, aspect=1.5)

My analysis results
1. Age does not correlate well with Spending Score, showing that spending is independent of age.

2. Age and annual income have a negative correlation. However, you may notice that age does not determine income.

3. Annual income and Spending Score are positively correlated. It doesn't matter because there is no monolinear relationship!

In [None]:
plt.rcParams['figure.figsize'] = (10, 8)
sns.pairplot(data, hue='Gender', aspect=1.5)

Looking at the graph, you can see that there is an even distribution of females and males overall. This is difficult to classify specifically by gender.

#### 3.7 PI chart

In [None]:
values = data['Gender'].value_counts()
labels = ['Female', 'Male']

fig, ax = plt.subplots(figsize = (10, 8), dpi = 100)
explode = (0, 0.06)

patches, texts, autotexts = ax.pie(values, labels = labels, autopct = '%1.2f%%', shadow = True,
                                   startangle = 90, explode = explode)

plt.setp(texts, size = 19, color = 'black')

plt.setp(autotexts, size = 19, color = 'white')
autotexts[1].set_color('white')
plt.show()

In [None]:
data["Gender"].value_counts()

It can be seen that the proportion of **women is 12% higher** than that of men.

#### 3.8 Standard Deviation

In [None]:
ax = sns.boxplot(x="Gender", y="Spending Score (1-100)", data=data)
plt.title("Spending Score by gender", size = 19)

In [None]:
pd.pivot_table(data=data, columns = "Gender", values = ["Annual Income (k$)", "Spending Score (1-100)"],
               aggfunc = np.mean)

It can be seen that male customers have a higher income, but female customers have higher Spending Score.

In [None]:
ax = sns.boxplot(x="Gender", y="Annual Income (k$)", data=data)
plt.title("Annual Income by gender", size = 19)

In [None]:
pd.pivot_table(data=data, columns = "Gender", values = ["Annual Income (k$)", "Spending Score (1-100)"],
               aggfunc = np.std)

The standard deviation of annual income is 26, but it can be seen that the range of women's Spending Score is wide.

#### 3.9 Scatter plot

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(x = data["Age"], y = data["Spending Score (1-100)"], hue = data["Gender"])
plt.title("Spending points by gender", size = 19)

Distinctive point
1. **Customers over 40 years old do spending score not exceed 60 points!**

2. Few customers under the age of 30 have a spending score score of 40 or less.

3. Younger customers under the age of 30 can infer a higher spending score!

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(data=data, x="Annual Income (k$)", y="Spending Score (1-100)", hue="Gender")
plt.title("Annual Income and Expenditure Scores by Gender", size = 19)

My analysis results
1. There are two groups with incomes between 80 and 140. Customers in the 0-40 range have a low spend score, and customers 60-100 have a high spend score.

2. There are two groups with income between 0 and 40. Customers in the 0-40 range have a low spend score, and customers 60-100 have a high spend score.

3. High annual income does not necessarily mean high expenditure score.

In [None]:
A_1 = data.loc[data["Age"] < 36]
A_2 = data.loc[data["Age"] > 35]

A_1["Age_"] = "Young"
A_2["Age_"] = "Old"

data_2 = pd.concat([A_1, A_2], axis = 0)
data_2["Age_"].value_counts()

plt.figure(figsize=(10,8))
sns.scatterplot(x = data_2["Annual Income (k$)"], y = data_2["Spending Score (1-100)"], hue = data_2["Age_"])
plt.title("Annual Income and Spending Score for Young Customers Under 35 and Older Customers Over 35", size = 19)

1. Younger customers under the age of 35 have higher spending scores.

2. Older customers had lower spending scores.

Let's take a look at each age group this time.

In [None]:
Y_1 = data.loc[data["Age"] < 20]
Y_2 = data.loc[data["Age"] > 19]
Y_3 = data.loc[data["Age"] > 29]
Y_4 = data.loc[data["Age"] > 39]
Y_5 = data.loc[data["Age"] > 49]
Y_6 = data.loc[data["Age"] > 59]

Y_1["Age_"] = "10Y"
Y_2["Age_"] = "20Y"
Y_3["Age_"] = "30Y"
Y_4["Age_"] = "40Y"
Y_5["Age_"] = "50Y"
Y_6["Age_"] = "60Y"

data_3 = pd.concat([Y_1, Y_2, Y_3, Y_4, Y_5, Y_6], axis = 0)
data_3["Age_"].value_counts()

plt.figure(figsize=(10,8))
sns.scatterplot(x = data_3["Annual Income (k$)"], y = data_3["Spending Score (1-100)"], hue = data_3["Age_"])
plt.title("Spending Score and income scores by age", size = 19)

1. It can be seen that the Spending Score of people in their 20s and 30s is high.

2. All age groups have an annual income of 40-60k$ and a Spending Score of 40-60.

In [None]:
data_3 = pd.concat([Y_1], axis = 0)
data_3["Age_"].value_counts()

plt.figure(figsize=(10,8))
sns.scatterplot(x = data_3["Annual Income (k$)"], y = data_3["Spending Score (1-100)"], hue = data_3["Gender"])
plt.title ("Annual Income and Spending Score for Teenagers", Size = 19)

data_3 = pd.concat([Y_2], axis = 0)
data_3["Age_"].value_counts()

plt.figure(figsize=(10,8))
sns.scatterplot(x = data_3["Annual Income (k$)"], y = data_3["Spending Score (1-100)"], hue = data_3["Gender"])
plt.title ("Annual Income and Spending Score for Twenties", Size = 19)

data_3 = pd.concat([Y_3], axis = 0)
data_3["Age_"].value_counts()

plt.figure(figsize=(10,8))
sns.scatterplot(x = data_3["Annual Income (k$)"], y = data_3["Spending Score (1-100)"], hue = data_3["Gender"])
plt.title ("Annual Income and Spending Score for Thirties", Size = 19)

data_3 = pd.concat([Y_4], axis = 0)
data_3["Age_"].value_counts()

plt.figure(figsize=(10,8))
sns.scatterplot(x = data_3["Annual Income (k$)"], y = data_3["Spending Score (1-100)"], hue =data_3["Gender"])
plt.title ("Annual Income and Spending Score for Forties", Size = 19)

data_3 = pd.concat([Y_5], axis = 0)
data_3["Age_"].value_counts()

plt.figure(figsize=(10,8))
sns.scatterplot(x = data_3["Annual Income (k$)"], y = data_3["Spending Score (1-100)"], hue =data_3["Gender"])
plt.title ("Annual Income and Spending Score for Fifties", Size = 19)

data_3 = pd.concat([Y_6], axis = 0)
data_3["Age_"].value_counts()

plt.figure(figsize=(10,8))
sns.scatterplot(x = data_3["Annual Income (k$)"], y = data_3["Spending Score (1-100)"], hue =data_3["Gender"])
plt.title ("Annual Income and Spending Score for Sixties", Size = 19)

1. 10's Features : 

        Customers with an annual income of 40-60k$ will have a Spending Score of 40-60.


1. 20's Features :

        Divided into 5 clusters.

        Customers with an annual income of 40-60k$ will have a Spending Score of 40-60.
                     

1. 30's Features : 

        Divided into 5 clusters.

        There are fewer clusters with low income and high consumption than other clusters.

        Customers with an annual income of 40-60k$ will have a Spending Score of 40-60.
                     

1. 40's Features : 

        Divided into 3 clusters.

        Spending Score are concentrated at 40-60.

        Regardless of their income, they usually have a low Spending Score.

        Customers with an annual income of 40-60k$ will have a Spending Score of 40-60.


1. 50's Features :

        Divided into 3 clusters.

        Regardless of their income, they usually have a low Spending Score.

        Customers with an annual income of 40-60k$ will have a Spending Score of 40-60.


1. 60's Features :

        Divided into 2 clusters.

        Customers with an annual income of 40-60k$ will have a Spending Score of 40-60.
        
        
        

Conclusions :


        1. Customers in their 20s and over 35 years of age with a high consumption score are the main target of the market. 

        2. Customers over 50 show a moderate consumption pattern.

# 4. Data preprocessing

#### 4.1 Scaling

In [None]:
df = data.copy()
df_num = data.select_dtypes(np.number) 
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df_num)
scaled_data = pd.DataFrame(scaled_data, columns= df_num.columns)
scaled_data

In [None]:
df['Gender'] = df['Gender'].astype('category').cat.codes
processed_df = pd.concat([scaled_data, df['Gender']], axis=1) 
processed_df

Convert Gender column to number and join

In [None]:
sns.pairplot(processed_df, hue='Gender', aspect=1.5)

#### 4.2 Dendrogram

In [None]:
coph=dict()
for method in ['ward','average','complete','single']:
    mergings=linkage(processed_df,method=method)
    c,d=cophenet(mergings,pdist(processed_df))
    coph[method]=c
print(coph)

print('\nOptimal Linkage Method:',max(coph))

In [None]:
fig = plt.figure(figsize=(25, 10))
dendrogram=sch.dendrogram(sch.linkage(processed_df,method='ward'))
plt.title("Dendrogram")
plt.xlabel("Customers")
plt.ylabel("Eucledian distance")
plt.show()

High time complexity - Determining the optimal number of clusters using a dendrogram is confusing.

#### 4.3 Elbow method

In [None]:
X = processed_df[['Annual Income (k$)','Spending Score (1-100)']]
dist_points_from_centroids = []
slscore = []
k = range(2,10)
for clusters in k:
    model = KMeans(n_clusters=clusters, max_iter=1000, random_state=10).fit(X)
    dist_points_from_centroids.append(model.inertia_)
    slscore.append(silhouette_score(X,model.labels_))
plt.xlabel("K")
plt.axvline(5,c='red')
plt.ylabel("inertia")
plt.title("Elbow Method")
plt.plot(k,dist_points_from_centroids)

In the elbow method you can see that there are 5 clusters.

#### 4.4 Silhouette method

In [None]:
plt.axvline(5,c='red')
plt.xlabel("K")
plt.ylabel("score")
plt.title("Silhouette score")
plt.plot(k, slscore)

Cluster 5 received the highest score. This means that n_clusters = 5 is the right choice.

#### 4.5 PCA

In [None]:
km=KMeans(n_clusters=5)

In [None]:
km.fit(processed_df)

In [None]:
cluster_centers=km.cluster_centers_
cluster_centers

In [None]:
clusters=km.predict(processed_df)
clusters

In [None]:
pca=PCA(n_components=4)
pca_data=pca.fit_transform(processed_df)

In [None]:
pca_data

In [None]:
pca_data.shape, data.shape, clusters.shape

In [None]:
pca_df=pd.DataFrame(pca_data, columns=['PC1', 'PC2', 'PC3', 'PC4'])
pca_df.head()

In [None]:
pca_df['clusters']=clusters
pca_df.head()

In [None]:
pca_df['clusters'].value_counts()

# 5. Modelling

In [None]:
sns.scatterplot(pca_df.loc[pca_df['clusters']==0, 'PC1'], pca_df.loc[pca_df['clusters']==0, 'PC2'],color='pink')
sns.scatterplot(pca_df.loc[pca_df['clusters']==1, 'PC1'], pca_df.loc[pca_df['clusters']==1, 'PC2'],color='red')
sns.scatterplot(pca_df.loc[pca_df['clusters']==2, 'PC1'], pca_df.loc[pca_df['clusters']==2, 'PC2'],color='orange')
sns.scatterplot(pca_df.loc[pca_df['clusters']==3, 'PC1'], pca_df.loc[pca_df['clusters']==3, 'PC2'],color='blue')
sns.scatterplot(pca_df.loc[pca_df['clusters']==4, 'PC1'], pca_df.loc[pca_df['clusters']==4, 'PC2'],color='black')

plt.legend(loc='best', bbox_to_anchor=(1.5, 1.5))
plt.show()