# About


**About Mall Datasets**

This dataset contains information about people visiting the mall. The dataset has gender, customer id, age, annual income, and spending score.

**What is Customer Segmentation?**

Customer Segmentation is the process of division of customer base into several groups of individuals that share a similarity in different ways that are relevant to marketing such as gender, age, interests, and miscellaneous spending habits.

**Environment setup & import neccesary library**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt # data visualization
from sklearn.cluster import KMeans # Import Sklearn KMeans clustering 
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Implement Customer Segmentation



**Read Datasets with pandas**

In [None]:
df = pd.read_csv('../input/mall-customers/Mall_Customers.csv')
# check first five rows
df.head()

**Check means, std ... with pandas describe method**

In [None]:
df.describe()

**Check null values & datatypes**

In [None]:
df.isnull().sum()

In [None]:
df.dtypes

# Customer Gender Visualization

In this, we will create a barplot and a piechart to show the gender distribution across our customer_data dataset.

In [None]:
# Count Genre 
df.Genre.value_counts()

In [None]:
# Groupby Genre visualization
# df.groupby('Genre').Genre.value_counts().unstack().plot.bar()
sns.countplot(x='Genre', data=df)
plt.title('Customer gender density')
plt.show()

From the above barplot, we observe that the number of females is higher than the males. Now, let us visualize a pie chart to observe the ratio of male and female distribution.

**Create pie chart**

In [None]:
totalgenre = df.Genre.value_counts()
genrelabel = ['Female', 'Male']

In [None]:
plt.axis('equal') # For perfect circle
plt.pie(totalgenre, labels=genrelabel, radius=1.5, autopct='%0.2f%%', shadow=True, explode=[0, 0], startangle=45)
# radius increase the size, autopct for show percentage two decimal point
plt.title('Pie Chart Depicting Ratio of Female & Male')
plt.show() 
#remove extra text

*From the above graph, we conclude that the percentage of females is 56%, whereas the percentage of male in the customer dataset is 44%.*

# Visualization of Age Distribution with Histogram

Let us plot a histogram to view the distribution to plot the frequency of customer ages. We will first proceed by taking summary of the Age variable.

In [None]:
df['Age'].describe()

In [None]:
my_bins=10
# Histogram used by deafult 10 bins . bins like range.
arr=plt.hist(df['Age'],bins=my_bins, rwidth=0.95) 
plt.xlabel('Age Class')
plt.ylabel('Frequency')
plt.title('Histogram to Show of Age Class')
for i in range(my_bins):
    plt.text(arr[1][i],arr[0][i],str(arr[0][i]))

In [None]:
plt.boxplot(df["Age"])
plt.title('Boxplot for Descriptive Analysis of Age')
plt.show()

From the above two visualizations, we conclude that the maximum customer ages are between 30 and 35. The minimum age of customers is 18, whereas, the maximum age is 70.

# Analysis of the Annual Income of the Customers

we will create visualizations to analyze the annual income of the customers. We will plot a histogram and then we will proceed to examine this data using a density plot.

In [None]:
df['Annual Income (k$)'].describe()

In [None]:
my_bins=10
# Histogram used by deafult 10 bins . bins like range.
arr=plt.hist(df['Annual Income (k$)'],bins=my_bins, rwidth=0.95) 
plt.xlabel('Age Class')
plt.ylabel('Frequency')
plt.title('Histogram to Show of Age Class')
for i in range(my_bins):
    plt.text(arr[1][i],arr[0][i],str(arr[0][i]))

In [None]:
# Density Plot and Histogram of all arrival delays
sns.distplot(df['Annual Income (k$)'], hist=True, kde=True, 
            color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 4})

From the above descriptive analysis, we conclude that the minimum annual income of the customers is 15 and the maximum income is 137. People earning an average income of 70 have the highest frequency count in our histogram distribution. The average salary of all the customers is 60.56. 

# Analyzing Spending Score of the Customers


In [None]:
# Check the summary of Spending Score of the Customers
df['Spending Score (1-100)'].describe()

In [None]:
my_bins=10
# Histogram used by deafult 10 bins . bins like range.
arr=plt.hist(df['Spending Score (1-100)'],bins=my_bins, rwidth=0.95) 
plt.xlabel('Spending Score Class')
plt.ylabel('Frequency')
plt.title('Histogram for Spending Score')
for i in range(my_bins):
    plt.text(arr[1][i],arr[0][i],str(arr[0][i]))

In [None]:
plt.boxplot(df["Spending Score (1-100)"])
plt.title('Boxplot for Descriptive Analysis of Spending Score')
plt.show()

The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20. From the histogram, we conclude that customers between class 40 and 50 have the highest spending score among all the classes.

# K-Means Algorithm
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares. In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. In here we used elbow Method.

**Elbow Method**

The KElbowVisualizer implements the “elbow” method to help data scientists select the optimal number of clusters by fitting the model with a range of values for K. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. In the visualizer “elbow” will be annotated with a dashed line.

In [None]:
km = KMeans(n_clusters=3)
km

In [None]:
y_predicted = km.fit_predict(df[['Annual Income (k$)', 'Spending Score (1-100)']])
# clustering
y_predicted

In [None]:
df['cluster'] = y_predicted
df.head()

In [None]:
#check centroid
km.cluster_centers_

In [None]:
df0 = df[df.cluster == 0]
df1 = df[df.cluster == 1]
df2 = df[df.cluster == 2]

In [None]:
plt.scatter(df0['Annual Income (k$)'], df0['Spending Score (1-100)'], color='green')
plt.scatter(df1['Annual Income (k$)'], df1['Spending Score (1-100)'], color='red')
plt.scatter(df2['Annual Income (k$)'], df2['Spending Score (1-100)'], color='black')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='yellow', marker='o', label='centroid', s=200)
plt.xlabel('Anual Income')
plt.ylabel('Spending Score')
plt.legend(['Cluster1', 'Cluster2', 'Cluster3', 'centroid'])

**Check elbow and SSE**

In [None]:
k_rng = range(1, 10)
sse = []
for k in k_rng:
    km = KMeans(n_clusters=k)
    km.fit(df[['Annual Income (k$)', 'Spending Score (1-100)']])
    sse.append(km.inertia_)
sse

In [None]:
plt.xlabel('K')
plt.ylabel('Sum of Square Error(SSE)')
plt.plot(k_rng, sse)

In figure elbow checking k = 5 is good point.

In [None]:
km = KMeans(n_clusters=5)
y_predicted = km.fit_predict(df[['Annual Income (k$)', 'Spending Score (1-100)']])
# clustering
y_predicted

In [None]:
df['cluster'] = y_predicted
df.head()

In [None]:
km.cluster_centers_

In [None]:
df0 = df[df.cluster == 0]
df1 = df[df.cluster == 1]
df2 = df[df.cluster == 2]
df3 = df[df.cluster == 3]
df4 = df[df.cluster == 4]
plt.scatter(df0['Annual Income (k$)'], df0['Spending Score (1-100)'], color='green')
plt.scatter(df1['Annual Income (k$)'], df1['Spending Score (1-100)'], color='red')
plt.scatter(df2['Annual Income (k$)'], df2['Spending Score (1-100)'], color='blue')
plt.scatter(df3['Annual Income (k$)'], df3['Spending Score (1-100)'], color='cyan')
plt.scatter(df4['Annual Income (k$)'], df4['Spending Score (1-100)'], color='magenta')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='yellow', marker='o', label='centroid', s=150)
plt.xlabel('Anual Income')
plt.ylabel('Spending Score')
plt.legend(['Cluster1', 'Cluster2', 'Cluster3', 'Cluster4', 'Cluster5', 'centroid'])