# Assignment - 8 

1. Read the dataset and get insights into your data 

2. Perform bi-variate data visualization using box plots for the following and explain your inference for each plot:
 Gender vs Spending Score
 
 Gender vs Annual Income
 
 Gender vs Age
 
3. Use a correlation matrix to identify the correlations between different features, what do you infer from this correlation matrix?

4. Explain with visualization the % split between Male and Female.

Explain your inference with visualization for frequency of visitors of the mall in terms of –

 Age
 
 Annual Income
 
 Spending Score
 
5. Convert categorical variables to numerical variables using one hot encoding.

6. Cluster your data using k-means clustering. Explain how you choose the value of k.

7. Explain each of your clusters in terms of all its attributes (use visualizations to explain better).


In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
from sklearn.metrics import silhouette_score
import seaborn as sns

In [None]:
cust = pd.read_csv('../input/mall-customers/Mall_Customers.csv')
cust.head()

Basic Visulaizations

Gender

In [None]:
#let's look at the men and women via histogram

sns.countplot(x='Genre', data=cust)
plt.title('Customer Gender')

In [None]:
#to make a piechart:
gender=cust.Genre.value_counts()
gender_label=['Female','Male']
plt.pie(gender, labels=gender_label, autopct='%0.2f%%',startangle=90)
plt.title('Distribution of men and women within the customers of the Mall')
plt.show()

Age Distribution

In [None]:
#let's see the max and min of ages
cust.describe()

The minimum age is 18 and the maximum is 70. We can create 6 bins to group people by age group. Each bin could represent 10 years

In [None]:
bin_list=[10,20,30,40,50,60,70]
plt.hist(cust['Age'], bins=bin_list, rwidth=0.9)
plt.xlabel('Age')
plt.ylabel('frequency')
plt.title('Age distribution of customers')

Annual Income of Customers

In [None]:
plt.hist(cust['Annual Income (k$)'], bins=12, rwidth=0.9)
plt.xlabel("Income in 1000's of $")
plt.ylabel("frequency")
plt.title('Annual income of customers')

Spending score of the customers

In [None]:
plt.hist(cust['Spending Score (1-100)'], bins=[0,10,20,30,40,50,60,70,80,90,100], rwidth=0.9)
plt.xlabel("Spending score")
plt.ylabel("frequency")
plt.title('Spending Score of customers')

Applying one-hot encoding on Gender.

In [None]:
#let's also drop the customer ID because it's not important
cust.drop("CustomerID", axis = 1, inplace=True)
#cust.drop("Genre", axis = 1, inplace=True)


cust["Genre"].replace("Male", 0, inplace=True)
cust["Genre"].replace("Female", 1, inplace=True)
cust

Let's start by visualising the relationship between different variable groups.
We are interested primarily in those who have a high spending score because this is the category we want to keep as customers for the mall. So let's check if there is a relationship between age and spending score, and annual income and spending score.

Gender and Spending

In [None]:
plt.scatter(cust['Genre'], cust['Spending Score (1-100)'])

It's hard to see clusters or relationships in this graph. That's mainly because the gender category has two distinct variables. Therefore, we will remove the gender from our analysis in order to make it simpler.

Age and Spending

In [None]:
plt.scatter(cust['Age'], cust['Spending Score (1-100)'])

It appears that there is some sort of correlation between being younger (less than 35 yo) and spending more, while the relatively older people are spending less than 60%. This graphs shows therefore 2 clusters.

Income and Spending

In [None]:
plt.scatter(cust['Annual Income (k$)'], cust['Spending Score (1-100)'])

In this graph, however, it's obvious that there are 5 groups, or clusters, in this dataset if we compare annual income and the spending score, and this is probably what we are interested in seing. Let's evaluate this more below

**K-mean clustering**

**Age vs spending clustering**
Let's test cluster number 2 to verify what we saw in the 'Age' vs 'Spending score' graph. First, we will only keep the Age and the Spending score column to simplify the dimensions and keep them in 2D.

In [None]:
#Let's have a new dataframe first with only the Age and the spending score

cust_age=cust.drop(["Annual Income (k$)", "Genre"], axis = 1)

In [None]:
#we can test a cluster number 2 to verify what we saw in the 'Age' vs 'Spending score' graph. 
#However, we will use 4 clusters here as we saw in the elbow plot that 4 is the optimal number. See below

k_means_age=KMeans(n_clusters=4)

#We can also use this code below in case we want to determine the n_init number
#k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 20)

k_means_age.fit(cust_age)
labels = k_means_age.labels_
print(labels)

Let's see where is the location of the centers

In [None]:
centers_age=k_means_age.cluster_centers_
centers_age

Let's plot a graph to visualise this relationship

In [None]:
plt.figure(figsize=(10, 8))

plt.scatter(cust_age['Age'], 
            cust_age['Spending Score (1-100)'], 
            c=k_means_age.labels_, s=100)

plt.scatter(centers_age[:,0], centers_age[:,1], color='blue', marker='s', s=200) 

plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.title('K-Means with 2 clusters')

plt.show()

Let's measure the silhouette score of this clustering:

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

In [None]:
score = silhouette_score (cust_age, k_means_age.labels_)

print("The score is = ", score)

This graph shows that we can have 4 clusters based on the Age and the spending score. Therefore, we can have 4 groups:

* Younger people with high spending score
* younger people with average spending score
* People with low spending score of less that 60 that belong to all age groups
* Older people with average spending score.
* The score of 0.5 is good. Let's see if we can get it higher.

Let's what is the optimal number of clusters by constructing an elbow plot

In [None]:
elbowlist1 = []
for i in range(1,15): 
    k_means_age = KMeans(n_clusters=i, init="k-means++",random_state=0)
    k_means_age.fit(cust_age)
    elbowlist1.append(k_means_age.inertia_)  

plt.plot(range(1,15),elbowlist1,marker="*",c="black")
plt.title("Elbow plot for optimal number of clusters: age and spending")

Here we can make sure that the cluster number 4 that we chose is correct. The elbow plot has a distinct slope break at 4 indicating that 4 is the optimal cluster number when comparing between age and spending

**Annual income vs spending clustering**
Let's test now cluster number 5 to verify what we saw in the 'Annual income' vs 'Spending score' graph. First, we will only keep the Annual income and the Spending score columns to simplify the dimensions and keep them in 2D.

In [None]:
#we drop the annual income column
cust_income=cust.drop(["Age", "Genre"], axis = 1)

In [None]:
#let's test cluster number 2 to verify what we saw in the 'Age' vs 'Spending score' graph.

k_means_income=KMeans(n_clusters=5)

#We can also use this code below in case we want to determine the n_init number
#k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 20)

k_means_income.fit(cust_income)
labels = k_means_income.labels_
print(labels)

Let's see where is the location of the centers

In [None]:
centers_income=k_means_income.cluster_centers_
centers_income

Let's plot a graph to visualise this relationship

In [None]:
plt.figure(figsize=(10, 8))

plt.scatter(cust_income['Annual Income (k$)'], 
            cust_income['Spending Score (1-100)'], 
            c=k_means_income.labels_, s=100)

plt.scatter(centers_income[:,0], centers_income[:,1], color='blue', marker='s', s=200) 

plt.xlabel('Annual Income in K$')
plt.ylabel('Spending Score')
plt.title('K-Means with 5 clusters')

plt.show()

In [None]:
score_2 = silhouette_score (cust_income, k_means_income.labels_)

print("The score is = ", score_2)

Here we can see in a much clearer way that there are 5 different clusters belonging to five different groups:

* Low annual income and high spending score --> Interesting category
* High annual income and high spending score --> interesting category
* Low annual income and low spending score --> Not interesting at all
* High annual income and low spending score
* Middge annual income and middle spending score


Here we have silhouette score of 0.55 which is better than before. That means that this clustering fits better than the age vs spending one calculated above.

Let's just make sure that 5 is a good number of clusters by constructing an elbow plot

In [None]:
elbowlist2 = []
for i in range(1,15): 
    k_means_income = KMeans(n_clusters=i, init="k-means++",random_state=0)
    k_means_income.fit(cust_income)
    elbowlist2.append(k_means_income.inertia_)  

plt.plot(range(1,15),elbowlist2,marker="*",c="black")
plt.title("Elbow plot for optimal number of clusters: income and spending")

Here we can make sure that the cluster number 5 that we chose is correct. The elbow plot has a distinct slope break at 5 indicating that 5 is the optimal cluster number when comparing between income and spending.

Visualise the gender distribution in this clustering

In [None]:
ax=plt.figure(figsize=(10, 8))

scatter=plt.scatter(cust_income['Annual Income (k$)'], 
            cust_income['Spending Score (1-100)'], 
            c=cust['Genre'], s=100)

plt.scatter(centers_income[:,0], centers_income[:,1], color='blue', marker='s', s=200) 

legend1 = ax.legend(*scatter.legend_elements(), loc="right", title="Gender")
ax.add_artist(legend1)

plt.xlabel('Annual Income in K$')
plt.ylabel('Spending Score')
plt.title('K-Means with 5 clusters')

plt.show()

There isn't much difference between gender. One can argue that females might have a higher spending score than men because there are relatively more yellow than purple in the high spending score categories. However, note that there are more females in this dataset than men (56% against 44%) so it's normal to have more females in this graph. Therefore, the gender does not have a noticeable role in this classification.

Remember:

male = 0
Female = 1

Let's just make sure that 5 is the optimal number of clusters by constructing an elbow plot

In [None]:
%matplotlib inline   

elbowlist3 = []
for i in range(1,15): 
    k_means_3D = KMeans(n_clusters=i, init="k-means++",random_state=0)
    k_means_3D.fit(cust_income)
    elbowlist3.append(k_means_3D.inertia_)  

plt.plot(range(1,15),elbowlist3,marker="*",c="black")
plt.title("Elbow plot for optimal number of clusters: age, income and spending")

Here as well we can notice that the cluster number 5 that we chose is correct. The elbow plot has a distinct slope break at 5 indicating that 5 is the optimal cluster number when comparing between age, income and spending.

K-mean clustering has been performed over a mall customer dataset to classify customers into different segments. Five customer segments were found having different age, income and spending trends. In order to make it better for the mall management to retain customers and increase sales, it is recommended that management focuses on retaining the following segments:

* Rich and high spending people between their 20 and 40's
* Relatively poor and high spending people between their 15 and 30's