# Customer Segmentation
### Finding the target customers for a mall by analyzing the spending score of customers on the basis of their age and income. Spending score is calculated on the behavior of purchasing.

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 6)

In [None]:
# importing dataset
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

In [None]:
df

In [None]:
# lets get some info about the dataset
df.info()

**with info command we get that there are 200 rows and 5 columns and there's no missing data required to deal with. Most of the data is numeric.**

In [None]:
# so lets rename the last 2 columns
df.rename(columns={'Annual Income (k$)':'Income', 'Spending Score (1-100)':'Spending_score'}, inplace=True)
df.head()

In [None]:
# let's remove the customerID column becuase it's useless
df.drop(columns=['CustomerID'], axis=1, inplace=True)

In [None]:
# lets check the data statisticuly or summarize the data
df.describe()

In [None]:
# lets visualize the features and relation with eachother
sns.pairplot(df)

**We see that between Spending_score and Income appears 5 groups of people**

## Using Elbow method to find the number of clusters

In [None]:
X = df[['Income','Spending_score']]
dist_points_from_centroids = []
slscore = []
k = range(2,10)
for clusters in k:
    model = KMeans(n_clusters=clusters, max_iter=1000, random_state=10).fit(X)
    dist_points_from_centroids.append(model.inertia_)
    slscore.append(silhouette_score(X,model.labels_))
plt.xlabel("K")
plt.ylabel("inertia")
plt.title("Elbow Method")
plt.plot(k,dist_points_from_centroids)

**We see that in above graph there's a line representing elbow and elbow is exactly mapping on 5 number which means there are 5 clusters**

## lets validate the number of cluster with silhouette score

In [None]:
plt.xlabel("K")
plt.ylabel("score")
plt.title("Silhouette score")
plt.plot(k, slscore)

**We see that cluster 5 has gotten the highest score which means n_clusters=5 is the right choice**

In [None]:
# lets create the model 
kmeans = KMeans(n_clusters=5, max_iter=1000, random_state=10).fit(X)

In [None]:
# lets see that labels assigned to the clusters
kmeans.labels_

In [None]:
# lets make a new column named as cluster and assign labels into it.
df['cluster']=kmeans.labels_

In [None]:
df.head()

In [None]:
# lets see the number of poeple lie in each group
plt.title("clusters with the number of customers")
plt.xlabel("clusters")
plt.ylabel("Count")
df.cluster.value_counts().plot(kind='bar')

## Cluster 0 has the most customers. let's see how the customers in this cluster differ from others

In [None]:
df.groupby(df.cluster).mean().plot(kind='bar')
plt.show()

### Number of people with highest spending score lies in cluster 1 and 3 around the age of 25 and 30. In cluster 1 customers are earning the great amount But what about in cluster 3 they don't earn that much but again the spend a lot. If we see in cluster 0,2 and 4 they are all aged around 45, People in cluster 0 spends as much they earn seems like they're having big family maybe. which is not the case with cluster 4 they earn a lot but they don't spend, means having small family, that's what could be a guess. Finally the 2 cluster having parental age spends as much they earn again they are also supporting the big family

## OK let's analyze the ratio of spending between men and women in each cluster

In [None]:
plt.title("Men VS Women ratio in each cluster")
plt.ylabel("Count")
sns.countplot(x=df.cluster, hue=df.Gender)
plt.show()

### Ratio of women is greater than men in every cluster except cluster 4

## let's visualize clusters

In [None]:
plt.figure(figsize=(15,9))
g=sns.scatterplot(x='Income', y='Spending_score', hue='cluster', data=df,palette=['green','orange','brown','dodgerblue','red'], legend='full')

# End