In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv', index_col="CustomerID")
df

In [None]:
df.describe()

**Exprolering missing data. **

In [None]:
df.isna().sum()

**#CREATE A COPY OF DATA FRAME. 
#create using the quatile range for [q1-q4] for age income and score. **

In [None]:
df_copy = df.copy()
df_copy["Age_bin"]=pd.qcut(df_copy["Age"],q=4)
df_copy["Annual_Income_bin"]=pd.qcut(df_copy["Annual Income (k$)"],q=4)
df_copy["score_bin"]=pd.qcut(df_copy["Spending Score (1-100)"],q=4)
df_copy.head()

**Understanding the data graphically and finding meaningful interaction***

Do annual income affect the spending score ?
Does age and income affect the spending score?
What age group have a better score?

This can help improve the marketing stategy of the company. Eg: what group of client 
do they have locked in and what group of client needs more advert or persuation. 

In [None]:
##### Annual income bin VS spending score.  
plt.figure(figsize=(15,10))
sns.boxplot(data=df_copy,y='Spending Score (1-100)',x='Annual_Income_bin')
plt.figure(figsize=(15,10))
sns.boxplot(data=df_copy,y='Spending Score (1-100)',x='Annual_Income_bin',hue="Gender")
plt.figure(figsize=(15,10))
sns.boxplot(data=df_copy,y='Spending Score (1-100)',x='Annual_Income_bin',hue="Age_bin")
plt.axhline(y=50)

** BOX-PLOT INTERPREATION **

The above boxplot illustates that those who make (78.0-137.0)k have a higher spending score. We can also see that there are more people in this cartegory. Another 
intresting finding is that people with lower annual income (41.5-61.60)k have a higher spending score than women who earn (41.5-61.6)k. 

From fig2 females who earn (78.0-137.0)k have a higher spending score than men with the same annual income. And men who earn (41-61.6)k have a higher spending score than women who earn (41-61.6)k. 

From fig3 I drew a horizontal line at 50 , which is the mean of spending score. From the graph we can say that any boxplot whose mean is above the line are locked-in customers 
and those below the line are not locked-in customers. Hence if the company is willing to get more customers they need to advertise or market to those below the line. And if they have a new product they should reach out to those locked-in customers who fall above the average spending score. 

**BUILDING KMEANS MODEL**

In [None]:
#preparing the data for KMEANS.  
df_copy1 = df.copy()
df['Gender'] = df['Gender'].replace(['Male'],'1')#replace male with 1 
df['Gender'] = df['Gender'].replace(['Female'],'0')#replace female with 0
print(df_copy1)

In [None]:
#normalizing data
from sklearn.preprocessing import MinMaxScaler
df_norm=MinMaxScaler().fit_transform(df_copy1)#normalizing data. 

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
###Finding the ideal value of K. 
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k,
               init='k-means++',n_init=14,max_iter=300,random_state=0)
    km = km.fit(df_norm)
    Sum_of_squared_distances.append(km.inertia_)

In [None]:
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
!pip install kneed

In [None]:
###This uses the elbow method to get the k-value . k=4
from kneed import DataGenerator, KneeLocator
k=KneeLocator(range(1, 15), Sum_of_squared_distances, curve="convex", direction="decreasing")
k.elbow

In [None]:
#This gives the k=value k=5
km.n_iter_

In [None]:
#BUILD MODEL WITH 4 CLUSTERS 
kmeans_4 = KMeans(n_clusters=4, 
                  init='random',n_init=14,
                  max_iter=300,random_state=0).fit(df_norm)#=4


In [None]:
#BUILD MODEL WITH 5 CLUSTERS 
kmeans_5 = KMeans(n_clusters=5, 
                  init='random',n_init=14,
                  max_iter=300,random_state=0).fit(df_norm)#=5


In [None]:
#silhouette_score
from sklearn.metrics import silhouette_score
print("This is the silhouette_score for k=4 ",silhouette_score(df_norm, kmeans_4.labels_))
print("This is the silhouette_score for k=5 ",silhouette_score(df_norm, kmeans_5.labels_))

In [None]:
#predicting the labels of clusters 
label_k4=kmeans_4.fit_predict(df_norm)
unique_label=np.unique(label_k4)
unique_label

**CREATING A BOX PLOT WITH THE CLUSTERS. **

In [None]:
df_copy_copy=df_copy1.copy()

In [None]:
kmeans_44 = pd.DataFrame(label_k4)
df_copy_copy.insert((df_copy_copy.shape[1]),'Cluster',kmeans_44)

In [None]:
df_copy_copy

In [None]:
df_copy_copy.isna().sum()

In [None]:
df_copy_copy.dropna()

In [None]:
##### PLOTING THE CLUSTERS AGAINST OTHE VARRIABLES.  
fig,axes=plt.subplots(2,3,figsize=(20,15))

fig.suptitle("Cluster Results.")

sns.boxplot(ax=axes[0,2],data=df_copy_copy,y='Spending Score (1-100)',x='Cluster')
sns.boxplot(ax=axes[0,1],data=df_copy_copy,y='Annual Income (k$)',x='Cluster')
sns.boxplot(ax=axes[0,0],data=df_copy_copy,y='Age',x='Cluster');

sns.boxplot(ax=axes[1,2],data=df_copy_copy,y='Spending Score (1-100)',x='Cluster',hue='Gender')
sns.boxplot(ax=axes[1,1],data=df_copy_copy,y='Annual Income (k$)',x='Cluster',hue='Gender')
sns.boxplot(ax=axes[1,0],data=df_copy_copy,y='Age',x='Cluster',hue='Gender');

**CLUSTER ANALYSIS ** 


From the above boxplot, each cluster [0-3] is plotted against other variables. 
 
Using the spending score as a reference, we can see that clusters 0 and 1 fall above the mean spending score "50," and clusters 2 and 3 falls under the spending range.  
Cluster 0 and 1 also tells us that those who tend to have a high spending score are with the age range of (25-40). Also, females in clusters 0 and 1 have a higher spending score. 

Cluster 2 and 3 have a mean age of (45 and 40), respectively; older people have a lower spending score. And more men tend to have a lower spending score. 

For annual income, The clusters are even. 