# INTRODUCTION 

This Kernel will make a customer segmentation of a mall store along with a detailed analysis of clusters and some statistics with T-test analysis.
* First, we will look at the data in general, what are the distribution of all the variable, the relationship and express a first idea of what our customer looks like.
* Second, we will perform the K-Means algorithm on Spending Score and Annual Income to create different group for further analysis down the line. 
* Finally, we will do specific analysis for each cluster and perform a t-test on spending score and gender. This will tell us if their is a statistical difference between the mean spending score of men and women. The results can then be use to develop specific action plan on gender and per cluster.

Hypothesis:
* H0: There is no difference between the mean spending score of Male and Women.
* H1: There is a statistical difference between the mean spending score of Male and Women.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy import stats

In [None]:
#Importe dataset
df = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
#Plot relation between all variables
plt.style.use('ggplot')
sns.pairplot(data=df,hue= 'Gender')

In [None]:
#Number of Male and Female
sns.countplot(x='Gender',data=df)

In [None]:
#BoxPlot 
for col in df.select_dtypes('int64'): 
    plt.figure(figsize=(8,8))
    sns.boxplot(x="Gender" ,y=col, data=df)

## Takeaways:

The dataset contain 5 columns and 200 observation. There is no missing values and only one outlier that we will keep in the dataset.

The graphics above show that all features are normaly distributed.The median of both men and women seems always close or identical even if women are slighty more present in the dataset then men.

Men also seems to have higher interquartile range then Women.

# K-MEANS

In [None]:
#SCALING THE DATA

#Drop 'Gender from X beacause Categorical variables do not fit into a K-Means algorithm'
X = df.drop(['CustomerID','Gender','Age'],axis=1)

#Scale the data
scale = StandardScaler()
X_scaled = scale.fit_transform(X)

X_scaled

In [None]:
#Create the cluster with KMeans

plt.figure(figsize=(10,8))

#Get the number of cluster to minimize the inertia. 
wcss =[]
for k in range(1,15):
    kmean=KMeans(n_clusters=k,init='k-means++',max_iter=300,n_init=20,random_state=45)
    kmean.fit(X_scaled)
    wcss.append(kmean.inertia_)
plt.plot(range(1,15),wcss)
plt.ylabel('WCSS')
plt.xlabel('Number of cluster')
plt.title('ELBOW METHOD')

### We can see that 5 cluster should be select.

In [None]:
#Build the model
kmean = KMeans(n_clusters=5,init='k-means++',max_iter=300,random_state=345)
model=kmean.fit(X_scaled)
cluster = model.predict(X_scaled)
cluster

In [None]:
#Look at the inertia
model.inertia_

In [None]:
#Plot the graphics with cluster in different color

plt.figure(figsize=(10,8))
sns.scatterplot(X_scaled[:,0],X_scaled[:,1],hue=model.labels_,palette='muted')

#Add the centroids to the graphics
sns.scatterplot(x=model.cluster_centers_[:,0],y=model.cluster_centers_[:,1],color='black',marker='d',sizes=20)
plt.legend(loc='upper right')
plt.title('Cluster on Annual Income and Spending Score')

Alright the algorithm is done and we can see that our cluster are well define and de centroids are place in the centers of every cluster.

# ANALYSIS OF EACH CLUSTER.

### Except cluster place in the bottom-left which represent customer with low income and low spending score. We will not examine that one beacause it is not really usefull for the entreprise.

In [None]:
#Bring back the cluster to the original dataset

df['cluster']=cluster
df

In [None]:
#Print back the scatter plot on the original dataset for interpretation
plt.figure(figsize=(10,8))
sns.scatterplot(x=df['Annual Income (k$)'],y=df['Spending Score (1-100)'],hue=model.labels_,palette='muted')
plt.legend(loc='upper right')
plt.title('Cluster on Annual Income and Spending Score')

> ## Cluster #0 : Low Annual Income and High Spending Score 

In [None]:
cluster_0 = df[df['cluster']==0]
cluster_0.describe()

In [None]:
#Create a definition that will print multiple graph for each cluster and for reproducibility

def cluster_plot(cluster_df):
    plt.figure(figsize=(10,10))
    plt.subplots_adjust(bottom=0.5,top=2.5)
    plt.subplot(4,1,1)
    
    sns.countplot(cluster_df['Gender'])
    plt.title('Count of Men and Women')
    
    plt.subplot(4,1,2)
    sns.scatterplot(x='Age',y='Annual Income (k$)',size='Spending Score (1-100)',hue='Gender',data=cluster_df)
    plt.title ('Age vs Annual Income with spending score and sex distribution')
    plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")

    plt.subplot(4,1,3)
    sns.boxplot(x='Gender',y='Spending Score (1-100)',data=cluster_df)
    plt.title('Spending score vs Gender')

    plt.subplot(4,1,4)
    sns.boxplot(x='Gender',y='Annual Income (k$)',data=cluster_df)
    plt.title('Annual Income vs Gender')

cluster_plot(cluster_0)

## Cluster #0 Final takeaways

This cluster might represent students customer because most of them are young and have a low Annual Income. 

**Key observation :**
* The customer are pretty young with a mean of 23 years old.
* The mean spending score is 79.36 which is pretty good and the minimum score is 61.
* The mean of Annual Income is 25 720$. Althoug Men have a higher median Annual Income then Women, the latest have a larger variance. Same observations with spending score.

**Actions:** 
This cluster is important beacause we would want to build trust and loyalty to these customers. The goal is to keep them long enough so that when they have a reel job and make more money they'll still buy from our store. The transition should be from cluster 0 (top_left) to cluster 3 (top right).

For now the company should send them deals for products on sale from time to time. They should also be in line with their scolar calendar. Make sure you send some deals for when schools begins and finishes. 

# Cluster #2: Low spending score with high Annual Income

In [None]:
#Create the cluster
cluster_2 = df[df['cluster']==2]
cluster_2.head()

In [None]:
#Get some descriptive statistics
cluster_2.describe()

In [None]:
#General Visualisation
cluster_plot(cluster_2)

### Statistical Analysis between two group.

In [None]:
#Create a definition for the T-Test

def t_test (test_cluster):
    
    #Binarize the Male and Female observation
    test_cluster['Gender'].replace({'Female':1,'Male':0},inplace=True)

    #Create a sample for the male observation
    sample_a = test_cluster[test_cluster['Gender']==0]
    sample_a= sample_a.loc[:,('Gender','Spending Score (1-100)')]

    #Create a sample for the female observation
    sample_b = test_cluster[test_cluster['Gender']== 1]
    sample_b= sample_b.loc[:,('Gender','Spending Score (1-100)')]

    # Execute the t-test
    test=stats.ttest_ind(sample_b,sample_a)
    print(test.pvalue[1])
    
    # Print the response for the result
    if test.pvalue[1] < 0.05:
        print('Reject the Null hypothesis, meaning there is a statistical difference in spending score between Men and Women')
    else:
        print ('Accept the Null hypothesis, meaning there is no statistical difference in spending score between Men ans Women')

t_test(cluster_2)

## Cluster #2 Final Takeaways

**Key Observation:**
* Mean of age is 41 years old.
* Average income is 88 200$.
* Spending score is low with and average of 17,11. Mostly affected by men how are pretty low in this cluster.
* Women do spend more then men in this category. It has been prove by the T-test.

**Actions:**
We see a transformation of customer persona in this cluster. Women have higher income and higher spending score then men. The goal here would be to move these customer up toward the cluster #3 which are customer with the same Annual Income and with a Higher spending score. 
From what we saw, the company should concentrade on Women by sending specific deals or more publicity to make sure they move up in the graphics.

# Cluster #3: High spending score with high Annual Income

In [None]:
#Create cluster #3
cluster_3 = df[df['cluster']==3]
cluster_3.head()

In [None]:
# Descriptive Analysis
cluster_3.describe()

In [None]:
#Generla Visualisation
cluster_plot(cluster_3)

In [None]:
# Statistical T-Test
t_test(cluster_3)

## Cluster #3 Final Takeaways

**Key Observations:**
* Average age is 32 years old, Lower than cluster #2.
* The average spending score is 82 which is good.
* On average, the income is around 86 538 $.
* There is no statistical difference between the spending score of Men and Woman. 

**Action:**
This Cluster is a priority. We want to keep those customers. The company should offer them some gift from time to time. Although they are already loyal customer without any action required, the company have to maintain a good relationship with them. Publicity that will be send to them should always come with somme cross product that are frequently bought together.


# Cluster #4: Average Annual Income and Average Spending Score

In [None]:
#Create the cluster dataframe
cluster_4 = df[df['cluster']==4]
cluster_4.head()

In [None]:
cluster_4.describe()

In [None]:
#Plot some visualisation
cluster_plot(cluster_4)

In [None]:
#Execute the statistical test
t_test(cluster_4)

## Cluster #4 Final Takeaways

**Key Observation:**
* Most populated cluster with 81 customer.
* Average age of 42 years old ranging from 18 to 70.
* Average Annual Income of 55 290$ and spending score of 49.
* Predominated presence of Women compare to Men.
* No statistical difference of spending score between Men and Women

**Actions: **
This cluster represent normal customer that vary in age, Annual Income and spending score. The obvious goal is trying to move them up but necessarily toward cluster #3.
We might get them to spend more by sending adds of products that people in the same cluster bought frequently. 

# CONCLUSION

I hope you liked this Kernel. If so please UPVOTE and feel free to leave a comment if you think something should be done diffrently. 

THANK YOU !

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session