# Grade: /100 pts
# Assignment 8: Customer Segmentation with Clustering

In this assignment, you will be solving a traditional problem in quantitative marketing: customer segmentation. Having a properly segmented database is extremely important to define marketing campaigns, as it allows companies to define value-centric actions targeted towards customers of different profiles. In this ocassion you are going to analyze customers of a supermarket chain *Fruver*.

Assume you are the owner of a consulting company that is in charge of this project. In this assignment you are going to analyze 2 different strategies performed by your employees. At the end you are going to decide which one produces the best result in segmentating the customers.

The information is provided in the document `data_customers.csv` which has the following columns:

- **ID:** Customer identifier (it does not have prediction power).
- **Education:** Education status of the customer.
- **Income:** Customer's annual household income.
- **Kidhome:** Number of children in customer's family.
- **Teenhome:** Number of teenagers in customer's family.
- **Recency:** Number of days since the last purchase in the supermarket.
- **NumWebVisitsMonth:** Number of visits to the supermarket web page the last month.
- **Complain:** If the customer has had claims.
- **Living_Status**: If the customer lives alone or does not.
- **Total_Promos_accept**: Number of total promotions accepted.
- **Age**: Current customer's age.
- **Total_Consumption**: Total amount spent in the supermarket.
- **Total_Num_Purchases**: Total Number of purchases.
- **Seniority**: Number of months in which the client has been enrolled with the supermarket.

### Follow These Steps before submitting
Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline. 

5.  Your submission document should be saved in the form: `LastName_FirstName_Assignment5.ipynb`


In [None]:
# pip install yellowbrick

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering, KMeans
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
#from yellowbrick.cluster.elbow import kelbow_visualizer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
from sklearn.decomposition import PCA
import datetime
import seaborn as sns
import itertools
%matplotlib inline

## Question 1 Loading the data and Preprocessing (5 pts)


1. Import the data and select only the customers with ages lower than 100 years. Report the new data shape. Is there any null value?
2. Your employees took these numeric predictors: `Income`, `Recency`, `NumWebVisitsMonth`, `Age`, `Total_Consumption`, `Total_Num_Purchases`, and `Seniority`. Therefore create the data frame `data_customers_num` that only contains these variables. Print the shape of this new data frame. Notice your employees used neither `Kidhome` nor `Teenhome` since their range is {0, 1, 2}, therefore it is better to take those as categorical variables.
3. Present the descriptive statistics of the numeric variables. What can you say about the variables you have? Why should you normalize the data? Normalize the data and create the new data frame `df_num_z`. (Do not forget to include the names of its columns as the columns of `data_customers_num`).

In [14]:
# 1pt
# Read the data
data_customers = pd.read_csv("data_customers.csv")
display(data_customers.head())
# New data
data_customers_sample = data_customers.loc[data_customers["Age"]<100]
print(f'The new data shape is:{data_customers_sample.shape}')
print(f'The number of null values is :{data_customers_sample.isna().sum().sum()}')

Unnamed: 0,ID,Education,Income,Kidhome,Teenhome,Recency,NumWebVisitsMonth,Complain,Living_Status,Total_Promos_accep,Age,Total_Consumption,Total_Num_Purchases,Seniority
0,5524,Graduate,58138.0,0,0,58,7,0,Alone,0,57,1617,25,33
1,2174,Graduate,46344.0,1,1,38,5,0,Alone,0,60,27,6,5
2,4141,Graduate,71613.0,0,0,26,4,0,Partner,0,49,776,21,17
3,6182,Graduate,26646.0,1,0,26,6,0,Partner,0,30,53,8,3
4,5324,Postgraduate,58293.0,1,0,94,5,0,Partner,0,33,422,19,12


The new data shape is:(2205, 14)
The number of null values is :0


In [15]:
# 1pt
# numeric predictors
data_customers_num = data_customers.filter(['Income', 'Recency', 'NumWebVisitsMonth', 'Age', 'Total_Consumption', 'Total_Num_Purchases', 'Seniority'], axis = 1)
print(f'The new shape is {data_customers_num.shape}')

The new shape is (2208, 7)


In [16]:
# 0.5 pts
# Descriptive statistics
data_customers_num.describe()

Unnamed: 0,Income,Recency,NumWebVisitsMonth,Age,Total_Consumption,Total_Num_Purchases,Seniority
count,2208.0,2208.0,2208.0,2208.0,2208.0,2208.0,2208.0
mean,51633.638134,49.01404,5.334239,45.192935,606.875906,14.879076,18.137228
std,20713.37653,28.938638,2.413837,11.991913,602.090272,7.615973,7.668229
min,1730.0,0.0,0.0,18.0,5.0,0.0,1.0
25%,35196.0,24.0,3.0,37.0,69.0,8.0,13.0
50%,51301.0,49.0,6.0,44.0,397.0,15.0,18.0
75%,68289.75,74.0,7.0,55.0,1047.25,21.0,24.0
max,113734.0,99.0,20.0,121.0,2525.0,43.0,36.0


In [18]:
# 1 pt
# Standardization
normalizer = StandardScaler()
df_num_z = normalizer.fit_transform(data_customers_num)

In [21]:
# 0.5 pts
# Descriptive statistics for normalized data
df_num_z

array([[ 0.31408859,  0.31058807,  0.69024456, ...,  1.67807547,
         1.3292086 ,  1.93866657],
       [-0.2554309 , -0.38068602, -0.13849932, ..., -0.96332276,
        -1.16611333, -1.71359055],
       [ 0.96478175, -0.79545048, -0.55287126, ...,  0.28095854,
         0.80387767, -0.1483375 ],
       ...,
       [ 0.25821831,  1.45119033,  0.27587262, ...,  1.05344293,
         0.5412122 , -0.80052627],
       [ 0.8504336 , -1.41759716, -0.9672432 , ...,  0.39226274,
         1.06654313, -0.80052627],
       [ 0.05965429, -0.31155861,  0.69024456, ..., -0.72244053,
        -0.50944967,  1.15604004]])

**Written Answer:** Why should you normalize the data?

**ANSWER HERE (1 pt)**:

____________

### Question 2: First Strategy (25 pts)

To solve the project, your employee D decided to use the following strategy: 

1. First performing dimension reduction with PCA using 2 components. Look for the best number of clusters (between 3 to 5) using `Hierarchical clustering` with `affinity = 'cosine'` and `linkage = 'complete'` and Silhuoette analysis.
2. Graph the scatter plot of the PCA-transformed data differentiated by cluster.
3. Make the scatterplot `Total_Consumption` vs `Income by clusters`.

In [24]:
# 5 pts (PCA)
pca = PCA(n_components=2)
pca.fit(df_num_z)
df_num_z2 = pca.transform(df_num_z)


In [None]:
# 8 pts 
# Silhuoette Analysis
# Range for the number of clusters
range_n_clusters = [3,4,5]

for i in range_n_clusters:
    Agg = AgglomerativeClustering(n_clusters=i,         # Number of clusters
                                            affinity='cosine', # Type of distance. Depends on your data and you can create your own!
                                            linkage='complete'     # Type of linkage.  
                                            )
    
    
    

**Written Answer:** How many clusters did D select based on the previous results?

**ANSWER HERE (2pts):** 

In [None]:
# 5 pts second point


In [None]:
# 5 pts
# Scatterplot


## Question 3: Second Strategy (55 pts)

### 3.1 (30 pts)

Employee J selected a different approach:
1. First J studied the silhouette average score taking into account only the sample of three predictors from the seven total ones, and selecting the combination with the maximum value.

You are going to replicate these results, generating a data frame `Results_df` whose columns are `Subset_Predictors`, `AVG_S_3`, `AVG_S_4`, `AVG_S_5`, where:
- `Subset_Predictors` are the three predictors that have been taken into consideration, eg: `[Age, Income, Total_Consumption]`.
- `AVG_S_i` is the average silhouette_score using $k = i$ clusters, when performing KMeans method over the `Subset_Predictors` variables.

P did not forget that as a team, you always use a `random_state = 3`, also J remembered `itertools.combinations` function may be useful.

In [None]:
# 9 pts 
# Generate the data frame of results
# Initialize the DataFrame and set the name of its columns


# Fill the Result_df according indications



In [None]:
# 15 pts 
# Print the first 5 rows of the Result_df matrix


In [None]:
# 1 pt
# Report the shape of Result_df DataFrame


In [None]:
# 2 pts
# Find the maximum avg silhouette_score

print(f'The maximum average silhouette score using a subset of 3 predictors is :{}')

**Written Answer:** What is the subset of 3 predictors that generated the maximum average silhouette score? How many clusters did J decide to use?

In [None]:
# 1 pt
# Find the index when we find the maximum value


**ANSWER HERE (2pts):**

### 3.2 Verification (15 pts)

J wants to verify this is indeed the best number of clusters. 
1. Create an elbow plot of between 3 and 10 clusters for the selected subset of variables, using `calinski_harabasz` metric. According with this metric what is the best number of clusters? What is the meaning of this metric?
2. Secondly, perform the silhouette analysis for the same cluster range, using `SilhouetteVisualizer` function as it was shown in `LabWeek10` to generate the plot, do not forget to print the average Silhouette scores!. **Does the previous result agree with the given by silhouette analysis?**


In [None]:
# 5 pts
# Code for creating the elbow plot


**ANSWER HERE (3pts)** 

In [None]:
# 5 pts
# Second point
# Code for creating the silhouette analysis


**ANSWER HERE (2pts):** 

### 3.3  PCA for clustering visualization (5 pts)

Considering the previous selection, J decided to visualize the results with PCA transformation. J applied PCA transformation using 2 components, created the scatterplot differentiating clusters with different colours. **Note the clusters still must be calculated over the unrotated data.**

### 3.3 Scatterplot. (5 pts)

At the end J also presented the scatterplot of `Total_Consumption` vs `Income` differentiated by cluster.

_________

## Question 5: Deploying the model (15 pts)

### 5.1 Final Decision

**Written Answer:** Now you have to decide which method to use for customer segmentation. Decide on one of the two strategies and explain, in no more than one paragraph, what strategy would you choose.

**ANSWER HERE (4 pts):**

### 5.2 Naming the clusters
Using the selected method, create a table of the averages per variable (`Income`, `Total_Consumption`, `Total_Num_Purchases`) of each cluster (use the original, non-scaled, variables, and  Pandas' `groupby` function). 

In [None]:
# 2 pts
# Generate the table


**Written Answer:** Name the different clusters and think what strategy the company could use in each of the clusters.

**ANSWER HERE (9pts):**

