___
<h1><center>  Clustering</center></h1>

___

## Customer Segmentation of a Wholesale Distributor

### Description :

The objective of the problem is to separate the customers of a wholesale distributor into groups that are as homogeneous as possible but differ as much as possible in order to carry out different targeted actions for each of the groups.

We will use the * Wholesale customers * dataset. This dataset can be downloaded from the following path from the University of California Irvine (** Url: ** https://archive.ics.uci.edu/ml/datasets/Wholesale+customers)

### Dataset Description:

The dataset has ** 8 descriptive variables X **.

The total number of samples is 440 clients.

** Independent variables X: **


1. FRESH: annual expense (CU) on fresh products (Continuous);
1. MILK: annual expense (CU) on dairy products (Ongoing);
1. GROCERY: annual expense (CU) on grocery products (Ongoing);
1. FROZEN: annual expenditure (CU) on frozen products (Continuous)
1. DETERGENTS_PAPER: Annual expenditure (CU) on detergents and paper products (Ongoing)
1. DELICATESSEN: annual expense (CU) on delicatessen products (Continuous);
1. CHANNEL: Customer channel - Horeca (Hotel / Restaurant / Café) or Retail channel (Nominal)
1. REGION: Client region - Lisnon, Porto or Others (Nominal)

**More details:**

There are two categorical or nominal variables, "REGION" and "CHANNEL".

REGION Frequency
* Lisbon 77
* Oporto 47
* Other Region 316

CHANNEL Frequency
* Horeca 298
* Retail 142

In [None]:
from IPython.core.display import display, HTML
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from scipy.stats import boxcox, probplot, norm, shapiro
from sklearn.preprocessing import PowerTransformer, MinMaxScaler
from sklearn.cluster import KMeans
import os
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Main function for plot. Usefull for other case of clustering.

def comprueba_normalidad(df, return_type='axes', title='Comprobación de normalidad'):
    '''
    '''
    fig_tot = (len(df.columns))
    fig_por_fila = 3.
    tamanio_fig = 4.
    num_filas = int( np.ceil(fig_tot/fig_por_fila) )    
    plt.figure( figsize=( fig_por_fila*tamanio_fig+5, num_filas*tamanio_fig+2 ) )
    c = 0 
    shapiro_test = {}
    lambdas = {}
    for i, col in enumerate(df.columns):
        ax = plt.subplot(num_filas, fig_por_fila, i+1)
        probplot(x = df[df.columns[i]], dist=norm, plot=ax)
        plt.title(df.columns[i])
        shapiro_test[df.columns[i]] = shapiro(df[df.columns[i]])
    plt.suptitle(title)
    plt.show()
    shapiro_test = pd.DataFrame(shapiro_test, index=['Test Statistic', 'p-value']).transpose()
    return shapiro_test

<h1><center> First Step </center></h1>

In [None]:
os.listdir()

In [None]:
XY = pd.read_csv('../input/uci-wholesale-customers-data/Wholesale customers data.csv')

In [None]:
XY.head(2)

In [None]:
XY.describe()

In [None]:
XY.info()

In [None]:
XY.isnull().sum()

In [None]:
# Mapeo los datos
XY['Channel'] = XY['Channel'].map({1:'Horeca', 2:'Retail'})
XY['Region'] = XY['Region'].map({3:'Other Region', 2:'Oporto', 1: 'Lisboa'})

<h1><center> GRAPHICS </center></h1>

I save in a variable ** X_cuants ** only the numeric variables, since I am going to represent them and apply some transformation on them.

In [None]:
XY_cuants = XY[['Fresh','Milk','Grocery','Frozen','Detergents_Paper','Delicassen']].copy()

In [None]:
XY_normalizado = (XY_cuants-XY_cuants.mean())/XY.std()
# This function, let as see a more ordered graph. 
# try not to use it yourself and see how the graph changes 

In [None]:
plt.figure(figsize=(14,6))
ax = sns.boxplot(data=XY_normalizado)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.title(u'Representación de cajas de las variables independientes X')
plt.ylabel('Valor de la variable normalizada')
_ = plt.xlabel('Nombre de la variable')

In [None]:
plt.figure(figsize=(14,6))
ax = sns.boxplot(data=XY)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.title(u'Representación de cajas de las variables independientes X')
plt.ylabel('Valor de la variable normalizada')
_ = plt.xlabel('Nombre de la variable')
#For this case there are not so much difference. But always its a good idea tried it. 

In [None]:
## Representation of the distributions of the variables using histograms.

In [None]:
plt.figure(figsize=(18,20))
n = 0
for i, column in enumerate(XY_cuants.columns):
    n+=1
    plt.subplot(5, 5, n)
    sns.distplot(XY_cuants[column], bins=30)
    plt.title('Distribución var {}'.format(column))
plt.show()

<h1><center> Representation of the correlation Matrix.

In [None]:
matriz_correlaciones = XY.corr(method='pearson')
n_ticks = len(XY.columns)
plt.figure( figsize=(9, 9) )
plt.xticks(range(n_ticks), XY.columns, rotation='vertical')
plt.yticks(range(n_ticks), XY.columns)
plt.colorbar(plt.imshow(matriz_correlaciones, interpolation='nearest', 
                            vmin=-1., vmax=1., 
                            cmap=plt.get_cmap('Blues')))
_ = plt.title('Matriz de correlaciones de Pearson')

<h1><center> Data Transformation to find the hypothesis.

As we are going to apply a K-means algorithm later, the data must meet a series of hypotheses.

* The K-means assumes that the data have a ** normal distribution **.
* Also, it is very prone to ** outliers **.

Therefore, we must transform the variables so that they follow a normal distribution and treat the outliers.

# Data normalization:

Variable normalization is the process in which a variable is transformed to follow a normal or Gaussian distribution.

In general, we will only want to normalize the data if we are going to use a machine learning algorithm or a statistical technique that assumes that the data is distributed in a Gaussian or normal way. For example, student's t tests, ANOVAs, linear regressions, logistic regressions, linear discriminant analysis (LDA), k-means, etc.

Among the ways to transform a variable to normal are methods such as the Box-Cox transformation or the Yeo-Johnson method.

The following graphs represent the <a href='https://es.wikipedia.org/wiki/Gr%C3%A1fico_Q-Q'>Q-Q Plot</a>, which is a graph that compares between two distributions. In this case, each of the variables with a normal distribution. If they follow the same distribution, the points fall close to the red line.

In [None]:
shapiro_test = comprueba_normalidad(XY_cuants, title='Normalidad variables originales')

In [None]:
shapiro_test

All variables are statistically significantly not distributed as a normal.

** Shapiro-Wilk test: ** If the p-value is less than a significance level $ \ alpha $, it is concluded that the distribution does not come from a normal one.

 Now I transform the variables with a Box-Cox transform.

In [None]:
bc = PowerTransformer(method='box-cox')
X_cuants_boxcox = bc.fit_transform(XY_cuants)
X_cuants_boxcox = pd.DataFrame(X_cuants_boxcox, columns=XY_cuants.columns)

In [None]:
shapiro_test = comprueba_normalidad(X_cuants_boxcox, title='Normalidad variables transformadas')

In [None]:
# looks perfect ¡

In [None]:
shapiro_test

The normality statistic is very high in all variables now, so we continue with this transform.

In [None]:
plt.figure(figsize=(18,20))
n = 0
for i, column in enumerate(X_cuants_boxcox.columns):
    n+=1
    plt.subplot(4, 4, n)
    sns.distplot(X_cuants_boxcox[column], bins=30)
    plt.title('Distribución var {}'.format(column))
plt.show()


Now, the distributions look Gaussian.

<h1><center> Outliers:


Another treatment that we must do is to treat outliers or atypical values.

In [None]:
plt.figure(figsize=(15,7))
ax = sns.boxplot(data=X_cuants_boxcox)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.title(u'Representación de cajas de las variables independientes X')
plt.ylabel('Valor de la variable normalizada')
_ = plt.xlabel('Nombre de la variable')

In [None]:
for k in list(X_cuants_boxcox.columns):
    IQR = np.percentile(X_cuants_boxcox[k],75) - np.percentile(X_cuants_boxcox[k],25)
    
    limite_superior = np.percentile(X_cuants_boxcox[k],75) + 1.5*IQR
    limite_inferior = np.percentile(X_cuants_boxcox[k],25) - 1.5*IQR
    
    X_cuants_boxcox[k] = np.where(X_cuants_boxcox[k] > limite_superior,limite_superior,X_cuants_boxcox[k])
    X_cuants_boxcox[k] = np.where(X_cuants_boxcox[k] < limite_inferior,limite_inferior,X_cuants_boxcox[k])

In [None]:
plt.figure(figsize=(15,7))
ax = sns.boxplot(data=X_cuants_boxcox)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.title(u'Representación de cajas de las variables independientes X')
plt.ylabel('Valor de la variable normalizada')
_ = plt.xlabel('Nombre de la variable')

In [None]:
#No Outliers now. Ok, next step. 

## I create dummies of categorical variables

In [None]:
#In df one the two initial categorical variables and the transformed numeric variables
df  =  pd.concat([XY[['Channel','Region']],X_cuants_boxcox],axis=1)
df[:3]

In [None]:
df = pd.get_dummies(df,columns=['Channel','Region'],drop_first=True)
df[:3]


## Pre-scaling the data:

We must scale the data when we use methods based on distance measurements, such as SVMs, K-NNs, or K-means. In these algorithms, a "1" unit change in a numeric variable is given equal importance regardless of the variable.

For example, we can look at prices in different currencies. A dollar is much more than a Yen, so if there are two products in different currencies, the algorithm will give the same importance to an increase of one Yen as that of a dollar.

In this case, since everything is spent on products in the same currency, it would not be strictly necessary. However, when scaling we are comparing ranges of variables. That is, customer C is one of those who spends more or spends less on a product.

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
X_escalado = scaler.fit_transform(df)
X_escalado = pd.DataFrame(X_escalado,columns=df.columns)
X_escalado.head()

<h1><center> Segmentation using K-means clustering:

In [None]:
# Now, with the next code, we are looking for the best number of cluster for our dataset.

In [None]:
cluster_range = range(1,20)
cluster_wss=[] 
for cluster in cluster_range:
    model = KMeans(cluster)
    model.fit(X_escalado)
    cluster_wss.append(model.inertia_)

In [None]:
plt.figure(figsize=[10,6])
plt.title('Curva WSS para encontrar el valor óptimo de clústers o grupos')
plt.xlabel('# grupos')
plt.ylabel('WSS')
plt.plot(list(cluster_range),cluster_wss,marker='o')
plt.show()


The graph assumes the optimal point when the curve creates a bend. In this case it would be about 4-6 groups.

We will choose the number of groups at 6, but what is usually done is to try several and see if the final results make sense from a business point of view, as I will comment later.

In [None]:
model = KMeans(n_clusters=6,random_state=0)
model.fit(X_escalado)


### I predict and get customers with your prediction

I create a dataframe with all the variables and a new one that is the prediction of the assigned cluster:

In [None]:
#Original Dataset with the predictions
df_total = XY.copy()
df_total['cluster']=model.predict(X_escalado)
df_total[:2]

In [None]:
df_total.cluster.value_counts().plot(kind='bar', figsize=(10,4))
plt.title('Conteo de clientes por grupo')
plt.xlabel('Grupo')
_ = plt.ylabel('Conteo')

In [None]:
#Here, we coud see our clients inside of a cluster

# Now, we have to obtain the characteristic of each group to find the hide informtion inside our DF and give value to our analysis.

I also get a dataframe with the means of the variables in each group. This would represent each of the groups.

This is very necessary since the actions that the objective of this problem would be to do actions to each of the groups separately. For this, it is very important to know what each group is like, in order to act differently.

In [None]:
descriptivos_grupos = df_total.groupby(['cluster'],as_index=False).mean()
descriptivos_grupos

# I explain the groups using the means of each variable per group: ¶

In [None]:
df_total.groupby('cluster').mean().plot(kind='bar', figsize=(15,7))
plt.title('Gasto medio por producto en cada clúster')
plt.xlabel(u'Número de clúster')
_ = plt.ylabel('Valor medio de gasto')

Finally, for each of the groups, I obtain their average expenditure on each product.

As an annotation ... behavior could also be analyzed by dividing the groups into their two categorical variables and analyzing the average expenses, in this way segmentations would be made by channel, geography and customer characteristics.

In [None]:
df_total[:2]

## Thats was all. Any questions or suggestion will be welcome. 