# CLUSTERING PROBLEM

Purpose: The goal is to develop a customer segmentation model to define a credit card company's marketing strategy.

Model Class: *Unsupervised*

Model Type: *Clustering*

Edit Date: 4/8/2020

Cluster Models Include:
- K-Means
- Hierarchical

Resources:
* https://afnan.io/2017-10-31/using-k-means-clustering-in-scikit-learn/
* https://www.kaggle.com/ainslie/credit-card-data-clustering-analysis/data
* https://lab.pwc.com/automation/details/3850
* https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

Data:
https://www.kaggle.com/ainslie/credit-card-data-clustering-analysis/data

The dataset summarizes the usage behavior of about 9000 active credit card holders during 6 months.
The file is at a customer level with 18 behavioral variables.

# DEPENDENCIES

Load the dependencies for model development. Current package requirements include:
* Sklearn
* Pandas
* Numpy
* Scipy
* Matplotlib

In [None]:
# data management
import pandas as pd
import numpy as np

# visualization
from pylab import*
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

# preprocessing
import sklearn as sk
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, FunctionTransformer

# clusters models
from sklearn.cluster import KMeans, AgglomerativeClustering, SpectralClustering, DBSCAN
from sklearn import metrics
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import dendrogram, linkage

# Data

In [None]:
data = pd.read_csv("../input/ccdata/CC GENERAL.csv")

## Basic Data Analysis - Overview

In [None]:
data.shape

In [None]:
data.head(3)

In [None]:
features = data.columns[1:]


1. **CUST_ID** : Identification of Credit Card holder
2. **BALANCE** : Balance amount left in their account to make purchases
3.**BALANCE_FREQUENCY** : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
4. **PURCHASES** : Amount of purchases made from account
5. **ONEOFF_PURCHASES** : Maximum purchase amount done in one-go
6. **INSTALLMENTS_PURCHASES** : Amount of purchase done in installment
7. **CASH_ADVANCE** : Cash in advance given by the user
8. **PURCHASES_FREQUENCY** : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
9. **ONEOFF_PURCHASES_FREQUENCY** : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
10. **PURCHASES_INSTALLMENTS_FREQUENCY** : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
11. **CASH_ADVANCE_FREQUENCY** : How frequently the cash in advance being paid
12. **CASH_ADVANCE_TRX** : Number of Transactions made with "Cash in Advanced"
13. **PURCHASES_TRX** : Numbe of purchase transactions made
14. **CREDIT_LIMIT** : Limit of Credit Card for user
15. **PAYMENTS** : Amount of Payment done by user
16. **MINIMUM_PAYMENTS** : Minimum amount of payments made by user
17. **PRC_FULL_PAYMENT** : Percent of full payment paid by user
18. **TENURE** : Tenure of credit card service for user

CUSTOMER_ID will not be taken into account as a model variable because (in my point of view) it doesn't give information about customer behavior.


In [None]:
data.info()

In [None]:
data[features].describe()

We can see from the table above, that variables the following variables have outliers:
* BALANCE,
* PURCHASES,
* ONEOFF_PURCHASES, 
* INSTALLMENTS_PURCHASES, 
* CASH_ADVANCE, 
* CASH_ADVANCE_TRX, 
* PURCHASE_TRX, 
* CREDIT_LIMIT, 
* PAYMENTS and 
* MINIMUM_PAYMENTS

A data point is an outlier if any of the two following conditions apply:
1. data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below 1st quartile.
2. data point that falls outside of 3 or 4 standard deviations,


In [None]:
data.nunique()

**Missing values**

In [None]:
data.isna().sum()

## Exploratory Data Analisys (EDA)

### Missing Values

**CREDIT_LIMIT**

In [None]:
print(data[data.CREDIT_LIMIT.isna()].shape[0],' clientes')
print("{0:.2f}%".format(100*data[data.CREDIT_LIMIT.isna()].shape[0]/data.shape[0]))

In [None]:
data[data.CREDIT_LIMIT.isna()]

In [None]:
data.CREDIT_LIMIT.describe()

Since one possibility is to fill the missing values with zero, I analyze if there are clients with Credit limit equal to zero. 

In [None]:
print('Customers with zero credit limit:' , data[data.CREDIT_LIMIT==0].shape[0])

We see that is not a good option to fill with zero. 

Taking into account the characteristics of the customer 15349, I look for special values in the CREDIT_LIMIT column for customers without purchases but with cash advances. 

In [None]:
data_aux = data[(data.PURCHASES_TRX==0)&(data.CASH_ADVANCE_TRX>0)][['CASH_ADVANCE','CASH_ADVANCE_TRX','CREDIT_LIMIT']]
print(data_aux.describe())
data_aux.head()

We observe above that there are not significant  differences with respect to the complete data. So, in this case, I decided to fill with the median value.

**MINIMUM_PAYMENTS**

In [None]:
print(data[data.MINIMUM_PAYMENTS.isna()].shape[0],' clientes')
print("{0:.2f}%".format(100*data[data.MINIMUM_PAYMENTS.isna()].shape[0]/data.shape[0]))

In [None]:
data[data.MINIMUM_PAYMENTS.isna()].head(7)

When PAYMENTS = 0, the value of MINIMUM_PAYMENTS is always NaN:

In [None]:
data[(data.PAYMENTS==0)].shape[0] == data[(data.PAYMENTS==0)&(data.MINIMUM_PAYMENTS.isna())].shape[0]

When MINIMUM_PAYMENTS is NaN, the value of PRC_FULL_PAYMENT is always zero:

In [None]:
data[(data.MINIMUM_PAYMENTS.isna())&(data.PRC_FULL_PAYMENT==0)].shape[0] == data[data.MINIMUM_PAYMENTS.isna()].shape[0]

In [None]:
data.MINIMUM_PAYMENTS.describe()

I didn't find anything special regarding the missing values of the column, so I will use the median to fill.

**CONCLUSIONS OF MISSING VALUES:** Fill the missing values in CREDIT_LIMIT and MINIMUM_PAYMENTS with the median of the column.

### Outliers

**z_scr method**

In [None]:
def detect_col_outliers(ls_data):
     # z_score and filter

    mean = np.mean(ls_data)
    std = np.std(ls_data)
   
    return [i for i in ls_data if np.abs(i-mean) > 4*std]

In [None]:
features_outliers = ['BALANCE','PURCHASES','ONEOFF_PURCHASES','INSTALLMENTS_PURCHASES','CASH_ADVANCE','CASH_ADVANCE_TRX','PURCHASES_TRX','CREDIT_LIMIT','PAYMENTS','MINIMUM_PAYMENTS']
for name_col in features_outliers:
    rtdo = detect_col_outliers(data[name_col])
    print('-'*50)
    print(name_col)
    print('# values outlier: ', len(rtdo))
    print('{0:.2f}% of the total data'.format(100*len(rtdo)/data.shape[0]))

**IQR method**

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(data=data[features])
plt.xticks(rotation=90)

**Columns transformation**

In [None]:
nr_rows = len(features_outliers)
nr_cols = 3

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*3.5,nr_rows*3))

for r, col in enumerate(features_outliers):
    sns.distplot(data[col], ax = axs[r][0]).set_title('Original')
    sns.distplot(np.sqrt(data[col].tolist()), ax = axs[r][1]).set_title("Root Square")
    sns.distplot(np.log1p(data[col]), ax = axs[r][2]).set_title('log(1+x)')
plt.tight_layout()    
plt.show()  

**CONCLUSION OF OUTLIERS:** The columns with outliers problems are 10: 
* BALANCE,
* PURCHASES,
* ONEOFF_PURCHASES, 
* INSTALLMENTS_PURCHASES, 
* CASH_ADVANCE, 
* CASH_ADVANCE_TRX, 
* PURCHASE_TRX, 
* CREDIT_LIMIT, 
* PAYMENTS and 
* MINIMUM_PAYMENTS

and for these variables I think it is appropiate to apply a logarithmic transformation.

### Discrete variables

In [None]:
int_cols = data[features].select_dtypes(include=['int']).columns
int_cols

In [None]:
for col in int_cols:
    print(data[col].value_counts().sort_values(ascending=False))
    print('-'*30)

In [None]:
data[int_cols].hist(figsize=(15,8))
plt.tight_layout()

### Correlation Analysis

In [None]:
#Using Pearson Correlation
plt.figure(figsize=(12,10))
corr_m = data[features].corr()
sns.heatmap(corr_m, annot=True, cmap=plt.cm.Reds).set_title('Correlation Matrix')
plt.show()

* PURCHASES has a higher lever of correlation with ONEOFF_PURCHASES.
* CASH_ADVANCE_TRX has a higher lever of correlation with CASH_ADVANCE_FREQUENCY.
* PURCHASES_TRX has a good level of correlation with INSTALLMENTS_PURCHASES, PURCHASES_FREQUENCY.
* BALANCE has a negative correlation with PRC_FULL_PAYMENT

**PURCHASE analysis**

In [None]:
cor_purchases = abs(corr_m["PURCHASES"])
cor_purchases[cor_purchases>0.5].sort_values(ascending=False)

In [None]:
print('{0:.2f}%'.format(100*sum(data.PURCHASES == data.ONEOFF_PURCHASES + data.INSTALLMENTS_PURCHASES)/data.shape[0]))

In [None]:
data[data.PURCHASES != data.ONEOFF_PURCHASES + data.INSTALLMENTS_PURCHASES].head()

In [None]:
sns.pairplot(data[['PURCHASES','ONEOFF_PURCHASES','INSTALLMENTS_PURCHASES']],
             markers="+",
             kind='reg',
             diag_kind=None, 
             height=4)

**CASH_ADVANCE analysis**

In [None]:
sns.pairplot(data[['CASH_ADVANCE_FREQUENCY','CASH_ADVANCE_TRX']],
             markers="+",
             kind='reg',
             height=4)

**CONCLUSIONS OF CORRELATIONS:** Taking into account the high correlation between PURCHASES and ONEOFF_PURCHASES, and the interpretation of the variables in the problem, I think PURCHASES can be remove or not from the variables to include in the models.

## Preprocessing

In [None]:
features = data.columns[1:]
features_group1 = ['BALANCE','ONEOFF_PURCHASES','INSTALLMENTS_PURCHASES','CASH_ADVANCE','CASH_ADVANCE_TRX','PURCHASES_TRX','PAYMENTS','CREDIT_LIMIT','MINIMUM_PAYMENTS']
features_group2 = list(set(features)-set(features_group1))

In [None]:
# using median in columns with outliers 
g1_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('log', FunctionTransformer(np.log1p)),
    #('scaler', MinMaxScaler(feature_range=(0, 1)))
    ('scaler', StandardScaler())
    ])

# using median in columns without outliers 
g2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
    ])

preprocessor = ColumnTransformer(
    transformers=[
        ('group1', g1_transformer, features_group1),
        ('group2', g2_transformer, features_group2),
        ])

## Analysis of Preprocessing

In [None]:
preprocessor.fit(data) 
np_data = preprocessor.transform(data) 
print(np_data[np.isnan(np_data)])
df_data = pd.DataFrame(np_data, columns=features_group1+features_group2)
print(df_data.isna().sum())
print(df_data.shape)
df_data.head(6)

In [None]:
#to check StandardScaler
df_data.describe()

In [None]:
# to check outliers
plt.figure(figsize=(15,10))
sns.boxplot(data=df_data)
plt.xticks(rotation=90)

### Preprocessing - Extra Miscellaneous

Another way to deal with outliers is to make ranges.

In [None]:
data_range = data.copy()

In [None]:
columns=['BALANCE', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS']

for c in columns:
    
    Range=c+'_RANGE'
    data_range[Range]=0        
    data_range.loc[((data[c]>0)&(data[c]<=500)),Range]=1
    data_range.loc[((data[c]>500)&(data[c]<=1000)),Range]=2
    data_range.loc[((data[c]>1000)&(data[c]<=3000)),Range]=3
    data_range.loc[((data[c]>3000)&(data[c]<=5000)),Range]=4
    data_range.loc[((data[c]>5000)&(data[c]<=10000)),Range]=5
    data_range.loc[((data[c]>10000)),Range]=6

In [None]:
columns=['BALANCE_FREQUENCY', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'PRC_FULL_PAYMENT']

for c in columns:  

    Range=c+'_RANGE'
    data_range[Range]=0
    for i in range(10):
        data_range.loc[((data[c]>i*0.1)&(data[c]<=(i+1)*0.1)), Range]=i+1

In [None]:
columns=['PURCHASES_TRX', 'CASH_ADVANCE_TRX']  

for c in columns:
    
    Range=c+'_RANGE'
    data_range[Range]=0
    data_range.loc[((data[c]>0)&(data[c]<=5)),Range]=1
    data_range.loc[((data[c]>5)&(data[c]<=10)),Range]=2
    data_range.loc[((data[c]>10)&(data[c]<=15)),Range]=3
    data_range.loc[((data[c]>15)&(data[c]<=20)),Range]=4
    data_range.loc[((data[c]>20)&(data[c]<=30)),Range]=5
    data_range.loc[((data[c]>30)&(data[c]<=50)),Range]=6
    data_range.loc[((data[c]>50)&(data[c]<=100)),Range]=7
    data_range.loc[((data[c]>100)),Range]=8

In [None]:
data_range.drop(['CUST_ID', 'BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES',
       'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE',
       'PURCHASES_FREQUENCY',  'ONEOFF_PURCHASES_FREQUENCY',
       'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY',
       'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS',
       'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT' ], axis=1, inplace=True)

In [None]:
len(data.columns), len(data_range.columns)

In [None]:
data_range.head()

In [None]:
data_range.describe()

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(data=data_range)
plt.xticks(rotation=90)

In [None]:
features_group3 = ['INSTALLMENTS_PURCHASES_RANGE','MINIMUM_PAYMENTS_RANGE','ONEOFF_PURCHASES_FREQUENCY_RANGE','CASH_ADVANCE_FREQUENCY_RANGE','PRC_FULL_PAYMENT_RANGE','CASH_ADVANCE_TRX_RANGE']
features_group4 = list(set(data_range.columns)-set(features_group3))

In [None]:
# using median in columns with outliers 
g1_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('log', FunctionTransformer(np.log1p)),
    #('scaler', MinMaxScaler(feature_range=(0, 1)))
    ('scaler', StandardScaler())
    ])

# using median in columns without outliers 
g2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
    ])

preprocessor2 = ColumnTransformer(
    transformers=[
        ('group1', g1_transformer, features_group3),
        ('group2', g2_transformer, features_group4),
        ])

In [None]:
data_range.columns

In [None]:
preprocessor2.fit(data_range) 
np_data_range = preprocessor2.transform(data_range) 

In [None]:
print(np_data_range[np.isnan(np_data_range)])
df_data2 = pd.DataFrame(np_data_range, columns=features_group3+features_group4)
print(df_data2.isna().sum())
print(df_data2.shape)
df_data2.head(6)

In [None]:
df_data.describe()

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(data=df_data)
plt.xticks(rotation=90)

## PCA data

In [None]:
pca = PCA(n_components=2)
pca.fit(np_data)

In [None]:
data_pca = pca.transform(np_data)
plt.figure(figsize=(8,6))
plt.scatter(np_data[:,0],np_data[:,1])
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

In [None]:
print(pca.noise_variance_)
print(pca.explained_variance_ratio_)

The estimated noise covariance is not a good value, so we cannot rely on the shape of the data using the visual of PCA in 2 dimensions.

# MODELS

## K-means

Below is a toy example to illustrate how the algorithm works.

![image.png](https://stanford.edu/~cpiech/cs221/img/kmeansViz.png)


Image: https://stanford.edu/~cpiech/cs221/img/kmeansViz.png

#### N°Clusters for K-means: Elbow Method

The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (e.g k=1 to 10), for each value of k, calculate sum of squared errors (SSE).

Calculate the mean distance between data points and their cluster centroid. Increasing the number of clusters(K) will always reduce the distance to data points, thus decrease this metric, to the extreme of reaching zero when K is as same as the number of data points. **So the goal is to choose a small value of k that still has a low SSE.**

In [None]:
Sum_of_squared_distances = []
K = range(1, 20)
for k in K:
    km = KMeans(n_clusters=k, 
                init='k-means++',
                max_iter=400, 
                n_init=80, 
                random_state=0).fit(np_data)
    Sum_of_squared_distances.append(km.inertia_)

plt.figure(figsize=(10,10))
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

#### N° Clusters for K-means: Silhouette Coefficient Method:

to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:

$a$: The mean distance between a sample and all other points in the same class.

$b$: The mean distance between a sample and all other points in the next nearest cluster.

The Silhouette Coefficient is for a single sample is then given as:

$$s=\dfrac{b−a}{max(a,b)}$$
 
To find the optimal value of k for KMeans, loop through 1..n for n_clusters in KMeans and calculate Silhouette Coefficient for each sample.

A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

The Silhouette coefficient, between -1 and 1, gives an indication of how close each point in one cluster is to points in the neighbouring clusters. Values close to 1 are furthest from other clusters whereas negative points overlap with others. In an ideal situation we would expect all the points of a cluster to have Silhouette coefficients close to 1. 

In [None]:
silhouette_scores = [] 
K = range(2, 20)

for k in K:
    km = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=45).fit_predict(np_data)
    scr = silhouette_score(np_data, km)
    silhouette_scores.append(scr)
    print("For n_clusters =", k, "The average silhouette_score is :", scr)
plt.plot(K, silhouette_scores, 'bx-')
plt.xlabel('k')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method For Optimal k')
plt.show()



In [None]:
K = range(2,10)

for k in K:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    ax1.set_xlim([-0.1, 1])
    ax1.set_ylim([0, len(np_data) + (k + 1) * 10])

    clusterer = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=45)
    cluster_labels = clusterer.fit_predict(np_data)

    silhouette_avg = silhouette_score(np_data, cluster_labels)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(np_data, cluster_labels)

    y_lower = 10
    for i in range(k):
        # Aggregate the silhouette scores for samples belonging to cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / k)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])


    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / k)
    pca = PCA(n_components=2)
    pca.fit(np_data)
    X = pca.transform(np_data)

    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    pca_centers = pca.transform(clusterer.cluster_centers_)
    # Draw white circles at cluster centers
    ax2.scatter(pca_centers[:, 0], pca_centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(pca_centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st principal feature")
    ax2.set_ylabel("Feature space for the 2nd principal feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % k),
                 fontsize=14, fontweight='bold')

plt.show()

### Set clusters

In [None]:
km = KMeans(n_clusters=6, 
            init='k-means++',
            max_iter=400, 
            n_init=80, 
            random_state=0)

km_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                          ('km', km)])

km_pipe.fit(data)

In [None]:
labels = km.labels_

In [None]:
clusters = pd.concat([data, pd.DataFrame({'CLUSTER':labels})], axis=1)
clusters.head()

In [None]:
clusters.CLUSTER.value_counts()

In [None]:
clusters.CLUSTER.hist(figsize=(10, 8))
plt.tight_layout()

In [None]:
# save clusters to csv
clusters.to_csv('Clusters_CreditCards_Kmeans.csv')

### Interpretation of clusters

In [None]:
for c in clusters:
    grid= sns.FacetGrid(clusters, col='CLUSTER')
    grid.map(plt.hist, c)

In [None]:
clusters.groupby(['CLUSTER']).mean()

***Cluster 0***  People with high level of income (balance) and high credit limit who take cash in advance.

***Cluster 1*** People with low level of income. Not Frequent purchases.

***Cluster 2*** Low balance but the balance gets updated frequently ie. more no. of transactions. They purchase mostly in installments

***Cluster 3*** They purchase mostly in one-go with a high frequency. the percent of full payment paid is low (debtors).

***Cluster 4***: People with a medium level of income who don't spend much money and who accept large amounts of cash advances but not frequently.

***Cluster 5*** High spenders with high credit limit who make expensive purchases and take more cash in advance

In [None]:
dist = 1 - cosine_similarity(np_data)

pca = PCA(2)
pca.fit(dist)
X_PCA = pca.transform(dist)
X_PCA.shape

In [None]:
x, y = X_PCA[:, 0], X_PCA[:, 1]

colors = {0: 'red',
          1: 'blue',
          2: 'green', 
          3: 'yellow', 
          4: 'orange',  
          5:'purple'}

names = {0: 'high level of income and high credit limit who take cash in advance', 
         1: 'low level of income. Not Frequent purchases', 
         2: 'who purchases mostly in installments', 
         3: 'They purchase mostly in one-go with a high frequency. the percent of full payment paid is low (debtors)', 
         4: 'do not spend much money and who accept large amounts of cash advances but not frequently',
         5: 'High spenders who take more cash in advance'}
  
df = pd.DataFrame({'x': x, 'y':y, 'label':labels}) 
groups = df.groupby('label')

fig, ax = plt.subplots(figsize=(20, 13)) 

for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=5,
            color=colors[name],label=names[name], mec='none')
    ax.set_aspect('auto')
    ax.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off')
    ax.tick_params(axis= 'y',which='both',left='off',top='off',labelleft='off')
    
ax.legend()
ax.set_title("Customers Segmentation based on their Credit Card usage bhaviour.")
plt.show()

## Hierarchical

Hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following two steps: 
1. identify the two clusters that are closest together, and
2. merge the two most similar clusters. This iterative process continues until all the clusters are merged together.

<img src="https://dpzbhybb2pdcj.cloudfront.net/rhys/v-7/Figures/CH17_FIG_2_MLR.png" width="400">

image: https://dpzbhybb2pdcj.cloudfront.net/rhys/v-7/Figures/CH17_FIG_2_MLR.png

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRhdKUnlaTWstr6eoGVrHV6iDhOLr4ZhBXValerAT4vUfbWXrgA&usqp=CAU)

image from: https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRhdKUnlaTWstr6eoGVrHV6iDhOLr4ZhBXValerAT4vUfbWXrgA&usqp=CAU

### N° of clusters - Visualization

![image](https://3.bp.blogspot.com/-TQYHVkgesMg/WbTcMIOuquI/AAAAAAAAD3Y/dY4YpxJ3OhU5VGppwcrS6j-ewvlddxSjwCLcBGAs/s1600/hcust.PNG)

image from: https://3.bp.blogspot.com/-TQYHVkgesMg/WbTcMIOuquI/AAAAAAAAD3Y/dY4YpxJ3OhU5VGppwcrS6j-ewvlddxSjwCLcBGAs/s1600/hcust.PNG

In [None]:
preprocessor.fit(data) 
np_data = preprocessor.transform(data) 

In [None]:
siliuette_list_hierarchical = []
for cluster in range(2,10):
    for linkage_method in ['ward', 'average','single']:
        agglomerative = AgglomerativeClustering(linkage=linkage_method, affinity='euclidean',n_clusters=cluster).fit_predict(np_data)
        sil_score = metrics.silhouette_score(np_data, agglomerative, metric='euclidean')
        siliuette_list_hierarchical.append((cluster, sil_score, linkage_method))
        
df_hierarchical = pd.DataFrame(siliuette_list_hierarchical, columns=['cluster', 'sil_score','linkage_method'])
df_hierarchical.sort_values('sil_score', ascending=False)

The dendogram can be hard to read when the original observation matrix from which the linkage is derived is large. Truncation is used to condense the dendrogram.

I'm going to plot with different parameters to see the best option.

In [None]:
Z_avg = linkage(np_data, 'average')

plt.figure(figsize=(15,10))
dendrogram(Z_avg, leaf_rotation=90, p=5, color_threshold=20, leaf_font_size=10, truncate_mode='level')
plt.axhline(y=125, color='r', linestyle='--')
plt.show()

In [None]:
Z_ward = linkage(np_data, 'ward')

plt.figure(figsize=(15,10))
dendrogram(Z_ward, leaf_rotation=90, p=5, color_threshold=20, leaf_font_size=10, truncate_mode='level')
plt.axhline(y=125, color='r', linestyle='--')
plt.show()

In [None]:
Z_ward = linkage(np_data, 'single')

plt.figure(figsize=(15,10))
dendrogram(Z_ward, leaf_rotation=90, p=15, color_threshold=20, leaf_font_size=10, truncate_mode='level')
plt.axhline(y=125, color='r', linestyle='--')
plt.show()

### Set 2 clusters

In [None]:
hierarchical = AgglomerativeClustering(n_clusters=2, linkage='average')

In [None]:
pipe_hierar = Pipeline(steps=[
                              ('preprocessor', preprocessor),
                              ('hierarchical', hierarchical)]
                       )

pipe_hierar.fit(data)

In [None]:
clusters_hierar = pd.concat([data, pd.DataFrame({'CLUSTER':hierarchical.labels_})], axis=1)
clusters_hierar.head()

In [None]:
clusters_hierar.to_csv('Clusters_CreditCard_Hierarchical.csv')

In [None]:
clusters_hierar.groupby('CLUSTER').mean()

In [None]:
clusters_hierar.CLUSTER.value_counts()