# Credit Card Customer Segmentation

## The problem

A marketing strategy team is trying to buid new [personas](https://www.interaction-design.org/literature/article/personas-why-and-how-you-should-use-them) so the marketing campaing is more accurate and achieve a higher return on investment (ROI).

Saying this, the problem can be described as a customer segmentation task. 
Hence, to tackle the problem we will try to follow the suitable steps of a well known data science methodology - CRISP.

1. **Understand the  business problem**; 
2. **Understand the data**;
3. **Prepare the data**;
4. **Modeling**;
5. **Evaluation**;
6. **Deploy** (Of course, for this problem, we will not deploy any model. Altough, this project may evolve in the future)


## The Data
The data summarizes the usage of 9000 credit card holders during 6 months, so the variable intend to capture the behavioural of those customers.

- **CUSTID** : Identification of Credit Card holder. (ID)
- **BALANCE** : Balance amount left in their account to make purchases 
- **BALANCEFREQUENCY** : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- **PURCHASES** : Amount of purchases made from account
- **ONEOFFPURCHASES** : Maximum purchase amount done in one-go
- **INSTALLMENTSPURCHASES** : Amount of purchase done in installment
- **CASHADVANCE** : Cash in advance given by the user
- **PURCHASESFREQUENCY** : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
- **ONEOFFPURCHASESFREQUENCY** : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
- **PURCHASESINSTALLMENTSFREQUENCY** : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
- **CASHADVANCEFREQUENCY** : How frequently the cash in advance being paid
- **CASHADVANCETRX** : Number of Transactions made with "Cash in Advanced"
- **PURCHASESTRX** : Number of purchase transactions made
- **CREDITLIMIT** : Limit of Credit Card for user
- **PAYMENTS** : Amount of Payment done by user
- **MINIMUM_PAYMENTS** : Minimum amount of payments made by user
- **PRCFULLPAYMENT** : Percent of full payment paid by user
- **TENURE** : Tenure of credit card service for user

# Packages

In [None]:
# os / sys 
import os 
import sys
import warnings


# standard

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

from glob import glob

# stats 

from sklearn.preprocessing import PowerTransformer

# machine learning

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_samples, silhouette_score



In [None]:
# global configs
%matplotlib inline
warnings.filterwarnings('ignore')
sns.set_theme()
sns.set_style('whitegrid')

# Loading Data

In [None]:
file = glob('raw/*.csv')

raw_df = pd.read_csv(file[0], index_col = 'CUST_ID')

#drop duplicates
raw_df.drop_duplicates(inplace = True, ignore_index = True)

raw_df

# Exploratory Data Analysis

After this section of the project, those are the criteria that needs to be met:
- Missing Values
- Columns Type
- Distributions
- Relationship between columns ( correlation, interaction).

In [None]:
# Missing values
def check_missing(data):
    
    # columns that have NA
    mask_columns_na = raw_df.columns[raw_df.isna().any()]
    
    # Value counts of NA
    na_count = raw_df[mask_columns_na].isna().sum()
    
    percentage_na = (raw_df[mask_columns_na].isna().sum()/len(raw_df)) * 100
    df_na = pd.DataFrame({'NA Values': na_count, 'Percentage NA': percentage_na})
    df_na = df_na.sort_values(by = 'Percentage NA', ascending = False)
    
    return print(df_na)

check_missing(raw_df)     

In [None]:
# Payments
plt.figure(figsize= (16,6))
sns.histplot(raw_df, x = 'MINIMUM_PAYMENTS', log_scale = True)
plt.show()

print('-'*50)
print( 'The number of CC holders that havent paid anything is: {}'.format(sum(raw_df['MINIMUM_PAYMENTS'] == 0)))

Well, the hypothesis of having zero cc holders that havent paid anything doensn't seems plausible to me, so lets give it a fast look of some of those NA occurrences.
To be more precise, we're looking for behaviour consistent with not paying anything.

In [None]:
raw_df[raw_df['MINIMUM_PAYMENTS'].isna()]

Okey, so it seem's my hypothesis is 80% correct. As we can see, a lot of rows that have NA's the paymements columns is equal to zero.

In [None]:
mask_zero_pay = raw_df['PAYMENTS'] == 0.0 
raw_df.loc[mask_zero_pay, 'MINIMUM_PAYMENTS'] = raw_df.loc[mask_zero_pay, 'MINIMUM_PAYMENTS'].fillna(0)
    
check_missing(raw_df)


Well, now we have less than 1% of the data missing, so we have two options:

- Undertand why and imput with some descriptive stats (mean, median)
- drop 

In practice, this a tradeoff between time (money) x return. I don't think that of less than 1% of the data would give me a better return on time. 
I will drop those rows.

In [None]:
raw_df.dropna(how = 'any', axis = 0, inplace = True)


In [None]:
raw_df['TENURE'] = raw_df['TENURE'].astype('category')
print(raw_df.dtypes)


In [None]:
raw_df.describe()

In [None]:
def plot_kde(n, df):
    """ 
    Plot kde for n columns 

    Input
    n =  number of columns
    df = data frame
    Output
    Return - None


    """
    plt.figure(figsize=(15,18))
    for i in range(0,n):
        plt.subplot(6,3,i+1)
        sns.kdeplot(df[df.columns[i]])
        plt.title(df.columns[i])
    plt.tight_layout()

In [None]:
plot_kde(n=16, df = raw_df)

Not surprisingly the data doenst follow a normal distribution and it follow the Power-Law (80-20) Rule. 
Hence, we will need some transformation to reduce the skewness.
- Categorical: Tenure 
- Left: Balance frequency 
- Right: All except above.

_Note : All the data inputed is positive, but there are zero-values, so we will need to use log(x+1)_


In [None]:
# Reducing the skewness through log transformation

cols = raw_df.columns.difference(['TENURE', 'BALANCE_FREQUENCY'])

for col in cols:
    raw_df[col] = np.log(1 + raw_df[col])




In [None]:
plot_kde(n=16, df= raw_df)

Well, it's still not normal distributed but the data is way better than the previously situation.

In [None]:
plt.figure(figsize = (12,12))
sns.heatmap(raw_df.corr(), cmap = 'coolwarm', linewidths = .5, annot = True, vmin = -1)
plt.show()

In [None]:

#df = pd.get_dummies(raw_df, columns = ['TENURE'], drop_first= True)
#df.head()

In [None]:
scaler = StandardScaler()
df = raw_df.copy()
df[df.columns] = scaler.fit_transform(df[df.columns])
df

# Principal Component Analysis

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. I.e, less variables without losing too much information.


In [None]:
# Identify the ideal number of components by the variance explained
pca = PCA().fit(df)

plt.figure(figsize = (12,6))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker = 'o', linestyle = '--')
plt.xlabel('Number of components')
plt.xticks([i for i in range(23)])
plt.ylabel('Cumulative variance explained')
plt.show()

Hence, I will select the 90% mark, which is 6 components.
Below we will just give it a look how the importance of the features are distributed.


In [None]:
for i, value in enumerate(list(pca.explained_variance_)):
    print('Explained variance - PCA {comp}: {value}'.format(comp = i+1, value = value))

In [None]:
X_red = PCA(0.9).fit_transform(df)
X_red


# Modeling

We will use the silhouette plot and score to choose the otimal number of clusters!

In [None]:
def silhouette_ploter(array, upper_range):
    ''' Input array and upper limit of cluster to iterate over '''
    
    range_n_clusters = range(2,upper_range)
    for n_clusters in range_n_clusters:
        # Create a subplot with 1 row and 2 columns
        fig, (ax1, ax2) = plt.subplots(1, 2)
        fig.set_size_inches(16, 6)

        # The 1st subplot is the silhouette plot
        # The silhouette coefficient can range from -1, 1
        ax1.set_xlim([-1, 1])
        # The (n_clusters+1)*10 is for inserting blank space between silhouette
        # plots of individual clusters, to demarcate them clearly.
        ax1.set_ylim([0, len(array) + (n_clusters + 1) * 10])

        # Initialize the clusterer with n_clusters value and a random generator
        # seed of 10 for reproducibility.
        clusterer = KMeans(n_clusters=n_clusters, random_state=23, n_jobs = 4)
        cluster_labels = clusterer.fit_predict(array)

        # The silhouette_score gives the average value for all the samples.
        # This gives a perspective into the density and separation of the formed
        # clusters
        silhouette_avg = silhouette_score(array, cluster_labels)
        print("For n_clusters =", n_clusters,
              "The average silhouette_score is :", silhouette_avg)

        # Compute the silhouette scores for each sample
        sample_silhouette_values = silhouette_samples(array, cluster_labels)

        y_lower = 10
        for i in range(n_clusters):
            # Aggregate the silhouette scores for samples belonging to
            # cluster i, and sort them
            ith_cluster_silhouette_values = \
                sample_silhouette_values[cluster_labels == i]

            ith_cluster_silhouette_values.sort()

            size_cluster_i = ith_cluster_silhouette_values.shape[0]
            y_upper = y_lower + size_cluster_i

            color = cm.nipy_spectral(float(i) / n_clusters)
            ax1.fill_betweenx(np.arange(y_lower, y_upper),
                              0, ith_cluster_silhouette_values,
                              facecolor=color, edgecolor=color, alpha=0.7)

            # Label the silhouette plots with their cluster numbers at the middle
            ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

            # Compute the new y_lower for next plot
            y_lower = y_upper + 10  # 10 for the 0 samples

        ax1.set_title("The silhouette plot for the various clusters.")
        ax1.set_xlabel("The silhouette coefficient values")
        ax1.set_ylabel("Cluster label")

        # The vertical line for average silhouette score of all the values
        ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

        ax1.set_yticks([])  # Clear the yaxis labels / ticks
        ax1.set_xticks([-1,-0.8, -0.6, -0.4,-0.2, 0, 0.2, 0.4, 0.6, 0.8, 1])

        # 2nd Plot showing the actual clusters formed
        colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
        ax2.scatter(array[:, 0], array[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                    c=colors, edgecolor='k')

        # Labeling the clusters
        centers = clusterer.cluster_centers_
        # Draw white circles at cluster centers
        ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                    c="white", alpha=1, s=200, edgecolor='k')

        for i, c in enumerate(centers):
            ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                        s=50, edgecolor='k')

        ax2.set_title("The visualization of the clustered data.")
        ax2.set_xlabel("Feature space for the 1st feature")
        ax2.set_ylabel("Feature space for the 2nd feature")

        plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                      "with n_clusters = %d" % n_clusters),
                     fontsize=14, fontweight='bold')


In [None]:
silhouette_ploter(X_red, 6)

From the silhouette plots and score we can see that the appropriated cluster number is equal to 3, since the avg score is higher and there are not negative coefficients.
Let's use the elbow method as well and see if we get any surprise.

In [None]:
def elbow_plotter(array, upper_limit):
    distortions = []
    k = range(2,upper_limit)
    for n_clusters in k:
        kmeanModel = KMeans(n_clusters=n_clusters, n_jobs = 4, random_state = 23 )
        kmeanModel.fit(array)
        distortions.append(kmeanModel.inertia_)
    plt.figure(figsize=(16,8))
    plt.plot(k, distortions, marker = 'o', linestyle = '--')
    plt.xlabel('k')
    plt.ylabel('Distortion')
    plt.title('The Elbow Method showing the optimal k')
    plt.show()


In [None]:
elbow_plotter(X_red, 10)

# Evaluation

As for the first iteration of the model, we've seen that the even though we're using the 0,9 threshold for cumulative variance explained.

So, looking at the silhouette plot and the scatter plot in 2 dimensions, the most promising number of cluster it's 3 or 2. Honestly, I don't think only 2 cluster will help to solve the business problem, so lets check both on the data. 


In [None]:
kmeans_2 = KMeans(n_clusters=2 , random_state=23, n_jobs =4)
kmeans_3 = KMeans(n_clusters=3, n_jobs=4, random_state = 23)

kmeans_2.fit(X_red)
kmeans_3.fit(X_red)

print('Silhoutte score of our model  with 2 cluster is ' + str(silhouette_score(X_red, kmeans_2.labels_)))
print('Silhoutte score of our model  with 3 cluster is ' + str(silhouette_score(X_red, kmeans_3.labels_)))


In [None]:
df['cluster_2'] = kmeans_2.labels_
df['cluster_3'] = kmeans_3.labels_



In [None]:
for col in cols:
    df[col] = np.exp(df[col])


In [None]:
best_cols_2 = ["BALANCE", "PURCHASES","PURCHASES_FREQUENCY", "CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS", 'cluster_2']
best_cols_3 = ["BALANCE", "PURCHASES", "PURCHASES_FREQUENCY", "CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS", 'cluster_3']

df.head()



In [None]:
plt.figure(figsize=(20,20)) 
sns.pairplot(data = df[best_cols_2], hue = 'cluster_2')

In [None]:
plt.figure(figsize=(20,20)) 
sns.pairplot(data = df[best_cols_3], hue = 'cluster_3')

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x='CREDIT_LIMIT', y='PURCHASES', hue='cluster_3')
plt.title('Distribution of clusters based on Credit limit and total purchases')
plt.show()

# Conclusion