# Segment Analysis of clients using Credit Card Data

The objective of this study is to apply clustering techniques to understand the market. 
One application of this type of study is in marketing campaigns, by understanding the different consumer profiles. 

Some of the concepts that will be presented in this project are:
- Clustering with k-means
- Dimensionality reduction using PCA
- Dimensionality reduction with autoencoders

## 1 - Loading the Libraries and the File

In [None]:
#Analysis and visualization
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Scaling the data
from sklearn.preprocessing import StandardScaler

#For clustering
from sklearn.cluster import KMeans

#For reduction of dimensionality
from sklearn.decomposition import PCA
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

In [None]:
credit_data = pd.read_csv('../input/ccdata/CC GENERAL.csv')
credit_data.head()

Checking the size of the dataset and some basic information on the data:

In [None]:
credit_data.shape

In [None]:
credit_data.info()

In [None]:
credit_data.dtypes

In [None]:
credit_data.describe()

Summary of this section:
* There are 8950 registers with 18 features.
* The data is in numerical form, except for the customer id (CUST_ID) which is an object containing letters and numbers.
* On average, clients maintain 1564 dollars in the bank account for use with the debit card.
* On average, clients spend 1000 USD on purchases. 
* About the purchase mode, on average clients spend 592 dollars on one-off purchases and 411 dollars on purchases with installments. 
* Good news for the bank: clients, on average, use 978 dollars as cash advancement. One must have in mind that, in general, the taxes for cash advancement are higher than the credit card taxes. 
* In regards to frequency, clients more frequently make purchases with installents (mean = 0.364) than one-off (mean = 0.202). 
* Regarding credit limits on the credit card, the maximum limit is 30,000 dollars with the minimum being 50 dollars. On average, clients have a credit card limit of 4494 dollars.


## 2-  Exploratory data analysis

#### Checking for null values: 

In [None]:
plt.figure(figsize = (16,6))
sns.heatmap(credit_data.isnull());

There are null data in the variables 'MINIMUM_PAYMENTS' and 'CREDIT_LIMIT':

In [None]:
credit_data.isnull().sum()

In [None]:
credit_data.loc[(credit_data['MINIMUM_PAYMENTS'].isnull() == True)]

There are many ways of replacing null numbers. In this case, the null values will be replaced with the mean as both (credit limit and minimum payments) are continuous variables:

In [None]:
credit_data['MINIMUM_PAYMENTS'].mean()

In [None]:
credit_data.loc[(credit_data['MINIMUM_PAYMENTS'].isnull() == True), 'MINIMUM_PAYMENTS'] = credit_data['MINIMUM_PAYMENTS'].mean()

In [None]:
credit_data['CREDIT_LIMIT'].mean()

In [None]:
credit_data.loc[(credit_data['CREDIT_LIMIT'].isnull() == True), 'CREDIT_LIMIT'] = credit_data['CREDIT_LIMIT'].mean()

Now just checking if the null values were replaced:

In [None]:
credit_data.isnull().sum()

Checking for duplicated values:

In [None]:
credit_data.duplicated().sum()

Custom ID is not an unecessary information that will only add more complexity to the data, as it is an object and not a numerical information. This information will be deleted from the dataset:

In [None]:
credit_data.drop('CUST_ID', axis = 1, inplace = True)

In [None]:
credit_data.columns

In [None]:
sns.set_palette("Set1")
plt.rcParams.update({'font.size': 12})
sns.set_style("whitegrid")
credit_data.hist(bins=40, figsize=(30, 30));

We can extract some insights Ffor some of the most relevant variables:
* BALANCE left in the account is more frequent around 1000 dollars.
* PURCHASES values concentrate below 5000 dollars.
* BALANCE FREQUENCY - we can see that clients frequently update the balance in their accounts. 
* ONEOFF_PURCHASES and INSTALLMENT_PURCHASES - looking at the scale of the graph we notice that purchases with installments are more frequent for values no greater than 5000 dollars and one-off purchases are more frenquent for values no greater than 10000 dollars. 
* PURCHASE FREQUENCY show a segumentation of clients: one group make purchases very frequently, while the other group rarely make purchases. 
* MINIMUM PAYMENTS and PRC FULL PAYMENT - these variables show us that many clients opt for paying the minumum of their credit card bill. Very few clients pay the full bill. This is also good for the bank as taxes are high for credit card bills. 
* TENURE shows that most of the clients are long term clients (more than 12 years)

In [None]:
#plt.figure(figsize=(20,80))
#sns.set_palette("cool_r")
#sns.set_style("darkgrid")
#for i in range(len(credit_data.columns)):
 # plt.subplot(9,2,i+1)
 # sns.distplot(credit_data[credit_data.columns[i]], kde = True)
 # plt.title(credit_data.columns[i])
#plt.tight_layout();

### Visualizing the correlations between variables:

In [None]:
correlations = credit_data.corr()
correlations

In [None]:
f, ax = plt.subplots(figsize=(20,15))
sns.heatmap(correlations, annot=True);

Correlation is stronger as the values approach 1. From the correlation matrix we take that:
* PURCHASE INSTALLMENTS FREQUENCY is somehow correlated to PURCHASES FREQUENCY, and this confirms the insight.
* PURCHASE and ONEOFF PURCHASE are strongly correlated and it seems that most of the purchases values are related to one-off purchases. When we look at INSTALLMENTS PURCHASES correlation with PURCHASES we see that the value is 0.68, not as strong as the correlation with one-off purchases. 

## 3 -  Clustering the data

The unsupervised learning algorithm, Kmeans, will be implemented to group the data in similar groups. 


### Scaling the data before clustering. 
We have data on frequency, which varie from 0 to 1 and data on payments that have a much greater scale. To implement a clustering algorithm it is important to put the data in the same scale, once the distance between the data is taken into account. 


In [None]:
scaler = StandardScaler()
credit_data_scaled = scaler.fit_transform(credit_data)

Checking scaling:

In [None]:
minmax_nonscaled = min(credit_data['BALANCE']), max(credit_data['BALANCE'])
minmax_scaled = min(credit_data_scaled[0]), max(credit_data_scaled[0])

print("Minimum and maximum values before scaling = {}".format(minmax_nonscaled))
print("Minimum and maximum values after scaling = {}".format(minmax_scaled))

### Determining number of clusters with the Elbow Method

To choose the best number of clusters the elbow method will be implemented. This is one of the most popular methods to determine the number of clusters. 

The objective of the elbow method is to minimize WCSS, which measures the within cluster sum of squares. WSS is the sum of squares of the distances of each data point in all clusters to their respective centroids. When WCSS is minimum, you have less variability ofthe data inside the cluster. 

In [None]:
wcss= []
range_values = range(1, 20)
for i in range_values:
  kmeans = KMeans(n_clusters=i)
  kmeans.fit(credit_data_scaled)
  wcss.append(kmeans.inertia_)

In [None]:
print(wcss)

In [None]:
plt.figure(figsize=(15,8)) 
plt.plot(wcss, 'bo-', color='c')
plt.xlabel('Number of clusters Clusters', fontsize=14)
plt.ylabel('WCSS', fontsize=14);

Using the elbow method it seems that the optimum number of clusters is between 7 and 10. 

### Implementing the number of clusters

Testing the implementation with 8 clusters:

In [None]:
kmeans = KMeans(n_clusters=8)
kmeans.fit(credit_data_scaled)
labels = kmeans.labels_

In [None]:
#Checking the number of clients per label:
np.unique(labels, return_counts=True)

In [None]:
#Which is the centroid for group ?
cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns = [credit_data.columns])
cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns = [credit_data.columns])
cluster_centers

Adding the cluster information to the original dataframe:

In [None]:
credit_data_cluster = pd.concat([credit_data, pd.DataFrame({'GROUP': labels})], axis = 1)
credit_data_cluster.head()

In [None]:
for i in credit_data.columns:
  plt.figure(figsize=(30,5))
  for j in range(8):
    plt.subplot(1, 8, j + 1)
    cluster = credit_data_cluster[credit_data_cluster['GROUP'] == j]
    cluster[i].hist(bins = 20)
    plt.title('{} \nGroup {}'.format(i, j))
  plt.show()

#### Ordering the data by group and saving it into a new csv file

In [None]:
ordered_data = credit_data_cluster.sort_values(by = 'GROUP')
ordered_data.head()

In [None]:
ordered_data.to_csv('group.csv')

## 4 - Principal Component Analysis

In this section, PCA technique will be used for the reduction of the dimensionality. It creates new uncorrelated variables that successively maximize variance. 
By doing this, PCA increases interpretability minimizing information loss. 


In [None]:
pca = PCA(n_components=2)
principal_comp = pca.fit_transform(credit_data_scaled)
principal_comp

In [None]:
pca_data = pd.DataFrame(data = principal_comp, columns=['pca1', 'pca2'])
pca_data.head()

In [None]:
pca_data = pd.concat([pca_data, pd.DataFrame({'GROUP': labels})], axis = 1)
pca_data.head()

In [None]:
plt.figure(figsize=(20,8))
sns.scatterplot(x = 'pca1', y = 'pca2', hue = 'GROUP', data = pca_data, palette = 'Set1')
plt.xlabel("PCA 1", fontsize=14)
plt.ylabel("PCA 2", fontsize=14);

## 5- Autoencoders

A technique for reduction of dimensionality as an alternative to PCA or can be used as in conjunction with PCA. Autoencoders are a branch of neural network which attempt to compress the information of the input variables into a reduced dimensional space and then recreate the input data set.


In [None]:
credit_data_scaled.shape

In [None]:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

In [None]:
#INPUT LAYER: 17 neurons
# 1st INTERNAL LAYER: 500 neurons, relu activated
# 2nd INTERNAL LAYER: 2000 Layer, relu activated
input_data = Input(shape=(17,))
x = Dense(500, activation='relu')(input_data)
x = Dense(2000, activation='relu')(x)

encoded = Dense(10, activation='relu')(x)

x = Dense(2000, activation='relu')(encoded)
x = Dense(500, activation='relu')(x)

decoded = Dense(17)(x)

In [None]:
autoencoder = Model(input_data, decoded)

Encoded variable to access only the encoded data:

In [None]:
encoder = Model(input_data, encoded)

### Training the autoencoder

In [None]:
#Using Adam optimizer
autoencoder.compile(optimizer = 'Adam', loss = 'mean_squared_error')

In [None]:
autoencoder.fit(credit_data_scaled, credit_data_scaled, epochs = 50)

In [None]:
credit_data_scaled.shape

In [None]:
compact_data = encoder.predict(credit_data_scaled)

#### Defining new clusters

In [None]:
compact_data.shape

In [None]:
credit_data_scaled[0]

In [None]:
compact_data[0]

In [None]:
wcss_2 = []
range_values = range(1, 20)
for i in range_values:
  kmeans = KMeans(n_clusters=i)
  kmeans.fit(compact_data)
  wcss_2.append(kmeans.inertia_)

In [None]:
plt.plot(wcss_2, 'bx-')
plt.xlabel('Clusters')
plt.ylabel('WCSS');

In [None]:
plt.plot(wcss_1, 'bx-', color = 'c')
plt.plot(wcss_2, 'bx-', color = 'm');

The second wcss curve shows that the results start to become more linear around 3 to 4 clusters. 

In [None]:
kmeans = KMeans(n_clusters=4)
kmeans.fit(compact_data)

In [None]:
labels = kmeans.labels_
labels, labels.shape

In [None]:
data_cluster_at = pd.concat([credit_data, pd.DataFrame({'cluster': labels})], axis = 1)
data_cluster_at.head()

Applying PCA to the new dataset, as a second reduction of dimensionality:

In [None]:
pca = PCA(n_components = 2)
prin_comp = pca.fit_transform(compact_data)
pca_df = pd.DataFrame(data = prin_comp, columns = ['pca1', 'pca2'])
pca_df.head()

In [None]:
pca_df = pd.concat([pca_df, pd.DataFrame({'cluster': labels})], axis = 1)
pca_df.head()

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(x = 'pca1', y = 'pca2', hue = 'cluster', data = pca_df, palette = ['cyan', 'black', 'blue', 'pink']);

In [None]:
df_cluster_ordered = data_cluster_at.sort_values(by = 'cluster')
df_cluster_ordered.head()

In [None]:
df_cluster_ordered.tail()

Saving the results

In [None]:
df_cluster_ordered.to_excel('cluster_ordereded.xls')