# Inside unsupervised learning: Group segmentation using clustering
## Build systems to segment users into distinct and homogenous groups
### by Ankur A. Patel + O'Reilly Media, Inc.

## Overview - Part B
In this notebook, you will understand how to:
#1 Perform good feature engineering
#2 Cluster users into distinct and homogenous groups
#3 Efficiently label a dataset after clustering, turning an unsupervised problem into a semi-supervised one

Specifically, we will cluster borrowers from Lending Club into distinct groups using the clustering algorithms we introduced in Part A of this course.

## Data Preparation
Let's load in the Lending Club dataset.

In [None]:
# Import libraries
'''Main'''
import numpy as np
import pandas as pd
import os, time, re, pickle, gzip

'''Data Viz'''
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl

%matplotlib inline

'''Data Prep and Model Evaluation'''
from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split 
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score

'''Algorithms'''
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import fastcluster
from scipy.cluster.hierarchy import dendrogram, cophenet, fcluster
from scipy.spatial.distance import pdist

In [None]:
# Load the datasets
os.chdir('/home/jovyan/')
current_path = os.getcwd()
file = '/data/lending_club_data/LoanStats3a.csv'
data = pd.read_csv(current_path + file)

In [None]:
# Select columns to keep
columnsToKeep = ['loan_amnt','funded_amnt','funded_amnt_inv','term', \
                 'int_rate','installment','grade','sub_grade', \
                 'emp_length','home_ownership','annual_inc', \
                 'verification_status','pymnt_plan','purpose', \
                 'addr_state','dti','delinq_2yrs','earliest_cr_line', \
                 'mths_since_last_delinq','mths_since_last_record', \
                 'open_acc','pub_rec','revol_bal','revol_util', \
                 'total_acc','initial_list_status','out_prncp', \
                 'out_prncp_inv','total_pymnt','total_pymnt_inv', \
                 'total_rec_prncp','total_rec_int','total_rec_late_fee', \
                 'recoveries','collection_recovery_fee','last_pymnt_d', \
                 'last_pymnt_amnt']

data = data.loc[:,columnsToKeep]

In [None]:
# Explore shape of data
data.shape

In [None]:
# View first 5 rows of the data
data.head()

In [None]:
# Transform features from string to numeric
for i in ["term","int_rate","emp_length","revol_util"]:
    data.loc[:,i] = \
        data.loc[:,i].apply(lambda x: re.sub("[^0-9]", "", str(x)))
    data.loc[:,i] = pd.to_numeric(data.loc[:,i])

In [None]:
# Determine which features are numerical
numericalFeats = [x for x in data.columns if data[x].dtype != 'object']

In [None]:
# Display NaNs by feature
nanCounter = np.isnan(data.loc[:,numericalFeats]).sum()
nanCounter

In [None]:
# Impute NaNs with mean 
fillWithMean = ['loan_amnt','funded_amnt','funded_amnt_inv','term', \
                'int_rate','installment','emp_length','annual_inc',\
                'dti','open_acc','revol_bal','revol_util','total_acc',\
                'out_prncp','out_prncp_inv','total_pymnt', \
                'total_pymnt_inv','total_rec_prncp','total_rec_int', \
                'last_pymnt_amnt']

# Impute NaNs with zero
fillWithZero = ['delinq_2yrs','mths_since_last_delinq', \
                'mths_since_last_record','pub_rec','total_rec_late_fee', \
                'recoveries','collection_recovery_fee']

# Perform imputation
im = pp.Imputer(strategy='mean')   
data.loc[:,fillWithMean] = im.fit_transform(data[fillWithMean])

data.loc[:,fillWithZero] = data.loc[:,fillWithZero].fillna(value=0,axis=1)

In [None]:
# Check for NaNs one last time
nanCounter = np.isnan(data.loc[:,numericalFeats]).sum()
nanCounter

## Feature Engineering & Scaling

In [None]:
# Feature engineering
data['installmentOverLoanAmnt'] = data.installment/data.loan_amnt
data['loanAmntOverIncome'] = data.loan_amnt/data.annual_inc
data['revol_balOverIncome'] = data.revol_bal/data.annual_inc
data['totalPymntOverIncome'] = data.total_pymnt/data.annual_inc
data['totalPymntInvOverIncome'] = data.total_pymnt_inv/data.annual_inc
data['totalRecPrncpOverIncome'] = data.total_rec_prncp/data.annual_inc
data['totalRecIncOverIncome'] = data.total_rec_int/data.annual_inc

newFeats = ['installmentOverLoanAmnt','loanAmntOverIncome', \
            'revol_balOverIncome','totalPymntOverIncome', \
           'totalPymntInvOverIncome','totalRecPrncpOverIncome', \
            'totalRecIncOverIncome']

In [None]:
# Select features for training
numericalPlusNewFeats = numericalFeats+newFeats
X_train = data.loc[:,numericalPlusNewFeats]

# Scale data
sX = pp.StandardScaler()
X_train.loc[:,:] = sX.fit_transform(X_train)

In [None]:
# View new columns
X_train.columns

In [None]:
# Designate labels for evaluation
labels = data.grade
labels.unique()

In [None]:
# Fill missing labels
labels = labels.fillna(value="Z")

# Convert labels to numerical values
lbl = pp.LabelEncoder()
lbl.fit(list(labels.values))
labels = pd.Series(data=lbl.transform(labels.values), name="grade")

# Store as y_train
y_train = labels

In [None]:
# View new labels vs. original labels
labelsOriginalVSNew = pd.concat([labels, data.grade],axis=1)
labelsOriginalVSNew

In [None]:
# Compare loan grades with interest rates
interestAndGrade = pd.DataFrame(data=[data.int_rate,labels])
interestAndGrade = interestAndGrade.T

interestAndGrade.groupby("grade").mean()

In [None]:
# Define function to evaluate goodness of the clusters

In [None]:
def analyzeCluster(clusterDF, labelsDF):
    countByCluster = \
        pd.DataFrame(data=clusterDF['cluster'].value_counts())
    countByCluster.reset_index(inplace=True,drop=False)
    countByCluster.columns = ['cluster','clusterCount']
        
    preds = pd.concat([labelsDF,clusterDF], axis=1)
    preds.columns = ['trueLabel','cluster']
    
    countByLabel = pd.DataFrame(data=preds.groupby('trueLabel').count())
        
    countMostFreq = pd.DataFrame(data=preds.groupby('cluster').agg( \
        lambda x:x.value_counts().iloc[0]))
    countMostFreq.reset_index(inplace=True,drop=False)
    countMostFreq.columns = ['cluster','countMostFrequent']
    
    accuracyDF = countMostFreq.merge(countByCluster, \
        left_on="cluster",right_on="cluster")
    
    overallAccuracy = accuracyDF.countMostFrequent.sum()/ \
        accuracyDF.clusterCount.sum()
    
    accuracyByLabel = accuracyDF.countMostFrequent/ \
        accuracyDF.clusterCount
    
    return countByCluster, countByLabel, countMostFreq, \
        accuracyDF, overallAccuracy, accuracyByLabel

## Clustering Application #1 - K-means

In [None]:
from sklearn.cluster import KMeans

n_clusters = 10
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 2018
n_jobs = 2

kmeans = KMeans(n_clusters=n_clusters, n_init=n_init, \
                max_iter=max_iter, tol=tol, \
                random_state=random_state, n_jobs=n_jobs)

kMeans_inertia = pd.DataFrame(data=[],index=range(10,31), \
                              columns=['inertia'])

overallAccuracy_kMeansDF = pd.DataFrame(data=[], \
    index=range(10,31),columns=['overallAccuracy'])

for n_clusters in range(10,31):
    kmeans = KMeans(n_clusters=n_clusters, n_init=n_init, \
                    max_iter=max_iter, tol=tol, \
                    random_state=random_state, n_jobs=n_jobs)

    kmeans.fit(X_train)
    kMeans_inertia.loc[n_clusters] = kmeans.inertia_
    X_train_kmeansClustered = kmeans.predict(X_train)
    X_train_kmeansClustered = pd.DataFrame(data= \
        X_train_kmeansClustered, index=X_train.index, \
        columns=['cluster'])
    
    countByCluster_kMeans, countByLabel_kMeans, \
    countMostFreq_kMeans, accuracyDF_kMeans, \
    overallAccuracy_kMeans, accuracyByLabel_kMeans = \
    analyzeCluster(X_train_kmeansClustered, y_train)
    
    overallAccuracy_kMeansDF.loc[n_clusters] = \
        overallAccuracy_kMeans

In [None]:
# Overall accuracy as the number of clusters increases
overallAccuracy_kMeansDF.plot()

In [None]:
# Accuracy by cluster
accuracyByLabel_kMeans

## Clustering Application #2 - Hierarchical Clustering

In [None]:
import fastcluster
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

Z = fastcluster.linkage_vector(X_train, method='ward', \
                               metric='euclidean')

Z_dataFrame = pd.DataFrame(data=Z,columns=['clusterOne', \
                'clusterTwo','distance','newClusterSize'])

In [None]:
# View first 20 clustered rows
Z_dataFrame[:20]

In [None]:
# View last 20 clustered rows
Z_dataFrame[42521:]

In [None]:
# Cut off tree and see how clusters are left
from scipy.cluster.hierarchy import fcluster

distance_threshold = 100
clusters = fcluster(Z, distance_threshold, criterion='distance')
X_train_hierClustered = pd.DataFrame(data=clusters, \
    index=X_train.index,columns=['cluster'])

In [None]:
# Number of clusters left after cutting off the tree
print("Number of distinct clusters: ", \
      len(X_train_hierClustered['cluster'].unique()))

In [None]:
# Evalute overall accuracy from hierarchical clustering
countByCluster_hierClust, countByLabel_hierClust, \
    countMostFreq_hierClust, accuracyDF_hierClust, \
    overallAccuracy_hierClust, accuracyByLabel_hierClust = \
    analyzeCluster(X_train_hierClustered, y_train)

print("Overall accuracy from hierarchical clustering: ", \
      overallAccuracy_hierClust)

In [None]:
# View accuracy by cluster
print("Accuracy by cluster for hierarchical clustering")
accuracyByLabel_hierClust

## Clustering Application #3 - HDBSCAN

In [None]:
import hdbscan

min_cluster_size = 20
min_samples = 20
alpha = 1.0
cluster_selection_method = 'leaf'

hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, \
    min_samples=min_samples, alpha=alpha, \
    cluster_selection_method=cluster_selection_method)

X_train_hdbscanClustered = hdb.fit_predict(X_train)
X_train_hdbscanClustered = pd.DataFrame(data= \
    X_train_hdbscanClustered, index=X_train.index, \
    columns=['cluster'])

countByCluster_hdbscan, countByLabel_hdbscan, \
    countMostFreq_hdbscan, accuracyDF_hdbscan, \
    overallAccuracy_hdbscan, accuracyByLabel_hdbscan = \
    analyzeCluster(X_train_hdbscanClustered, y_train)

In [None]:
# View overall accuracy from HDBSCAN
print("Overall accuracy from HDBSCAN: ", overallAccuracy_hdbscan)

In [None]:
# View count of entities within each cluster
print("Cluster results for HDBSCAN")
countByCluster_hdbscan

In [None]:
# View accuracy by cluster
accuracyByLabel_hdbscan

## Exercises
Adjust parameters for K-means, hierarchical clusters, and HDBSCAN per the instructions and calculate overall accuracy again on the training set.

### K-Means
Use 50 clusters and recalculate accuracy.

In [None]:
from sklearn.cluster import KMeans

n_clusters = #Fill in
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 2018
n_jobs = 2

kmeans = KMeans(n_clusters=n_clusters, n_init=n_init, \
                max_iter=max_iter, tol=tol, \
                random_state=random_state, n_jobs=n_jobs)

kmeans.fit(#Fill in)
X_train_kmeansClustered = kmeans.predict(#Fill in)
X_train_kmeansClustered = pd.DataFrame(data= \
        X_train_kmeansClustered, index=X_train.index, \
        columns=['cluster'])
    
countByCluster_kMeans, countByLabel_kMeans, \
countMostFreq_kMeans, accuracyDF_kMeans, \
overallAccuracy_kMeans, accuracyByLabel_kMeans = \
analyzeCluster(X_train_kmeansClustered, y_train)

print("Overall accuracy from k-means: ", \
      overallAccuracy_kMeans)

### Hierarchical clustering
Use distance threshold (i.e., tree cutoff of 50) and recalculate accuracy.

In [None]:
import fastcluster
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

Z = fastcluster.linkage_vector(X_train, method='ward', \
                               metric='euclidean')

Z_dataFrame = pd.DataFrame(data=Z,columns=['clusterOne', \
                'clusterTwo','distance','newClusterSize'])

# Cut off tree and see how clusters are left
from scipy.cluster.hierarchy import fcluster

distance_threshold = #Fill in
clusters = fcluster(Z, distance_threshold, criterion='distance')
X_train_hierClustered = pd.DataFrame(data=clusters, \
    index=X_train.index,columns=['cluster'])

# Number of clusters left after cutting off the tree
print("Number of distinct clusters: ", \
      len(X_train_hierClustered['cluster'].unique()))

# Evalute overall accuracy from hierarchical clustering
countByCluster_hierClust, countByLabel_hierClust, \
    countMostFreq_hierClust, accuracyDF_hierClust, \
    overallAccuracy_hierClust, accuracyByLabel_hierClust = \
    analyzeCluster(X_train_hierClustered, y_train)

print("Overall accuracy from hierarchical clustering: ", \
      overallAccuracy_hierClust)

### HDBSCAN
Use min_cluster_size of 10 and min_samples of 5 and recalculate accuracy.

In [None]:
import hdbscan

min_cluster_size = #Fill in
min_samples = #Fill in
alpha = 1.0
cluster_selection_method = 'leaf'

hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, \
    min_samples=min_samples, alpha=alpha, \
    cluster_selection_method=cluster_selection_method)

X_train_hdbscanClustered = hdb.fit_predict(X_train)
X_train_hdbscanClustered = pd.DataFrame(data= \
    X_train_hdbscanClustered, index=X_train.index, \
    columns=['cluster'])

countByCluster_hdbscan, countByLabel_hdbscan, \
    countMostFreq_hdbscan, accuracyDF_hdbscan, \
    overallAccuracy_hdbscan, accuracyByLabel_hdbscan = \
    analyzeCluster(X_train_hdbscanClustered, y_train)

print("Overall accuracy from HDBSCAN: ", \
      overallAccuracy_hdbscan)

## Answers to the Exercises

In [None]:
# Exercise 1
# K-means
from sklearn.cluster import KMeans

n_clusters = 50
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 2018
n_jobs = 2

kmeans = KMeans(n_clusters=n_clusters, n_init=n_init, \
                max_iter=max_iter, tol=tol, \
                random_state=random_state, n_jobs=n_jobs)

kmeans.fit(X_train)
X_train_kmeansClustered = kmeans.predict(X_train)
X_train_kmeansClustered = pd.DataFrame(data= \
        X_train_kmeansClustered, index=X_train.index, \
        columns=['cluster'])
    
countByCluster_kMeans, countByLabel_kMeans, \
countMostFreq_kMeans, accuracyDF_kMeans, \
overallAccuracy_kMeans, accuracyByLabel_kMeans = \
analyzeCluster(X_train_kmeansClustered, y_train)

print("Overall accuracy from k-means: ", \
      overallAccuracy_kMeans)

In [None]:
# Exercise 2 Answers
# Hierarchical clustering
import fastcluster
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

Z = fastcluster.linkage_vector(X_train, method='ward', \
                               metric='euclidean')

Z_dataFrame = pd.DataFrame(data=Z,columns=['clusterOne', \
                'clusterTwo','distance','newClusterSize'])

# Cut off tree and see how clusters are left
from scipy.cluster.hierarchy import fcluster

distance_threshold = 50
clusters = fcluster(Z, distance_threshold, criterion='distance')
X_train_hierClustered = pd.DataFrame(data=clusters, \
    index=X_train.index,columns=['cluster'])

# Number of clusters left after cutting off the tree
print("Number of distinct clusters: ", \
      len(X_train_hierClustered['cluster'].unique()))

# Evalute overall accuracy from hierarchical clustering
countByCluster_hierClust, countByLabel_hierClust, \
    countMostFreq_hierClust, accuracyDF_hierClust, \
    overallAccuracy_hierClust, accuracyByLabel_hierClust = \
    analyzeCluster(X_train_hierClustered, y_train)

print("Overall accuracy from hierarchical clustering: ", \
      overallAccuracy_hierClust)

In [None]:
# Exercise 3 Answers
# HDBSCAN

import hdbscan

min_cluster_size = 10
min_samples = 5
alpha = 1.0
cluster_selection_method = 'leaf'

hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, \
    min_samples=min_samples, alpha=alpha, \
    cluster_selection_method=cluster_selection_method)

X_train_hdbscanClustered = hdb.fit_predict(X_train)
X_train_hdbscanClustered = pd.DataFrame(data= \
    X_train_hdbscanClustered, index=X_train.index, \
    columns=['cluster'])

countByCluster_hdbscan, countByLabel_hdbscan, \
    countMostFreq_hdbscan, accuracyDF_hdbscan, \
    overallAccuracy_hdbscan, accuracyByLabel_hdbscan = \
    analyzeCluster(X_train_hdbscanClustered, y_train)

print("Overall accuracy from HDBSCAN: ", \
      overallAccuracy_hdbscan)

## Conclusion to Part B
In this notebook, we applied K-means, hierarchical clustering, and HDBSCAN on the dataset of Lending Club applications to group similiar applicants together.

We used the labels of loan grades to see how well the clustering was able to segment the borrowers into distinct and homogenous groups based on creditworthiness.

The results were OK but could be improved with better feature engineering and selection and more hyperparameter tuning.

Group segementation is one real world application of clustering, and now you could use clustering methods to group users of your choice in your own field.

Congratulations, you've finished this course! 
Go build more clustering systems!

The next course in the Inside Unsupervised Learning series is Feature Extraction using Autoencoders and Semi-Supervised Learning.
https://learning.oreilly.com/live-training/courses/inside-unsupervised-learning-feature-extraction-using-autoencoders-and-semi-supervised-learning/0636920283492/

You could also learn more about Unsupervised Learning in my book, Hands-on Unsupervised Learning Using Python.
https://www.unsupervisedlearningbook.com/