In this notebook I was trying to divide bank's clients into groups, based on their behaviour. I used a few methods for this:
- KMeans clustering
- Hierarchical clustering (Agglomerative)
- Silhouette score
- DBSCAN
- TSNE 

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../input/german-credit/german_credit_data.csv', index_col=0)

In [None]:
df.head()

It seems to have quiet a lot of categorical features.

In [None]:
df.shape

It's a relatively small dataframe, so we can apply any method or function. It's still will take not too much time to perform on my laptop.

## Missing values

Let's check if we have any nulls.

In [None]:
df.isnull().sum()

It's easy to count %, in this case. So, we have 18.3% of nulls in 'Saving account' feature and 39.4% of nulls in 'Checking account' feature. There aren't any solid rule for what to do with missing values. Usually, I get rid of observations, if it have less than 5% of nulls. In this case, let's take a closer look.

First, let's substitute Nulls with 'None', to see how distribution of observations will look like.

In [None]:
df['Saving accounts'] = df['Saving accounts'].fillna('None')
df['Saving accounts'].value_counts(normalize=True)

It's not clear why we don't have information about Saving accounts in some observations. And we can't just drop 18.3 % of data. So I will leave it as it is, with additional option 'None'. Basically, we iterpret this nulls as one more category of the feature.

Later we will encode it with numbers. We could use one-hot encoders, but here we can use just numbers (as ranking feature). Because we previously suggest, that nulls is a distinct category.

In [None]:
df['Checking account'] = df['Checking account'].fillna('None')
df['Checking account'].value_counts(normalize=True)

Let's stick to our hypothesis. And do the same thing with 'Checking account' as with 'Saving account'.

Just checking, that's everything is ok.

In [None]:
df.head()

In [None]:
df.shape

## Categorical features

In [None]:
df.Sex.hist();

There two times more males than females. Let's encode sex. Male - 1, female - 0.

In [None]:
df['Sex'] = df['Sex'].apply(lambda x: 1 if x=='male' else 0)

In [None]:
df.Housing.value_counts(normalize=True)

This feature should be encoded with one hot encoding.

In [None]:
df.Purpose.value_counts(normalize=True)

We can decrease the amount of categories for this feature. I take the last three categories and sum them up into category 'others'(each of them is presented in less than 5%).

In [None]:
df['Purpose'].replace(['repairs', 'domestic appliances', 'vacation/others'], 'others', inplace=True)

Let's check, if we've done everything right.

In [None]:
df.Purpose.value_counts(normalize=True)

In [None]:
df['Saving accounts'].value_counts(normalize=True)

In [None]:
df['Saving accounts'].replace(['None', 'little', 'moderate', 'rich', 'quite rich'], [0,1,2,3,4], inplace=True)

In [None]:
df['Checking account'].value_counts(normalize=True)

In [None]:
df['Checking account'].replace(['None', 'little', 'moderate', 'rich'], [0,1,2,3], inplace=True)

Just checking if everything is looks right. 

In [None]:
df.head()

We have only 'Housing' and 'Purpose', that cannot be encoded like ranking, so we will use one-hot-encoding instead.

## Overwiew of the features

In [None]:
df.hist(figsize=(12,12));

In [None]:
df['Duration'] = np.log(df['Duration'])
df['Age'] = np.log(df['Age'])
df['Credit amount'] = np.log(df['Credit amount'])

Age, credit amount and duration - numerical features with long tail. So we should try log them to get more normalized distrubution.

As we don't have too much observations, making pairplot won't take much time.

In [None]:
sns.pairplot(df);

It doesn't give any additional information.

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(), annot=True);

We don't have too correlated features. However we can see, that the most correlated features are: 'Job', 'Credit amount' and 'Duration'. Also it seems like that among clients of the banks men are older than women.

## One Hot Encoding

In [None]:
df = pd.get_dummies(df, drop_first=True)

In [None]:
df.head()

In [None]:
df.shape

## Scaling

![](http://)Everything is working. Nice. So we are ready to go in clustering and stuff.

## Creating models

In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist

from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE

### K Means

To use K-means method we should find the amount of clusters.

In [None]:
inertia = []
k = range(1, 11)
for k_i in k:
    km = KMeans(n_clusters=k_i).fit(df)
    inertia.append(km.inertia_)
    
plt.plot(k, inertia)
plt.xlabel('k')
plt.ylabel('inertia')
plt.title('The Elbow Method showing the optimal k');

From KMeans it seems to be 2, 4 or maybe 5 clusters.

### Hierarchical clustering

In [None]:
distance_mat = pdist(df)

Z = hierarchy.linkage(distance_mat, 'ward')

In [None]:
plt.figure(figsize=(20, 10))

plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('cluster size')
plt.ylabel('distance')
hierarchy.dendrogram(
    Z,
    truncate_mode='lastp',
    p=12,  
    leaf_font_size=12.,
    show_contracted=True, 
)
plt.show()

Well, from the dendrogram 2, 3 and 4 - are the best fit of clusters. However, it's not clear.

### Silhouettte score

Now we using sklearn impolemention of metric, to better understand how much clusters we have with hierarchical clustering.

In [None]:
silhouette_scores = [] 
k = range(2,8)

for n_cluster in k:
    silhouette_scores.append( 
        silhouette_score(df, AgglomerativeClustering(n_clusters = n_cluster).fit_predict(df))) 
    
    
# Plotting a bar graph to compare the results 

plt.bar(k, silhouette_scores) 
plt.xlabel('Number of clusters', fontsize = 10) 
plt.ylabel('Silhouette Score', fontsize = 10) 
plt.show() 

From silhouette score we have 3 clusters here. Probably 2 or 4, but not 5.

# 2 clusters

### DBSCAN

It's always feels like a game to guess, what is the best parameters for DBSCAN. However, this model can give you the percantage of noise (that still can be another cluster).

I was trying to get epsilon with minimal amount of noise for 2 clusters.

In [None]:
db = DBSCAN(eps=1.61, min_samples=4).fit(df)

In [None]:
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise_ = list(db.labels_).count(-1)

print('Estimated number of clusters: {}'.format(n_clusters_))
print('Estimated percentage of noise points: {:.2f}%'.format(100*n_noise_/df.shape[0]))

### TSNE

In [None]:
from sklearn.manifold import TSNE

Let's see how good we are able to group clients in clusters. This is a function for choosing perplexity (it's not the best way to do it).

In [None]:
def draw_tsne(df):
    _, axes = plt.subplots(nrows=2, ncols=3, figsize=(16, 8), sharey=True)

    tsne=TSNE(perplexity=5).fit_transform(df)
    axes[0, 0].title.set_text('Perplexity 5')
    sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], ax=axes[0, 0]);

    tsne=TSNE(perplexity=10).fit_transform(df)
    axes[0, 1].title.set_text('Perplexity 10')
    sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], ax=axes[0, 1]);

    tsne=TSNE(perplexity=20).fit_transform(df)
    axes[0, 2].title.set_text('Perplexity 20')
    sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], ax=axes[0, 2]);

    tsne=TSNE(perplexity=30).fit_transform(df)
    axes[1, 0].title.set_text('Perplexity 30')
    sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], ax=axes[1, 0]);

    tsne=TSNE(perplexity=40).fit_transform(df)
    axes[1, 1].title.set_text('Perplexity 40')
    sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], ax=axes[1, 1]);

    tsne=TSNE(perplexity=50).fit_transform(df)
    axes[1, 2].title.set_text('Perplexity 50')
    sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], ax=axes[1, 2]);

In [None]:
draw_tsne(df)

In [None]:
tsne=TSNE(perplexity=30).fit_transform(df)

In [None]:
plt.figure(figsize=(12,12))
plt.title('Perplexity 30')
sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1]);

In [None]:
plt.figure(figsize=(10, 10))
plt.title('DBSCAN, 2 clusters')
plt.scatter(tsne[:, 0], tsne[:, 1], c=db.labels_);

Well, there is some pattern. But not too good.
Let's compare Kmeans and hierarchical clustering.

In [None]:
km = KMeans(n_clusters=2).fit(df)
agg_cluster = AgglomerativeClustering(n_clusters = 2).fit(df)

_, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 10), sharey=True)

axes[0].title.set_text('K-MEANS, 2 clusters')
sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], hue=km.labels_, ax=axes[0]);


plt.title('Hierarchical clustering, 2 clusters')
sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], hue=agg_cluster.labels_, ax=axes[1]);

It seems like hierachical clustering is better for two clusters.

# 3 clusters

In [None]:
km = KMeans(n_clusters=3).fit(df)
agg_cluster = AgglomerativeClustering(n_clusters = 3).fit(df)

_, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 10), sharey=True)

axes[0].title.set_text('K-MEANS, 3 clusters')
sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], hue=km.labels_, ax=axes[0], palette=['green','orange','brown']);


plt.title('Hierarchical clustering, 3 clusters')
sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], hue=agg_cluster.labels_, ax=axes[1], palette=['green','orange','brown']);

And again we can see three groups.

# 4 clusters

In [None]:
km = KMeans(n_clusters=4).fit(df)
agg_cluster = AgglomerativeClustering(n_clusters = 4).fit(df)

_, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 10), sharey=True)

axes[0].title.set_text('K-MEANS, 4 clusters')
sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], hue=km.labels_, ax=axes[0], palette=['green','orange','brown', 'yellow']);


plt.title('Hierarchical clustering, 4 clusters')
sns.scatterplot(x = tsne[:, 0], y = tsne[:, 1], hue=agg_cluster.labels_, ax=axes[1], palette=['green','orange','brown', 'yellow']);

Well, it's seem like we have 3 clusters. So let's try to find out who are they.

# Interpretation

Let's try to interpret, what are these three groups.

In [None]:
agg_cluster = AgglomerativeClustering(n_clusters = 3).fit(df)

In [None]:
fig, ax = plt.subplots(nrows=5, ncols=3, figsize=(40, 20))

i_col = 0
i_row = 0

for column in df.columns:
    sns.boxplot(y=column, x=agg_cluster.labels_, 
                     data=df, 
                     palette="colorblind", ax=ax[i_row, i_col])
    if i_row < 4:
        i_row += 1
    else:
        i_col += 1
        i_row = 0


So, there are three groups:
- Men, with a moderate jobs and now savings
- Women, with a highly skilled jobs and some savings. Also this group take higher amount of money for longer periods, and not for TV/radio.
- Men, with no job or not a resident and with a lot of savings