## German Credit Risk

Data:
- Age (numeric)
- Sex (text: male, female)
- Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
- Housing (text: own, rent, or free)
- Saving accounts (text - little, moderate, quite rich, rich)
- Checking account (numeric, in DM - Deutsch Mark)
- Credit amount (numeric, in DM)
- Duration (numeric, in month)
- Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly as py
import os
import plotly.io as pio
pio.renderers.default='notebook'

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering, DBSCAN
import scipy.cluster.hierarchy as shc 


import category_encoders as ce

plt.style.use('seaborn-colorblind')
%matplotlib inline

In [None]:
df = pd.read_csv('../input/german-credit/german_credit_data.csv', index_col = 'Unnamed: 0')
print(df.shape)
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include=['object'])

In [None]:
df.nunique()

In [None]:
numeric = ['Age', 'Job', 'Credit amount', 'Duration']
categorical = ['Sex', 'Housing', 'Saving accounts', 'Checking account', 'Purpose']

## Exploratory Data Analysis

In [None]:
def check_missing(data, output_path=None):
    """Считаем количество пропусков и долю пропусков."""
    result = pd.concat([data.isnull().sum(), data.isnull().mean()], axis=1)
    result = result.rename(index=str, columns={0: 'total missing', 1: 'proportion'})
    if output_path:
        result.to_csv(f'{output_path}missing.csv')
        print(output_path, 'missing.csv')
    return result

In [None]:
check_missing(data=df)

In [None]:
df= df.fillna('unknown')

In [None]:
df.hist(figsize = (20,15));

In [None]:
for col in df[categorical].columns:
    sns.countplot(y =col, data = df)
    plt.show()

- Distribution of `Age` is positively skewed, we will apply log-transformation for this feature;
- There are twice as many male customers as female;
- Most of customers are skilled;
- Most of customers have their own house;
- Most of customers have little saving accounts;
- Distribution of `Credit amount` is positively skewed, we will apply log-transformation for this feature;
- Duration is distributed from 4 to 72 months. Credits for a year or two are most common.

In [None]:
plt.figure(figsize=(20,8))
plotnumber =1
for column in df[numeric]:
    ax = plt.subplot(2,2,plotnumber)
    sns.boxplot(data = df, x = column, palette='pastel')
    plt.xlabel(column)
    plotnumber+=1
plt.show()

In [None]:
sns.pairplot(df)
plt.show;

In [None]:
corr = df.corr()
plt.figure(figsize=(10,8));
sns.heatmap(corr, annot=True, fmt='.2f');

`Duration` and `Credit amount` are highly correlated

## Feature engineering

#### Numerical

In [None]:
data = df.copy()

In [None]:
np.log(data['Age']).hist()

In [None]:
data['Age'] = np.log(data['Age'])

In [None]:
np.log(data['Credit amount']).hist()

In [None]:
data['Credit amount'] = np.log(data['Credit amount'])

#### Categorical

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder()
from sklearn.preprocessing import LabelEncoder
for label in categorical:
    data[label] = encoder.fit_transform(data[label])

In [None]:
data[categorical]

#### Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data)
data_scaled = pd.DataFrame(X_scaled, columns=data.columns)
data_scaled.head()

For Clustering we need a way to compute the distance between pairs of data points. Data points that are close to each other will more likely belong to the same cluster.  
The reason we normalize the data is to make sure all dimensions are treated equally. In other words, we want each column to contribute the same impact on the distance. 

#### PCA

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

In [None]:
X_pca.shape

#### UMAP

In [None]:
import umap

In [None]:
reducer = umap.UMAP(random_state=42)
X_umap = reducer.fit_transform(X_scaled)

In [None]:
X_umap.shape

#### tSNE

In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne = TSNE(n_components=2, random_state=10)
X_tsne = tsne.fit_transform(X_scaled)

In [None]:
X_tsne.shape

We will use different dimensionality reduction techniques for data visualization after clustering.

## Clustering

#### K-means

In [None]:
inertia = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, random_state=10).fit(data_scaled)
    labels = kmeans.labels_
    inertia_i = kmeans.inertia_
    inertia.append(inertia_i)

In [None]:
plt.plot(range(1,11), inertia, marker='o');

In [None]:
D = []
for i in range(1,9):
    Di = (inertia[i] - inertia[i+1])/(inertia[i-1] - inertia[i])
    D.append(Di)

In [None]:
plt.plot(range(2,10), D, marker='o');

In [None]:
kmeans = KMeans(n_clusters=4, random_state=10).fit(X_scaled)
labels_kmeans = kmeans.labels_

In [None]:
plt.title('K-means, 4 clusters')
sns.scatterplot(x = X_pca[:,0], y = X_pca[:,1], hue=labels_kmeans, palette='rainbow');

In [None]:
data_clustered = df.copy()
data_clustered['cluster_kmeans'] = labels_kmeans

#### Hierarhical

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
plt.figure(figsize=(20,10))
linkage_ = linkage(X_scaled, method='ward')
dendrogram_ = dendrogram(linkage_)

Now, looking at the highest vertical and imagining a horizontal line crossing it would mean the best number of clusters would be 3-4.

In [None]:
from tqdm import tqdm
from sklearn.metrics import silhouette_score
silhouette = []
for i in tqdm(range(2,11)):
    agg = AgglomerativeClustering(n_clusters=i).fit(X_scaled)
    labels = agg.labels_
    score = silhouette_score(X_scaled, labels)
    silhouette.append(score)

In [None]:
plt.plot(range(2,11), silhouette, marker='o');

In [None]:
agg_cluster = AgglomerativeClustering(n_clusters = 3).fit(X_scaled)
labels_agg = agg_cluster.labels_

In [None]:
plt.title('Hierarchical clustering, 3 clusters')
sns.scatterplot(x = X_tsne[:, 0], y = X_tsne[:, 1], hue=agg_cluster.labels_,  palette=['green','orange','blue']);

In [None]:
data_clustered['cluster_agg'] = labels_agg

#### DBSCAN

In [None]:
def dbscan_clustering(eps_range, X):
    eps_range = eps_range
    silhouette = []
    clusters = []
    for i in tqdm(eps_range):
        dbscan = DBSCAN(eps=i).fit(X)
        labels = dbscan.labels_
        uniq_labels = np.unique(labels)
        n_clusters = len(uniq_labels[uniq_labels != -1])
        if n_clusters > 1:
            score = silhouette_score(X, labels)
        else:
            score = 0
        silhouette.append(score)
        clusters.append(n_clusters)
        
    fig, ax1 = plt.subplots()

    color = 'tab:red'
    ax1.plot(eps_range, silhouette, marker='o', color=color)
    ax1.set_xlabel('eps')
    ax1.set_ylabel('silhouette', color=color)
    ax1.tick_params(axis='y', labelcolor=color)

    ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

    color = 'tab:blue'
    ax2.plot(eps_range, clusters, marker='o', color=color)
    ax2.set_ylabel('n_clusters', color=color)  
    ax2.tick_params(axis='y', labelcolor=color)

    fig.tight_layout()  # otherwise the right y-label is slightly clipped
    plt.show()

In [None]:
eps_range = np.arange(1,4,0.05)
dbscan_clustering(eps_range, X_scaled)

In [None]:
eps_range = np.arange(1.5,2.5,0.05)
dbscan_clustering(eps_range, X_scaled)

In [None]:
eps_range = np.arange(1.95,2.5,0.01)
dbscan_clustering(eps_range, X_scaled)

In [None]:
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=3)
nbrs = neigh.fit(X_scaled)
distances, indices = nbrs.kneighbors(X_scaled)

distances = np.sort(distances, axis=0)
distances = distances[:,1]

plt.plot(distances)

Considering two methods to find optimal parametrs for DBSCAN:
- eps = 2.15
- min_samples >= dimensinality + 1

In [None]:
dbscan = DBSCAN(eps=2.15, min_samples=10).fit(X_scaled)
labels_dbscan = dbscan.labels_

In [None]:
plt.title('DBSCAN')
sns.scatterplot(x = X_umap[:,0], y = X_umap[:,1], hue=labels_dbscan, palette='rainbow');

In [None]:
data_clustered['cluster_dbscan'] = labels_dbscan

## Interpretation

In [None]:
data_clustered.groupby('cluster_kmeans').mean()[['Age', 'Job', 'Credit amount', 'Duration']]

In [None]:
data_clustered['cluster_kmeans'].value_counts()

In [None]:
fig, ax  = plt.subplots(1,3,figsize=(20,5))
sns.scatterplot(x = data_clustered['Duration'], y = data_clustered['Credit amount'], hue=labels_kmeans, ax=ax[0], palette='rainbow');
sns.scatterplot(x = data_clustered['Age'], y = data_clustered['Credit amount'], hue=labels_kmeans, ax=ax[1], palette='rainbow');
sns.scatterplot(x = data_clustered['Age'], y = data_clustered['Duration'], hue=labels_kmeans, ax=ax[2], palette='rainbow');

In [None]:
for col in data_clustered[numeric].columns:
    sns.boxplot(data=data_clustered, x=col, y=labels_kmeans, orient='h')
    plt.show();

#### Cluster 0

In [None]:
data_clustered[data_clustered['cluster_kmeans']==0]['Sex'].hist()

In [None]:
data_clustered[data_clustered['cluster_kmeans']==0]['Job'].hist()

In [None]:
sns.countplot(y ='Purpose', data = data_clustered[data_clustered['cluster_kmeans']==0])
plt.show()

#### Cluster 1

In [None]:
data_clustered[data_clustered['cluster_kmeans']==1]['Sex'].hist()

In [None]:
sns.countplot(y ='Purpose', data = data_clustered[data_clustered['cluster_kmeans']==1])
plt.show()


#### Cluster 2

In [None]:
data_clustered[data_clustered['cluster_kmeans']==2]['Sex'].hist()

In [None]:
sns.countplot(y ='Purpose', data = data_clustered[data_clustered['cluster_kmeans']==2])
plt.show()

#### Cluster 3

In [None]:
data_clustered[data_clustered['cluster_kmeans']==3]['Sex'].hist()

In [None]:
sns.countplot(y ='Purpose', data = data_clustered[data_clustered['cluster_kmeans']==3])
plt.show()

## Conclusion

- **Cluster 0**  has the highest mean  `Credit amount` (6569.7 DM) and the longest mean `Duration` (35 months). There are mostly men in this cluster;
- **Cluster 1** has mean  `Credit amount` = 2073.7 DM and the mean `Duration` = 35 months. There are only women in this cluster;
- **Cluster 2** is the smallest cluster with mean  `Credit amount` = 3251.1 DM and the mean `Duration` = 20 months. There are mostly men in this cluster;
- **Cluster 3** is the biggest cluster with the lowest mean  `Credit amount` = 1960.1 DM and the shortest mean `Duration` = 15 months. There are only men in this cluster;
- Average `Age` is very similar throughout the clusters;
- For all clusters main `Purpose` are car and radio/TV.