<div>
    <img style="float:right;" src="images/snext-logo.png"/>
    <div style="float:left;color:#626262;padding-top:30px"><h1>Exercise: Unsupervised learning in Python with scikit-learn</h1></div>
</div>

This notebooks contains the skeleton of a simple data analytics documentation that demonstrates two unsupervised algorithms on a given dataset.

Walk through the analysis be executing the cells one by one, complete the contained assignments then apply the learnings to a new case.

## 1. Case description

### Business Problem
In financial institutions, the process of loan approval and determining the interest rate offered is critical in mitigating potential risks and maximizing returns. The decision-making process involves assessing the risk of each loan application based on various factors, such as credit score, income, and past repayment history. The assessment informs the bank of the probability of the borrower defaulting on the loan, which affects the interest rate offered to the applicant. Therefore, having a reliable and accurate risk assessment model is essential for financial institutions to make informed decisions.

### Research Problem
The research problem is to develop a classification model that can accurately classify loan applications into risk or no-risk categories. The model will review historical data on past loan applications and outcomes to identify patterns and predict the probability of the loan defaulting. Based on the model output, the loan applications shall be classified into those with low or high risk. The outcome of the model will help the bank make informed decisions on the loan amount, interest rates, and payment schedules, thus mitigating potential risks and enhancing returns.

### Training Data
The training dataset will consist of past loan applications and the corresponding outcomes. The data points collected will include the borrower's credit score, income, years of experience, and financial history such as investments, credit card debt, mortgage information, and other assets. The outcome variable will be a binary classification of either a loan default or no default. The model will undergo a series of tests using cross-validation techniques before implementation.

### Exercise
Developing a risk assessment model for loan applications is crucial for financial institutions to minimize risks and maximize profitability. Students in a university can engage in this problem to gain hands-on experience with data analysis and predictive modeling. The project will involve building and benchmarking classification models that can accurately classify loan applications, considering various features to predict if a loan is likely to default or not. The project will allow students to learn the methods of data pre-processing, model building and evaluation.

## 2. Data loading, preparation and exploration

Load required libraries and jupyter extenions

In [None]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# load Jupyter plugins to enable SQL to query data and display plots inline (below the code cell)

%load_ext sql
%matplotlib inline

In [None]:
# The data for this exercise is contained in a sqlite database that is compressed with ZIP
# ZIP file is expected to be in folder data 

# Uncompress database / skip this if you downloaded and unzipped the database manually
import zipfile
zipfile.ZipFile('data/snext-data.zip', 'r').extractall('data')

In [None]:
%sql sqlite:///data/snext-database.db

In [None]:
data = %sql SELECT * FROM credit_ger
df = data.DataFrame()

In [None]:
# set index, shorten/unify feature names
df = df.set_index(["id"])
df = df.rename({
    "Age": "age",
    "Sex": "sex",
    "Job": "job",
    "Housing": "housing",
    "Saving_accounts": "savings", 
    "Checking_account": "cash",
    "Credit_amount": "amount",
    "Duration": "duration",
    "Purpose": "purpose",
    "Risk": "risk"
}, axis="columns")
df.head(5)

## 2. Data preparation

We'll apply the same transformations as in the chapter "Feature Engineering" from the notebook about supervised learning.

In [None]:
# X will hold the input features (input for the model)
# y will hold the label (the desired output of the model)
X = pd.DataFrame()
y = pd.DataFrame()

# Dummy encode binary nominal features
y["risk"] = (df.risk == "bad")*1   # *1 translates True/False to 0/1
X["male"] = (df.sex == "male")*1

# Pne-hot-encoding to nominal features with multiple categories

df.purpose = df.purpose.str.slice(0,8) # shorten purpose string
encoded_features = pd.get_dummies(df[["housing","purpose","savings","cash"]])
X = pd.concat([X, encoded_features], axis=1) # append features to dataframe X with training data

# all metric variables can remain as is, so we append them to the training data 
X = pd.concat([X,df[["age", "amount", "duration"]]], axis=1)

## 3. Modelling

### 3.1 K-Means

#### Determine the optimal number of clusters to generate

In [None]:
inertias = []   # list contains the distances of the last merged clusters for each solution we plan to generate

for n_clusters in range(2,10):  # generate multiple solutions for 2-9 clusters
    kmeans = KMeans(n_clusters=n_clusters, n_init=10).fit(X)  # for each solution try 10 different random starting configurations for cluster centroids
    inertias.append(kmeans.inertia_)  # add the inertia (distance of last merged clusters) to our list

# Plot inertias 
plt.figure(figsize=(10,5))   
plt.title('Elbow criterion')
plt.xlabel("n_clusters")
plt.ylabel("inertia")
plt.plot(range(2,10), inertias, marker="o")

---
### <span style="color:#46B7E9;">Assignment: Interpret the elbow criterion diagram</span>
1. Think about what this line tells you about the generated cluster configurations. If unsure, rewatch the video about k-means.
2. Apply the elbow criterion: What cluster configurations seems best? Determine the optimal number of clusters and put it in the following cell.

In [None]:
optimal_n_clusters = 

#### Apply clustering

In [None]:
kmeans = KMeans(n_clusters=optimal_n_clusters, n_init=10)
clusters = kmeans.fit_predict(X)  # apply clustering and assign each data row to the resulting cluster
df["clusters"] = clusters # add this to the original dataframe.

---
### <span style="color:#46B7E9;">Assignment: Explore the clustering solution</span>
1. Generate some descriptive statistics or plots to understand what the clusters represent
2. Try to find fitting labels for all clusters like "car loans for young people".

In [None]:
# ...

#### Cluster Labels
- Cluster 1: ...
- Cluster 2: ...
- ...

### Hints - butc try it yourself first, you can do it :-)

In [None]:
# some diagrams to get you started...

import seaborn as sns
fig, ax  = plt.subplots(2,3,figsize=(20,12))
fig.suptitle("Interpretation Cluster nach Clusterzentroiden")
sns.scatterplot(x=df.duration, y = df.amount, hue=clusters, ax=ax[0,0], palette="bright")
sns.scatterplot(x=df.age, y = df.amount, hue=clusters, ax=ax[0,1], palette='bright')
sns.scatterplot(x=df.age, y = df.duration, hue=clusters, ax=ax[0,2], palette='bright')

ax[1,0].set_title ("Altersverteilung in Clustern")
sns.boxplot(data=[ df[df.clusters == i].age for i in range(0,optimal_n_clusters)], ax=ax[1,0])

ax[1,1].set_title ("Kreditsummenverteilung in Clustern")
sns.boxplot(data=[ df[df.clusters == i].amount for i in range(0,optimal_n_clusters)], ax=ax[1,1])

ax[1,2].set_title ("Kreditdauerverteilung in Clustern")
sns.boxplot(data=[ df[df.clusters == i].duration for i in range(0,optimal_n_clusters)], ax=ax[1,2])

plt.tight_layout()
plt.show()

### 3.2 Principal Component Analyis (PCA)
In this section we use PCA to reduce the dimensions (number of variables) to make the dataset more handy and break it down so we can visualize all points in a 3D space.

### <span style="color:#46B7E9;">Assignment: Before we get started...</span>
1. Think about, why our degree of data preprocession is insufficient for the PCA. If unsure, rewatch the video an pay attention to the prerequisites.
2. What is missing? Compare your thoughts to the transformation in the next (hidden) cell

In [None]:
# Further required processing of the data
# Solution: we need to z-standardize the metric values, so the absolute value has no impact on the pca solution

to_be_rescaled = ["age", "amount", "duration"]
scaler = StandardScaler()   # tool from scikit learn library that applies z-standardization (mean of 0, std of 1)

scaled_X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

#### Determine the optimal number of components with a scree plot

In [None]:
pca = PCA(n_components=5)  # extract five components from the data
pca.fit (scaled_X)
X_trans = pca.transform(scaled_X)  # apply pca to dataset to calculate coordinations in PCA space for each credit application

In [None]:
# this is how much variance each component can explain
print(pca.explained_variance_ratio_)

In [None]:
# let's plot this in a scree plot
PC_values = range(0,pca.n_components_) # components
plt.plot(PC_values, pca.explained_variance_ratio_, 'o-', linewidth=2, color='blue')  # plot % of variance explained vs. components
plt.title('Scree Plot')
plt.xlabel('Principal Component #')
plt.ylabel('Variance Explained')
plt.show()

---
### <span style="color:#46B7E9;">Assignment: Intepret the scree plot</span>
1. Think about that the diagram explains to you
2. How much of the original variance can the PCA reproduce approximately with 3 components (that we could visualize in a 3D diagram)?

In [None]:
# calculate the answer
sum(pca.explained_variance_ratio_[0:3])

---
### <span style="color:#46B7E9;">Assignment: Think about the shapes of the transformed dataset</span>
What shape should the pca_data dataframe have?

In [None]:
X_trans.shape

#### Now let's explore the components by generating the loading matrix
The matrix explains how the componants are assembled from the input variables 

In [None]:
print("Loading Matrix")
loading_matrix = pd.DataFrame(pca.components_.T, index=X.columns)
loading_matrix.sort_values(0)

---
### <span style="color:#46B7E9;">Assignment: Interpret and name the components 0, 1 and 2</span>
1. Sort the loading matrix by columns 0,1,2 and explore the top positive and negative input variables
2. How would you label the components 0,1,2 given which and how the input variable weights from the loading matrix?

- Component 0:
- Component 1:
- Component 2:

#### Let's visualize the result

In [None]:
import mpl_toolkits.mplot3d 
plt.style.use('default')
 
# Prepare 3D graph
fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
 
# Plot scaled features
xdata = X_trans[:,0]
ydata = X_trans[:,1]
zdata = X_trans[:,2]
 
# Plot 3D plot
ax.scatter3D(xdata, ydata, zdata, c=y, cmap='RdBu')  # c=y sets the color of the dot to the risk/no-risk variable we stored in dataframe y
 
# Plot title of graph
plt.title("3D Scatter of Credit Applications")

# Rotate diagram
ax.view_init(30, 30, 0)

# Plot x, y, z labels
ax.set_xlabel('Component 0', rotation=150)
ax.set_ylabel('Component 1')
ax.set_zlabel('Component 2', rotation=60)
plt.show()

---
### <span style="color:#46B7E9;">Assignment: Analyze and interpret the plot</span>
1. Try to change the orientation of the diagram (ax.view_init) to better see the risk high/low dot clouds. Is it possible / helpful?
2. Write the code to describe the risk/no-risk credits using the components to check your visual interpretation

In [None]:
# hint 1
df_tmp = pd.concat([pd.DataFrame(X_trans),y], axis=1)
df_tmp

In [None]:
# hint 2
df_tmp[df_tmp.risk==0].describe()
# ...