<div>
    <img style="float:right;" src="images/snext-logo.png"/>
    <div style="float:left;color:#626262;padding-top:30px"><h1>Exercise: Supervised learning in Python with scikit-learn</h1></div>
</div>

This notebooks contains the skeleton of a simple data analytics documentation that benchmarks two models on a given dataset.

Walk through the analysis be executing the cells one by one, complete the contained assignments then apply the learnings to a new case.

## 1. Case description

### Business Problem
In financial institutions, the process of loan approval and determining the interest rate offered is critical in mitigating potential risks and maximizing returns. The decision-making process involves assessing the risk of each loan application based on various factors, such as credit score, income, and past repayment history. The assessment informs the bank of the probability of the borrower defaulting on the loan, which affects the interest rate offered to the applicant. Therefore, having a reliable and accurate risk assessment model is essential for financial institutions to make informed decisions.

### Research Problem
The research problem is to develop a classification model that can accurately classify loan applications into risk or no-risk categories. The model will review historical data on past loan applications and outcomes to identify patterns and predict the probability of the loan defaulting. Based on the model output, the loan applications shall be classified into those with low or high risk. The outcome of the model will help the bank make informed decisions on the loan amount, interest rates, and payment schedules, thus mitigating potential risks and enhancing returns.

### Training Data
The training dataset will consist of past loan applications and the corresponding outcomes. The data points collected will include the borrower's credit score, income, years of experience, and financial history such as investments, credit card debt, mortgage information, and other assets. The outcome variable will be a binary classification of either a loan default or no default. The model will undergo a series of tests using cross-validation techniques before implementation.

### Exercise
Developing a risk assessment model for loan applications is crucial for financial institutions to minimize risks and maximize profitability. Students in a university can engage in this problem to gain hands-on experience with data analysis and predictive modeling. The project will involve building and benchmarking classification models that can accurately classify loan applications, considering various features to predict if a loan is likely to default or not. The project will allow students to learn the methods of data pre-processing, model building and evaluation.

## 2. Data loading, preparation and exploration

Load required libraries and jupyter extenions

In [None]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
from sklearn import model_selection, linear_model, tree
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.metrics import precision_score, recall_score

# load Jupyter plugins to enable SQL to query data and display plots inline (below the code cell)

%load_ext sql
%matplotlib inline

In [1]:
# The data for this exercise is contained in a sqlite database that is compressed with ZIP
# ZIP file is expected to be in folder data 

# Uncompress database / skip this if you downloaded and unzipped the database manually
import zipfile
zipfile.ZipFile('data/snext-data.zip', 'r').extractall('data')

In [None]:
%sql sqlite:///data/snext-database.db

In [None]:
data = %sql SELECT * FROM credit_ger
df = data.DataFrame()
df.head()

In [None]:
# set index, shorten/unify feature names
df = df.set_index(["id"])
df = df.rename({
    "Age": "age",
    "Sex": "sex",
    "Job": "job",
    "Housing": "housing",
    "Saving_accounts": "savings", 
    "Checking_account": "cash",
    "Credit_amount": "amount",
    "Duration": "duration",
    "Purpose": "purpose",
    "Risk": "risk"
}, axis="columns")
df.head(5)

---
### <span style="color:#46B7E9;">Assignment: Explore the dataset to gain some understanding about the contained credit applications</span>

## 2. Feature Engineering
What is it about?

> Feature engineering is the process of preparing data for algorithms by making features accessible. Typically in this step, domain knowledge of the modeler is incorporated into the dataset.

Feature engineering is an important step in machine learning as it can greatly affect the accuracy of the models. Without feature engineering, the models may not be able to capture the important information in the data. It requires not only technical skills but also a deep understanding of the problem and domain knowledge. By preparing the features carefully, we can improve the models' performance and make better predictions.

In this example, we focus on the minimum: the selected machine learning methods should be able to handle the dataset technically. There are several string features (nominal scale) in the dataset that cannot be directly processed. Therefore, it is useful to recode them into numbers.

In [None]:
# create dataframes for processed data
# X will hold the input features (input for the model)
# y will hold the label (the desired output of the model)
X = pd.DataFrame()
y = pd.DataFrame()

#### Dummy encoding for binary, nominal features

In [None]:
y["risk"] = (df.risk == "bad")*1   # *1 translates True/False to 0/1
X["male"] = (df.sex == "male")*1

print(X.male.value_counts())
print("")
print(y.risk.value_counts())

#### One-Hot-Encoding for nominal features with multiple categories
This method creates one variable for each category of a nominal feature.

In [None]:
example = df["housing"]
one_hot_encoded = pd.get_dummies(example)*1
pd.concat([example, one_hot_encoded], axis=1).head(10)

In [None]:
# apply one-hot-encoding to nominal features with multiple categories

df.purpose = df.purpose.str.slice(0,8) # shorten purpose string

encoded_features = pd.get_dummies(df[["housing","purpose","savings","cash"]])
X = pd.concat([X, encoded_features], axis=1) # append features to dataframe X with training data

In [None]:
# all metric variables can remain as is, so we append them to the training data 
X = pd.concat([X,df[["age", "amount", "duration"]]], axis=1)

---
### <span style="color:#46B7E9;">Assignment: Inspect the recoded training data X and training data labels y</span>

## 3. Modelling

### Split training data in training, and test

In [None]:
# check how many (rows, columns) each dataframe holds
print(X.shape)
print(y.shape)

In [None]:
# seperate 20% of training data as "test data" that will be set aside and not be used for model training
# we'll use this data later on to check the model performance with data it hasn't "seen" yet

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=0)

---
### <span style="color:#46B7E9;">Assignment: Think about how many rows and rolumns the splitted dataframes hold before executing the next cell</span>

Answer: X_train has ... rows, ... columns, ...

In [None]:
# check the shape of the resulting dataframes (rows, columns)
print (f"Training data shape: {X_train.shape}")
print (f"Test data shape: {X_test.shape}")
print (f"Training data labels: {y_train.shape}")
print (f"Test data labels: {y_test.shape}")

### Logistic Regression

In [None]:
# fit the regression model
reg = linear_model.LogisticRegression()
reg = reg.fit(X_train, y_train["risk"])

In [None]:
# analyze weight factors
stat = pd.DataFrame([X.columns, reg.coef_.ravel()]).transpose()
stat = stat.sort_values(by=[1])
stat = stat[abs(stat[1])>0.05]   # only important parameters
stat = stat.set_index(0)
stat.plot(kind="barh", title="Regression coefficients of features", legend=False, xlabel="Coefficient value", ylabel="Coefficient name")

---
### <span style="color:#46B7E9;">Assignment: Interpret the diagram with regression coefficients</span>
1. What is the meaning of an regression coefficient?
2. From the diagram: What are the top factors making a credit application look more or less risky? Hint: Label encoding in y is 0=no risk, 1=risk.

Answer: 

### Decision Tree

In [None]:
# build the tree
decision_tree = tree.DecisionTreeClassifier(min_samples_leaf=20) # split nodes until only x samples (credit application cases) remain in a node
decision_tree = decision_tree.fit(X_train, y_train) # only use training data

In [None]:
# visualize tree
fig, ax = plt.subplots(1,1,figsize=(35,15))
plt.style.use('default')  # Bug in scikit-learn: Wenn Seaborn-Style gesetzt, wird der Tree nicht korrekt dargestellt, daher erst zurücksetzen
t = tree.plot_tree(decision_tree, ax=ax, class_names=True, label="root", precision=2, feature_names=X.columns, fontsize=12, proportion=True, filled=True)
plt.show()

---
### <span style="color:#46B7E9;">Assignment: Compare the top regression coefficients with the key splits of the descision tree</span>
1. Recap or research/rewatch the videos from the course to answer this questions: How does a feature becomes important in each of the algorithms?
2. What do you think: should both algorithms rely primarily on the same features?
3. Compare the three most important features in the tree visualization and regression coefficient diagram to check your hypothesis. Are the important features identical, overlapping or different?
4. Think and/or research: under which circumstances will both algorithms not pick the same features as the most important?

## 4. Model Evaluation and Benchmarking

#### Simple metric: Average Precision

In [None]:
print(f"Average Precision DecisionTree: {decision_tree.score(X_train, y_train)} with training data, {decision_tree.score(X_test, y_test)} with test data")

---
### <span style="color:#46B7E9;">Assignment: Interpret the quality metric</span>
1. Research the exact definition, how the scikit-learn library calculates the average precision.
2. Think about under what circumstances this metric might be misleading!

Answer:

### Confusion Matrices

#### Logistic Regression

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, reg.predict(X_test), display_labels=["no-risk","risk"], cmap=plt.cm.Blues)

#### Decision Tree

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, decision_tree.predict(X_test), display_labels=["no-risk","risk"], cmap=plt.cm.Blues)

---
### <span style="color:#46B7E9;">Assignment: Interpret the confusion matrices</span>
1. The matrices show two kinds of error, which one is "bigger" in terms of how often it occurres?
2. Which of the problems is worse business-wise?


### Precision and Recall

When analyzing a Confusion Matrix, there are two facets of errors that we are particularly interested in. These can be framed as two questions: 
- how many of the accused (risky credits) were correctly identified, and
- how many of the risky loans were detected?

These questions are answered by the metrics Precision and Recall (read more [here](https://en.wikipedia.org/wiki/Precision_and_recall)).

Precision refers to the **proportion of actual credit risks among the predicted credit risks**, while Recall refers to the **proportion of identified risks among all risky loans**. 

From a business perspective, the types of errors have varying weights and consequences: falsely assuming risk results in missed opportunities for business growth, while failing to identify risk can lead to substantial financial losses. As such, Recall, with its emphasis on identifying risks, should be valued more highly than Precision.

It is important to understand Precision and Recall as they represent two key metrics for evaluating predictive models. Precision measures the ability of the model to avoid making false positive predictions, while Recall measures the ability to detect all positive instances. In other words, Precision identifies how many of the predicted risks were actual risks, and Recall indicates how many of the actual risks were predicted.

Ultimately, it is crucial for businesses to use models that optimize both Precision and Recall. An overly strict model may have high Precision but low Recall, resulting in missed opportunities for growth. On the other hand, an overly lenient model may have high Recall but low Precision, leading to a greater number of false positives and missed opportunities. Finding the right balance between these two metrics is key to building a successful predictive model and avoiding costly errors down the line.

In [None]:
tree_precision = precision_score(y_test.values.ravel(), decision_tree.predict(X_test))
tree_recall    = recall_score(y_test.values.ravel(), decision_tree.predict(X_test))

reg_precision = precision_score(y_test.values.ravel(), reg.predict(X_test))
reg_recall    = recall_score(y_test.values.ravel(), reg.predict(X_test))

print("Tree:       Precision {:.2f}%, Recall {:.2f}%".format(100 * tree_precision, 100 * tree_recall))
print("Regression: Precision {:.2f}%, Recall {:.2f}%".format(100 * reg_precision, 100 * reg_recall))

---
### <span style="color:#46B7E9;">Assignment: Interpret the precision and recall metric</span>
1. What do these metrics tell you about the prediction quality? What new strenghts and weaknesses can you uncover?
2. How relevant is this difference between the models from a business side?

Answer: 

### Analyzing prediction errors
To discover areas for improvement, let's examine the biggest mistakes made by the model. To demonstrate the process we look at the predictions of the decision tree.

In [None]:
Y_pred = decision_tree.predict(X_train)                                   # predicion from tree model
Y_prob = decision_tree.predict_proba(X_train)                             # probabilities for classes 0 and 1
df_pred = pd.DataFrame(Y_pred, columns=["prediction"])           
df_prob = pd.DataFrame(Y_prob, columns=["Prob_0", "Prob_1"])
df_err = pd.concat([X_test, y_test, df_pred, df_prob], axis=1)   # assemble dataframe with all diagnostic information

df_err.dropna(inplace=True)                                      # remove all data with missing values (training data, we're look at test data only)

df_err = df_err[df_err.risk != df_err.prediction]
df_err["error_size"] = df_err[["Prob_0","Prob_1"]].max(axis=1)

df_err.sort_values("error_size", ascending=False)

Now let's visualize some characteristics of the misclassified credit applications.

In the first row, we describe the misclassified applications, in the second row the whole test dataset

In [None]:
fig, ax = plt.subplots(2,3, figsize=(12,6))

df_err.amount.plot(kind="kde", ax=ax[0,0])
df_err.duration.plot(kind="kde", ax=ax[0,1])
df_err.age.plot(kind="kde", ax=ax[0,2])

X_test.amount.plot(kind="kde", ax=ax[1,0])
X_test.duration.plot(kind="kde", ax=ax[1,1])
X_test.age.plot(kind="kde", ax=ax[1,2])

ax[0,0].set_title("amount")
ax[0,1].set_title("duration")
ax[0,2].set_title("age")

#[a.grid(linestyle="--", linewidth=.5) for a in ax]

plt.tight_layout()
plt.show()

---
### <span style="color:#46B7E9;">Assignment: Interpret the findings and conduct additional analyis</span>
1. Which credit applications does the tree struggle whith?
2. Create some hypotheses: What could be the cause? How could we mitigate the problem?
   - Is the model not capable enough?
   - Are we missing data that could shed more light on these specific cases?
   - ...


Answer: 