<a href="https://colab.research.google.com/github/stefanlessmann/ASE-ML/blob/main/Day-1-Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wecome to our Section 12 Deep Learning Practice Session
Today, we put everything together studied thus far and develop a fully-fledged pipeline for training, testing, and model selecting a neural network classifier to predict the risk of credit default. 

The outline of today's session is as follows:
- Load real-world credit data from GitHub
- Eyeballing the data using the `Pandas` library
- Perform basic data preprocessing
- Partition the data into training a testing data
- Train and assess a neural network classifier. For assessment, compute the classification accuracy of your trained network.
- Optional task for expert: Tuning the architecture of the neural network

Much work to do, so let's go!


In [None]:
# Standard libraries for data data handling and plotting
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd


## Load real-world credit data from GitHub
The data set we use for this session is available at the following URL: <br>
https://raw.githubusercontent.com/stefanlessmann/ASE-ML/master/hmeq.csv

Create a variable named `data_url` in which you store this URL. Next, use the method `read_csv()` from the `Pandas` library to download the data right from the web and store it in a `DataFrame`. 

In [None]:
# Enter code to download the demo data set from the web 


#### Introducing the HMEQ data set (you can skip this section)
Our data set, called the  "Home Equity" or, in brief, HMEQ data set, is provided by www.creditriskanalytics.net. It comprises  information about a set of borrowers, which are categorized along demographic variables and variables concerning their business relationship with the lender. A binary target variable called 'BAD' is  provided and indicates whether a borrower has repaid her/his debt. You can think of the data as a standard use case of binary classification.

You obtain the data, together with other interesting finance data sets, directly from www.creditriskanalytics.net. The website also provides a brief description of the data set. Specifically, the data set consists of 5,960 observations and 13 features including the target variable. The variables are defined as follows:

- BAD: the target variable, 1=default; 0=non-default 
- LOAN: amount of the loan request
- MORTDUE: amount due on an existing mortgage
- VALUE: value of current property
- REASON: DebtCon=debt consolidation; HomeImp=home improvement
- JOB: occupational categories
- YOJ: years at present job
- DEROG: number of major derogatory reports
- DELINQ: number of delinquent credit lines
- CLAGE: age of oldest credit line in months
- NINQ: number of recent credit inquiries
- CLNO: number of credit lines
- DEBTINC: debt-to-income ratio

As you can see, the features aim at describing the financial situation of a borrower. We will keep using the data set for many modeling tasks in this demo notebook and future demo notebook. So it makes sense to familiarize yourself with the above features. Make sure you understand what type of information they provide and what this information might reveal about the risk of defaulting.  

## Eyeballing data using the Pandas library
In this part, we briefly demonstrate some practices for obtaining a first intuition about a data set. Let's first identify some standard methods that the `Pandas` library provides for this purpose. This is not meant to give you a fully-comprehensive list. With that disclaimer, however, it is fair to say that you will apply all of the following methods to a new data frame.  
- `info()` 
- `describe()`
- `head()` / `tail()`

Try them out and inspect the output. Briefly summarize your findings.

In [None]:
# Standard methods to exploring a DataFrame 


Creating outputs in the form of lists and tables is fine and can provide useful insights. However, you probably know the saying "*a picture says more than a thousand words*". Suitable graphics are often more insightful than (big) tables and can convey information in a more accessible way. Fortunately, `Pandas` offers many useful features to visualize the data within a DataFrame.

Try  accomplish the following tasks. Web search using competendly selected search phrases will easily reveal the `Pandas` methods you need and how these are invoked. 
- Create a *histogram* for the the dependent (aka target) variable **BAD**
- Create a density plot to depict the distribution of the variable **LOAN**
- It is more common to visualize the distribution of numerical variables using *boxplots*. Considering again the variable, **LOAN** and depict its distribution using a *boxplot*
- Create one last boxplot of the variable **LOAN**. This time, depict the distribution of loan amounts seperately for good and bad borrowers. To that end, use the target variable **BAD** and use it to add a grouping to your boxplot.

In [None]:
# Create a histogram for the the dependent (aka target) variable BAD

In [None]:
# Create a density plot to depict the distribution of the variable LOAN

In [None]:
# Create a box plot to depict the distribution of the variable LOAN

In [None]:
# Create a box plot of the variable LOAN while grouping by the target

## Perform basic data preprocessing
Data preparation is a topic of great importance. We could easily devote an entire lecture to this topic. Below, we sketch only a tiny little bit of the tasks typically performed in the scope of data preparation. 

In fact, we could not proceed with the data in its present form. For example, going back to the tabular preview of our data, you might notice that several variables show some missing values (denoted as NaN for *not a number* in the preview*). Leaving those missing values w/o treatment would break our neural network.

Similarly, the data preview highlights that different variables show substantialyl different value ranges. Simply compare the values (or descriptive statistics also shown above) of the variables **LOAN** to those of **VALUE** to see this. For example, the maximal value of the variable **LOAN** is 89.900,00 whreas the largest value for the variable **VALUE** is 855.909,00 so almost ten times bigger. Differences in the value range of variables are normal but, unless suitably treated in data prepration, will also harm our neural network. 

Last, the data set includes some categorical variables. These would also need special treatment, which we do not have the time to detail. Thus, we simply delete them.

In sum, we need to accomplish the following tasks:
- Deleting the categorical variables **REASON** and **JOB**
- Replacing all missing values in the data using *mean imputation*
- Transforming the value range of all variables such that values are scaled between zero and one.

Let's go...

In [None]:
# Delete selected columns
df = df.drop(columns=["REASON", "JOB"], axis=1)
df.info() # verify the columns are no longer part of the DataFrame

In [None]:
# Replace missing values in a column with the mean of that column 
df = df.fillna(value=df.mean())
df

In [None]:
# Scale value ranges of variables to zero and one
# REMARK: it is important we perform this operation only on the dependent 
#         variables and not on the dependent variable. Our target variable
#         remains unchanged. 
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))

# Remove the target variable so that it remains unchanged
y = df.pop("BAD")

# Scale the remaining columns
# We also create a new DataFrame to store the transformed data. 
# This is because our scaler does not alter the original data but
# first creates a copy of the data and then transforms this copy.  
df_ready = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Verify our transformed data shows the same value ranges for all columns
df_ready.describe()  # We reuse the method .describe(), which computes min/max values for each column


## Partition the data into training a testing data
Data partitioning is standard practice to ensure we use disjoint data for training and evaluating a machine learning model. 

the `sklearn` library provides many methods to partition data for standard machine learning workflows. Arguably, the easiest approach is to use the method `train_test_split()`. Check out its documentation and partition our *prepared* data. Let's say we use 30% of the data for testing and the rest for training.

## Train and assess a neural network classifier
Finally, we come to the point where neural networks come in. In fact, the following exercise of training and assessing a classifier will look pretty much the same with any learning algorithm. That is one of the nice features of the `sklearn` library. It offers a consistent interface to many different learning algorithms. This way, we can easily switch from one learning algorithm to another if we like. For now, however, we focus on neural networks. 

Here is the chain to tasks we need to perform
- Import relevant libraries to train a neural network classifier
- Train our NN classifier using the training data
- Compute NN predictions for the test data
- Compute a performance indicator over the test set predictions. 


In [None]:
# Import relevant libraries to train a neural network classifier


# Train our NN classifier using the training data


# Compute NN predictions for the test data


# Compute a performance indicator over the test set predictions. 




## Optional task for expert: Tuning the architecture of the neural network
Try to improve your NN classifier by systematically evaluating alternative network architectures. For example, you can consider networks with different numbers of layers and different numbers of nodes in those layers. 

In fact, there are many other configuration options (called meta-parameters) you could consider, as we discussed in the lecture. The point of this task is not to find the best NN classifier for our specific data set. Rather, the point is to familiarize you with relevant libraries and methods to carry out such tuning tasks using Python. In the lecture, we introduced the concept of *grid search*. Time to run a web search for, e.g., "sklearn grid search" and see what it produces ;) 

In [None]:
# Trying to improve our NN by grid searching meta-parameter options