# Credit Default Prediction using Supervised and Unsupervised Learning Techniques

Dataset from : https://www.kaggle.com/c/home-credit-default-risk/data

Description

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

As we have labels in our dataset, we will try both supervised and unsupervised approch

## Supervised

We use the following Machine learning algorithms and built Classification models for our supervised classification task. <br>
1) Logistic Regression <br>
2) Random Forest <br>
3) Extreme Gradient boosting

### Logistic Regression

Logistic regression predicts the probability of an outcome that can only have two values (i.e. a dichotomy). The prediction is based on the use of one or several predictors (numerical and categorical). A linear regression is not appropriate for predicting the value of a binary variable for two reasons: <br>
• A linear regression will predict values outside the acceptable range (e.g. predicting probabilities outside the range 0 to 1) <br>
• Since the dichotomous experiments can only have one of two possible values for each experiment, the residuals will not be normally distributed about the predicted line. <br>
On the other hand, a logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic regression is similar to a linear regression, but the curve is constructed using the natural logarithm of the “odds” of the target variable, rather than the probability. Moreover, the predictors do not have to be normally distributed or have equal variance in each group. <br>

![LogReg.png](attachment:LogReg.png)

Image source: https://www.c-sharpcorner.com/article/logistic-regression/

In the logistic regression the constant (b0) moves the curve left and right and the slope (b1) defines the steepness of the curve. By simple transformation, the logistic regression equation can be written in terms of an odds ratio.

## $$\frac{p}{1-p}=exp({b_o}+{b_1}x)$$

Finally, taking the natural log of both sides, we can write the equation in terms of log-odds (logit) which is a linear function of the predictors. The coefficient (b1) is the amount the logit (log-odds) changes with a one unit change in x.

## $$ln(\frac{p}{1-p})={b_o}+{b_1}x$$

As mentioned before, logistic regression can handle any number of numerical and/or categorical variables.

## $$p=\frac{1}{1+e^{-({b_o}+{b_1}{x_1}+{b_2}{x_2}+...+{b_p}{x_p})}}$$

Model Evaluation

![image.png](attachment:image.png)

MOdel Evaluation after Balancing the class using SMOTETomek

SMOTETomek is a hybrid method which uses an under sampling method (Tomek) in with an over sampling method (SMOTE).

![image.png](attachment:image.png)

Model Evaluation after Balancing the class using  RandomOverSampler 

![image.png](attachment:image.png)

### Random Forest

Random forests, also known as random decision forests, are a popular ensemble method that can be used to build predictive models for both classification and regression problems. Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Each tree is grown as follows: <br>
1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree. <br>
2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.

In the original paper on random forests, it was shown that the forest error rate depends on two things: <br>
▪ The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate. <br>
▪ The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.

Reducing m reduces both the correlation and the strength. Increasing it increases both. Somewhere in between is an "optimal" range of m - usually quite wide. Using the oob error rate (see below) a value of m in the range can quickly be found. This is the only adjustable parameter to which random forests is somewhat sensitive. <br>

Features of Random Forests
▪ It is unexcelled in accuracy among current algorithms.<br>
▪ It runs efficiently on large data bases.<br>
▪ It can handle thousands of input variables without variable deletion.<br>
▪ It gives estimates of what variables are important in the classification.<br>
▪ It generates an internal unbiased estimate of the generalization error as the forest building progresses.<br>
▪ It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.<br>
▪ It has methods for balancing error in class population unbalanced data sets.<br>
▪ Generated forests can be saved for future use on other data.<br>
▪ Prototypes are computed that give information about the relation between the variables and the classification.<br>
▪ It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.<br>
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.<br>
▪ It offers an experimental method for detecting variable interactions.<br>

Model Evaluation

![image.png](attachment:image.png)

MOdel Evaluation after Balancing the class using SMOTETomek

![image.png](attachment:image.png)

Model Evaluation after Balancing the class using  RandomOverSampler 

![image.png](attachment:image.png)

### Extreme Gradient boosting 

XGBoost is one of the most popular and efficient implementations of the Gradient Boosted Trees algorithm, a supervised learning method that is based on function approximation by optimizing specific loss functions as well as applying several regularization techniques.

Model Evaluation

![image.png](attachment:image.png)

MOdel Evaluation after Balancing the class using SMOTETomek

![image.png](attachment:image.png)

Model Evaluation after Balancing the class using  RandomOverSampler 

![image.png](attachment:image.png)

## Unsupervised

We have used both scikit-learna nd PyOD frameworks.PyOD is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data. 

We have used the following Machine learning algorithms and built Anomaly detection models for our Unsupervised classification.

Outlier Ensembles <br>
1) Isolation Forest <br>

Proximity-Based <br>
2) Local Outlier Factor <br>
3) Clustering-Based Local Outlier Factor <br>
4) Histogram-based Outlier Score <br>
5) k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score) <br>

Linear Model <br>
6) Principal Component Analysis <br>

Non - Linear Model <br>
7) Kernal Principal Component Analysis <br>

### What is anomaly?

The anomaly can be viewed as a rare or unusual observation in the dataset. For example in the case credit card transaction dataset, the fraudulent transactions are an anomaly as the number of fraud cases is very few as compared to normal transactions in a large dataset.

In anomaly detection, we try to identify observations that are statistically different from the rest of the observations.

### Isolation Forest Algorithm :

One of the newest techniques to detect anomalies is called Isolation Forests. The algorithm is based on the fact that anomalies are data points that are few and different. As a result of these properties, anomalies are susceptible to a mechanism called isolation.

This method is highly useful and is fundamentally different from all existing methods. It introduces the use of isolation as a more effective and efficient means to detect anomalies than the commonly used basic distance and density measures. Moreover, this method is an algorithm with a low linear time complexity and a small memory requirement. It builds a good performing model with a small number of trees using small sub-samples of fixed size, regardless of the size of a data set.

Typical machine learning methods tend to work better when the patterns they try to learn are balanced, meaning the same amount of good and bad behaviors are present in the dataset.

How Isolation Forests Work The Isolation Forest algorithm isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The logic argument goes: isolating anomaly observations is easier because only a few conditions are needed to separate those cases from the normal observations. On the other hand, isolating normal observations require more conditions. Therefore, an anomaly score can be calculated as the number of conditions required to separate a given observation.

The way that the algorithm constructs the separation is by first creating isolation trees, or random decision trees. Then, the score is calculated as the path length to isolate the observation.

Scikit Learn Model Evaluation:

![image.png](attachment:image.png)

PyOD Model Evaluation:

![image.png](attachment:image.png)

### Local Outlier Factor(LOF) Algorithm

The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.

The number of neighbors considered, (parameter n_neighbors) is typically chosen 1) greater than the minimum number of objects a cluster has to contain, so that other objects can be local outliers relative to this cluster, and 2) smaller than the maximum number of close by objects that can potentially be local outliers. In practice, such informations are generally not available, and taking n_neighbors=20 appears to work well in general.

Scikit Learn Model Evaluation:

![image.png](attachment:image.png)

PyOD Model Evaluation:

![image.png](attachment:image.png)

### Cluster Based Local Outlier Factor

The CBLOF operator calculates the outlier score based on cluster-based local outlier factor.


CBLOF takes as an input the data set and the cluster model that was generated by a clustering algorithm. It classifies the clusters into small clusters and large clusters using the parameters alpha and beta.
The anomaly score is then calculated based on the size of the cluster the point belongs to as well as the distance to the nearest large cluster.

Use weighting for outlier factor based on the sizes of the clusters as proposed in the original publication. Since this might lead to unexpected behavior (outliers close to small clusters are not found), it is disabled by default.Outliers scores are solely computed based on their distance to the closest large cluster center.

By default, kMeans is used for clustering algorithm instead of Squeezer algorithm mentioned in the original paper for multiple reasons.

PyOD Model Evaluation:

![image.png](attachment:image.png)

### Principal Component Analysis 

We have PCA to learn the underlying structure of the Loan application dataset. The more anomalous the data is, the more likely it is to be fraudulent, assuming that fraud is rare and looks somewhat different than the majority of application data, which are normal. Once we learn this structure, we will use the learned model to reconstruct the Loan application data and then calculate how different the reconstructed data are from the original data. Those transactions that PCA does the poorest job of reconstructing are the most anomalous (and most likely to be fraudulent).


The algorithms will have the largest reconstruction error on those data points that are hardest to model—in other words, those that occur the least often and are the most anomalous. Since fraud is rare and presumably different than normal observations, the fraudulent observations should exhibit the largest reconstruction error. So let’s define the anomaly score as the reconstruction error. The reconstruction error for each transaction is the sum of the squared differences between the original feature matrix and the reconstructed matrix using the dimensionality reduction algorithm. We will scale the sum of the squared differences by the max-min range of the sum of the squared differences for the entire dataset, so that all the reconstruction errors are within a zero to one range.

The observations that have the largest sum of squared differences will have an error close to one, while those that have the smallest sum of squared differences will have an error close to zero.

This should be familiar. Like the supervised fraud detection solution we built in ealier, the dimensionality reduction algorithm will effectively assign each observations an anomaly score between zero and one. Zero is normal and one is anomalous (and most likely to be fraudulent).


Scikit Learn Model Evaluation for Normal PCA:

![image.png](attachment:image.png)

Now let’s design a fraud detection solution using kernel PCA, which is a nonlinear form of PCA and is useful if the fraud transactions are not linearly separable from the nonfraud observations.

We need to specify the number of components we would like to generate, the kernel (we will use the RBF kernel as we did in ealier), and the gamma (which is set to 1/n_features by default, so 1/30 in our case). We also need to set the fit_inverse_transform to true to apply the built-in inverse_transform function provided by Scikit-Learn.

Finally, because kernel PCA is so expensive to train with, we will train on just the first two thousand samples in the transactions dataset. This is not ideal but it is necessary to perform experiments quickly.

We will use this training to transform the entire training set and generate the principal components. Then, we will use the inverse_transform function to recreate the original dimension from the principal components derived by kernel PCA:

Scikit Learn Model Evaluation for Kernel PCA:

![image.png](attachment:image.png)

PyOD Model Evaluation for PCA(the sum of weighted projected distances to the eigenvector hyperplanes):

Principal component analysis (PCA) can be used in detecting outliers.
PCA is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

In this procedure, covariance matrix of the data can be decomposed to orthogonal vectors, called eigenvectors, associated with eigenvalues. The eigenvectors with high eigenvalues capture most of the variance in the data.

Therefore, a low dimensional hyperplane constructed by k eigenvectors can capture most of the variance in the data. However, outliers are different from normal data points, which is more obvious on the hyperplane constructed by the eigenvectors with small eigenvalues.

Therefore, outlier scores can be obtained as the sum of the projected distance of a sample on all eigenvectors.

![image.png](attachment:image.png)

### K-Nearest Neighbor

K-nearest neighbor: k-NN is a simple, non-parametric lazy learning technique used to classify data based on similarities in distance metrics such as Eucledian, Manhattan, Minkowski, or Hamming distance.


It uses the math behind the classification algorithm KNN. Indeed, for any data point, the distance to its kth nearest neighbor could be viewed as the outlying score. PyOD supports three KNN detectors: largest, mean and median, which use as outlying score, respectively, the distance of the kth neighbor, the average of all the k neighbors and the median distance to k neighbors.

PyOD Model Evaluation for KNN

![image.png](attachment:image.png)

### Histogram-based outlier score

It is only a combination of univariate methods not being able to model dependencies between
features, its fast computation is charming for large data sets.

Histogram-Based Outlier Score (HBOS) is an efficient unsupervised method. It assumes the feature independence and calculates the degreeof outlyingness by building histograms

For each single feature (dimension), an univariate histogram is constructed first. If the feature comprises of categorical data, simple counting of the values of
each category is performed and the relative frequency (height of the histogram) is computed. For numerical features, two different methods can be used: (1) Static bin-width histograms or (2) dynamic bin-width histograms. The first is
the standard histogram building technique using k equal width bins over the value range. The frequency (relative amount) of samples falling into each bin is used as an estimate of the density (height of the bins). The dynamic binwidth is determined as follows: values are sorted first and then a fixed amount of N 

k successive values are grouped into a single bin where N is the number of total instances and k the number of bins. Since the area of a bin in a histogram represents the number of observations, it is the same for all bins in our case.
Because the width of the bin is defined by the first and the last value and the area is the same for all bins, the height of each individual bin can be computed. This means that bins covering a larger interval of the value range have less height
and represent that way a lower density. However, there is one exception: Under certain circumstances, more than k data instances might have exactly the same value, for example if the feature is an integer and a long-tail distribution has to
be estimated. In this case, our algorithm must allow to have more than N 

k values in the same bin. Of course, the area of these larger bins will grow appropriately. The reason why both methods are offered in HBOS is due to the fact of having very different distributions of the feature values in real world data. Especially
when value ranges have large gaps (intervals without data instances), the fixed bin width approach estimates the density poorly (a few bins may contain most of the data). Since anomaly detection tasks usually involve such gaps in the value ranges due to the fact that outliers are far away from normal data, we recommend using the dynamic width mode, especially if distributions are unknown or long tailed. Besides, also the number of bins k needs to be set. An often used rule of thumb is setting k to the square root of the number of instances N.

Now, for each dimension d, an individual histogram has been computed (regardless if categorical, fixed-width or dynamic-width), where the height of each single bin represents a density estimation. The histograms are then normalized such that the maximum height is 1.0. This ensures an equal weight of each feature to the outlier score. Finally, the HBOS of every instance p is calculated using the corresponding height of the bins where the instance is located:


PyOD Model Evaluation for HBOS

![image.png](attachment:image.png)

# Hyperparameters

![image.png](attachment:image.png)

# Learderboard for the algorithms

![image.png](attachment:image.png)

Since XGBoost(RandomOverSampler) has better ROC AUC Score. That model is deployed

## WEB APPLICATION using FLASK

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## MySQL Database

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## Logging

DEBUG: Detailed information, typically of interest only when diagnosing problems.

INFO: Confirmation that things are working as expected.

WARNING: An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected.

ERROR: Due to a more serious problem, the software has not been able to perform some function.

CRITICAL: A serious error, indicating that the program itself may be unable to continue running.

Test.log file

![image.png](attachment:image.png)

## Deployed on Heruko

Link: https://loan-default-prediction-by-kv.herokuapp.com/