# Credit Default Prediction using Supervised and Unsupervised Learning Techniques

Dataset from : https://www.kaggle.com/c/home-credit-default-risk/data

Description

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

As we have labels in our dataset, we will try both supervised and unsupervised approch

## Supervised

We use the following Machine learning algorithms and built Classification models for our supervised classification task. <br>
1) Logistic Regression <br>
2) Random Forest <br>
3) Extreme Gradient boosting

### Logistic Regression

Logistic regression predicts the probability of an outcome that can only have two values (i.e. a dichotomy). The prediction is based on the use of one or several predictors (numerical and categorical). A linear regression is not appropriate for predicting the value of a binary variable for two reasons: <br>
• A linear regression will predict values outside the acceptable range (e.g. predicting probabilities outside the range 0 to 1) <br>
• Since the dichotomous experiments can only have one of two possible values for each experiment, the residuals will not be normally distributed about the predicted line. <br>
On the other hand, a logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic regression is similar to a linear regression, but the curve is constructed using the natural logarithm of the “odds” of the target variable, rather than the probability. Moreover, the predictors do not have to be normally distributed or have equal variance in each group. <br>

![LogReg.png](attachment:LogReg.png)

Image source: https://www.c-sharpcorner.com/article/logistic-regression/

In the logistic regression the constant (b0) moves the curve left and right and the slope (b1) defines the steepness of the curve. By simple transformation, the logistic regression equation can be written in terms of an odds ratio.

## $$\frac{p}{1-p}=exp({b_o}+{b_1}x)$$

Finally, taking the natural log of both sides, we can write the equation in terms of log-odds (logit) which is a linear function of the predictors. The coefficient (b1) is the amount the logit (log-odds) changes with a one unit change in x.

## $$ln(\frac{p}{1-p})={b_o}+{b_1}x$$

As mentioned before, logistic regression can handle any number of numerical and/or categorical variables.

## $$p=\frac{1}{1+e^{-({b_o}+{b_1}{x_1}+{b_2}{x_2}+...+{b_p}{x_p})}}$$

Model Evaluation

![image.png](attachment:image.png)

### Random Forest

Random forests, also known as random decision forests, are a popular ensemble method that can be used to build predictive models for both classification and regression problems. Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Each tree is grown as follows: <br>
1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree. <br>
2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.

In the original paper on random forests, it was shown that the forest error rate depends on two things: <br>
▪ The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate. <br>
▪ The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.

Reducing m reduces both the correlation and the strength. Increasing it increases both. Somewhere in between is an "optimal" range of m - usually quite wide. Using the oob error rate (see below) a value of m in the range can quickly be found. This is the only adjustable parameter to which random forests is somewhat sensitive. <br>

Features of Random Forests
▪ It is unexcelled in accuracy among current algorithms.<br>
▪ It runs efficiently on large data bases.<br>
▪ It can handle thousands of input variables without variable deletion.<br>
▪ It gives estimates of what variables are important in the classification.<br>
▪ It generates an internal unbiased estimate of the generalization error as the forest building progresses.<br>
▪ It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.<br>
▪ It has methods for balancing error in class population unbalanced data sets.<br>
▪ Generated forests can be saved for future use on other data.<br>
▪ Prototypes are computed that give information about the relation between the variables and the classification.<br>
▪ It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.<br>
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.<br>
▪ It offers an experimental method for detecting variable interactions.<br>

Model Evaluation

![image.png](attachment:image.png)

### Extreme Gradient boosting 

XGBoost is one of the most popular and efficient implementations of the Gradient Boosted Trees algorithm, a supervised learning method that is based on function approximation by optimizing specific loss functions as well as applying several regularization techniques.

Model Evaluation

![image.png](attachment:image.png)