GitHub - sprasadhpy/Risk-Bucketed--POD-prediction-MLmodels-1-

Risk-Bucketed--POD-prediction models-ML

Risk bucketing involves categorizing borrowers based on their creditworthiness into groups that exhibit similar characteristics. The underlying objective of this process is to obtain homogeneous groups or clusters so that credit risk can be estimated more accurately. Failing to distinguish between borrowers with different levels of risk could lead to inaccurate predictions since the model would not be able to capture the distinct characteristics of each group. By dividing borrowers into different groups based on their riskiness, risk bucketing allows for more accurate predictions. Various statistical methods can be used to accomplish this, but we will employ clustering techniques such as K-means and DBScan algorithms to produce homogeneous clusters.

I have implemented DBScan (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm which is a density-based clustering algorithm that is particularly useful for identifying clusters of arbitrary shape, size, and orientation. Unlike K-Means, DBScan can handle noise and outliers in the data, and it does not require the user to specify the number of clusters beforehand. Instead, it identifies clusters based on the density of the data points in the neighborhood. DBScan is also more robust to the choice of initial cluster centers since it does not depend on randomly initialized centroids.

Notebook: https://github.com/sprasadhpy/Risk-Bucketed--POD-predictiomodels-1-/blob/main/Risk%20Bucketed%20-POD%20Prediction%20ML%20Models.ipynb

Implemented ML Algorithms :

Logistic regression is a commonly used classification algorithm in machine learning and data analysis. It is effective in modeling binary outcomes and predicting the likelihood of an event. This algorithm estimates the probability of a binary outcome using predictor variables. AUC Curve

Bayesian Model : To predict the probability of default, the PYMC3 package will be used for Bayesian estimation. Among the several available approaches for Bayesian analysis using PYMC3, the first application will employ the MAP distribution for efficient modeling using the representative posterior distribution. The Bayesian model will also feature a deterministic variable (p) solely dependent on parent variables, including age, job, credit amount, and duration. This comprehensive approach will enable accurate predictions and provide a detailed analysis of the probability of default.

Again the implementation enables the user to perform Bayesian analysis, plot the trace, and display the summary statistics for the trace of each logistic model. The implementation logging level for pymc3 to error and defines two logistic models, logistic_model1 and logistic_model2. It then samples from each model using Metropolis as the step method for the sampler and generates the trace. The trace is then plotted using az.plot_trace() and the summary statistics for the trace are displayed using display(az.summary()).

SupporVectorMachine : SVM is known to be a parametric model that performs well with high-dimensional data. It is a suitable approach to use in the case of predicting the probability of default in a multivariate setting. To optimize the performance of SVM and conduct hyperparameter tuning, HalvingRandomSearchCV will be used. This approach utilizes iterative selection and fewer resources, leading to better performance and saving time. HalvingRandomSearchCV employs successive halving to identify candidate parameters by evaluating all parameter combinations with a certain number of training samples in the first iteration, using some of the selected parameters in the second iteration with a larger number of training samples, and finally including only the top-scoring candidates in the model until the last iteration.

Random Forest : The random forest classifier can be used to model the probability of default, and it performs well with large numbers of samples. Using a halving search approach, we can determine the best combination of hyperparameters, including n_estimators, criterion, max_features, max_depth, and min_samples_split.Each cluster has its own set of optimal hyperparameters, and the proposed model is more intricate with a greater depth. Furthermore, the maximum number of features differs across the various clusters.

XGBoost : XGBoost is a boosting algorithm that combines multiple decision trees to create a strong ensemble model. The algorithm iteratively improves the performance of the model by adding new trees that focus on the most challenging examples.

*** Neural Network:**** To set up the NN model, GridSearchCV optimizes the number of hidden layers, optimization technique, and learning rate. The MLP library controls several parameters, including the size of the hidden layer, the optimization technique (solver), and the learning rate. The optimized hyperparameters of the two clusters differ only in the number of neurons in the hidden layer. Cluster one has a larger number of neurons in the first hidden layer, while cluster two has a larger number in the second hidden layer.

*** KerasClassifier:**** KerasClassifier enables the use of pre-trained models like CNNs and RNNs for PD estimation, with flexibility in defining network architecture and optimization algorithms. Hyperparameters such as batch size, epoch, and dropout rate can be fine-tuned to specific data needs, while the sigmoid activation function is optimal for classification problems like PD estimation. Deep Learning with NNs provides a complex structure for better predictive performance by capturing the data dynamics.

Best hyperparameters for first cluster in DL are {'batch_size': 10, 'dropout_rate': 0.2, 'epochs': 50} 6/6 [==============================] - 0s 3ms/step DL_ROC_AUC is 0.5102

Best parameters for second cluster in DL are {'batch_size': 100, 'dropout_rate': 0.4, 'epochs': 150} 2/2 [==============================] - 0s 4ms/step DL_ROC_AUC is 0.6711

** GAN + TabNet *** Source : https://www.sciencedirect.com/science/article/pii/S0957417423000441

Source : https://assets.researchsquare.com/files/rs-724813/v1_covered.pdf?c=1631875380

To predict credit default on the lending dataset, a combination of TabNet DL model and GAN was used. Due to the small number of defaults in the dataset, sample imbalance was observed, which affected the model's performance. To address this issue, a GAN network was employed to generate synthetic samples of bad users. The TabNet model was then trained on the original and synthetic samples to predict credit default.

After applying the combination of TabNet DL model and GAN to the lending dataset to predict credit default, the accuracy in the default clusters increased from 0.4123 (logistic regression) to 0.4854. While this improvement is promising, it is important to test for overfitting, especially since the synthetic samples generated by the GAN may introduce bias into the model.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
GaN and ResNet paper.pdf		GaN and ResNet paper.pdf
README.md		README.md
Risk Bucketed -POD Prediction ML Models.ipynb		Risk Bucketed -POD Prediction ML Models.ipynb
TabNet.ipynb		TabNet.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GaN and ResNet paper.pdf

GaN and ResNet paper.pdf

README.md

README.md

Risk Bucketed -POD Prediction ML Models.ipynb

Risk Bucketed -POD Prediction ML Models.ipynb

TabNet.ipynb

TabNet.ipynb

Repository files navigation

Risk-Bucketed--POD-prediction models-ML

Implemented ML Algorithms :

About

Releases

Packages

Languages

sprasadhpy/Risk-Bucketed--POD-prediction-MLmodels-1-

Folders and files

Latest commit

History

Repository files navigation

Risk-Bucketed--POD-prediction models-ML

Implemented ML Algorithms :

About

Resources

Stars

Watchers

Forks

Languages