Skip to content

zShubh/Credit-Risk-Analysis-Using-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Credit Risk Analysis

Apply machine learning (Supervised Learning) to solve a real-world challenge: credit card risk that we've built and evaluate using Scikit-Learn.

PictureofCreditCards

Photo by Avery Evans on Unsplash

Overview

Building and evaluating several machine learning models in the branch of Supervised Learning to predict credit risk. Being able to predict credit risk with machine learning algorithms can help banks and financial institutions predict anomalies, reduce risk cases, monitor portfolios, and provide recommendations on what to do in cases of fraud.

Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. When using data from LendingClub; a peer-to-peer lending services company to apply we to employ different techniques to train and evaluate models with unbalanced classes ...

Preprocessing steps:

Transformational steps:

  • Identifying and handling the missing values (None)
  • Encoding categorical variables with a mix of Pandas and Scikit Learn's 'LabelEncoder'

Splitting the data set:

  • Feature scaling with StandardScaler
  • Normalization

Implement machine learning models:

  • Logistic Regression
  • Decision Tree
  • Random Forest
  • Support Vector Machine
  • Gradient Boosting for classification

Models

Use re-sampling to attempt to address class imbalance:

  • Combination Sampling with imblearn's SMOTEENN

Evaluate the performance of machine learning models:

  • Model evaluation and the calculating with the confusion matrix.

You want to use pipelining with scikit learn from an API or flatfile. A pipeline is one object that does all of your preprocessing work (feature selection, imputation, etc.)

Understanding Machine Learning

Machine Learning -> Supervised Learning -> Classification OR Regression:

  • Classification predicts the CATEGORY that data belongs to.
  • Regression predicts a NUMERICAL value based on previously observed data.

Resources

  • Software: Visual Studio Code, Jupyter Lab
  • Languages: Python
  • Libraries: numpy, pandas, matplotlib, seaborn, scikit-learn
  • Data Sources: Loan Data: ../data/raw/LoanStats_2019Q1.csv

Analysis

Two evaluation methods: ensemble learning and re-sampling

Easy Ensemble AdaBoost Classifier performs the best with our steps & dataset; therefore, we would move forward with this estimator for further predictions.

The oversampling recall score (with SMOTE) has the highest score for predicting both low-risk and high risk loan statuses. We would put forward that this is the best model considering the financial cost risk associated with False Negatives.

Naive Random Oversampler

SMOTE

Cluster Centroids

SMOTEENN

Balanced Random Forest Classifier

Easy Ensemble AdaBoost Classifier

Todo Checklist

A helpful checklist to gauge how your README is coming on what I would like to finish:

  • Use Scikit Learn's Pipelines

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors