Skip to content

surabhi0901/comprehensive_banking_analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comprehensive Banking Analytics Project Overview

Introduction

This project aims to provide a comprehensive analysis of banking data, including Exploratory Data Analysis (EDA), customer segmentation, credit risk assessment, and performance prediction. The goal is to gain insights into the data, identify key features influencing credit scores, and build predictive models to assess credit risk and performance.

Prerequisites

Make sure you have the following libraries installed:

  • pandas
  • numpy
  • seaborn
  • scikit-learn
  • xgboost
  • imbalanced-learn
  • matplotlib

You can install these libraries using pip:

pip install pandas numpy seaborn scikit-learn xgboost imbalanced-learn matplotlib

Data

The dataset used in this project is a CSV file containing banking information. The data is loaded using pandas and various preprocessing steps are performed to prepare it for analysis and modeling.

Project Structure

  1. Import Libraries: Importing necessary libraries for data manipulation, visualization, and modeling.
  2. Data Loading: Reading the data from a CSV file.
  3. Data Exploration: Initial exploration of the data, checking for null values, and basic data information.
  4. Data Preprocessing: Converting categorical columns to numerical values and handling missing values.
  5. Exploratory Data Analysis (EDA):
    • Histograms of numerical columns.
    • Boxplots of numerical columns.
    • Correlation heatmap.
    • Pairplot of numerical columns.
    • Distribution of the target variable (Credit Score).
    • Countplots to analyze the effect of different features on Credit Score.
  6. Feature Importance: Using a Random Forest Classifier to identify important features.
  7. Customer Segmentation: Using K-Means clustering for customer segmentation based on selected features.
  8. Credit Risk Assessment: Building and evaluating various classification models to predict Credit Score.
  9. Performance Prediction: Building and evaluating various regression models to predict Credit Score.

Results

Exploratory Data Analysis (EDA)

  • Histograms and boxplots provided insights into the distribution and outliers of numerical features.
  • The correlation heatmap helped identify relationships between features.
  • Pairplots visualized the relationships between pairs of features.
  • Countplots showed the distribution of Credit Scores based on various features.

Feature Importance

  • Random Forest Classifier identified the most important features influencing the Credit Score.

Customer Segmentation

  • K-means clustering segmented customers into distinct groups based on selected features.

Credit Risk Assessment

  • Various classification models were trained to predict Credit Score.
  • Performance metrics like accuracy and classification reports were generated for each model.

Performance Prediction

  • Various regression models were trained to predict Credit Score.
  • Performance metrics like RMSE and R2 score were generated for each model.

Conclusion

This project demonstrated a comprehensive approach to analyzing banking data. By performing EDA, identifying important features, segmenting customers, and building predictive models, we gained valuable insights into the factors influencing Credit Scores and the overall performance of customers. The results can help in making informed decisions for credit risk assessment and customer segmentation.

Future Work

  • Fine-tuning models with hyperparameter optimization.
  • Exploring additional feature engineering techniques.
  • Implementing more advanced machine learning models.
  • Integrating external data sources for enhanced predictions.

License

This project is licensed under the MIT License.

Images of the graphs

Histogram of all the relevant numerical columns

Histogram

Boxplots of all the relevant numerical columns

Boxplots

Correlation Heatmap

Correlation Heatmap

Pairplot

Pairplot

Vizualization of the target column distribution

Credit Score Distribution

Barplots

Credit Score based on Num of Bank Acc

Credit Score based on Num of Bank Credit Cards

Credit Score based on Num of Loans

Credit Score based on Num of Bank Credit Inquiries

Credit Score based on Credit Mix

Feature importance to help with feature selection

Random Forest Feature Importances

Customer Segmentation

Graph showing elbow method to get the k value

Graph for elbow method

Customer Segmentation by K-means Clustering

Customer Segmentation by K-means Clustering

Credit Risk Assessment

Classifier: Logistic Regression Accuracy: 0.93325 Classification Report: precision recall f1-score support

       0       1.00      1.00      1.00      5874
       1       0.91      0.96      0.94     10599
       2       0.87      0.73      0.79      3527

accuracy                           0.93     20000

macro avg 0.93 0.90 0.91 20000 weighted avg 0.93 0.93 0.93 20000


Classifier: Decision Tree Accuracy: 1.0 Classification Report: precision recall f1-score support

       0       1.00      1.00      1.00      5874
       1       1.00      1.00      1.00     10599
       2       1.00      1.00      1.00      3527

accuracy                           1.00     20000

macro avg 1.00 1.00 1.00 20000 weighted avg 1.00 1.00 1.00 20000


Classifier: Random Forest Accuracy: 1.0 Classification Report: precision recall f1-score support

       0       1.00      1.00      1.00      5874
       1       1.00      1.00      1.00     10599
       2       1.00      1.00      1.00      3527

accuracy                           1.00     20000

macro avg 1.00 1.00 1.00 20000 weighted avg 1.00 1.00 1.00 20000


Classifier: XGBoost Accuracy: 1.0 Classification Report: precision recall f1-score support

       0       1.00      1.00      1.00      5874
       1       1.00      1.00      1.00     10599
       2       1.00      1.00      1.00      3527

accuracy                           1.00     20000

macro avg 1.00 1.00 1.00 20000 weighted avg 1.00 1.00 1.00 20000


Classifier: KNN Accuracy: 0.7768 Classification Report: precision recall f1-score support

       0       0.76      0.82      0.79      5874
       1       0.80      0.78      0.79     10599
       2       0.72      0.69      0.71      3527

accuracy                           0.78     20000

macro avg 0.76 0.76 0.76 20000 weighted avg 0.78 0.78 0.78 20000


Classifier: Naive Bayes Accuracy: 0.6292 Classification Report: precision recall f1-score support

       0       0.64      0.73      0.68      5874
       1       0.81      0.51      0.62     10599
       2       0.44      0.83      0.57      3527

accuracy                           0.63     20000

macro avg 0.63 0.69 0.63 20000 weighted avg 0.69 0.63 0.63 20000


Performance Prediction


Model: Linear Regression Root Mean Squared Error (RMSE): 2.8070534180026826e-15 R2 Score: 1.0

Model: Decision Tree Regressor Root Mean Squared Error (RMSE): 0.0 R2 Score: 1.0

Model: Random Forest Regressor Root Mean Squared Error (RMSE): 0.0 R2 Score: 1.0

Model: Gradient Boosting Regressor Root Mean Squared Error (RMSE): 1.794278735749497e-05 R2 Score: 0.999999999294415

Model: XGBoost Regressor Root Mean Squared Error (RMSE): 3.329686164276412e-06 R2 Score: 0.9999999999757017

Model: KNeighbors Regressor Root Mean Squared Error (RMSE): 0.4220118481749061 R2 Score: 0.6096817763207161


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published