This project aims to provide a comprehensive analysis of banking data, including Exploratory Data Analysis (EDA), customer segmentation, credit risk assessment, and performance prediction. The goal is to gain insights into the data, identify key features influencing credit scores, and build predictive models to assess credit risk and performance.
Make sure you have the following libraries installed:
- pandas
- numpy
- seaborn
- scikit-learn
- xgboost
- imbalanced-learn
- matplotlib
You can install these libraries using pip:
pip install pandas numpy seaborn scikit-learn xgboost imbalanced-learn matplotlib
The dataset used in this project is a CSV file containing banking information. The data is loaded using pandas and various preprocessing steps are performed to prepare it for analysis and modeling.
- Import Libraries: Importing necessary libraries for data manipulation, visualization, and modeling.
- Data Loading: Reading the data from a CSV file.
- Data Exploration: Initial exploration of the data, checking for null values, and basic data information.
- Data Preprocessing: Converting categorical columns to numerical values and handling missing values.
- Exploratory Data Analysis (EDA):
- Histograms of numerical columns.
- Boxplots of numerical columns.
- Correlation heatmap.
- Pairplot of numerical columns.
- Distribution of the target variable (Credit Score).
- Countplots to analyze the effect of different features on Credit Score.
- Feature Importance: Using a Random Forest Classifier to identify important features.
- Customer Segmentation: Using K-Means clustering for customer segmentation based on selected features.
- Credit Risk Assessment: Building and evaluating various classification models to predict Credit Score.
- Performance Prediction: Building and evaluating various regression models to predict Credit Score.
- Histograms and boxplots provided insights into the distribution and outliers of numerical features.
- The correlation heatmap helped identify relationships between features.
- Pairplots visualized the relationships between pairs of features.
- Countplots showed the distribution of Credit Scores based on various features.
- Random Forest Classifier identified the most important features influencing the Credit Score.
- K-means clustering segmented customers into distinct groups based on selected features.
- Various classification models were trained to predict Credit Score.
- Performance metrics like accuracy and classification reports were generated for each model.
- Various regression models were trained to predict Credit Score.
- Performance metrics like RMSE and R2 score were generated for each model.
This project demonstrated a comprehensive approach to analyzing banking data. By performing EDA, identifying important features, segmenting customers, and building predictive models, we gained valuable insights into the factors influencing Credit Scores and the overall performance of customers. The results can help in making informed decisions for credit risk assessment and customer segmentation.
- Fine-tuning models with hyperparameter optimization.
- Exploring additional feature engineering techniques.
- Implementing more advanced machine learning models.
- Integrating external data sources for enhanced predictions.
This project is licensed under the MIT License.
Histogram of all the relevant numerical columns
Boxplots of all the relevant numerical columns
Correlation Heatmap
Pairplot
Vizualization of the target column distribution
Barplots
Feature importance to help with feature selection
Graph showing elbow method to get the k value
Customer Segmentation by K-means Clustering
Classifier: Logistic Regression Accuracy: 0.93325 Classification Report: precision recall f1-score support
0 1.00 1.00 1.00 5874
1 0.91 0.96 0.94 10599
2 0.87 0.73 0.79 3527
accuracy 0.93 20000
macro avg 0.93 0.90 0.91 20000 weighted avg 0.93 0.93 0.93 20000
Classifier: Decision Tree Accuracy: 1.0 Classification Report: precision recall f1-score support
0 1.00 1.00 1.00 5874
1 1.00 1.00 1.00 10599
2 1.00 1.00 1.00 3527
accuracy 1.00 20000
macro avg 1.00 1.00 1.00 20000 weighted avg 1.00 1.00 1.00 20000
Classifier: Random Forest Accuracy: 1.0 Classification Report: precision recall f1-score support
0 1.00 1.00 1.00 5874
1 1.00 1.00 1.00 10599
2 1.00 1.00 1.00 3527
accuracy 1.00 20000
macro avg 1.00 1.00 1.00 20000 weighted avg 1.00 1.00 1.00 20000
Classifier: XGBoost Accuracy: 1.0 Classification Report: precision recall f1-score support
0 1.00 1.00 1.00 5874
1 1.00 1.00 1.00 10599
2 1.00 1.00 1.00 3527
accuracy 1.00 20000
macro avg 1.00 1.00 1.00 20000 weighted avg 1.00 1.00 1.00 20000
Classifier: KNN Accuracy: 0.7768 Classification Report: precision recall f1-score support
0 0.76 0.82 0.79 5874
1 0.80 0.78 0.79 10599
2 0.72 0.69 0.71 3527
accuracy 0.78 20000
macro avg 0.76 0.76 0.76 20000 weighted avg 0.78 0.78 0.78 20000
Classifier: Naive Bayes Accuracy: 0.6292 Classification Report: precision recall f1-score support
0 0.64 0.73 0.68 5874
1 0.81 0.51 0.62 10599
2 0.44 0.83 0.57 3527
accuracy 0.63 20000
macro avg 0.63 0.69 0.63 20000 weighted avg 0.69 0.63 0.63 20000
Model: Linear Regression Root Mean Squared Error (RMSE): 2.8070534180026826e-15 R2 Score: 1.0
Model: Decision Tree Regressor Root Mean Squared Error (RMSE): 0.0 R2 Score: 1.0
Model: Random Forest Regressor Root Mean Squared Error (RMSE): 0.0 R2 Score: 1.0
Model: Gradient Boosting Regressor Root Mean Squared Error (RMSE): 1.794278735749497e-05 R2 Score: 0.999999999294415
Model: XGBoost Regressor Root Mean Squared Error (RMSE): 3.329686164276412e-06 R2 Score: 0.9999999999757017
Model: KNeighbors Regressor Root Mean Squared Error (RMSE): 0.4220118481749061 R2 Score: 0.6096817763207161