End-to-end ML pipeline to predict Good vs Bad credit risk from structured applicant data (German Credit). Includes clean preprocessing, statistical screening, model comparison, hyperparameter tuning, and final evaluation with visualizations.
Why this matters: in credit underwriting, the most expensive mistake is approving a Bad Risk. This repo shows how to build, evaluate, and tune decision thresholds with that business constraint in mind.
- Clean data mapping from coded categories (e.g.,
A11,A34) to human-readable labels. - EDA visuals: class-split category distributions and boxplots for numeric features.
- Statistical screening: Chi-square (categoricals) & one-way ANOVA (numerics) to pick signal-bearing features.
- Feature engineering: one-hot encoding; optional PCA for dimensionality reduction.
- Model zoo: Logistic Regression, Random Forest, SVM, Gradient Boosting, XGBoost, LightGBM, Extra Trees, AdaBoost.
- Tuning:
GridSearchCVon the top model (LightGBM in the notebook). - Evaluation: F1, ROC AUC, confusion matrix, classification report, ROC curve; feature importances (if supported).
- Risk-aware tips: class weights & threshold tuning to reduce Bad→Good errors.
.
├── Credit_Risk_Analysis_German_Credit.ipynb # Main notebook (end-to-end)
├── data/
│ └── german.data # Raw data (see Dataset note below)
├── reports/ # (optional) saved figures/metrics
├── src/ # (optional) if you factor out utilities
└── README.md
The notebook path may differ depending on where you saved it. Adjust as needed.
- Name: German Credit
- Size: 1,000 rows; Features: 20 (7 numeric, 13 categorical); Target: Good(1) / Bad(2).
- Note: The raw file is typically named
german.data. If licensing prevents bundling, download it yourself (commonly available via the UCI Machine Learning Repository) and place it under./data/german.data.
The notebook reads:
df = pd.read_csv("german.data", sep=" ", header=None)- Python: 3.9+
- Libraries:
numpy,pandas,matplotlib,seaborn,scikit-learn,scipy,xgboost,lightgbm,jupyter
Quick setup:
# (Recommended) create a virtual env
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install deps
pip install -U pip
pip install numpy pandas matplotlib seaborn scikit-learn scipy xgboost lightgbm jupyter- Place the data
- Put
german.dataunder./data/or next to the notebook. If you change the path, update thepd.read_csv(...)line.
- Put
- Launch Jupyter
jupyter notebook # or: jupyter lab - Open
Credit_Risk_Analysis_German_Credit.ipynband run all cells, top to bottom.
- Preprocessing
- Map coded categoricals (e.g., checking account, credit history, purpose, etc.) to readable labels.
- Create one-hot dummies for selected categorical features.
- Map target:
Good Risk → 1,Bad Risk → 0.
- EDA
- Category distributions split by class (pyramid-style bar charts).
- Boxplots for numeric features by class (median, spread, outliers).
- Univariate screening
- Chi-square for categorials → association with class.
- ANOVA for numerics → mean differences across classes.
- Dimensionality reduction (optional)
- PCA to 16 components (or set a variance target).
- Note: tree/boosting models usually don’t need PCA.
- Modeling & tuning
- Compare 8+ models on F1 (stratified train/test split).
GridSearchCVon the best model (LightGBM in this project).
- Final evaluation
- F1 (positive class defaults to
Good=1), ROC AUC. - Confusion matrix & classification report.
- ROC curve; feature importance if available.
- F1 (positive class defaults to
- F1 summarizes precision/recall for the positive class.
- ROC AUC measures separability (threshold-independent).
- Confusion matrix highlights which mistakes happen:
- The costliest in credit is predicting Good when actually Bad.
- Tip: If your goal is catching Bad applicants, either:
- Flip the positive class (treat Bad = 1), or
- Keep
Good = 1but raise the decision threshold to reduce false-positives for Good.
Threshold tuning example:
thr = 0.6 # try 0.5 → 0.8 grid
y_pred_custom = (best_model.predict_proba(X_test)[:, 1] >= thr).astype(int)- Stratified train/test split with fixed
random_state. GridSearchCVwith a specified CV and scoring.- For full reproducibility, pin exact library versions via
requirements.txtorpip-tools/poetry.
- Explainability: SHAP on the best non-PCA model (human-readable features).
- Cost-sensitive learning: class weights, custom loss aligned to cost matrix.
- Calibration: Platt or isotonic for better probability estimates.
- Production: export a
Pipeline(preprocess + model) and serve via FastAPI; add monitoring & drift checks.
- Code: choose a license (e.g., MIT) and add
LICENSE. - Data: respect the dataset’s original terms. If not redistributing, add a small fetch script or instructions to download.
If you have questions or want to collaborate, feel free to open an issue or reach out at bishankaseal@gmail.com