Use historical loan application data to predict whether or not an applicant will be able to repay a loan.
Expected output: a probability of default prediction, between 0 and 1.
The dataset is provided by Home Credit
- ✔️ Exploratory Data Analysis (EDA);
- ✔️ Data cleaning;
- ✔️ Feature engineering;
- ✔️ Imbalanced classes management;
- ✔️ Model training: Naïve Bayes, Logistic Regression, Stochastic Gradient Descent, Random Forest, LightGBM
- ✔️ Model evaluation: AUC (Area Under the Curve), Recall, F1-Score
- ✔️ Hyperparameters optimization;
- ✔️ Evaluation of variable importance with SHAP (Shapley Additive exPlanations).
scikit-learn, LightGBM, SHAP
- Start here a gentle introduction - Will Koehrsen on Kaggle
- Collection of useful functions - Ann Antonova on Kaggle
- Introduction to Imbalanced Classification - Jason Brownlee (machinelearningmastery.com)
- Reduce memory usage on large datasets with datatype conversion
- Baseline: scikit-learn - Naive Bayes (Gaussian)
- Linear models: scikit-learn - Logistic Regression; scikit-learn - Stochastic Gradient Descent (SGD)
- Non Linear models : Random Forest Classifier (ensemble method); LightGBM (Gradient Boosting Machine) documentation
- Cross-validation: evaluating estimator performance; cross_validate
- Evaluation metrics : Evaluation of models with scikit-learn
- Tuning the hyperparameters of an estimator; GridSearchCV
- Interpretable Machine Learning with SHAP