# Predictive Credit Risk Modeling for Sustainable Lending: An In-depth Analysis Using Machine Learning at SuperLender

![image.png](attachment:image.png)

## Introduction

Loan defaulting poses a substantial challenge in the financial landscape, impacting both lenders and borrowers alike. The repercussions are far-reaching, as lenders not only incur financial losses but also witness a erosion of trust when borrowers falter in repaying their loans. Simultaneously, borrowers face dire consequences, ranging from damaged credit scores to potential legal actions. In light of these significant implications, it becomes imperative to devise strategies that can predict the likelihood of loan default and, subsequently, implement preventive measures to mitigate associated risks.

This project centers on the utilization of advanced machine learning techniques to construct a robust predictive model for loan default based on data sourced from SuperLender, a prominent local digital lending company. SuperLender, at the forefront of the financial technology revolution, adopts a data-driven approach to comprehensively assess the credit risk of its customers. Two fundamental pillars govern the assessment of loan repayment likelihood: the borrower's willingness and ability to repay.

As the analytics and modeling arm of this endeavor, Nyika Analytika delves into understanding how SuperLender leverages sophisticated machine learning models to forecast loan outcomes and assess their predictive performance. By unraveling the intricacies of this data-driven approach, Nyika Analytika seeks to illuminate the methodology behind SuperLender's credit risk assessment, offering insights into the fusion of technology and finance to address one of the most pressing challenges in the lending industry. Through the lens of machine learning, this project aims to contribute not only to the specific case of SuperLender but also to the broader discourse on enhancing credit risk assessment methodologies and fortifying the financial ecosystem against the perils of loan default.

## Business Understanding

The landscape of financial institutions is intricately woven with the challenge of borrowers defaulting on their loans, a phenomenon that transcends mere monetary losses to encompass profound repercussions on profit margins, liquidity, and the overall long-term sustainability of a business. Beyond the immediate financial implications, loan defaults have the potential to tarnish a company's reputation, erode investor confidence, and impede future borrowing opportunities. Recognizing the magnitude of these challenges, SuperLender, a prominent local digital lending company, is poised to address this issue head-on by implementing an effective credit risk model that can discern a borrower's likelihood of repaying a loan.

In the ambit of this project, the primary objective is to develop a sophisticated credit risk model leveraging state-of-the-art machine learning techniques. This model aims to harness the power of historical data, scrutinizing patterns and trends to predict potential defaults. By doing so, SuperLender aspires to usher in a paradigm shift in risk management, moving from a reactive stance to a proactive one. The envisioned model is not merely a safeguard against financial losses but is designed to serve as an invaluable tool, furnishing the credit manager and other institutional stakeholders with critical insights.

At the core of this initiative is the aspiration to empower decision-makers with a data-driven approach. The credit risk model developed in this project will not only inform the binary decision of approving or denying a loan but will extend its influence to guide targeted and nuanced strategies. These strategies include the customization of loan terms tailored to individual borrowers, thereby optimizing the lending process. The broader aim is to instill a culture of informed decision-making within the institution, enabling not only risk mitigation but also fostering a more dynamic and responsive approach to the evolving landscape of borrower behavior.

In essence, the synergy between machine learning techniques and SuperLender's mission transcends the singular goal of predicting loan defaults. It represents a strategic initiative to revolutionize credit risk assessment, embedding data-driven decision-making into the very fabric of the lending process. As Nyika Analytika undertakes this venture, the overarching objective is to unravel the intricacies of SuperLender's vision, ultimately contributing to the evolution of best practices in the realm of credit risk management within the dynamic sphere of digital lending.

## Statement of the Problem

Within the realm of lending, the specter of a defaulted loan looms as a formidable financial challenge, constituting an unwelcome and costly expense to businesses. The imperative for financial institutions to fortify their risk assessment strategies is heightened, as responsible lending practices become the linchpin for sustained business viability. The crux of this challenge lies in the necessity to predict customer loan defaults accurately, thereby minimizing financial risks and fostering an environment of sustainable lending.

The likelihood of a customer fulfilling their loan obligations is a multifaceted equation, intricately woven with demographic factors and historical financial details. Consequently, the onus is on financial institutions to navigate this intricate landscape, distinguishing customers with a proven capacity to fulfill their loan commitments from those who present a higher risk of default. The pivotal task at hand is to discern this critical distinction effectively, ensuring that resources are allocated judiciously to borrowers who are not only in need but also possess the financial means to meet their repayment obligations.

In essence, the problem statement encapsulates the challenge faced by financial institutions: how to refine their risk assessment methodologies to predict customer loan defaults with a high degree of precision. The goal is not only to shield the institution from the financial fallout of defaulted loans but, perhaps more crucially, to foster an ecosystem of responsible lending. In doing so, financial institutions can not only safeguard their own fiscal health but also contribute to the broader stability of the financial landscape.

As Nyika Analytika delves into this intricate challenge, the aim is to unravel the complexities of predicting customer loan defaults, contributing to the development of a nuanced and effective credit risk model. The ultimate aspiration is to provide financial institutions, exemplified by SuperLender, with a predictive tool that empowers them to make informed decisions, lending responsibly and ensuring the longevity of their business operations in an ever-evolving financial landscape.

## Objectives

### Main Objective:

The overarching goal of this project is to develop a robust predictive model capable of assessing and predicting customer loan repayment chances. This central objective aligns with the broader mission of SuperLender to enhance its credit risk assessment capabilities through the integration of advanced machine learning techniques.

### Specific Objectives:

1. **Determine Demographic Influences:**
   - Investigate and analyze demographic factors that significantly impact customer loan repayment chances. By delving into variables such as age, income, employment status, and other pertinent demographic information, the goal is to discern patterns and correlations that can inform the predictive model.

2. **Explore Past Financial Details and Behavior:**
   - Scrutinize historical financial data and customer behavior to identify key indicators that influence loan repayment likelihood. This involves a comprehensive examination of past credit history, spending patterns, and financial habits, aiming to extract actionable insights that contribute to a more nuanced and accurate credit risk model.

3. **Develop a User Interface (UI) for Credit Managers:**
   - Create a user-friendly interface designed specifically for credit managers, providing them with timely and comprehensive information on customer loan repayment details. The UI should be intuitive, offering actionable insights derived from the developed predictive model. This tool empowers credit managers to make informed decisions promptly, contributing to a more efficient and data-driven lending process.

## Data Understanding:

In the pursuit of developing a robust credit risk model, Nyika Analytika will harness datasets from Zindi, Africa's premier professional network for data scientists. The focus of the data exploration spans three distinct datasets, each contributing unique dimensions to the understanding of customer loan repayment tendencies.

**a) Demographic Data:**
   - **`customerid` (Primary Key):** Serves as a unique identifier, ensuring traceability across multiple datasets, anchoring borrowers' histories.
   - **`birthdate` (Date of Birth):** Offers insights into borrowers' ages, a factor correlated with financial stability and loan repayment capacity.
   - **`bank_account_type` (Type of Primary Bank Account):** Reflects customers' banking preferences, potentially indicating financial stability.
   - **`latitude_gps` / `longitude_gps`:** Geographic coordinates facilitate the assessment of regional risk factors influencing repayment tendencies.
   - **`bank_name_clients` (Name of the Bank):** Provides insights into banking history and its potential impact on loan behavior.
   - **`bank_branch_clients` (Location of the Branch):** Context about borrowers' banking relationships, if available.
   - **`employment_status_clients`:** Critical in determining income stability and the ability to repay loans.
   - **`level_of_education_clients`:** Reflects financial literacy and potential income, influencing loan behavior.

**b) Performance Data:**
   - **`customerid` (Primary Key):** Associates loan performance with individual borrowers, ensuring a comprehensive analysis.
   - **`systemloanid` (Loan ID):** Unique identifier for each loan, enabling the tracking of specific loan histories.
   - **`loannumber` (Number of the Loan Being Predicted):** Crucial for assessing a borrower's history of loan applications and ability to manage multiple financial commitments.
   - **`approveddate` (Date Loan Was Approved):** Facilitates historical trend analysis in loan approval, considering changes in economic conditions and lending policies.
   - **`loan amount`:** A critical predictor of loan default, indicating potential repayment challenges.
   - **`totaldue` (Total Repayment Required):** Assesses the borrower's capacity to meet financial obligations.
   - **`termdays` (Loan Term):** Duration significantly impacting default risk, with different patterns for longer-term versus shorter-term loans.
   - **`referredby`:** If available, provides insights into customer referrals influencing loan behavior.
   - **`good_bad_flag` (Loan Performance):** The central objective for accurate prediction in this project.

**c) Previous Loans Data:**
   - **`customerid` (Primary Key):** Ensures the association of historical loan data with individual borrowers for comprehensive analysis.
   - **`systemloanid` (Loan ID):** Unique identifier for each loan, facilitating tracking of specific loan histories.
   - **`loannumber` (Number of the Loan Being Predicted):** Understanding historical borrowing patterns and their potential influence on current defaults.
   - **`approveddate` (Date Loan Was Approved):** Timing information crucial for understanding loan behavior over time.
   - **`creationdate` (Date Loan Was Created):** Offers insights into the timing of loan applications.
   - **`loan amount`:** Reflects the loan amount associated with previous loans.
   - **`totaldue` (Total Repayment Required):** Assesses the borrower's capacity to meet financial obligations.
   - **`closeddate` (Date Loan Was Settled):** Indicates when previous loans were paid off, offering insights into past loan performance.
   - **`referredby`:** If available, offers insights into customer referrals influencing loan behavior.
   - **`firstduedate` (Date of First Payment Due):** Provides information about the initial payment schedule.
   - **`firstrepaiddate` (Actual Date of First Payment):** Records the date of the customer's first payment on previous loans, indicating initial repayment behavior.

### Importing Libraries

In [2]:
# Data Analysis and Visualization
import pandas as pd
import numpy as np
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime
%matplotlib inline
pd.set_option('display.max_columns', None)

# Suppress Warnings
import warnings
warnings.filterwarnings("ignore")

# Data Pre-processing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE

# Machine Learning Models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier,\
GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

# Model Evaluation and Selection
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

# Model Metrics and Visualization
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc, f1_score, roc_auc_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.feature_selection import RFE
from sklearn.tree import plot_tree, export_text