**Instructions**
1. Make sure your project is related to the business of the companies you are interested in.
2. The project should not only be helpful to the employer's business, but also compatible with the existing tools they are using.
2. Don’t spend time in generating brand-new ideas. Using mature methods to solve specific problem is more preferred.
3. Don’t try to build a perfect project. Start with a Minimum Viable Product. Resources are always limited.

<font color='red'>**Question:**</font> What is the problem you would like to solve? Why is it important?

**Example**: In direct mail marketing, tens of millions of mails are sent to potential customers. Only a very small fraction of the mail receivers respond. Most of mails are thrown into recycle bins directly. If we are able to accurately identify people who will possibly respond. We can save money in direct mail marketing or accquire more customers with given resources.

The problem is to predict credit risk for loan applications by determining the likelihood of loan default. This is crucial because:

1. Financial institutions need to assess credit risk accurately to minimize potential losses
2. Poor credit risk assessment can lead to significant financial losses and economic instability
3. Better risk assessment can help make lending more accessible to creditworthy borrowers while protecting lenders
4. Automated and accurate credit risk assessment can reduce operational costs and processing time

By developing an accurate credit risk prediction model, we can help financial institutions make better lending decisions, reduce losses from defaults, and potentially make credit more accessible to qualified borrowers who might be overlooked by traditional methods.


<font color='red'>**Question:**</font> How to measure the business value of your solution? What is the benchmark?

The business value can be measured in several ways:

1. **Reduction in Default Rates**
   - If current default rate is X%, improving prediction accuracy could reduce it by Y%
   - Based on our dataset's current loan intent risk scoring (AUC = 0.469), significant improvement potential exists

2. **Cost Savings**
   - For every $1M in loans, preventing even 1% of defaults (at average loan amount of $35,000) could save $350,000
   - Reduced operational costs through automated risk assessment
   - Lower loan loss provisions due to better risk assessment

3. **Operational Efficiency**
   - Reducing manual review time by automating risk assessment
   - Faster loan processing leading to higher throughput
   - More consistent risk assessment across applications

4. **Improved Customer Experience**
   - Faster loan processing
   - More accurate risk-based pricing
   - Better matching of loan products to customer risk profiles

**Benchmarks:**
- Current industry average default rates for different loan types (EDUCATION: 10.1%, MEDICAL: 15.2%, etc.)
- Traditional credit scoring methods (like FICO scores)
- Current model's AUC of 0.469 for loan intent risk scoring


**Example**: If our goal is saving money in direct mail marketing, the benchmark could be the current situtation in which we send mail to all the people in an area. Let's assume there are a million families in this area, only 0.1% of the people respond, and each mail costs 10 cents. If our model can successfully filter out half of the people who won't respond, we can save 50,000 dollars for each time.

<font color='red'>**Question:**</font> What are the metrics you will use? Why?

**Example**: We can use AUC-ROC for model training. We will use precision and recall to measure our models, keep recall close to 100%, and maximize precision at the same time. If we can't keep recall as 100%, we need to consider the value of each customer and calculate the loss.

We will use multiple metrics to evaluate our model:

1. **Primary Metrics**
   - **AUC-ROC**: Best for imbalanced datasets and provides a good overall measure of model discrimination
   - **Precision-Recall curve**: Important because false positives (approving bad loans) and false negatives (rejecting good loans) have different business costs
   - **F1-Score**: Balanced measure of precision and recall
   - **Confusion Matrix**: To understand the types of errors the model makes

2. **Secondary Metrics**
   - **Kolmogorov-Smirnov (KS) statistic**: Common in credit scoring to measure separation between good and bad loans
   - **Gini coefficient**: Industry standard for credit scoring models
   - **Expected monetary value**: Incorporating actual costs of false positives/negatives

These metrics were chosen because:
- They handle imbalanced datasets well (common in credit risk)
- They provide different perspectives on model performance
- They are industry-standard metrics for credit risk assessment
- They help balance the trade-off between identifying defaults and maintaining a healthy approval rate


<font color='red'>**Question:**</font> What data do you need? How will you get the data?

**Example**: Direct mail marketing data is very hard to get. So it is not suitable to be a keystone project. Make sure you are able to get the data you need before you decide to work on a project. Here are some sources to get data:  
1. House price data from Zillow, rent price from Craigslist, resteraunt review from Yelp, product review from Amazon, etc.
2. Some companies provide APIs like Twitter (https://developer.twitter.com/en/docs)
3. Quite some datasets are available on Kaggle (https://www.kaggle.com/datasets)


We already have the primary dataset from Kaggle (credit-risk-dataset) which includes:

1. **Core Features**
   - Personal information: age, income, home ownership, employment length
   - Loan details: intent, grade, amount, interest rate
   - Credit history: default history, credit history length
   - Target variable: loan_status (default/non-default)

2. **Enhanced Features** (through our web crawler)
   - Market-based loan intent risk scores
   - Home ownership risk scores
   - Combined risk scores

3. **Additional Data We Could Gather**
   - Economic indicators (unemployment rates, GDP growth)
   - Industry-specific default rates
   - Property values for secured loans
   - Market interest rates and trends

The data quality is good with:
- Sufficient sample size
- Relevant features for credit risk assessment
- Real-world loan performance data
- Enhanced features from market data

<font color='red'>**Question:**</font> What algorithms will you choose for this project? Why?

**Example**: We will use Random Forest as a benchmark algorithm, because it is very stable and able to handle nonlinear relationship between features and target. But eventually, we will use Logistic Regression with created features, because we need to keep the model interpretable to communicate with the marketing team.

We'll use a multi-stage approach with multiple algorithms:

1. **Baseline Models**
   - **Logistic Regression**: 
     - Interpretable, standard in credit industry
     - Good for understanding feature importance
     - Required for regulatory compliance
   
   - **Random Forest**: 
     - Handles non-linear relationships
     - Good with mixed data types
     - Built-in feature importance

2. **Advanced Models**
   - **XGBoost**: 
     - Usually provides best performance for structured data
     - Handles missing values well
     - Good with imbalanced datasets
   
   - **LightGBM**: 
     - Efficient with large datasets
     - Good with categorical features
     - Fast training and prediction

3. **Ensemble Methods**
   - Stacking: Combine predictions from multiple models
   - Weighted averaging: Based on model performance on different segments

The final choice will balance:
- Model performance (AUC, precision, recall)
- Interpretability (required for regulatory compliance)
- Computational efficiency
- Ease of deployment and maintenance


<font color='red'>**Question:**</font> What will be the input and output?

**Example**: The input would be the information of people (if available) such as zipcode, household income, age, etc. The output will be 1 (will respond) or 0 (won't respond).

**Inputs:**
1. Personal Information
   - Age
   - Income
   - Employment length
   - Home ownership status

2. Loan Details
   - Amount
   - Intent
   - Grade
   - Interest rate
   - Percent of income

3. Credit History
   - Default history
   - Credit history length

4. Enhanced Features
   - Loan intent risk score
   - Home ownership risk score
   - Combined risk score

**Outputs:**
1. Primary Output
   - Binary classification (1 = default, 0 = non-default)

2. Secondary Outputs
   - Probability of default (0-1 score)
   - Risk tier assignment (Low, Medium, High risk)
   - Recommended interest rate range
   - Confidence score for the prediction

In [1]:
import kagglehub
import os
import shutil

# Download latest version
path = kagglehub.dataset_download("laotse/credit-risk-dataset")

# Create a local directory for the dataset if it doesn't exist
local_dir = "credit-risk-dataset"
if not os.path.exists(local_dir):
    os.makedirs(local_dir)

# Copy all files from the downloaded path to our local directory
for file in os.listdir(path):
    src_file = os.path.join(path, file)
    dst_file = os.path.join(local_dir, file)
    shutil.copy2(src_file, dst_file)

print("Dataset files have been downloaded to:", os.path.abspath(local_dir))

Dataset files have been downloaded to: c:\Users\leemi\Techlent ML camp\Group_project\credit-risk-dataset


In [2]:
import pandas as pd
df = pd.read_csv('C:/Users/leemi/Techlent ML camp/Group_project/credit-risk-dataset/credit_risk_dataset.csv')

In [3]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [5]:
df = pd.read_csv('C:/Users/leemi/Techlent ML camp/Group_project/credit-risk-dataset/enhanced_credit_risk_dataset.csv')
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_intent_risk_score,home_ownership_risk_score,combined_risk_score
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3,0.259802,0.44,0.331881
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2,0.387553,0.17,0.300532
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3,0.4322,0.29,0.37532
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2,0.4322,0.44,0.43532
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4,0.4322,0.44,0.43532
