**Instructions**
1. Make sure your project is related to the business of the companies you are interested in.
2. The project should not only be helpful to the employer's business, but also compatible with the existing tools they are using.
2. Don’t spend time in generating brand-new ideas. Using mature methods to solve specific problem is more preferred.
3. Don’t try to build a perfect project. Start with a Minimum Viable Product. Resources are always limited.

<font color='red'>**Question:**</font> What is the problem you would like to solve? Why is it important?

**Example**: In direct mail marketing, tens of millions of mails are sent to potential customers. Only a very small fraction of the mail receivers respond. Most of mails are thrown into recycle bins directly. If we are able to accurately identify people who will possibly respond. We can save money in direct mail marketing or accquire more customers with given resources.

<font color='red'>**Question:**</font> How to measure the business value of your solution? What is the benchmark?

**Example**: If our goal is saving money in direct mail marketing, the benchmark could be the current situtation in which we send mail to all the people in an area. Let's assume there are a million families in this area, only 0.1% of the people respond, and each mail costs 10 cents. If our model can successfully filter out half of the people who won't respond, we can save 50,000 dollars for each time.

<font color='red'>**Question:**</font> What are the metrics you will use? Why?

**Example**: We can use AUC-ROC for model training. We will use precision and recall to measure our models, keep recall close to 100%, and maximize precision at the same time. If we can't keep recall as 100%, we need to consider the value of each customer and calculate the loss.

<font color='red'>**Question:**</font> What data do you need? How will you get the data?

**Example**: Direct mail marketing data is very hard to get. So it is not suitable to be a keystone project. Make sure you are able to get the data you need before you decide to work on a project. Here are some sources to get data:  
1. House price data from Zillow, rent price from Craigslist, resteraunt review from Yelp, product review from Amazon, etc.
2. Some companies provide APIs like Twitter (https://developer.twitter.com/en/docs)
3. Quite some datasets are available on Kaggle (https://www.kaggle.com/datasets)

<font color='red'>**Question:**</font> What algorithms will you choose for this project? Why?

**Example**: We will use Random Forest as a benchmark algorithm, because it is very stable and able to handle nonlinear relationship between features and target. But eventually, we will use Logistic Regression with created features, because we need to keep the model interpretable to communicate with the marketing team.

<font color='red'>**Question:**</font> What will be the input and output?

**Example**: The input would be the information of people (if available) such as zipcode, household income, age, etc. The output will be 1 (will respond) or 0 (won't respond).

In [1]:
import kagglehub
import os
import shutil

# Download latest version
path = kagglehub.dataset_download("laotse/credit-risk-dataset")

# Create a local directory for the dataset if it doesn't exist
local_dir = "credit-risk-dataset"
if not os.path.exists(local_dir):
    os.makedirs(local_dir)

# Copy all files from the downloaded path to our local directory
for file in os.listdir(path):
    src_file = os.path.join(path, file)
    dst_file = os.path.join(local_dir, file)
    shutil.copy2(src_file, dst_file)

print("Dataset files have been downloaded to:", os.path.abspath(local_dir))

Dataset files have been downloaded to: c:\Users\leemi\Techlent ML camp\Group_project\credit-risk-dataset


In [2]:
import pandas as pd
df = pd.read_csv('C:/Users/leemi/Techlent ML camp/Group_project/credit-risk-dataset/credit_risk_dataset.csv')

In [3]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [7]:
df = pd.read_csv('C:/Users/leemi/Techlent ML camp/Group_project/credit-risk-dataset/enhanced_credit_risk_dataset.csv')
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_intent_risk_score,home_ownership_risk_score,combined_risk_score
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3,0.28788,0.44,0.348728
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2,0.33524,0.17,0.269144
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3,0.3197,0.29,0.30782
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2,0.3197,0.44,0.36782
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4,0.3197,0.44,0.36782
