 # Bucket FICO scores

Charlie wants to make her model work for future data sets, so she needs a general approach to generating the buckets. Given a set number of buckets corresponding to the number of input labels for the model, she would like to find out the boundaries that best summarize the data. You need to create a rating map that maps the FICO score of the borrowers to a rating where a lower rating signifies a better credit score.

The process of doing this is known as quantization. You could consider many ways of solving the problem by optimizing different properties of the resulting buckets, such as the mean squared error or log-likelihood (see below for definitions).

In [18]:
import numpy as np
import pandas as pd
from scipy import stats

from sklearn.tree import DecisionTreeClassifier

In [55]:
file_path = "Loan_Data.csv"
Loan = pd.read_csv(file_path)

Loan = Loan.drop(["customer_id"], axis = 1)
Loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   credit_lines_outstanding  10000 non-null  int64  
 1   loan_amt_outstanding      10000 non-null  float64
 2   total_debt_outstanding    10000 non-null  float64
 3   income                    10000 non-null  float64
 4   years_employed            10000 non-null  int64  
 5   fico_score                10000 non-null  int64  
 6   default                   10000 non-null  int64  
dtypes: float64(3), int64(4)
memory usage: 547.0 KB


In [63]:
X = Loan[["fico_score"]]
y = Loan[["default"]]

tree = DecisionTreeClassifier(max_depth = 2)

# Fit the decision tree to the data
tree.fit(X, y)

# Extract the optimal thresholds from the decision tree
thresholds = tree.tree_.threshold[tree.tree_.children_left != tree.tree_.children_right]

# Sort and remove duplicates
thresholds = list(np.sort(np.unique(thresholds)))
thresholds = [0] + thresholds
thresholds.append(850)

# Use the thresholds for discretization
Loan['fico_bin'] = pd.cut(Loan['fico_score'], bins = thresholds)

# Calculate class probabilities for each unique feature value
class_probabilities = Loan.groupby('fico_bin')['default'].mean().to_dict()

# Calculate log-likelihood for each instance
Loan['Log_Likelihood'] = Loan['default'].map(lambda x: np.log(class_probabilities.get(x, 0.5)))

# Sum up the log likelihood value 
print("Thresholds: ", thresholds)
print("Log Likelihood: ", Loan["Log_Likelihood"].sum())


Thresholds:  [0, 520.5, 580.5, 640.5, 850]
Log Likelihood:  -6931.471805599453
