In [1]:
import pandas as pd
import numpy as np
from math import log

# Load data
df = pd.read_csv(r"C:\Users\satya\Downloads\QR - JP Morgan\Task 3 and 4_Loan_Data.csv")

# Extract relevant columns
fico = df['fico_score'].astype(int).to_list()
defaults = df['default'].astype(int).to_list()
n = len(fico)

# Aggregate by unique FICO score
data = pd.DataFrame({'fico': fico, 'default': defaults})
agg = data.groupby('fico').agg(total=('default', 'size'),
                               defaults=('default', 'sum')).reset_index().sort_values('fico')

scores = agg['fico'].to_list()
total = agg['total'].to_list()
default = agg['defaults'].to_list()
m = len(scores)

# Prefix sums
prefix_total = np.cumsum(total)
prefix_default = np.cumsum(default)

# Log-likelihood function
def log_likelihood(n, k):
    if n == 0: return 0
    p = k / n
    if p <= 0 or p >= 1:
        return 0
    return k * log(p) + (n - k) * log(1 - p)

# Dynamic programming approach for K buckets
K = 5
dp = np.full((K+1, m), -1e18)
prev = np.full((K+1, m), -1, dtype=int)

for j in range(m):
    dp[1, j] = log_likelihood(prefix_total[j], prefix_default[j])

for k in range(2, K+1):
    for j in range(k-1, m):
        for t in range(k-2, j):
            n_bucket = prefix_total[j] - prefix_total[t]
            k_bucket = prefix_default[j] - prefix_default[t]
            ll = log_likelihood(n_bucket, k_bucket)
            val = dp[k-1, t] + ll
            if val > dp[k, j]:
                dp[k, j] = val
                prev[k, j] = t

# Reconstruct boundaries
cuts = []
k = K
j = m - 1
while k > 0:
    t = prev[k, j]
    cuts.append((t+1, j))
    j = t
    k -= 1
cuts = cuts[::-1]

# Create summary table
boundaries = []
for (i, j) in cuts:
    low = scores[i]
    high = scores[j]
    cnt = prefix_total[j] - (prefix_total[i-1] if i > 0 else 0)
    defs = prefix_default[j] - (prefix_default[i-1] if i > 0 else 0)
    rate = defs / cnt
    boundaries.append([low, high, cnt, defs, round(rate, 4)])

buckets = pd.DataFrame(boundaries, columns=['FICO_min', 'FICO_max', 'Count', 'Defaults', 'Default_rate'])
print(buckets)


   FICO_min  FICO_max  Count  Defaults  Default_rate
0       408       520    301       199        0.6611
1       521       580   1407       536        0.3810
2       581       640   3438       703        0.2045
3       641       696   3197       336        0.1051
4       697       850   1657        77        0.0465


### Explanation

The objective is to quantize continuous FICO scores into discrete “rating buckets” that summarize borrower credit risk. 
A lower rating indicates a better credit score, while a higher rating represents higher risk.

This task is a discretization problem optimized by maximizing the log-likelihood function:

L = Σ [kᵢ log(pᵢ) + (nᵢ - kᵢ) log(1 - pᵢ)]

where kᵢ is the number of defaults in bucket i, nᵢ is the total records in bucket i, and pᵢ = kᵢ/nᵢ is the probability 
of default within the bucket.


To find the optimal bucket boundaries, a dynamic programming (DP) approach is used. DP iteratively determines where 
to split the FICO score range to maximize total log-likelihood. The algorithm runs in O(K × N²) time, where K is 
the number of desired buckets.

### Results
| Rating     | FICO Range | Count | Defaults | Default Rate |
|:-----------|:------------|------:|----------:|--------------:|
| 0 (worst)  | 408 – 520   | 301   | 199       | 0.6611 |
| 1           | 521 – 580   | 1,407 | 536       | 0.3810 |
| 2           | 581 – 640   | 3,438 | 703       | 0.2045 |
| 3           | 641 – 696   | 3,197 | 336       | 0.1051 |
| 4 (best)   | 697 – 850   | 1,657 | 77        | 0.0465 |



### Interpretation

- *Borrowers in the highest bucket (697–850) have the lowest probability of default (~4.7%), indicating strong creditworthiness.*
- *Borrowers in the lowest bucket (408–520) show very high default risk (~66%), signifying poor credit quality.*
- *This structure can serve as a rating map for assigning risk grades in future datasets.*



### Rating Map

| Rating | FICO Range | Credit Category |
|:------:|:------------|:----------------|
| 4 | 697–850 | Excellent |
| 3 | 641–696 | Good |
| 2 | 581–640 | Fair |
| 1 | 521–580 | Poor |
| 0 | 408–520 | Very Poor |



### Conclusion

This analysis builds a quantization framework for FICO scores using a log-likelihood–based dynamic programming method.
It optimally partitions borrower credit scores into risk-based buckets, producing clear, data-driven rating boundaries.
The log-likelihood approach is preferred here over MSE because it explicitly models default probabilities, which aligns 
directly with credit risk prediction goals.