# SQL-Based Exploratory Analysis of Credit Risk

This notebook explores a cleaned credit risk dataset using SQL queries. The data has been preprocessed to remove outliers and engineered to include features such as income brackets, age groups, debt ratio categories, and credit utilization bands.

The goal is to use SQLite and SQL queries executed via Python to answer key questions related to borrower delinquency.


## Setup

In [5]:
import pandas as pd
import sqlite3

df = pd.read_csv('../data/clean_credit_risk.csv')
connection_object = sqlite3.connect('../database/credit_risk.db')
df.to_sql('credit_risk', connection_object, if_exists='replace', index=False)

16182

## Analysis

### What share of borrowers defaulted?

In [10]:
query = """
SELECT AVG(dlq_2yrs) AS default_rate
FROM credit_risk;
"""
display(pd.read_sql_query(query, connection_object))

Unnamed: 0,default_rate
0,0.496539



#### Insight:

Roughly **49.7%** of borrowers in the dataset defaulted, indicating a high-risk borrower pool. The default rate is surprisingly high and may be the result of the dataset being balanced or limited to higher-risk applicants. Regardless, this value acts as a **baseline** for comparing subgroup risk levels in later queries.


### How many borrowers have never been late vs. those who have?


In [11]:
query = """
SELECT 
  SUM(total_late = 0) AS never_late,
  SUM(total_late > 0) AS late
FROM credit_risk;
"""
display(pd.read_sql_query(query, connection_object))


Unnamed: 0,never_late,late
0,9434,6748



#### Insight:

The majority of borrowers (about 58%) have **no late payments**, suggesting that not all borrowers with high debt ratios or utilization are equally risky. This segmentation lays the groundwork for more **targeted comparisons** based on behavior, not just financials.
