# **Data Sources**

This outlines the data sources for Internal Ratings Based (IRB) and IFRS 9 capital modelling, categorised into internal and external sources. It's crucial to remember that the specific data required and its granularity will depend on the complexity of the model, the institution's risk profile, and regulatory requirements. Data quality and governance are paramount to ensure the reliability and accuracy of the capital calculations.

In [30]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from src.config import HOME_CREDIT_DATA_DIR

In [4]:
os.chdir(HOME_CREDIT_DATA_DIR)

## **Internal Data Sources**

These sources provide information on the bank's own portfolio and operations. Robust data governance and validation processes are essential to ensure data accuracy and consistency.

1. **Month-End Financial Data**: This forms the foundation of most models. It includes:
  - Delinquency Data: Days past due (DPD), number of days delinquent, and delinquency status (e.g., 30-day delinquent, 90-day delinquent). This is crucial for estimating Probability of Default (PD).
  - Balances: Outstanding loan balances, credit card balances, and other relevant financial exposures. This is used to calculate Exposure at Default (EAD).
  - Limits: Credit limits, exposure limits, and other relevant risk limits applied to borrowers. This is used in EAD calculations and for stress testing.
2. **Transactional Data**: Individual transaction details, including payment amounts, dates, and types of transactions. This provides insights into borrower behaviour and can be used to refine PD and EAD estimations.
3. **Collection Data**: Details on collection efforts, including contact attempts, recovery amounts, and write-off information. This is crucial for refining PD and Loss Given Default (LGD) estimations.


In [37]:
financial_df = pd.read_parquet(
    path="credit_card_balance.parquet",
    columns=[
        "SK_ID_CURR",
        "NAME_CONTRACT_STATUS",
        "SK_DPD",
        "SK_DPD_DEF",
        "AMT_BALANCE",
        "AMT_CREDIT_LIMIT_ACTUAL",
        "MONTHS_BALANCE",
    ],
)
financial_df.sort_values(by=["SK_ID_CURR", "MONTHS_BALANCE"], inplace=True)
financial_df.reset_index(drop=True, inplace=True)

today = pd.Timestamp.today()
financial_df["SNAPSHOT_DATE"] = today + financial_df["MONTHS_BALANCE"].astype("timedelta64[M]")

financial_df.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,MONTHS_BALANCE,SNAPSHOT_DATE
0,100006,Active,0,0,0.0,270000,-6,2025-01-12 17:03:30.265247
1,100006,Active,0,0,0.0,270000,-5,2025-02-12 17:03:30.265247
2,100006,Active,0,0,0.0,270000,-4,2025-03-15 17:03:30.265247
3,100006,Active,0,0,0.0,270000,-3,2025-04-14 17:03:30.265247
4,100006,Active,0,0,0.0,270000,-2,2025-05-15 17:03:30.265247


In [39]:
transactional_df = pd.read_parquet(
    path="credit_card_balance.parquet",
    columns=[
        "SK_ID_CURR",
        "AMT_DRAWINGS_ATM_CURRENT",
        "AMT_DRAWINGS_CURRENT",
        "AMT_DRAWINGS_OTHER_CURRENT",
        "AMT_DRAWINGS_POS_CURRENT",
        "CNT_DRAWINGS_ATM_CURRENT",
        "CNT_DRAWINGS_CURRENT",
        "CNT_DRAWINGS_OTHER_CURRENT",
        "CNT_DRAWINGS_POS_CURRENT",
        "CNT_INSTALMENT_MATURE_CUM",
        "MONTHS_BALANCE",
    ],
)
transactional_df.sort_values(by=["SK_ID_CURR", "MONTHS_BALANCE"], inplace=True)
transactional_df.reset_index(drop=True, inplace=True)

today = pd.Timestamp.today()
transactional_df["SNAPSHOT_DATE"] = today + transactional_df["MONTHS_BALANCE"].astype(
    "timedelta64[M]"
)

transactional_df.head()

Unnamed: 0,SK_ID_CURR,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,MONTHS_BALANCE,SNAPSHOT_DATE
0,100006,,0.0,,,,0,,,0.0,-6,2025-01-12 17:06:14.599528
1,100006,,0.0,,,,0,,,0.0,-5,2025-02-12 17:06:14.599528
2,100006,,0.0,,,,0,,,0.0,-4,2025-03-15 17:06:14.599528
3,100006,,0.0,,,,0,,,0.0,-3,2025-04-14 17:06:14.599528
4,100006,,0.0,,,,0,,,0.0,-2,2025-05-15 17:06:14.599528


In [40]:
collection_df = pd.read_parquet(
    path="credit_card_balance.parquet",
    columns=[
        "SK_ID_CURR",
        "AMT_INST_MIN_REGULARITY",
        "AMT_PAYMENT_CURRENT",
        "AMT_PAYMENT_TOTAL_CURRENT",
        "AMT_RECEIVABLE_PRINCIPAL",
        "AMT_RECIVABLE",
        "AMT_TOTAL_RECEIVABLE",
        "MONTHS_BALANCE",
    ],
)
collection_df.sort_values(by=["SK_ID_CURR", "MONTHS_BALANCE"], inplace=True)
collection_df.reset_index(drop=True, inplace=True)

today = pd.Timestamp.today()
collection_df["SNAPSHOT_DATE"] = today + collection_df["MONTHS_BALANCE"].astype("timedelta64[M]")

collection_df.head()

Unnamed: 0,SK_ID_CURR,AMT_INST_MIN_REGULARITY,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,MONTHS_BALANCE,SNAPSHOT_DATE
0,100006,0.0,,0.0,0.0,0.0,0.0,-6,2025-01-12 17:06:46.131962
1,100006,0.0,,0.0,0.0,0.0,0.0,-5,2025-02-12 17:06:46.131962
2,100006,0.0,,0.0,0.0,0.0,0.0,-4,2025-03-15 17:06:46.131962
3,100006,0.0,,0.0,0.0,0.0,0.0,-3,2025-04-14 17:06:46.131962
4,100006,0.0,,0.0,0.0,0.0,0.0,-2,2025-05-15 17:06:46.131962


## **External Data Sources**

These sources provide information from outside the bank, offering a broader perspective on borrower risk. Careful consideration of data quality, licensing, and privacy regulations is essential.

1. **Application Data**: Information provided by customers during the loan application process. This includes income, employment history, assets, liabilities, and other financial information. This is used to assess creditworthiness and estimate PD.
2. **Credit Bureau Data**: Information from credit bureaus, including credit scores, credit history, payment behaviour, and other relevant credit information (inquiries). This is a critical input for PD estimation and can help to validate internal data. This may also include

In [56]:
application_train_df = pd.read_parquet(path="application_train.parquet")
application_test_df = pd.read_parquet(path="application_test.parquet")

application_df = pd.concat([application_train_df, application_test_df])
application_df.sort_values(by=["SK_ID_CURR"], inplace=True)
application_df.reset_index(drop=True, inplace=True)

application_df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100002,1.0,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
2,100003,0.0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100004,0.0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4,100005,,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0


In [65]:
credit_bureau_df = pd.read_parquet(path="bureau.parquet")
credit_bureau_balance_df = pd.read_parquet(path="bureau_balance.parquet")

credit_bureau_df.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,
