## Notebook Overview

### 0.1.1 Project Overview

This notebook focuses on the crucial initial steps of a machine learning project for credit risk prediction. Specifically, we will load the raw Home Credit dataset, perform essential data cleaning and memory optimization, and save the preprocessed data in a more efficient format (Parquet) for subsequent stages of the project.

### 0.1.2 Why is this Important?

Efficient data loading and preprocessing are fundamental to any successful machine learning project. By optimizing memory usage and ensuring data quality, we lay the groundwork for faster processing, improved model training, and more reliable results. This is particularly important in credit risk prediction where accurate and timely decisions are crucial for financial institutions.

### 0.1.3 Dataset Description

The Home Credit dataset contains information about loan applications from Home Credit Group. It includes a variety of features related to applicant demographics, financial history, and loan specifics. This data will be used to build models that predict the likelihood of loan defaults. The dataset is structured across several interconnected tables, offering a rich source of information for analysis.

### 0.1.4 Our Approach

We will leverage the Polars library for data loading and manipulation due to its efficiency in handling large datasets. Our approach includes:

1. **Data Loading:** Load the `application_train.csv` and `application_test.csv` files using Polars.
2. **Initial Exploration:** Examine the first few rows and data types to gain an initial understanding of the data.
3. **Lowercase Column Names:** Standardize column names by converting them to lowercase.
4. **Memory Optimization:** Reduce the dataset's memory footprint by converting data types to lower-precision alternatives without significant loss of information.
5. **Save Preprocessed Data:** Save the optimized DataFrames to Parquet files for efficient storage and retrieval in subsequent notebooks.

### 0.1.5 Additional Notes

This notebook assumes that the raw data files (`application_train.csv` and `application_test.csv`) are located in the `../data/raw/` directory. The preprocessed data will be saved in the `../data/processed/` directory.

In [2]:
import polars as pl
from retail_bank_risk.data_preprocessing_utils import reduce_memory_usage_pl

We then load the CSV files using Polars.


In [3]:
application_train = pl.read_csv("../data/raw/application_train.csv")
application_test = pl.read_csv("../data/raw/application_test.csv")

We've loaded the main data tables, `application_train` and `application_test`, which contains the core details of each loan application.

This information might be sufficient for building our initial models.

If we need more information to improve accuracy, we'll then explore the other tables and incorporate relevant data from them.

Next up we will check the first rows of our train and test datasets.


In [4]:
application_train.head()

SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,…,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
i64,i64,str,str,str,str,i64,f64,f64,f64,f64,str,str,str,str,str,f64,i64,i64,f64,i64,f64,i64,i64,i64,i64,i64,i64,str,f64,i64,i64,str,i64,i64,i64,i64,…,f64,str,str,f64,str,str,f64,f64,f64,f64,f64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,f64,f64,f64,f64,f64
100002,1,"""Cash loans""","""M""","""N""","""Y""",0,202500.0,406597.5,24700.5,351000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Single / not married""","""House / apartment""",0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,"""Laborers""",1.0,2,2,"""WEDNESDAY""",10,0,0,0,…,0.0,"""reg oper account""","""block of flats""",0.0149,"""Stone, brick""","""No""",2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,"""Cash loans""","""F""","""N""","""N""",0,270000.0,1293502.5,35698.5,1129500.0,"""Family""","""State servant""","""Higher education""","""Married""","""House / apartment""",0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,"""Core staff""",2.0,1,1,"""MONDAY""",11,0,0,0,…,0.01,"""reg oper account""","""block of flats""",0.0714,"""Block""","""No""",1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,"""Revolving loans""","""M""","""Y""","""Y""",0,67500.0,135000.0,6750.0,135000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Single / not married""","""House / apartment""",0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,"""Laborers""",1.0,2,2,"""MONDAY""",9,0,0,0,…,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,"""Cash loans""","""F""","""N""","""Y""",0,135000.0,312682.5,29686.5,297000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Civil marriage""","""House / apartment""",0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,"""Laborers""",2.0,2,2,"""WEDNESDAY""",17,0,0,0,…,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
100007,0,"""Cash loans""","""M""","""N""","""Y""",0,121500.0,513000.0,21865.5,513000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Single / not married""","""House / apartment""",0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,"""Core staff""",1.0,2,2,"""THURSDAY""",11,0,0,0,…,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
application_test.head()

SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,…,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
i64,str,str,str,str,i64,f64,f64,f64,f64,str,str,str,str,str,f64,i64,i64,f64,i64,f64,i64,i64,i64,i64,i64,i64,str,f64,i64,i64,str,i64,i64,i64,i64,i64,…,f64,str,str,f64,str,str,f64,f64,f64,f64,f64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,f64,f64,f64,f64,f64
100001,"""Cash loans""","""F""","""N""","""Y""",0,135000.0,568800.0,20560.5,450000.0,"""Unaccompanied""","""Working""","""Higher education""","""Married""","""House / apartment""",0.01885,-19241,-2329,-5170.0,-812,,1,1,0,1,0,1,,2.0,2,2,"""TUESDAY""",18,0,0,0,0,…,,,"""block of flats""",0.0392,"""Stone, brick""","""No""",0.0,0.0,0.0,0.0,-1740.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100005,"""Cash loans""","""M""","""N""","""Y""",0,99000.0,222768.0,17370.0,180000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Married""","""House / apartment""",0.035792,-18064,-4469,-9118.0,-1623,,1,1,0,1,0,0,"""Low-skill Laborers""",2.0,2,2,"""FRIDAY""",9,0,0,0,0,…,,,,,,,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
100013,"""Cash loans""","""M""","""Y""","""Y""",0,202500.0,663264.0,69777.0,630000.0,,"""Working""","""Higher education""","""Married""","""House / apartment""",0.019101,-20038,-4458,-2175.0,-3503,5.0,1,1,0,1,0,0,"""Drivers""",2.0,2,2,"""MONDAY""",14,0,0,0,0,…,,,,,,,0.0,0.0,0.0,0.0,-856.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
100028,"""Cash loans""","""F""","""N""","""Y""",2,315000.0,1575000.0,49018.5,1575000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Married""","""House / apartment""",0.026392,-13976,-1866,-2000.0,-4208,,1,1,0,1,1,0,"""Sales staff""",4.0,2,2,"""WEDNESDAY""",11,0,0,0,0,…,0.0817,"""reg oper account""","""block of flats""",0.37,"""Panel""","""No""",0.0,0.0,0.0,0.0,-1805.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
100038,"""Cash loans""","""M""","""Y""","""N""",1,180000.0,625500.0,32067.0,625500.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Married""","""House / apartment""",0.010032,-13040,-2191,-4000.0,-4262,16.0,1,1,1,1,0,0,,3.0,2,2,"""FRIDAY""",5,0,0,0,0,…,,,,,,,0.0,0.0,0.0,0.0,-821.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,


We can see that there are many features. As expected, the test dataset does not include the target variable.

First lets lowercase the columns names and then lets check the exact number of rows and features we do have here.


In [6]:
application_train = application_train.select(
    [pl.col(name).alias(name.lower()) for name in application_train.columns]
)
application_test = application_test.select(
    [pl.col(name).alias(name.lower()) for name in application_test.columns]
)

In [7]:
print("Number of Rows and Columns in Application Train:", application_train.shape)
print("Number of Rows and Columns in Application Test:", application_test.shape)

Number of Rows and Columns in Application Train: (307511, 122)
Number of Rows and Columns in Application Test: (48744, 121)


Next up, before proceeding, we will reduce the dataset's memory footprint by converting the data types to lower-precision alternatives. This will lead to faster processing times.

In [8]:
application_train = reduce_memory_usage_pl(application_train)
application_test = reduce_memory_usage_pl(application_test)

Size before memory reduction: 289.61 MB
Initial data types: Counter({Float64: 65, Int64: 41, String: 16})
Size after memory reduction: 113.52 MB
Final data types: Counter({Float32: 65, Int8: 35, Categorical(ordering='physical'): 16, Int32: 4, Int16: 2})
Size before memory reduction: 45.51 MB
Initial data types: Counter({Float64: 65, Int64: 40, String: 16})
Size after memory reduction: 17.94 MB
Final data types: Counter({Float32: 65, Int8: 34, Categorical(ordering='physical'): 16, Int32: 4, Int16: 2})


We converted the data types of the `application_train` and `application_test` DataFrames to lower-precision alternatives.

This reduced the memory footprint of the training data from 289.61 MB to 113.52 MB and that of the test data from 45.51 MB to 17.94 MB.

That being said, let's confirm our features' data types.

In [9]:
application_train.head()

sk_id_curr,target,name_contract_type,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,amt_credit,amt_annuity,amt_goods_price,name_type_suite,name_income_type,name_education_type,name_family_status,name_housing_type,region_population_relative,days_birth,days_employed,days_registration,days_id_publish,own_car_age,flag_mobil,flag_emp_phone,flag_work_phone,flag_cont_mobile,flag_phone,flag_email,occupation_type,cnt_fam_members,region_rating_client,region_rating_client_w_city,weekday_appr_process_start,hour_appr_process_start,reg_region_not_live_region,reg_region_not_work_region,live_region_not_work_region,…,nonlivingarea_medi,fondkapremont_mode,housetype_mode,totalarea_mode,wallsmaterial_mode,emergencystate_mode,obs_30_cnt_social_circle,def_30_cnt_social_circle,obs_60_cnt_social_circle,def_60_cnt_social_circle,days_last_phone_change,flag_document_2,flag_document_3,flag_document_4,flag_document_5,flag_document_6,flag_document_7,flag_document_8,flag_document_9,flag_document_10,flag_document_11,flag_document_12,flag_document_13,flag_document_14,flag_document_15,flag_document_16,flag_document_17,flag_document_18,flag_document_19,flag_document_20,flag_document_21,amt_req_credit_bureau_hour,amt_req_credit_bureau_day,amt_req_credit_bureau_week,amt_req_credit_bureau_mon,amt_req_credit_bureau_qrt,amt_req_credit_bureau_year
i32,i8,cat,cat,cat,cat,i16,f32,f32,f32,f32,cat,cat,cat,cat,cat,f32,i32,i32,f32,i32,f32,i8,i8,i8,i8,i8,i8,cat,f32,i8,i8,cat,i16,i8,i8,i8,…,f32,cat,cat,f32,cat,cat,f32,f32,f32,f32,f32,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,i8,f32,f32,f32,f32,f32,f32
100002,1,"""Cash loans""","""M""","""N""","""Y""",0,202500.0,406597.5,24700.5,351000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Single / not married""","""House / apartment""",0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,"""Laborers""",1.0,2,2,"""WEDNESDAY""",10,0,0,0,…,0.0,"""reg oper account""","""block of flats""",0.0149,"""Stone, brick""","""No""",2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,"""Cash loans""","""F""","""N""","""N""",0,270000.0,1293502.5,35698.5,1129500.0,"""Family""","""State servant""","""Higher education""","""Married""","""House / apartment""",0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,"""Core staff""",2.0,1,1,"""MONDAY""",11,0,0,0,…,0.01,"""reg oper account""","""block of flats""",0.0714,"""Block""","""No""",1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,"""Revolving loans""","""M""","""Y""","""Y""",0,67500.0,135000.0,6750.0,135000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Single / not married""","""House / apartment""",0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,"""Laborers""",1.0,2,2,"""MONDAY""",9,0,0,0,…,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,"""Cash loans""","""F""","""N""","""Y""",0,135000.0,312682.5,29686.5,297000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Civil marriage""","""House / apartment""",0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,"""Laborers""",2.0,2,2,"""WEDNESDAY""",17,0,0,0,…,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
100007,0,"""Cash loans""","""M""","""N""","""Y""",0,121500.0,513000.0,21865.5,513000.0,"""Unaccompanied""","""Working""","""Secondary / secondary special""","""Single / not married""","""House / apartment""",0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,"""Core staff""",1.0,2,2,"""THURSDAY""",11,0,0,0,…,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


We can see that the initial data exploration reveals that several features, despite being represented numerically, are inherently categorical.

These include `region_rating_client`, `region_rating_client_w_city`, `reg_city_not_work_city`, and `target`.

To ensure accurate analysis and modeling, we need to explicitly convert these features to the appropriate categorical data type.

In [14]:
categorical_cols_train = [
    "region_rating_client",
    "region_rating_client_w_city",
    "reg_city_not_work_city",
    "target",
]

categorical_cols_test = [
    "region_rating_client",
    "region_rating_client_w_city",
    "reg_city_not_work_city",
]

application_train = application_train.with_columns(
    [
        pl.col(col).round(0).cast(pl.Int32).cast(pl.Utf8).cast(pl.Categorical)
        for col in categorical_cols_train
    ]
)

application_test = application_test.with_columns(
    [
        pl.col(col).round(0).cast(pl.Int32).cast(pl.Utf8).cast(pl.Categorical)
        for col in categorical_cols_test
    ]
)

Next, we will save the preprocessed DataFrames (`application_train` and `application_test`) to Parquet files (`application_train_preprocessed.parquet` and `application_test_preprocessed.parquet`) for efficient storage and retrieval.

These files will be used in the next notebook (`02_feature_engineering.ipynb`) to perform feature engineering and further data preparation. 

In [10]:
application_train.write_parquet(
    "../data/processed/application_train_preprocessed.parquet"
)
application_test.write_parquet(
    "../data/processed/application_test_preprocessed.parquet"
)