# Credit Risk Data Preprocessing
This notebook demonstrates the full data preprocessing pipeline, including error handling and parameterization.

## Introduction
* This notebook walks through the data preprocessing steps for the credit risk dataset
* It covers loading the raw data, cleaning and normalizing text columns, and preparing the data for further analysis or modeling.
* The code is modular and includes error handling to ensure robustness.


1. Adding project directory to the system path.

In [1]:
# Import data processing module and set up path for reproducibility and modularity
import sys
sys.path.append('../src')  # Ensure src is in the path
import data_processing 

## Load the raw data
* The next cell attempts to load the raw credit risk data from the specified directory using the `load_raw_data` function from the `data_processing` module. 
* It includes error handling to print a message if the file is not found, and sets `df` to `None` in that case.


In [2]:
try:
    df = data_processing.load_raw_data(raw_data_dir='../data/raw/')
    print('Data loaded successfully!')
except FileNotFoundError as e:
    print(e)
    df = None

Data loaded successfully!


### The next cell displays the first few rows of the loaded raw DataFrame using the `head()` method.


In [3]:
df.head()

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2,0
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2,0
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2,0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15T03:32:55Z,2,0
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15T03:34:21Z,2,0


### The next cell displays the last few rows of the loaded row DataFrame using the `tail()` method.


In [4]:
df.tail()

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
95657,TransactionId_89881,BatchId_96668,AccountId_4841,SubscriptionId_3829,CustomerId_3078,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-1000.0,1000,2019-02-13T09:54:09Z,2,0
95658,TransactionId_91597,BatchId_3503,AccountId_3439,SubscriptionId_2643,CustomerId_3874,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2019-02-13T09:54:25Z,2,0
95659,TransactionId_82501,BatchId_118602,AccountId_4841,SubscriptionId_3829,CustomerId_3874,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2019-02-13T09:54:35Z,2,0
95660,TransactionId_136354,BatchId_70924,AccountId_1346,SubscriptionId_652,CustomerId_1709,UGX,256,ProviderId_6,ProductId_19,tv,ChannelId_3,3000.0,3000,2019-02-13T10:01:10Z,2,0
95661,TransactionId_35670,BatchId_29317,AccountId_4841,SubscriptionId_3829,CustomerId_1709,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-60.0,60,2019-02-13T10:01:28Z,2,0


#### The next cell preprocesses the loaded DataFrame using the `preprocess_dataframe` function from the `data_processing` module  

In [11]:
from data_processing import format_datetime_column
df_cleaned = data_processing.preprocess_dataframe(df)


### Format TransactionStartTime as Datetime
* Use the date_formatter utility to convert the TransactionStartTime column to pandas datetime. Optionally, specify a format string if known.

In [10]:
if df is not None and 'TransactionStartTime' in df.columns:
    from data_processing import date_formatter
    df['TransactionStartTime'] = date_formatter(df['TransactionStartTime'])
    display(df[['TransactionStartTime']].head())

Unnamed: 0,TransactionStartTime
0,2018-11-15 02:18:49
1,2018-11-15 02:19:08
2,2018-11-15 02:44:21
3,2018-11-15 03:32:55
4,2018-11-15 03:34:21


#### The next cell displays the first few rows of the cleaned DataFrame using the `head()` method.

In [13]:
df_cleaned.head()

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
0,transactionid_76871,batchid_36123,accountid_3957,subscriptionid_887,customerid_4406,ugx,256,providerid_6,productid_10,airtime,channelid_3,1000.0,1000,2018-11-15 02:18:49,2,0
1,transactionid_73770,batchid_15642,accountid_4841,subscriptionid_3829,customerid_4406,ugx,256,providerid_4,productid_6,financial_services,channelid_2,-20.0,20,2018-11-15 02:19:08,2,0
2,transactionid_26203,batchid_53941,accountid_4229,subscriptionid_222,customerid_4683,ugx,256,providerid_6,productid_1,airtime,channelid_3,500.0,500,2018-11-15 02:44:21,2,0
3,transactionid_380,batchid_102363,accountid_648,subscriptionid_2185,customerid_988,ugx,256,providerid_1,productid_21,utility_bill,channelid_3,20000.0,21800,2018-11-15 03:32:55,2,0
4,transactionid_28195,batchid_38780,accountid_4841,subscriptionid_3829,customerid_988,ugx,256,providerid_4,productid_6,financial_services,channelid_2,-644.0,644,2018-11-15 03:34:21,2,0


#### The next cell displays the last few rows of the cleaned DataFrame using the `tail()` method.


In [14]:
df_cleaned.tail()

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
95657,transactionid_89881,batchid_96668,accountid_4841,subscriptionid_3829,customerid_3078,ugx,256,providerid_4,productid_6,financial_services,channelid_2,-1000.0,1000,2019-02-13 09:54:09,2,0
95658,transactionid_91597,batchid_3503,accountid_3439,subscriptionid_2643,customerid_3874,ugx,256,providerid_6,productid_10,airtime,channelid_3,1000.0,1000,2019-02-13 09:54:25,2,0
95659,transactionid_82501,batchid_118602,accountid_4841,subscriptionid_3829,customerid_3874,ugx,256,providerid_4,productid_6,financial_services,channelid_2,-20.0,20,2019-02-13 09:54:35,2,0
95660,transactionid_136354,batchid_70924,accountid_1346,subscriptionid_652,customerid_1709,ugx,256,providerid_6,productid_19,tv,channelid_3,3000.0,3000,2019-02-13 10:01:10,2,0
95661,transactionid_35670,batchid_29317,accountid_4841,subscriptionid_3829,customerid_1709,ugx,256,providerid_4,productid_6,financial_services,channelid_2,-60.0,60,2019-02-13 10:01:28,2,0


## Preprocess DataFrame
Clean and normalize all text columns. Optionally, specify which columns to process.

In [15]:
if df is not None:
    df = data_processing.preprocess_dataframe(df)
    display(df.head())

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
0,transactionid_76871,batchid_36123,accountid_3957,subscriptionid_887,customerid_4406,ugx,256,providerid_6,productid_10,airtime,channelid_3,1000.0,1000,2018-11-15 02:18:49,2,0
1,transactionid_73770,batchid_15642,accountid_4841,subscriptionid_3829,customerid_4406,ugx,256,providerid_4,productid_6,financial_services,channelid_2,-20.0,20,2018-11-15 02:19:08,2,0
2,transactionid_26203,batchid_53941,accountid_4229,subscriptionid_222,customerid_4683,ugx,256,providerid_6,productid_1,airtime,channelid_3,500.0,500,2018-11-15 02:44:21,2,0
3,transactionid_380,batchid_102363,accountid_648,subscriptionid_2185,customerid_988,ugx,256,providerid_1,productid_21,utility_bill,channelid_3,20000.0,21800,2018-11-15 03:32:55,2,0
4,transactionid_28195,batchid_38780,accountid_4841,subscriptionid_3829,customerid_988,ugx,256,providerid_4,productid_6,financial_services,channelid_2,-644.0,644,2018-11-15 03:34:21,2,0


## Tokenize a Text Column
If you have a text column (e.g., 'message'), you can add a tokenized version.

In [16]:
if df is not None and 'message' in df.columns:
    df = data_processing.add_tokenized_column(df, 'message')
    display(df[['message', 'tokens']].head())

## Save Processed Data
Save the cleaned DataFrame to the processed directory.

In [17]:
if df is not None:
    data_processing.save_processed_data(df, filename='processed.csv', processed_dir='../data/processed/')
    print('Processed data saved!')

Processed data saved!


## Run Unit Tests
* The following cell runs the unit tests for the data processing module to ensure all functions work as expected.


In [10]:
!pytest ../tests/test_data_processing.py

platform win32 -- Python 3.13.2, pytest-8.0.2, pluggy-1.6.0
rootdir: d:\10Acadamy\Credit-Risk-Probability-Model
plugins: anyio-4.9.0, hydra-core-1.3.2, cov-6.1.1
collected 6 items

..\tests\test_data_processing.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                  [100%][0m




# **Summary of Steps:**
#
1. **Load the Raw Data:**
   - Use the `data_processing.load_raw_data()` function to import the raw CSV data from the specified directory.
   - The function can automatically detect the first CSV file if no filename is provided.
   - The loaded data is stored in a pandas DataFrame (`df`).
# 
2. **Display the First Few Rows:**
   - Use `display(df.head())` to visually inspect the first five rows of the raw data.
   - This helps verify that the data has loaded correctly and gives an overview of the columns and sample values.
# 
 3. **Clean and Normalize Text Columns:**
   - Apply the `data_processing.preprocess_dataframe()` function to clean and normalize all text columns in the DataFrame.
   - This function removes special characters, extra spaces, and converts text to lowercase.
   - Amharic-specific normalization is included if needed.
   - Optionally, you can specify which columns to clean; otherwise, all object (string) columns are processed.
# 
 4. **Display the Cleaned DataFrame:**
   - Use `display(df.head())` again to show the cleaned and normalized data.
   - This allows you to compare before and after cleaning.
# 
 5. **Tokenize a Text Column:**
    - If the DataFrame contains a text column (e.g., 'message'), use `data_processing.add_tokenized_column()` to add a new column with tokenized text.
    - The new column (default name: 'tokens') contains lists of tokens (words) for each row.
    - Display the first few rows of the original and tokenized columns for inspection.
# 
6. **Save the Processed DataFrame:**
   - Use `data_processing.save_processed_data()` to save the cleaned DataFrame to the processed data directory.
   - Specify the filename and directory as needed.
   - This ensures your processed data is stored for future use or modeling.
# 
7. **Run Unit Tests:**
   - Execute the unit tests in `../tests/test_data_processing.py` using `pytest` to verify that all data processing functions work as expected.
   - This step helps catch errors or regressions in the data processing pipeline.
