## Task 3 - Feature Engineering

### Some of the Recommendations made from Task 2 for Feature Engineering:
 - Encode categorical variables (ProductCategory, ChannelId, ProviderId) using one-hot or label encoding.
 - Create time-based features (hour, day, month) from TransactionStartTime to capture temporal patterns.
 - Handle negative Amounts separately (e.g., split into debit/credit features).
 - Apply log-transformation to Amount/Value to address skewness.
 - Use robust scaling for outliers or cap extreme values.

**Based on the insight from task-2 EDA, the Feature Engineering is done as follows**

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%reload_ext autoreload

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import sys
import warnings
warnings.filterwarnings("ignore")
import logging
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')


In [4]:
sys.path.append(os.path.abspath('../src/'))

In [5]:
from data_preprocessing import data_loader
from data_preprocessing_FE import process_data

### Perform Feature Engineering

In [6]:
# Load data
df = data_loader('../data/raw/data.csv')
logging.info("Data loaded successfully")

2025-07-06 14:41:03,113 - INFO - CSV file loaded successfully from ../data/raw/data.csv.
2025-07-06 14:41:03,121 - INFO - Data loaded successfully


In [7]:
df.columns

Index(['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId',
       'CurrencyCode', 'CountryCode', 'ProviderId', 'ProductId',
       'ProductCategory', 'ChannelId', 'Amount', 'Value',
       'TransactionStartTime', 'PricingStrategy', 'FraudResult'],
      dtype='object')

In [8]:
# Define columns
numerical_columns = ['Amount', 'Value']
categorical_columns = ['ProductCategory', 'ChannelId', 'ProviderId']
customer_id_col = 'CustomerId'
time_column = 'TransactionStartTime'
        
# Process data
X_processed, y, feature_names = process_data(
    df,
    target_column='FraudResult',
    numerical_columns=numerical_columns,
    categorical_columns=categorical_columns,
    customer_id_col=customer_id_col,
    time_column=time_column
)

2025-07-06 14:41:24,234 - INFO - Starting data processing
2025-07-06 14:41:24,606 - INFO - Creating data processing pipeline
2025-07-06 14:41:24,619 - INFO - Extracting time-based features
2025-07-06 14:41:31,982 - INFO - Columns after time feature extraction: ['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CurrencyCode', 'CountryCode', 'ProviderId', 'ProductId', 'ProductCategory', 'ChannelId', 'Amount', 'Value', 'TransactionStartTime', 'PricingStrategy', 'TransactionHour', 'TransactionDay', 'TransactionMonth', 'TransactionYear']
2025-07-06 14:41:32,311 - INFO - Aggregating features by CustomerId
2025-07-06 14:41:32,911 - INFO - Columns after aggregation: ['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CurrencyCode', 'CountryCode', 'ProviderId', 'ProductId', 'ProductCategory', 'ChannelId', 'Amount', 'Value', 'PricingStrategy', 'TransactionHour', 'TransactionDay', 'TransactionMonth', 'TransactionYear', 'Amount_TotalAmount', 'Amount_A

In [9]:
# Save processed data
X_processed.to_csv('../data/processed/processed_data.csv', index=False)
y.to_csv('../data/processed/target.csv', index=False)
logging.info("Processed data and target saved")

2025-07-06 14:42:37,774 - INFO - Processed data and target saved


In [10]:
print("Feature names:", feature_names)
print("Processed data preview:")
X_processed.head()

Feature names: ['Amount', 'Value', 'TransactionHour', 'TransactionDay', 'TransactionMonth', 'TransactionYear', 'Amount_TotalAmount', 'Amount_AvgAmount', 'Amount_TransactionCount', 'Amount_StdAmount', 'ProductCategory_data_bundles', 'ProductCategory_financial_services', 'ProductCategory_movies', 'ProductCategory_other', 'ProductCategory_ticket', 'ProductCategory_transport', 'ProductCategory_tv', 'ProductCategory_utility_bill', 'ChannelId_ChannelId_2', 'ChannelId_ChannelId_3', 'ChannelId_ChannelId_5', 'ProviderId_ProviderId_2', 'ProviderId_ProviderId_3', 'ProviderId_ProviderId_4', 'ProviderId_ProviderId_5', 'ProviderId_ProviderId_6', 'TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CurrencyCode', 'CountryCode', 'ProductId', 'PricingStrategy']
Processed data preview:


Unnamed: 0,Amount,Value,TransactionHour,TransactionDay,TransactionMonth,TransactionYear,Amount_TotalAmount,Amount_AvgAmount,Amount_TransactionCount,Amount_StdAmount,...,ProviderId_ProviderId_6,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProductId,PricingStrategy
0,-0.139857,-0.072291,-2.15553,-0.100739,0.848684,-0.994246,-0.514949,-0.754644,-0.311831,-0.763824,...,1.0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProductId_10,2
1,-0.457536,-0.080251,-2.15553,-0.100739,0.848684,-0.994246,-0.514949,-0.754644,-0.311831,-0.763824,...,0.0,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProductId_6,2
2,-0.295582,-0.076352,-2.15553,-0.100739,0.848684,-0.994246,-0.688512,-0.92267,-0.444993,-1.270194,...,1.0,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProductId_1,2
3,1.7522,0.096648,-1.949214,-0.100739,0.848684,-0.994246,-0.325636,1.26598,-0.40402,1.587514,...,0.0,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProductId_21,2
4,-0.65188,-0.075183,-1.949214,-0.100739,0.848684,-0.994246,-0.325636,1.26598,-0.40402,1.587514,...,0.0,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProductId_6,2
