# Feature Engineering

1.	Create Aggregate Features
	Example:
-	Total Transaction Amount: Sum of all transaction amounts for each customer.
-	Average Transaction Amount: Average transaction amount per customer.
-	Transaction Count: Number of transactions per customer.
-	Standard Deviation of Transaction Amounts: Variability of transaction amounts per customer.
2.	Extract Features
	Example:
-	Transaction Hour: The hour of the day when the transaction occurred.
-	Transaction Day: The day of the month when the transaction occurred.
-	Transaction Month: The month when the transaction occurred.
-	Transaction Year: The year when the transaction occurred.
3.	Encode Categorical Variables
Convert categorical variables into numerical format by using:
-	One-Hot Encoding: Converts categorical values into binary vectors.
-	Label Encoding: Assigns a unique integer to each category.
4.	Handle Missing Values
	Use imputation or Removal to handle missing values
-	Imputation: Filling missing values with mean, median, mode, or using more methods like KNN imputation.
-	Removal: Removing rows or columns with missing values if they are few.
5.	Normalize/Standardize Numerical Features
Normalization and standardization are scaling techniques used to bring all numerical features onto a similar scale.
-	Normalization: Scales the data to a range of [0, 1].
-	Standardization: Scales the data to have a mean of 0 and a standard deviation of 1.


In [53]:
import pandas as pd
import pytz
import sys
import os
sys.path.append(os.path.abspath('../scripts'))

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from eda import *
import prediction as dp
from feature import *

## Load data

In [54]:
df , _ = load_data()

2024-10-08 22:42:57,929 - INFO - Loading Data ...
2024-10-08 22:42:58,479 - INFO - Loading Data Finshed


In [55]:
df.columns

Index(['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId',
       'CurrencyCode', 'CountryCode', 'ProviderId', 'ProductId',
       'ProductCategory', 'ChannelId', 'Amount', 'Value',
       'TransactionStartTime', 'PricingStrategy', 'FraudResult'],
      dtype='object')

## Aggregate 

In [56]:
df_agg = aggeregatef(df)

In [57]:
df_agg

Unnamed: 0_level_0,Transaction_count,Total_Transaction,Average_Transaction,Transaction_std
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CustomerId_1,1,-10000.0,-10000.000000,
CustomerId_10,1,-10000.0,-10000.000000,
CustomerId_1001,5,20000.0,4000.000000,6558.963333
CustomerId_1002,11,4225.0,384.090909,560.498966
CustomerId_1003,6,20000.0,3333.333333,6030.478146
...,...,...,...,...
CustomerId_992,6,20000.0,3333.333333,6088.240030
CustomerId_993,5,20000.0,4000.000000,6745.368782
CustomerId_994,101,543873.0,5384.881188,14800.656784
CustomerId_996,17,139000.0,8176.470588,4433.329648


In [132]:
df_agg.sort_values(by='Transaction_count')

Unnamed: 0_level_0,Transaction_count,Total_Transaction,Average_Transaction,Transaction_std
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CustomerId_1,1,-10000.0,-10000.000000,
CustomerId_2690,1,1000.0,1000.000000,
CustomerId_2694,1,1000.0,1000.000000,
CustomerId_2698,1,500.0,500.000000,
CustomerId_2703,1,500000.0,500000.000000,
...,...,...,...,...
CustomerId_4033,778,1768355.5,2272.950514,10382.687289
CustomerId_1096,784,1949226.0,2486.257653,17819.372757
CustomerId_647,1869,3633564.0,1944.121990,7715.389537
CustomerId_3634,2085,2628793.0,1260.811990,5388.206928


## Feature Extraction 

In [62]:
dff = df.copy()
# Convert the 'transaction_start_time' to datetime and localize to UTC
dff['TransactionStartTime'] = pd.to_datetime(dff['TransactionStartTime']).dt.tz_convert('UTC')

# # Convert to Uganda time (UTC+3)
# uganda_tz = pytz.timezone('Africa/Kampala')
# dff['TransactionStartTime_Uganda'] = dff['TransactionStartTime'].dt.tz_convert(uganda_tz)


In [63]:
dff['TransactionYear'] =  dff['TransactionStartTime'].dt.year
dff['TransactionMonth'] = dff['TransactionStartTime'].dt.month
dff['TransactionDay'] = dff['TransactionStartTime'].dt.day
dff['TransactionHour'] = dff['TransactionStartTime'].dt.hour

In [64]:
# Find the start and end dates
start_date = dff['TransactionStartTime'].min()
end_date = dff['TransactionStartTime'].max()

# Display the results
print("Start Date:", start_date)
print("End Date:", end_date)

Start Date: 2018-11-15 02:18:49+00:00
End Date: 2019-02-13 10:01:28+00:00


In [68]:
dff['rec'] = end_date - dff['TransactionStartTime'] 
dff['Recency'] = dff['rec'].dt.days

In [115]:

# Calculate Recency
dff['recency'] = (end_date - dff.groupby('CustomerId')['TransactionStartTime_Uganda'].max()).dt.days

In [65]:
dff.columns

Index(['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId',
       'CurrencyCode', 'CountryCode', 'ProviderId', 'ProductId',
       'ProductCategory', 'ChannelId', 'Amount', 'Value',
       'TransactionStartTime', 'PricingStrategy', 'FraudResult',
       'TransactionYear', 'TransactionMonth', 'TransactionDay',
       'TransactionHour'],
      dtype='object')

In [18]:
latest_transaction = dff.groupby('CustomerId')['TransactionStartTime_Uganda'].max()
latest_transaction

CustomerId
CustomerId_1      2018-11-21 19:49:14+03:00
CustomerId_10     2018-11-21 19:49:09+03:00
CustomerId_1001   2018-11-16 11:20:39+03:00
CustomerId_1002   2019-01-18 13:05:00+03:00
CustomerId_1003   2019-02-01 18:04:51+03:00
                             ...           
CustomerId_992    2019-02-08 13:27:42+03:00
CustomerId_993    2019-01-18 18:56:30+03:00
CustomerId_994    2019-02-12 14:17:08+03:00
CustomerId_996    2018-12-07 18:24:31+03:00
CustomerId_998    2019-02-13 10:47:23+03:00
Name: TransactionStartTime_Uganda, Length: 3742, dtype: datetime64[ns, Africa/Kampala]

## Model preparing 

In [6]:
categorical_feature = [    'ProviderId', 'ProductId',
                        'ProductCategory', 'ChannelId',  'PricingStrategy']

num_cat = [ 'Amount', 'TransactionYear', 'TransactionMonth',
       'TransactionDay', 'TransactionHour']


In [7]:
all_feature = categorical_feature + num_cat
df_feat = dff[all_feature]
print(df_feat.shape)


(95662, 10)


In [8]:
df_feat.columns

Index(['ProviderId', 'ProductId', 'ProductCategory', 'ChannelId',
       'PricingStrategy', 'Amount', 'TransactionYear', 'TransactionMonth',
       'TransactionDay', 'TransactionHour'],
      dtype='object')

In [133]:
df_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95662 entries, 0 to 95661
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ProviderId        95662 non-null  object 
 1   ProductId         95662 non-null  object 
 2   ProductCategory   95662 non-null  object 
 3   ChannelId         95662 non-null  object 
 4   PricingStrategy   95662 non-null  int64  
 5   Amount            95662 non-null  float64
 6   TransactionYear   95662 non-null  int32  
 7   TransactionMonth  95662 non-null  int32  
 8   TransactionDay    95662 non-null  int32  
 9   TransactionHour   95662 non-null  int32  
dtypes: float64(1), int32(4), int64(1), object(4)
memory usage: 5.8+ MB


In [16]:
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, OneHotEncoder
import joblib

In [20]:
df_label = encoder('oneHotEncoder', df_feat, categorical_feature)

In [21]:
df_label.shape

(95662, 46)

In [22]:
df_scaled = dp.scaler('standardScaler', df_label, num_cat)

In [23]:
df_scaled.shape

(95662, 46)

In [24]:
from scipy.stats import chi2_contingency, ttest_ind
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score

In [30]:
def split_data(X, y, test_size=0.2, random_state=42):
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

# Split the data
X = df_scaled.copy()
y = dff['FraudResult']
X_train, X_test, y_train, y_test = split_data(X, y)