## Xente Fraud Detection: Feature Engineering
Competition : https://zindi.africa/competitions/xente-fraud-detection-challenge

Problem statement: Create a machine learning model to detect fraudulent transactions.

Predict `FraudResult` probability

Evaluation: The error metric for this competition is the `F1 score`, which ranges from 0 (total failure) to 1 (perfect score). Hence, the closer your score is to 1, the better your model.

In [30]:
# Load libraries

# to handle datasets
import pandas as pd
import numpy as np
import datetime as dt

# for plotting
import matplotlib.pyplot as plt
%matplotlib inline

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)



In [31]:
import warnings
warnings.simplefilter("ignore")

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

### Feature Engineering

In the following cells, I will engineer / pre-process the variables of the Xente Fraud Detection Dataset. I will engineer the variables to address:

1. Missing values
2. Temporal variables
3. Non-Gaussian distributed variables
4. Categorical variables: remove rare labels
5. Categorical variables: convert strings to numbers
5. Standarise the values of the variables to the same range



In [1]:
# Load datasets
df_data = pd.read_csv('../data/interim/wrangled_data.csv', parse_dates=['TransactionStartTime'])

NameError: name 'pd' is not defined

In [33]:
# view the size of the dataset
print('The size of the dataset: ' + str(df_data.shape))

# view some rows
df_data.head()

The size of the dataset: (140681, 12)


Unnamed: 0,Amount,ChannelId,CustomerId,FraudResult,PricingStrategy,ProductCategory,ProductId,ProviderId,TransactionId,TransactionStartTime,Value,source
0,1000.0,ChannelId_3,CustomerId_4406,0.0,2,airtime,ProductId_10,ProviderId_6,TransactionId_76871,2018-11-15 02:18:49+00:00,1000,train
1,-20.0,ChannelId_2,CustomerId_4406,0.0,2,financial_services,ProductId_6,ProviderId_4,TransactionId_73770,2018-11-15 02:19:08+00:00,20,train
2,500.0,ChannelId_3,CustomerId_4683,0.0,2,airtime,ProductId_1,ProviderId_6,TransactionId_26203,2018-11-15 02:44:21+00:00,500,train
3,20000.0,ChannelId_3,CustomerId_988,0.0,2,utility_bill,ProductId_21,ProviderId_1,TransactionId_380,2018-11-15 03:32:55+00:00,21800,train
4,-644.0,ChannelId_2,CustomerId_988,0.0,2,financial_services,ProductId_6,ProviderId_4,TransactionId_28195,2018-11-15 03:34:21+00:00,644,train


In [34]:
df_train = df_data[df_data['source'] == 'train']
df_new_data = df_data[df_data['source'] == 'test']

In [35]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(df_train, df_train.FraudResult,
                                                    test_size=0.1,
                                                    random_state=0) # we are setting the seed here
X_train.shape, X_test.shape

((86095, 12), (9567, 12))

In [36]:
# numerical variable
for var in ['Value']:
    X_train[var] = np.log(X_train[var])
    X_test[var] = np.log(X_test[var])
    df_new_data[var]= np.log(df_new_data[var])

In [37]:
# function to extract year, month, day, hr, minute to create columns in dataframe
def extract_date_features(df):
    '''df takes dataframe'''
    
    df['Transaction_year'] = df['TransactionStartTime'].apply(lambda x: x.year)
    df['Transaction_month'] = df['TransactionStartTime'].apply(lambda x: x.month)
    df['Transaction_day'] = df['TransactionStartTime'].apply(lambda x: x.day)
    df['Transaction_hour'] = df['TransactionStartTime'].apply(lambda x: x.hour)
    df['Transaction_minute'] = df['TransactionStartTime'].apply(lambda x: x.minute)
    

In [38]:
# Extract date features
extract_date_features(X_train)
extract_date_features(X_test)
extract_date_features(df_new_data)

In [39]:
# preserve the transactionid
df_train_transaction_ids = X_train['TransactionId']
df_test_transaction_ids = X_test['TransactionId']
df_new_data_transaction_ids = df_new_data['TransactionId']


In [40]:
# drop columns
X_train.drop(columns=['CustomerId','source','Amount','TransactionStartTime'], inplace=True)
X_test.drop(columns=['CustomerId','source','Amount','TransactionStartTime'], inplace=True)
df_new_data.drop(columns=['CustomerId','source','Amount','TransactionStartTime'], inplace=True)


In [41]:
# let's capture the categorical variables first
cat_vars = [var for var in X_train.columns if X_train[var].dtype == 'O']
cat_vars

['ChannelId', 'ProductCategory', 'ProductId', 'ProviderId', 'TransactionId']

In [42]:
categorical_vars = ['ChannelId', 'ProductCategory', 'ProductId', 'ProviderId']

In [43]:
# this function will assign discrete values to the strings of the variables, 
# so that the smaller value corresponds to the smaller mean of target

def replace_categories(train, test,new_data, var, target):
    ordered_labels = train.groupby([var])[target].mean().sort_values().index
    ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)} 
    train[var] = train[var].map(ordinal_label)
    test[var] = test[var].map(ordinal_label)
    new_data[var] = new_data[var].map(ordinal_label)

In [44]:
for var in cat_vars:
    replace_categories(X_train, X_test, df_new_data, var, 'FraudResult')

In [45]:
X_train.head()

Unnamed: 0,ChannelId,FraudResult,PricingStrategy,ProductCategory,ProductId,ProviderId,TransactionId,Value,Transaction_year,Transaction_month,Transaction_day,Transaction_hour,Transaction_minute
59577,1,0.0,2,6,14,2,61971,4.60517,2019,1,13,19,0
90,1,0.0,4,5,15,2,34926,6.907755,2018,11,15,7,1
72095,2,0.0,2,5,15,1,18581,8.29405,2019,1,25,12,57
54884,2,0.0,2,5,16,1,29054,7.600902,2019,1,8,19,49
43368,2,0.0,2,5,15,1,6870,6.214608,2018,12,28,6,49


In [46]:
X_train['TransactionId'] = df_train_transaction_ids
X_test['TransactionId'] = df_test_transaction_ids
df_new_data['TransactionId'] = df_new_data_transaction_ids

In [47]:
# check absence of na
[var for var in X_train.columns if X_train[var].isnull().sum()>0]

[]

In [48]:
# check absence of na
[var for var in X_test.columns if X_test[var].isnull().sum()>0]

[]

In [49]:
train_vars = [var for var in X_train.columns if var not in ['TransactionId', 'FraudResult']]
len(train_vars)

11

In [50]:
X_train[['TransactionId', 'FraudResult']].reset_index(drop=True)

Unnamed: 0,TransactionId,FraudResult
0,TransactionId_109305,0.0
1,TransactionId_74975,0.0
2,TransactionId_51548,0.0
3,TransactionId_98435,0.0
4,TransactionId_67877,0.0
5,TransactionId_7976,0.0
6,TransactionId_71943,0.0
7,TransactionId_107148,0.0
8,TransactionId_53768,0.0
9,TransactionId_38186,0.0


In [51]:
# fit scaler
scaler = MinMaxScaler() # create an instance
scaler.fit(X_train[train_vars]) #  fit  the scaler to the train set for later use

# transform the train and test set, and add on the Id and SalePrice variables
X_train = pd.concat([X_train[['TransactionId', 'FraudResult']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(X_train[train_vars]), columns=train_vars)],
                    axis=1)

X_test = pd.concat([X_test[['TransactionId', 'FraudResult']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(X_test[train_vars]), columns=train_vars)],
                    axis=1)

df_new_data = pd.concat([df_new_data[['TransactionId', 'FraudResult']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(df_new_data[train_vars]), columns=train_vars)],
                    axis=1)

In [52]:
X_train.tail()

Unnamed: 0,TransactionId,FraudResult,ChannelId,PricingStrategy,ProductCategory,ProductId,ProviderId,Value,Transaction_year,Transaction_month,Transaction_day,Transaction_hour,Transaction_minute
86090,TransactionId_1505,0.0,0.666667,1.0,0.625,0.0,0.8,0.448181,0.0,1.0,0.233333,0.869565,0.694915
86091,TransactionId_68325,0.0,0.333333,0.5,0.75,0.636364,0.4,0.208843,0.0,1.0,0.933333,0.347826,0.491525
86092,TransactionId_62369,0.0,0.666667,0.5,0.625,0.727273,0.2,0.429516,0.0,1.0,0.866667,0.73913,0.050847
86093,TransactionId_1639,0.0,0.666667,0.5,0.625,0.0,0.2,0.448181,0.0,1.0,0.9,0.347826,0.101695
86094,TransactionId_83620,0.0,0.666667,0.5,0.625,0.727273,0.2,0.403209,1.0,0.0,0.7,0.652174,0.372881


In [53]:
# check absence of missing values
X_train.isnull().sum()

TransactionId         0
FraudResult           0
ChannelId             0
PricingStrategy       0
ProductCategory       0
ProductId             0
ProviderId            0
Value                 0
Transaction_year      0
Transaction_month     0
Transaction_day       0
Transaction_hour      0
Transaction_minute    0
dtype: int64

In [54]:
# save the train and test sets for the next notebook!
X_train.to_csv("../data/processed/x_train.csv",index=False)
X_test.to_csv("../data/processed/x_test.csv",index=False)
df_new_data.to_csv("../data/processed/new_data.csv",index=False)