## Xente Fraud Detection: Feature Selection
Competition : https://zindi.africa/competitions/xente-fraud-detection-challenge

Problem statement: Create a machine learning model to detect fraudulent transactions.

Predict `FraudResult` probability

Evaluation: The error metric for this competition is the `F1 score`, which ranges from 0 (total failure) to 1 (perfect score). Hence, the closer your score is to 1, the better your model.

## Feature Selection

In the following cells, a group of variables will be selected, the most predictive ones, to build the machine learning models. 

In [4]:
# Load libraries

# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
%matplotlib inline

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

# for feature extraction with Univariate Statistical Tests (Chi-squared for classification)
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# for feature extraction with RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# for feature importance
from sklearn.ensemble import ExtraTreesClassifier

In [5]:
import warnings
warnings.simplefilter("ignore")

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

In [6]:
# Load datasets
X_train = pd.read_csv('../data/processed/x_train.csv')
X_test = pd.read_csv('../data/processed/x_test.csv')

X_test.head()

Unnamed: 0,TransactionId,FraudResult,ChannelId,PricingStrategy,ProductCategory,ProductId,ProviderId,Value,Transaction_year,Transaction_month,Transaction_day,Transaction_hour,Transaction_minute
0,TransactionId_91392,0.0,0.666667,1.0,0.625,0.727273,0.8,0.448181,1.0,0.0,0.733333,0.478261,0.847458
1,TransactionId_119416,0.0,0.666667,0.5,0.625,0.681818,0.2,0.40998,1.0,0.0,0.6,0.73913,0.186441
2,TransactionId_124012,0.0,0.666667,0.5,0.625,0.681818,0.2,0.448181,0.0,0.909091,0.7,0.26087,0.576271
3,TransactionId_12251,0.0,0.333333,0.5,0.75,0.636364,0.4,0.403209,1.0,0.090909,0.0,0.73913,0.762712
4,TransactionId_27059,0.0,0.666667,0.5,0.625,0.727273,0.2,0.403209,1.0,0.0,0.466667,0.521739,0.355932


In [7]:
# capture the target
y_train = X_train['FraudResult']
y_test = X_test['FraudResult']

# drop unnecessary variables from our training and testing sets
X_train.drop(['FraudResult','TransactionId'], axis=1, inplace=True)
X_test.drop(['FraudResult','TransactionId'], axis=1, inplace=True)

### Feature Selection

A few techniques will be used for this.

#### Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

In [8]:
X = X_train.values
Y = y_train.values

In [9]:
# feature extraction
test = SelectKBest(score_func=chi2, k=5)
fit = test.fit(X, Y)

In [14]:
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)

[  5.848   5.716   1.387   7.233  62.464  80.087   4.381   3.218   1.501
   0.511   0.328]


In [15]:
# summarize selected features
print(features[0:5,:])

[[ 0.333  0.5    0.636  0.4    0.254]
 [ 0.333  1.     0.682  0.4    0.403]
 [ 0.667  0.5    0.682  0.2    0.493]
 [ 0.667  0.5    0.727  0.2    0.448]
 [ 0.667  0.5    0.682  0.2    0.358]]


In [16]:
features

array([[ 0.333,  0.5  ,  0.636,  0.4  ,  0.254],
       [ 0.333,  1.   ,  0.682,  0.4  ,  0.403],
       [ 0.667,  0.5  ,  0.682,  0.2  ,  0.493],
       ..., 
       [ 0.667,  0.5  ,  0.727,  0.2  ,  0.43 ],
       [ 0.667,  0.5  ,  0.   ,  0.2  ,  0.448],
       [ 0.667,  0.5  ,  0.727,  0.2  ,  0.403]])

In [17]:
X[0:1]

array([[ 0.333,  0.5  ,  0.75 ,  0.636,  0.4  ,  0.254,  1.   ,  0.   ,
         0.4  ,  0.826,  0.   ]])

In [18]:
X_train.head(5)

Unnamed: 0,ChannelId,PricingStrategy,ProductCategory,ProductId,ProviderId,Value,Transaction_year,Transaction_month,Transaction_day,Transaction_hour,Transaction_minute
0,0.333333,0.5,0.75,0.636364,0.4,0.253815,1.0,0.0,0.4,0.826087,0.0
1,0.333333,1.0,0.625,0.681818,0.4,0.403209,0.0,0.909091,0.466667,0.304348,0.016949
2,0.666667,0.5,0.625,0.681818,0.2,0.493153,1.0,0.0,0.8,0.521739,0.966102
3,0.666667,0.5,0.625,0.727273,0.2,0.448181,1.0,0.0,0.233333,0.826087,0.830508
4,0.666667,0.5,0.625,0.681818,0.2,0.358237,0.0,1.0,0.9,0.26087,0.830508


In [61]:
#Top 5: ChannelId,PricingStrategy,ProductId,ProviderId,Value

#### Feature Extraction with RFE

In [24]:
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 6)
fit = rfe.fit(X, Y)
print("Num Features:")
print(fit.n_features_)

print("Selected Features:")
print(fit.support_)

print("Feature Ranking: %s")
print(fit.ranking_)

Num Features:
6
Selected Features:
[ True  True False False  True  True  True  True False False False]
Feature Ranking: %s
[1 1 3 2 1 1 1 1 5 6 4]


In [25]:
#Top 5: PricingStrategy, ProviderId, Value, Transaction_year, Transaction_month

In [26]:
# this is how we can make a list of the selected features
selected_feat = X_train.columns[(fit.support_)]
selected_feat

Index(['ChannelId', 'PricingStrategy', 'ProviderId', 'Value',
       'Transaction_year', 'Transaction_month'],
      dtype='object')

#### Feature Importance with Extra Trees Classifier

In [27]:
# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[ 0.009  0.033  0.017  0.028  0.062  0.561  0.002  0.02   0.092  0.08
  0.095]


In [None]:
# Most important features: Value, Transaction_minute, Transaction_day, ProviderId, PricingStrategy

In [None]:
# This features cut across the 3 selection processes used: value, providerid, PricingStrategy

In [28]:
# save the selected list of features
# only value selected to test with
pd.Series(selected_feat).to_csv('../data/processed/selected_features.csv', index=False)