## Xente Fraud Detection: Model Deployment Prep
Competition : https://zindi.africa/competitions/xente-fraud-detection-challenge

Problem statement: Create a machine learning model to detect fraudulent transactions.

Predict `FraudResult` probability

Evaluation: The error metric for this competition is the `F1 score`, which ranges from 0 (total failure) to 1 (perfect score). Hence, the closer your score is to 1, the better your model.

### Model Deployment Prep


Now, we want to deploy our model. We want to create an API, that we can call with new data, with new characteristics about houses, to get an estimate of the SalePrice. In order to do so, we need to write code in a very specific way. We will show you how to write production code in the coming lectures.

Here, we will summarise, the key pieces of code, that we need to take forward, for this particular project, to put our model in production.

In [74]:
# Load libraries

# to handle datasets
import pandas as pd
import numpy as np

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

# to build the models


# to evaluate the models

# to persist the model and the scaler
from sklearn.externals import joblib

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [75]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

In [76]:
import warnings
warnings.simplefilter("ignore")

In [77]:
# Load datasets
df_data = pd.read_csv('../data/raw/training.csv',parse_dates=['TransactionStartTime'])
selected_features = pd.read_csv('../data/processed/selected_features.csv')
df_data.head()

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15 02:18:49+00:00,2,0
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15 02:19:08+00:00,2,0
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15 02:44:21+00:00,2,0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15 03:32:55+00:00,2,0
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15 03:34:21+00:00,2,0


#### Separate dataset into train and test

Before beginning to engineer our features, it is important to separate our data intro training and testing set. This is to avoid over-fitting. There is an element of randomness in dividing the dataset, so remember to set the seed.

In [78]:
# Let's separate into train and test set
# Remember to seet the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(df_data, df_data.FraudResult,
                                                    test_size=0.1,
                                                    random_state=0) # we are setting the seed here
X_train.shape, X_test.shape

((86095, 16), (9567, 16))

#### Numerical variables
We will log transform the numerical variables that do not contain zeros in order to get a more Gaussian-like distribution. This tends to help Linear machine learning models.

In [79]:
# numerical variable
for var in ['Value']:
    X_train[var] = np.log(X_train[var])
    X_test[var] = np.log(X_test[var])

In [80]:
# function to extract year, month, day, hr, minute to create columns in dataframe
def extract_date_features(df):
    '''df takes dataframe'''
    
    df['Transaction_year'] = df['TransactionStartTime'].apply(lambda x: x.year)
    df['Transaction_month'] = df['TransactionStartTime'].apply(lambda x: x.month)
    df['Transaction_day'] = df['TransactionStartTime'].apply(lambda x: x.day)
    df['Transaction_hour'] = df['TransactionStartTime'].apply(lambda x: x.hour)
    df['Transaction_minute'] = df['TransactionStartTime'].apply(lambda x: x.minute)
    

In [81]:
# Extract date features
extract_date_features(X_train)
extract_date_features(X_test)

In [82]:
# make a list of the categorical variables that contain missing values
vars_with_na = [var for var in features if X_train[var].isnull().sum()>1 and X_train[var].dtypes=='O']

# print the variable name and the percentage of missing values
for var in vars_with_na:
    print(var, np.round(X_train[var].isnull().mean(), 3),  ' % missing values')

In [83]:
# let's capture the categorical variables first
cat_vars = [var for var in features if X_train[var].dtype == 'O']
cat_vars

['ChannelId', 'ProviderId']

In [84]:
features

['ChannelId',
 'PricingStrategy',
 'ProviderId',
 'Value',
 'Transaction_year',
 'Transaction_month']

In [85]:
def find_frequent_labels(df, var, rare_perc):
    # finds the labels that are shared by more than a certain % of the houses in the dataset
    df = df.copy()
    tmp = df.groupby(var)['FraudResult'].count() / len(df)
    return tmp[tmp>rare_perc].index

frequent_labels_dict = {}

for var in cat_vars:
    frequent_ls = find_frequent_labels(X_train, var, 0.01)
    
    # we save the list in a dictionary
    frequent_labels_dict[var] = frequent_ls
    
    X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
    X_test[var] = np.where(X_test[var].isin(frequent_ls), X_test[var], 'Rare')
    
# now we save the dictionary
np.save('../data/FrequentLabels.npy', frequent_labels_dict)

In [86]:
frequent_labels_dict

{'ChannelId': Index(['ChannelId_2', 'ChannelId_3', 'ChannelId_5'], dtype='object', name='ChannelId'),
 'ProviderId': Index(['ProviderId_1', 'ProviderId_3', 'ProviderId_4', 'ProviderId_5',
        'ProviderId_6'],
       dtype='object', name='ProviderId')}

In [87]:
# this function will assign discrete values to the strings of the variables, 
# so that the smaller value corresponds to the smaller mean of target
    
def replace_categories(train, test, var, target):
    train = train.copy()
    test = test.copy()
    
    ordered_labels = train.groupby([var])[target].mean().sort_values().index
    ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)} 
    
    train[var] = train[var].map(ordinal_label)
    test[var] = test[var].map(ordinal_label)
    
    return ordinal_label, train, test

In [88]:
ordinal_label_dict = {}
for var in cat_vars:
    ordinal_label, X_train, X_test = replace_categories(X_train, X_test, var, 'FraudResult')
    ordinal_label_dict[var] = ordinal_label
    
# now we save the dictionary
np.save('../data/OrdinalLabels.npy', ordinal_label_dict)
ordinal_label_dict

{'ChannelId': {'ChannelId_5': 0,
  'ChannelId_2': 1,
  'ChannelId_3': 2,
  'Rare': 3},
 'ProviderId': {'Rare': 0,
  'ProviderId_6': 1,
  'ProviderId_4': 2,
  'ProviderId_5': 3,
  'ProviderId_1': 4,
  'ProviderId_3': 5}}

In [89]:
# check absence of na
[var for var in features if X_train[var].isnull().sum()>0]

[]

In [90]:
# check absence of na
[var for var in features if X_test[var].isnull().sum()>0]

[]

In [91]:
# fit scaler
scaler = MinMaxScaler() # create an instance
scaler.fit(X_train[features]) #  fit  the scaler to the train set for later use

# transform the train and test set, and add on the Id and SalePrice variables
X_train = pd.concat([X_train[['TransactionId', 'FraudResult']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(X_train[features]), columns=features)],
                    axis=1)

X_test = pd.concat([X_test[['TransactionId', 'FraudResult']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(X_test[features]), columns=features)],
                    axis=1)

In [92]:
X_train.head()

Unnamed: 0,TransactionId,FraudResult,ChannelId,PricingStrategy,ProviderId,Value,Transaction_year,Transaction_month
0,TransactionId_109305,0,0.333333,0.5,0.4,0.253815,1.0,0.0
1,TransactionId_74975,0,0.333333,1.0,0.4,0.403209,0.0,0.909091
2,TransactionId_51548,0,0.666667,0.5,0.2,0.493153,1.0,0.0
3,TransactionId_98435,0,0.666667,0.5,0.2,0.448181,1.0,0.0
4,TransactionId_67877,0,0.666667,0.5,0.2,0.358237,0.0,1.0


In [93]:
# check absence of na
[var for var in X_train.columns if X_train[var].isnull().sum()>0]

[]

In [94]:
# check absence of na
[var for var in X_test.columns if X_test[var].isnull().sum()>0]

[]

In [95]:
# capture the target
#y_train = X_train['FraudResult']
#y_test = X_test['FraudResult']

In [96]:
X_train.head()

Unnamed: 0,TransactionId,FraudResult,ChannelId,PricingStrategy,ProviderId,Value,Transaction_year,Transaction_month
0,TransactionId_109305,0,0.333333,0.5,0.4,0.253815,1.0,0.0
1,TransactionId_74975,0,0.333333,1.0,0.4,0.403209,0.0,0.909091
2,TransactionId_51548,0,0.666667,0.5,0.2,0.493153,1.0,0.0
3,TransactionId_98435,0,0.666667,0.5,0.2,0.448181,1.0,0.0
4,TransactionId_67877,0,0.666667,0.5,0.2,0.358237,0.0,1.0


#### Algorithms

In [97]:
# Test options and evaluation metric
num_folds = 10
scoring = 'accuracy'
seed = 0

In [98]:
# Make predictions on validation dataset
alg = DecisionTreeClassifier()
alg.fit(X_train.drop(columns=['TransactionId','FraudResult']), y_train)
predictions = alg.predict(X_test.drop(columns=['TransactionId','FraudResult']))
print('accuracy_score')
print(accuracy_score(y_test, predictions))
print(' ')
print('confusion_matrix')
print(confusion_matrix(y_test, predictions))
print(' ')
print('classification_report')
print(classification_report(y_test, predictions))

accuracy_score
0.999686422076
 
confusion_matrix
[[9548    2]
 [   1   16]]
 
classification_report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      9550
           1       0.89      0.94      0.91        17

   micro avg       1.00      1.00      1.00      9567
   macro avg       0.94      0.97      0.96      9567
weighted avg       1.00      1.00      1.00      9567

