# Prepare the real-time scoring model

The team at Woodgrove Bank has provided you with exported CSV copies of historical data for you to train your model against. Run the following cell to load required libraries and download the data sets from the Azure ML datastore.

In [None]:
#!pip install --upgrade azureml-train-automl-runtime==1.36.0
#!pip install --upgrade azureml-automl-runtime==1.36.0
#!pip install --upgrade scikit-learn
#!pip install --upgrade numpy

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# sklearn.externals.joblib was deprecated in 0.21
from sklearn import __version__ as sklearnver
from packaging.version import Version
if Version(sklearnver) < Version("0.21.0"):
    from sklearn.externals import joblib
else:
    import joblib

import numpy as np
import pandas as pd


account_df = pd.read_csv('./data/Account_Info.csv')
fraud_df = pd.read_csv('./data/Fraud_Transactions.csv')
untagged_df = pd.read_csv( './data/Untagged_Transactions.csv')

  account_df = pd.read_csv('./data/Account_Info.csv')
  untagged_df = pd.read_csv( './data/Untagged_Transactions.csv')


View the fraud dataframe

In [4]:
fraud_df.head(3)

Unnamed: 0,transactionID,accountID,transactionAmount,transactionCurrencyCode,transactionDate,transactionTime,localHour,transactionDeviceId,transactionIPaddress
0,65020E58-781D-4FFC-BEF2-0FDF87BE671D,A985156981092344,1148.6,CAD,20130402,14450,20.0,,66.46
1,8EC10EBC-F4BB-4148-9073-4B2BA93C9B34,A985156981066925,150.31,USD,20130403,135015,8.0,,67.8
2,CD624353-E473-4EE0-8BDF-A818627AA1D1,A985156970845915,99.98,USD,20130403,161950,11.0,,96.57


View the account info dataframe.

In [5]:
account_df.head(3)

Unnamed: 0,accountID,transactionDate,transactionTime,accountOwnerName,accountAddress,accountPostalCode,accountCity,accountState,accountCountry,accountOpenDate,accountAge,isUserRegistered,paymentInstrumentAgeInAccount,numPaymentRejects1dPerUser
0,A1688852564389340,20130401,2932,,,30170-000,,MG,BR,,1.0,False,0.000694,0.0
1,A985156162171434,20130401,3005,,,da11 9ps,,England,GB,,61.0,True,55.490972,0.0
2,A844427191626038,20130401,12302,,,4671,,Queensland,AU,,1.0,False,0.002083,0.0


View the untagged transactions dataframe.

In [6]:
###### Reorder the column of dataframe by ascending order in pandas 
cols=untagged_df.columns.tolist()
cols.sort()
untagged_df=untagged_df[cols]

untagged_df.head(3)

Unnamed: 0,accountID,browserLanguage,browserType,cardNumberInputMethod,cardType,cvvVerifyResult,digitalItemCount,ipCountryCode,ipPostcode,ipState,...,transactionCurrencyConversionRate,transactionDate,transactionDeviceId,transactionDeviceType,transactionID,transactionIPaddress,transactionMethod,transactionScenario,transactionTime,transactionType
0,A985156985579195,en-AU,,,VISA,M,1,au,3000,victoria,...,,20130409,,,5EAC1EBD-1428-4593-898E-F4B56BC3FA06,121.219,,A,95040,P
1,A985156966855837,en-AU,,,VISA,M,0,us,14534,new york,...,,20130409,,,48C88D1C-3705-472B-A4A3-5FCE45A5429B,216.15,,A,94256,P
2,A844428012992486,nn-NO,,,MC,M,1,no,1006,oslo,...,,20130409,,,13B2A110-EA04-42CD-88CC-A85814A5C961,94.246,,A,95257,P


## Prepare data

The raw data has some issues we need to cleanup before we can use it to train a model, which we perform in the following cells.

### Prepare accounts

Begin by cleaning the data in accounts data set.
Remove columns that have very few or no values: `accountOwnerName`, `accountAddress`, `accountCity` and `accountOpenDate` 

In [7]:
account_df_clean = account_df[["accountID", "transactionDate", "transactionTime", 
                               "accountPostalCode", "accountState", "accountCountry", 
                               "accountAge", "isUserRegistered", "paymentInstrumentAgeInAccount", 
                               "numPaymentRejects1dPerUser"]]

Create a copy of the dataframe so our data manipulation does not affect the original.

In [8]:
account_df_clean = account_df_clean.copy()

Let's ensure that values that are not numeric (e.g., they have incorrect string values or garbage data) are converted to NaN and then we can fill those NaN values with 0.

In [9]:
account_df_clean['paymentInstrumentAgeInAccount'] = pd.to_numeric(account_df_clean['paymentInstrumentAgeInAccount'], errors='coerce')
account_df_clean['paymentInstrumentAgeInAccount'] = account_df_clean[['paymentInstrumentAgeInAccount']].fillna(0)['paymentInstrumentAgeInAccount']

Next, let's convert the `numPaymentRejects1dPerUser` so that the column has a datatype of `float` instead of `object`.

In [10]:
account_df_clean["numPaymentRejects1dPerUser"] = account_df_clean[["numPaymentRejects1dPerUser"]].astype(float)["numPaymentRejects1dPerUser"]

In [11]:
account_df_clean["numPaymentRejects1dPerUser"].value_counts()

numPaymentRejects1dPerUser
0.0     191382
1.0       5500
2.0       1476
3.0        562
4.0        254
5.0        136
6.0         51
7.0         30
8.0         27
10.0        24
9.0         14
17.0         9
14.0         6
13.0         4
16.0         3
32.0         2
12.0         2
11.0         2
15.0         1
23.0         1
18.0         1
26.0         1
28.0         1
29.0         1
Name: count, dtype: int64

`account_df_clean` is now ready for use in modeling.

### Prepare untagged transactions

Next, cleanup the untagged transactions data set. There are 16 columns in the untagged_transactions whose values are all null, let's drop these columns to simplify our dataset.

In [12]:
untagged_df_clean = untagged_df.dropna(axis=1, how="all").copy()

We can examine the count of non-null values, and view the inferred data type for each column by running the following cell. Looking at the output of the cell, we have some work to do. For a start, we have columns with fewer than 200,000 non-null values. This means there are some null values in that column that we need to fix.

Let's cleanup the `localHour` field. 

Replace null values in `localHour` with `-99`. Also replace values of `-1` with `-99`.

In [13]:
untagged_df_clean["localHour"] = untagged_df_clean["localHour"].fillna(-99)
untagged_df_clean.loc[untagged_df_clean.loc[:,"localHour"] == -1, "localHour"] = -99

Confirm the values now look good.

In [14]:
untagged_df_clean["localHour"].value_counts()

localHour
 13.0    12783
 15.0    12720
 14.0    12694
 10.0    12439
 11.0    12372
 12.0    12315
 16.0    11929
 19.0    11880
 20.0    11588
 18.0    11539
 17.0    11458
 9.0     11200
 21.0     9728
 8.0      8768
-99.0     7037
 22.0     6986
 7.0      5368
 23.0     4716
 6.0      3094
 0.0      2944
 1.0      1859
 5.0      1596
 2.0      1122
 4.0       969
 3.0       896
Name: count, dtype: int64

Clean up the remaining null fields:
- Fix missing values for location fields by setting them to `NA` for unknown. 
- Set `isProxyIP` to False
- Set `cardType` to `U` for unknown (which is a new level)
- Set `cvvVerifyResult` to `N` which means for those where the transaction failed because the wrong CVV2 number was entered ro no CVV2 numebr was entered, treat those as if there was no CVV2 match.

In [15]:
untagged_df_clean = untagged_df_clean.fillna(value={"ipState": "NA", "ipPostcode": "NA", "ipCountryCode": "NA", 
                               "isProxyIP":False, "cardType": "U", 
                               "paymentBillingPostalCode" : "NA", "paymentBillingState":"NA",
                               "paymentBillingCountryCode" : "NA", "cvvVerifyResult": "N"
                              })

Confirm all null values have been addressed.

The `transactionScenario` column provides no insights because all rows have the same `A` value. Let's drop that column. Same idea for the `transactionType` column.

In [16]:
del untagged_df_clean["transactionScenario"]

In [17]:
del untagged_df_clean["transactionType"]

`untagged_df_clean` is now ready for use in modeling.

### Prepare fraud transactions

Now move on to preparing the fraud transactions data set.

The `transactionDeviceId` has no meaningful values, so we will drop it.

In [18]:
fraud_df_clean = fraud_df.copy()
del fraud_df_clean['transactionDeviceId']

The fraud data set has a `localHour` field that we need to fill missing values, just as we did for the account data set.

In [19]:
fraud_df_clean["localHour"] = fraud_df_clean["localHour"].fillna(-99)

Examine your work, you should have 8640 non-null values in each column.

`fraud_df_clean` is now ready for use in modeling.

## Create labels

The goal is to create a dataframe with all transactions, where each transaction is tagged via the `isFraud` column with a value of `0` - no fraud or `1` - fraudulent. 

Any transactions that appear in untagged_transactions dataframe that also appear in the fraud dataframe will be marked as fraudulent. 

The remaining transactions will be marked as not fraudulent. 

Run the following cells to create the labels series.

In [20]:
all_labels = untagged_df_clean["transactionID"].isin(fraud_df_clean["transactionID"])

In [21]:
all_transactions = untagged_df_clean

Then we can save our estimators module.

In [1]:
# write out to models/customestimators.py
scoring_service = """
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class NumericCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self = self
    def fit(self, X, y=None):
        print("NumericCleaner.fit called")
        return self
    def transform(self, X):
        print("NumericCleaner.transform called")
        X["localHour"] = X["localHour"].fillna(-99)
        X.loc[X.loc[:,"localHour"] == -1, "localHour"] = -99
        return X

class CategoricalCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self = self
    def fit(self, X, y=None):
        print("CategoricalCleaner.fit called")
        return self
    def transform(self, X):
        print("CategoricalCleaner.transform called")
        X = X.fillna(value={"cardType":"U","cvvVerifyResult": "N"})
        return X
""" 

with open("./customestimators.py", "w") as file:
    file.write(scoring_service)

Next, load the estimators.

In [23]:
from customestimators import NumericCleaner, CategoricalCleaner

Now build the pipeline that will prepare the data. 

The gist of the following cell is to split the data preparation into two paths, splitting the data sets vertically, and then combine the result. The `ColumnTransformer` will effectively concatenate the data frame that results from the numeric transformations with the data frame resulting from the categorical transformations. 

- Numeric Transformer Pipeline: We use the custom transformers created previously to cleanup the numeric columns. Since the model you will train in this notebook is a Support Vector Machine classifier, we need to standardize the scale of numeric values which is what the `StandardScaler` provides.
- Categorical Transformer Pipeline: We use the custome transformer created previously cleanup the categorical columns. Then we one-hot encode each value of each categorical column, resulting in a wider data frame with one column for each possible value (and 1 appearing in rows that had that value).

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_features=["transactionAmountUSD", "transactionDate", "transactionTime", "localHour", 
                  "transactionIPaddress", "digitalItemCount", "physicalItemCount"]

categorical_features=["transactionCurrencyCode", "browserLanguage", "paymentInstrumentType", "cardType", "cvvVerifyResult"]                           

numeric_transformer = Pipeline(steps=[
    ('cleaner', NumericCleaner()),
    ('scaler', StandardScaler())
])
                               
categorical_transformer = Pipeline(steps=[
    ('cleaner', CategoricalCleaner()),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

Let's confirm we run all our historical data thru this transformation pipeline and observe the resulting shape.

In [25]:
preprocessed_result = preprocessor.fit_transform(all_transactions)

NumericCleaner.fit called
NumericCleaner.transform called
CategoricalCleaner.fit called
CategoricalCleaner.transform called


In [26]:
preprocessed_result.shape

(200000, 292)

In [28]:
pd.DataFrame(preprocessed_result.todense()).head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,282,283,284,285,286,287,288,289,290,291
0,-0.152051,-1.422822,-0.51683,0.417653,0.265748,0.218614,-0.357016,0.0,1.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.555361,-1.422822,-0.527675,-5.086697,1.964129,-1.403816,0.593898,0.0,1.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.039733,-1.422822,-0.513828,-0.00217,-0.216818,0.218614,-0.357016,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2.53365,-1.422822,-0.407303,0.184418,-0.239646,-1.403816,3.446637,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.15022,-1.422822,-0.399612,0.044477,-0.199231,0.218614,-0.357016,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## Create pipeline and train a simple model

Now you will build upon the transformation pipeline you created previously to train a model to classify rows as fraudulent or not fraudulent.

Run the following cells to make sure you've imported the dependencies for the pipeline (you probably already have, but having them clearly loaded here will help you when porting your code to a web service).

In [29]:
from customestimators import NumericCleaner, CategoricalCleaner
from sklearn.model_selection import train_test_split

As might be obvious, our data has a lot of samples that are not fraudulent. If we proceed to train a model, we will effectively train the model to predict non-fraud. This situation where one class (non-fraud) appears much more often than the others (fraud) is called a class imbalance, and to mitigate its effect we can reduce the number of non-fraud samples so that we have the same number of non-fraud and fraud samples. 

Run the following cells to downsize and then randomly sample 1,151 non-fraud rows, and then we'll union these row with our 1,151 fraud rows.

> Feel free to ignore any `SettingWithCopyWarning` warnings in the cell output below.

In [33]:
only_fraud_samples = all_transactions.loc[all_labels == True]
only_fraud_samples["label"] = True
only_non_fraud_samples = all_transactions.loc[all_labels == False]
only_non_fraud_samples["label"] = False
random_non_fraud_samples = only_non_fraud_samples.sample(n=1151, replace=False, random_state=42)
balanced_transactions = pd.concat([random_non_fraud_samples, only_fraud_samples])

balanced_transactions["label"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  only_fraud_samples["label"] = True
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  only_non_fraud_samples["label"] = False


label
False    1151
True     1151
Name: count, dtype: int64

Next, you need to separate out the label column from the dataframe so the labels are not used as input features:

In [34]:
balanced_labels = balanced_transactions["label"]
del balanced_transactions["label"]

Now you will create subsets of the training data frame, one that will be used for training the model `X_train` and `y_train` and the another that reserved for testing its performance `X_test` and `y_test`.

In [35]:
X_train, X_test, y_train, y_test = train_test_split(balanced_transactions, balanced_labels, 
                                                    test_size=0.2, random_state=42)

Now train the model. In this case, you will use the `LinearSVC` class.

> Feel free to ignore any `ConvergenceWarning` warnings in the cell output below

In [36]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

svm_clf = Pipeline((
    ("preprocess", preprocessor),
    ("linear_svc", LinearSVC(C=1, loss="hinge"))
))
svm_clf.fit(X_train, y_train)

NumericCleaner.fit called
NumericCleaner.transform called
CategoricalCleaner.fit called
CategoricalCleaner.transform called




Test the model predicting against a single row from the test set.

In [37]:
svm_clf.predict(X_test[0:1])

NumericCleaner.transform called
CategoricalCleaner.transform called


array([ True])

Next, evaluate the model by examining how well it is predicting against all data in the training set.

In [38]:
y_train_preds = svm_clf.predict(X_train)

NumericCleaner.transform called
CategoricalCleaner.transform called


Use a confusion matrix to see how your model performed when correctly predicting non-fraud and fraud (the top left and bottom right values). Also, examine how your model made mistakes (the bottom left and top right values). In the below, the column headers are predicted non-fraud and predicted fraud, and the row headers are actually non-fraud, and actually fraud (e.g., as described by the training data).

In [39]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
confusion_matrix(y_train, y_train_preds)

array([[382, 539],
       [112, 808]], dtype=int64)

Take a look at the performance of your model using the common set of metrics for a classifier. Do you think this is good or bad?

In [40]:
print("Accuracy:", accuracy_score(y_train, y_train_preds))
print("Precision:", precision_score(y_train, y_train_preds))
print("Recall:", recall_score(y_train, y_train_preds))
print("F1:", f1_score(y_train, y_train_preds))
print("AUC:", roc_auc_score(y_train, y_train_preds))

Accuracy: 0.6463878326996197
Precision: 0.5998515219005197
Recall: 0.8782608695652174
F1: 0.7128363475959418
AUC: 0.6465137138271255


Given that this is just a parsimonous model, this model provides a start that performs better than random (as indicated by the AUC being greater than 0.5). There is more work (such as additional feature engineering) that can be done to improve this beyond the current performance that you would want to do before deploying it in production. A parsiminous model helps us to both see if the desired classification is possible given the data and allows to quickly get to something we can deploy as a service to enable integration early on. Then we can iterate deploying improved versions of the model.

Now, evaluate the same using the test data set, using data the trained model has not seen. How does it perform?

In [41]:
y_test_preds = svm_clf.predict(X_test)
print(confusion_matrix(y_test, y_test_preds))
print(accuracy_score(y_test, y_test_preds))
print("Accuracy:", accuracy_score(y_test, y_test_preds))
print("Precision:", precision_score(y_test, y_test_preds))
print("Recall:", recall_score(y_test, y_test_preds))
print("F1:", f1_score(y_test, y_test_preds))
print("AUC:", roc_auc_score(y_test, y_test_preds))

NumericCleaner.transform called
CategoricalCleaner.transform called
[[ 98 132]
 [ 34 197]]
0.6399132321041214
Accuracy: 0.6399132321041214
Precision: 0.5987841945288754
Recall: 0.8528138528138528
F1: 0.7035714285714286
AUC: 0.639450404667796


The overall performance of the model against data it has not seen (the test data) is similar to how it performs with the training data. That's a good sign, indicating we did not overfit the model to the training data.

Next, let's look the steps to prepare the model for deployment as a web service.

## Save the model to disk

In preparation for deploying the model, you need to save the model to disk.

In [42]:
joblib.dump(svm_clf, 'fraud_score.pkl')

['fraud_score.pkl']

## Test loading the model

Next simulate re-loading the model from disk, just like the web service (which you will create in a moment) will have to do.

In [51]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from customestimators import NumericCleaner, CategoricalCleaner

# sklearn.externals.joblib was deprecated in 0.21
from sklearn import __version__ as sklearnver
from packaging.version import Version
if Version(sklearnver) < Version("0.21.0"):
    from sklearn.externals import joblib
else:
    import joblib

desired_cols = ['accountID',
 'browserLanguage',
 'cardType',
 'cvvVerifyResult',
 'digitalItemCount',
 'ipCountryCode',
 'ipPostcode',
 'ipState',
 'isProxyIP',
 'localHour',
 'paymentBillingCountryCode',
 'paymentBillingPostalCode',
 'paymentBillingState',
 'paymentInstrumentType',
 'physicalItemCount',
 'transactionAmount',
 'transactionAmountUSD',
 'transactionCurrencyCode',
 'transactionDate',
 'transactionID',
 'transactionIPaddress',
 'transactionTime']

scoring_pipeline = joblib.load('fraud_score.pkl')

In [52]:
untagged_df_fresh = pd.read_csv('./data/Untagged_Transactions.csv')[desired_cols]

test_pipeline_preds = scoring_pipeline.predict(untagged_df_fresh)
test_pipeline_preds

  untagged_df_fresh = pd.read_csv('./data/Untagged_Transactions.csv')[desired_cols]


NumericCleaner.transform called
CategoricalCleaner.transform called


array([ True,  True, False, ..., False, False, False])

In [53]:
one_row = untagged_df_fresh.iloc[:1]
test_pipeline_preds2 = scoring_pipeline.predict(one_row)
test_pipeline_preds2

NumericCleaner.transform called
CategoricalCleaner.transform called


array([ True])