In this notebook we will train a [Logistic Regression model](https://en.wikipedia.org/wiki/Logistic_regression) to distinguish between feature vectors corresponding to legitimate transactions and fraudulent transactions. 

Logistic Regression is a classic statistical method used for binary classification, meaning that there are two possible data 'types': In our case these are _legitimate_ and _fraudulent_ transactions.

In [1]:
import numpy as np
import pandas as pd
df = pd.read_parquet("fraud-cleaned-sample.parquet")

# Train/test split

We need to split our data set into two. One part will be used for training the model, and the other will be a testing set we can use to evaluate the model. We're using time-series data, so we'll split the data set based on time.

In [2]:
first = df['timestamp'].min()
last = df['timestamp'].max()
cutoff = first + ((last - first) * 0.7)

df = df.sample(frac=0.1).copy()

train = df[df['timestamp'] <= cutoff]
test = df[df['timestamp'] > cutoff]

We also load in the feature engineering pipeline stage which we developed in [notebook 2](02-feature-engineering.ipynb).

In [3]:
import cloudpickle as cp
feature_pipeline = cp.load(open('feature_pipeline.sav', 'rb'))

## Dealing with Imbalanced Classes

When the training data set contains unequal representation from each of your classes we say we are dealing with _imbalanced classes_. In our data set only approximately 2% of the samples are fraudulent, and the remaining 98% are legitimate. Thus we have imbalanced classes. 

This causes problems for a few reasons:
1. a model which classified all new transactions as 'legitimate' would be correct 98% of the time. This high accuracy can trick you into thinking that your model is working well, despite it just returning 'legitimate' for every sample it sees. 
2. even if your model tries to learn patterns in the data, it may struggle to learn from the 'fraudulent' data since there simply isn't enough of it.


There are a few approaches we could take to tackle the problem, but the two we are going to use today are: 
1. use metrics which are more informative that if we just counted how many times the model makes a correct classification. 
2. weight the samples by the inverse of the frequency of their label. These weights will be passed into the logistic regression model, and used to ensure that the model is penalised proportionally to this weight for making a misclassification when it is training. 


In this next cell we compute the weights for each of the data labels. 

In [4]:
fraud_frequency = train[train["label"] == "fraud"]["timestamp"].count() / train["timestamp"].count()
train.loc[train["label"] == "legitimate", "weights"] = fraud_frequency
train.loc[train["label"] == "fraud", "weights"] = (1 - fraud_frequency)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


## Training the model

We're now ready to train our Logistic Regression model. We pass the model the feature vectors (generated using our `feature_pipeline` from the previous notebook) and the weights we computed above.

In [5]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=500)

svecs = feature_pipeline.fit_transform(train)
lr.fit(svecs, train["label"], sample_weight=train["weights"])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Model Validation 

We need to check how well the model performs on data it wasn't trained on. We use the model we just trained to make predictions for the data in our _test_ set, and compare those predictions to the truth. 



In [6]:
from sklearn.metrics import classification_report

predictions = lr.predict(feature_pipeline.fit_transform(test))
print(classification_report(test.label.values, predictions))


              precision    recall  f1-score   support

       fraud       0.15      0.95      0.26      1447
  legitimate       1.00      0.89      0.94     73759

    accuracy                           0.90     75206
   macro avg       0.57      0.92      0.60     75206
weighted avg       0.98      0.90      0.93     75206



The report shows that the model is performing okay, but is much better at identifying legitimate transactions than fraudulent ones. 

We can visualise these accuracy of classifications in a confusion matrix.

In [7]:
from mlworkflows import plot
df, chart = plot.binary_confusion_matrix(test["label"], predictions)
chart

Viewing the raw counts emphasises that the model misclassifies a lot of 'fraudulent' transactions as 'legitimate'. 

In [8]:
df

Unnamed: 0,predicted,actual,raw_count,value
0,fraud,fraud,1370,0.946786
1,fraud,legitimate,77,0.053214
2,legitimate,fraud,7800,0.10575
3,legitimate,legitimate,65959,0.89425


# Save the model as a pipeline stage

In [9]:
from mlworkflows import util
util.serialize_to(lr, "lr.sav")