In [1]:
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [2]:
train = pd.read_csv("../data/train.csv")

In [3]:
# select promising features discovered during the EDA
X = train[["V4", "V11", "V7", "Amount"]].astype("Float64")

# engineer new features
X["V4xV11"] = X["V4"] * X["V11"]

X["V7_is_negative"] = X["V7"] < 0
X["V7_is_negative"] = X["V7_is_negative"].astype("Int64")

y = train["Class"].astype("Int64")

In [4]:
steps = [
    ("scaler", StandardScaler()),
    (
        "classifier",
        LogisticRegression(random_state=42, max_iter=1000, solver="liblinear")
    ),
]

pipeline = Pipeline(steps=steps)

# What metric to pick?

Since the class distribution is highly imbalanced (only 0.17% of the transactions are fraud), we cannot use metrics like *accuracy*.

This is easy to see, if you consider a model that always predicts *not fraud* - which would achieve >99% accuracy, but be utterly useless.

Furthermore, we can't solely rely on *recall* either - since a high number of false positives (non-frauds classified as frauds) will have a negative impact on the business (e.g. a bank) as well.

We will start with the simple (unweighted) F1-score as a more appropriate metric, since it balances *precision* and *recall* evenly, and consider giving more weight to recall in the future, since false negatives (frauds we couldn't catch) are more expensive than false positives (e.g. a delay or a misunderstanding that could be fixed with minimal human intervention in most cases)

In [5]:
# although the number of samples is reasonably large, and a random split would likely yield representative folds,
# since the dataset is highly imbalanced, and the positive class is rare (0.17% of the samples)
# we opt for a stratified k-fold cross-validation to be safe
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [6]:
# to gain insight into our metric of choice, we will first evaluate a few dummy models

from sklearn.dummy import DummyClassifier

# predict fraud with probability: number of frauds / number of samples
prior = DummyClassifier(strategy="prior", random_state=42)

print("prior")
print(cross_val_score(estimator=prior, X=X, y=y, scoring="f1", cv=skf))

# always predict fraud
always_positive = DummyClassifier(strategy="constant", constant=1, random_state=42)
print("always fraud")
print(cross_val_score(estimator=always_positive, X=X, y=y, scoring="f1", cv=skf))

# always predict not fraud
always_negative = DummyClassifier(strategy="constant", constant=0, random_state=42)
print("always not fraud")
print(cross_val_score(estimator=always_negative, X=X, y=y, scoring="f1", cv=skf)
)

prior
[0. 0. 0. 0. 0.]
always fraud
[0.00341753 0.00346127 0.00346127 0.00346127 0.00346127]
always not fraud
[0. 0. 0. 0. 0.]


In [7]:
# and now our baseline logistic regression model
cross_val_score(estimator=pipeline, X=X, y=y, scoring="f1", cv=skf)

array([0.58064516, 0.55172414, 0.58267717, 0.60655738, 0.56666667])

In [8]:
pipeline.fit(X, y)

joblib.dump(pipeline, "../models/baseline.pkl")

['../models/baseline.pkl']

# Conclusion
Key takeaways:
1. We deliberated and picked an appropriate metric for the imbalanced fraud dataset - *F1 score*.
2. We established a baseline using a simple logistic regression model.
3. We sanity checked our metric and baseline by comparing it against a few dummy models.