# Daily retraining for Fake Property Detection: Impact, Stability, and Thresholding

# Introduction

This analysis is a first look at the predictions that our daily retrained Batman model [1], which gets generated every day by our automated retrained pipeline [2]. This model runs in production in "dry-running mode", which means that this model makes predictions for each property registration, but we are not (yet) using these predictions in our business decisions. 

Our business decisions are still being made by a model that we trained once on 2021-02-26 [3]. This model is completely identical both in feature set, model architecture, and in hyperparameters as [1], and the only difference is that [1] is retrained/redeployed every day using fresh data.

We aim to look into the following **research questions**:

- **RQ 1**: does the daily-retrained-model (i.e., model [1]) make better predictions than the stale model (i.e., model [3])?
- **RQ 2**: does it happen that with automated-daily-retraining we on some days we deploy a "bad" model?
- **RQ 3**: does the distribution of the model scores of the daily-retrained-model (i.e., model [1]) have a higher variance from day-to-day than the distribution of model scores of the stale model (i.e., model [3])? If so, we might have a challenge in setting thresholds for the daily-retrained-model.

This analysis is structured as follows:

- **Data preparation**, which is a prerequisite to be able to answer all three questions
- **The effectiveness of daily-retraining on Batman**, aims to answer RQ 1
- **The consistency of the effectiveness of daily-retraining on Batman**, aims to answer RQ 2
- **Analysis of Volume-stability**, aims to answer RQ 3
- **Conclusions**

This analyis contains is pretty long and detailed, and it mixes code for experimental setup with textual interpretation of plots and results. To quickly find only the results and their interpretation, these all start with the indicator "**Observation:**", which allows you to quickly skim through this analysis by CTRL-F-ing on that word.

**References**:

[1] [https://ml.booking.com/model/partner_fraud_preopening_batman_20210226_daily_retrained](https://ml.booking.com/model/partner_fraud_preopening_batman_20210226_daily_retrained)

[2] [https://gitlab.booking.com/core/machine-learning-platform/model-building/training-pipelines/-/tree/master/partner_fraud_preopening_batman_20210226_daily_retrained](https://gitlab.booking.com/core/machine-learning-platform/model-building/training-pipelines/-/tree/master/partner_fraud_preopening_batman_20210226_daily_retrained)

[3] [https://ml.booking.com/model/partner_fraud_preopening_batman_20210226](https://ml.booking.com/model/partner_fraud_preopening_batman_20210226)

In [None]:
from pyspark.sql import functions as sf, types
from scipy.stats import kendalltau
from sklearn import metrics

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

In [None]:
spark.sql("REFRESH TABLE counterfraud.batman_model_instances")

# Data preparation
We filter the data on the time range 2021-04-02 (the day when we started to make online predictions with the daily-retrained-model) to yesterday. Which gives us almost two weeks of model predictions. In the last few days, our labels might not be fully mature yet (i.e., we don't know all the fraud yet). However, gradually over time, we have gotten much better at catching fake properties fast, and we nowadays find ~90% of the fake properties within just a few days (as found in [this analysis](https://analysis.booking.com/post/75280930.kp)).

In [None]:
instances = (
    spark.table("counterfraud.batman_model_instances")
    .where(sf.col("signup_finished_at") >= "2021-04-02")
    .where(sf.col("signup_finished_at") < "2021-05-17")
    .withColumn("registration_date", sf.col("signup_finished_at").substr(0, 10))
    .select("property_id", "is_fake_hotel", "registration_date")
)

In [None]:
json_parse_schema = types.StructType(
    [
        types.StructField("score", types.FloatType(), True),
    ]
)

## Obtain the predictions of the stale model
batman_stale_predictions = (
    spark.table("dbimports.sectools_batmanmodelpredictionlog")
    .where(sf.col("model_instance") == "partner_fraud_preopening_batman_20210226")
    .where(sf.col("source_system") == "RS")
    .select("property_id", "prediction")
    .withColumn("prediction", sf.from_json("prediction", json_parse_schema))
    .select(sf.col("property_id"), sf.col("prediction.*"))
    .withColumnRenamed("score", "stale_prediction")
)

In [None]:
instances_w_batman = instances.join(batman_stale_predictions, on="property_id", how="left")

In [None]:
## Obtain the predictions of the daily-retrained-model model
batman_retrained_predictions = (
    spark.table("dbimports.sectools_batmanmodelpredictionlog")
    .where(sf.col("model_instance") == "partner_fraud_preopening_batman_20210226_daily_retrained")
    .where(sf.col("source_system") == "RS")
    .select("property_id", "prediction")
    .withColumn("prediction", sf.from_json("prediction", json_parse_schema))
    .select(sf.col("property_id"), sf.col("prediction.*"))
    .withColumnRenamed("score", "retrained_prediction")
)

We now join both the predictions of the stale model and those of the daily-retrained-model into the instances.

In [None]:
batman_predictions = (
    instances_w_batman
    .join(batman_retrained_predictions, on="property_id", how="left")
)

## A first look at daily-retrained-model predictions

Let's see what kind of properties daily-retrained-model detects that are currently not yet known fakes.

In [None]:
(
    batman_predictions
    .where(sf.col("is_fake_hotel") == 0)
    .withColumn("prediction_delta", sf.col("retrained_prediction") - sf.col("stale_prediction"))
    .orderBy("prediction_delta", ascending=False)
).show(10, False)

Note that we have a lot of hotels where the fraud-risk of the hotel is much higher in the retrained model than in the stale model. These might be types of fraud that are of a type that is new, and simply did not yet exist in the training data at the time that the stale model was trained. I have given a list of 50 properties (all where the score of the new model was more than 0.4 higher than that of the stale model) to Amir Kalter from our fraud operations team to manually investigate.

His analysis findings are shown [here](https://docs.google.com/spreadsheets/d/18Is0vlZzbaJIDN3MvvF4gfOTN_kC0Tt09x72XHJBMM0/edit#gid=0), where he:

- closed 30 out of these 50 properties because they were fake
- found 2 false positives that were non-fake
- for 18 properties he was not able to find conclusive evidence, but suspected most of those to be fake

In [None]:
status = spark.table("dbimports.acquisition_property").selectExpr("id as property_id", "status")

Meaning of these statusses:

- Status 140 is spam registration
- Status 11 are "test"-properties

In [None]:
(
    batman_predictions
    .join(status, on="property_id", how="left")
    .where(sf.col("is_fake_hotel") == 0)
    # The score bucket is the first decimal of the fraud-risk, e.g., score 0.743 is score_bucket 7
    .withColumn("score_bucket", sf.col("retrained_prediction").substr(3,1))
    .groupBy("score_bucket")
    .pivot("status")
    .agg(sf.sum(sf.lit(1)).alias("count"))
    .orderBy("score_bucket")
    .fillna(0)
).show(20, False)

Note that Spam hotels (140) tend be fairly uniformly distibuted over the buckets.

In [None]:
properties_pd = (
    batman_predictions
    .join(status, on="property_id", how="left")
    .toPandas()
)

In [None]:
properties_pd

We have earlier identified a data issue and found that all spam-properties are non-fake properties that are wrongly put into that status because of some hacky processing of the registrations team. We exlude those properties, and the test properties from further analysis.

In [None]:
nonspam_nontest_properties = properties_pd[(properties_pd["status"] != 140) & (properties_pd["status"] != 11)]

In [None]:
nonspam_nontest_properties

# The effectiveness of daily-retraining on Batman

Let's do some basic visual analysis and plot the models of the stale production model against those of the daily retrained model and visually inspect the results.

In [None]:
plt.figure(figsize=(7, 7), dpi=80)
ax = sns.scatterplot(data=nonspam_nontest_properties, 
                x="retrained_prediction", 
                y="stale_prediction", 
                hue="is_fake_hotel",
                alpha=0.2
               )
ax.set(ylim=(0, 1), xlim=(0,1))
X_plot = np.linspace(0.001, 10)
Y_plot = X_plot
plt.plot(X_plot, Y_plot, color = 'r')
plt.show()

**Observation:** Visually, we see that:

- It seems in this plot that **there is a lot of orange (fake) in the bottom right**: where the stale model gives a low score and the daily retrained model gives a high score.
- The daily-retrained-model **seems to give higher scores than the lower scores**, since there are more data points below the red y=x-line than above it.

To analyze this more quantitatively, we run a logistic regression, predicting the fakeness of a hotel based on model scores of both the stale model and of the daily-retrained-model.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
nonspam_nontest_properties = nonspam_nontest_properties.dropna()

In [None]:
model1 = smf.glm(
    formula='is_fake_hotel ~ stale_prediction + retrained_prediction', 
    data=nonspam_nontest_properties,
    family=sm.families.Binomial()
)
results1 = model1.fit()
results1.summary()

**Observation:**
We see in the GLM regression that the daily retrained model is a strong predictor of whether the hotel is fake.

We also see in the GLM regression that the 95%-CI of the stale-score is fully below zero, indicating that after knowing the score of the daily-retrained-model, it is the case that a *higher* score in the stale model makes the property *less likely* to be fake rather than more likely.

This seems to suggest that the daily-retrained-model is a much better predictor of hotel fakeness than the stale model (remember, the stale model is what we currently run in production).

We additionally look at the overall ROC-AUC of the daily-retrained-model and the stale model.

In [None]:
metrics.roc_auc_score(
    nonspam_nontest_properties["is_fake_hotel"],
    nonspam_nontest_properties["stale_prediction"]
)

In [None]:
metrics.roc_auc_score(
    nonspam_nontest_properties["is_fake_hotel"],
    nonspam_nontest_properties["retrained_prediction"]
)

**Observation:**
The daily retraining increased the ROC-AUC of Batman from an average of 0.868 to 0.906 on average over a the two-week period.

Based on this analysis we can answer RQ 1 positively: **the daily-retrained-model does indeed make better predictions than the stale model**.

# The consistency of the effectiveness of daily-retraining on Batman

Here we analyze the effect of the daily-retrained-model score **per day of registrations** instead of over the whole period. The hypothesis here is that over time the benefit of the daily-retrained-model over the stale model gets larger because the "difference in data freshness" grows over time. With this analysis we also hope to answer RQ2: if it is the case that on some days the daily-retrained-model deploys a bad model, we expect to that on some registration-days the stale model outperforms the daily-retrained-model.

First, we parse the coefficient and its confidence intervals for the retrained_prediction from the logistic regression statsmodels model object.

## Confidence Intervals of the contributions of daily-retraining

In [None]:
retrained_mean_coeff = results1.params.T["retrained_prediction"]
print(retrained_mean_coeff)

In [None]:
retrained_lower_coeff, retrained_upper_coeff = results1.conf_int(alpha=0.05, cols=None).T["retrained_prediction"].to_list()
print(retrained_lower_coeff, retrained_upper_coeff)

Check that there numbers are indeed identical to the summary table above.

We now repeat the analysis per day, to see how the daily-retrained-model coefficients trend over time.

In [None]:
registration_dates = set(nonspam_nontest_properties["registration_date"])

Let's implement a method to get the coefficient and the CI for all the predictions of a single day of registrations.

In [None]:
def get_coeffs_for_day(reg_date):
    date_results = smf.glm(
        formula='is_fake_hotel ~ stale_prediction + retrained_prediction', 
        data=nonspam_nontest_properties[nonspam_nontest_properties["registration_date"] == reg_date],
        family=sm.families.Binomial()
    ).fit()
    mean = date_results.params.T["retrained_prediction"]
    lower, upper = date_results.conf_int(alpha=0.05, cols=None).T["retrained_prediction"].to_list()
    
    return mean, lower, upper, reg_date

Let's try it on a single day to see if it works.

In [None]:
get_coeffs_for_day("2021-04-10")

Let's now apply this method on all the dates for which we have predictions.

In [None]:
results_per_date = [get_coeffs_for_day(reg_date) for reg_date in registration_dates]

In [None]:
results_per_date_pd = pd.DataFrame(
    [{"mean": e1, "CI_05":e2, "CI_95": e3, "reg_date": e4} for e1, e2, e3, e4 in results_per_date]
)

Now we can start plotting the effects over time: for every registration date we plot the confidence interval of how much information is contained in the score of the daily-retrained-model **after** accounting for the information that is already contained in the score of the stale model.

In [None]:
plt.figure(figsize=(15, 6), dpi=80)

y_axis_name = "Effect of daily-retrained model scores on the log-odds scale"

ax = sns.lineplot(
    data=pd.melt(results_per_date_pd, ['reg_date'], value_name=y_axis_name), 
    x="reg_date", 
    y=y_axis_name, 
    hue='variable',
)
ax.axhline(0, ls='--', color="black")

ax.set_title("The effect of daily-retrained model scores on hotel-fakeness over time (incremental, over knowing just the stale-model score)")

for tick in ax.get_xticklabels():
    tick.set_rotation(60)

**Observation:**
Our hypothesis does seem to hold: over time the coefficients of the daily-retrained-model grow over time, which means that they become an increasingly strong signal of hotel-fakeness, and which also means that the stale models relatively becomes an increasingly weak signal of hotel-fakeness.

Also note that there is not a single day in which the coefficients of the daily-retrained-model are weak. This suggests that so far after 11 consecutive days of automated retrain/deploys we haven't yet seen any bad deploys of the daily-retrained-model. We would like to monitor this a bit longer to be sure, but for now it does seem like **we can answer RQ2 negatively**: there is no evidence that we deploy a bad model on some days.

## Analysis of ROC-AUC per day
Beyond analyzing the coefficient of the retrained_prediction in a Logistic Regression will also take another view on the contribution of daily-retraining on the performance of the model by calculating ROC-AUC for every day of registrations and plotting the daily-retrained-model against the stale model.

In [None]:
def auc_group(df):
    y_hat = df.y_hat
    y = df.is_fake_hotel
    return metrics.roc_auc_score(y, y_hat)

In [None]:
melted_scores = pd.melt(
    nonspam_nontest_properties.reset_index()[["registration_date", "stale_prediction", "retrained_prediction", "is_fake_hotel"]],
    id_vars=["registration_date", "is_fake_hotel"],
    value_name="y_hat",
    var_name="model"
)

In [None]:
results = melted_scores.groupby(["registration_date", "model"]).apply(auc_group).reset_index()
results.columns = ["registration_date", "model", "ROC-AUC"]

In [None]:
plt.figure(figsize=(15,10))
ax = sns.lineplot(
    data=results, 
    x="registration_date", 
    y="ROC-AUC", 
    hue="model",
)
ax.legend(loc='upper right')

ax.axvline("2021-04-20", color="red")
ax.text("2021-04-20", 0.94, color="red", s=" <-- Lowered the Batman threshold,\n       triggering more concept drift")


for tick in ax.get_xticklabels():
    tick.set_rotation(60)

**Observations**:
The daily-retrained-model is consistently above the stale model for the whole time range, with the exception of 2021-04-17. This date was the day on which an old attack pattern (called the 2FA-attack) popped back up for one day, which was raised to the responsible product team and the loophole was closed, stopping the attack from the next day on.

The fact that the daily-retrained-model model is consistently above the stale model gives further reinforcing evidence for **RQ1: daily-retraining does seem to positively impact the predictions**. Additionally, this gives evidence to **negatively answer RQ 2**, we have no evidence so far that daily retraining leads to bad model deployments on some days.

It is also noticeable that we started becoming much more [aggressive with the Batman threshold on 2021-04-20](https://docs.google.com/document/d/1TMS4ohydf8E85xmvLjSOAqKfuSGqh1-gDcS2HCAX_gQ/edit#), the fraudsters started adapting their behavior more quickly to work around our defenses, and the ROC-AUC gap between our stale model and daily-retrained-model seems to have grown. This is shown figually in the figure above, and confirmed by the table below.

In [None]:
results["is_before_lower_threshold"] = results["registration_date"] < "2021-04-20"

results.groupby(["is_before_lower_threshold", "model"]).mean()

# Analysis of Volume-stability
Here we investigate RQ3 and try to get insight into whether we expect to see risks on the volumes that we send to high-risk and to medium-risk as a consequence of daily retraining.

Let's first just eye-ball the mean and the median score per day of the stale model and of the daily-retrained-model to get a sense of how stable their score distributions are.

In [None]:
nonspam_nontest_properties.groupby("registration_date").agg({"stale_prediction": [np.mean, np.median], 
                                                             "retrained_prediction": [np.mean, np.median],
                                                             "is_fake_hotel": [np.mean]
                                                            })

**Observation:**
The mean scores of the score distributions look slightly more wobbly that for the stale model. This might be acceptible if the daily-variability follows the fraud rate, i.e., if we send more properties to high-risk on the days that there also actually is more fraud.

## Analysis of volume-stability on the high-risk threshold

We now plot how many properties are above the high risk threshold per day, and **how stable this volume is**. 
For the stale model we analyze this at threshold 0.5 (current product threshold for high risk)
For the daily-retrained model we analyze this at a 0.6, because it seems to give slightly higher scores.

In [None]:
high_risk_results = nonspam_nontest_properties.groupby("registration_date").agg(
    stale_prediction=pd.NamedAgg(column='stale_prediction', aggfunc=lambda x: (x > 0.5).sum()),
    retrained_prediction=pd.NamedAgg(column='retrained_prediction', aggfunc=lambda x: (x > 0.6).sum()),
    fraud_rate=pd.NamedAgg(column='is_fake_hotel', aggfunc=lambda x: x.mean()),
)

In [None]:
plt.figure(figsize=(15,10))
y_axis_name = "Number of high-risk properties per day"
ax = sns.lineplot(
    data=pd.melt(high_risk_results.reset_index()[["registration_date",
                                                  "stale_prediction",
                                                  "retrained_prediction"]],
                 ['registration_date'],
                 value_name=y_axis_name
                ), 
    x="registration_date", 
    y=y_axis_name, 
    hue='variable',
)

ax.set_title("The number of high-risk properties per day, according to daily-retrained model and the stale model.")
ax.legend(loc='upper left')
ax.set(ylim=(0, 100))

for tick in ax.get_xticklabels():
    tick.set_rotation(60)

ax2 = ax.twinx()
ax3 = sns.lineplot(data=high_risk_results[["fraud_rate"]], palette="BuGn")
ax3.legend(loc='upper right')
ax3.set(ylim=(0, 0.15))

plt.ylabel('Fraud rate', axes=ax3)

**Observation:** On the last days the fraud rate is unreliable due to label delay. Up until 2021-04-09 the fraud rate is a close enough approximation from the true fraud rate. The retrained model does have a bit higher volumes and larger volume-variability at the high-risk level compared to the stale model. On the first few days of the plot it does seem that to some degree the high-risk-volume of the daily-retrained-model might follow the fraud rate a bit better than the stale model, which might be good. This is limited evidence though, on so few days, and this analyses should be repeated on more data later to find a more reliable answer.

### Are the volumes at high-risk better correlated with the fraud rate?

In [None]:
kendalltau(high_risk_results["fraud_rate"], high_risk_results["stale_prediction"])

In [None]:
kendalltau(high_risk_results["fraud_rate"], high_risk_results["retrained_prediction"])

**Observation:** Yes, while both the scores of the stale model and of the daily-retrained-model are only weakly correlated with the fraud rate, the daily-retrained model has slightly higher correlated to the fraud rate. This might mean that if on a certain day the model sends more properties to high risk, then with the daily-retrained-model there is a higher probability that this is because there really was more fraud on that day.

## Analysis of volume-stability on the medium-risk threshold

We now repeat this analysis for the medium-risk volumes.

In [None]:
medium_risk_results = nonspam_nontest_properties.groupby("registration_date").agg(
    stale_prediction=pd.NamedAgg(column='stale_prediction', aggfunc=lambda x: (x > 0.086).sum()),
    retrained_prediction=pd.NamedAgg(column='retrained_prediction', aggfunc=lambda x: (x > 0.086).sum()),
    fraud_rate=pd.NamedAgg(column='is_fake_hotel', aggfunc=lambda x: x.mean()),
)

In [None]:
plt.figure(figsize=(15,10))
y_axis_name = "Number of medium-risk properties per day"
ax = sns.lineplot(
    data=pd.melt(medium_risk_results.reset_index()[["registration_date",
                                                  "stale_prediction",
                                                  "retrained_prediction"]],
                 ['registration_date'],
                 value_name=y_axis_name
                ), 
    x="registration_date", 
    y=y_axis_name, 
    hue='variable',
)

ax.set_title("The number of Medium-risk properties per day, according to daily-retrained model and the stale model.")
ax.legend(loc='upper left')
ax.set(ylim=(0, 1500))

for tick in ax.get_xticklabels():
    tick.set_rotation(60)

ax2 = ax.twinx()
ax3 = sns.lineplot(data=medium_risk_results[["fraud_rate"]], palette="BuGn")
ax3.legend(loc='upper right')
ax3.set(ylim=(0, 0.15))

plt.ylabel('Fraud rate', axes=ax3)

### Are the volumes at medium-risk better correlated with the fraud rate?

In [None]:
kendalltau(medium_risk_results["fraud_rate"], medium_risk_results["stale_prediction"])

In [None]:
kendalltau(medium_risk_results["fraud_rate"], medium_risk_results["retrained_prediction"])

**Observation:** Surprisingly, the stale model even has a negative correlation between the fraud rate and the model score: the model assigns more medium-risk scores on days that there is less fraud! For the daily-retrained model there is a weak but positive correlation between the model score and the fraud rate.

## Analysis of volume-stability on the experimental new high-risk threshold
From 2021-04-20 we started a temporary experiment to be much more aggressive with fake property registrations and lower the high-risk threshold from 0.5 to 0.15. We now explore the stability of volumes under that threshold.

In [None]:
new_high_risk_results = nonspam_nontest_properties.groupby("registration_date").agg(
    stale_prediction=pd.NamedAgg(column='stale_prediction', aggfunc=lambda x: (x > 0.15).sum()),
    retrained_prediction=pd.NamedAgg(column='retrained_prediction', aggfunc=lambda x: (x > 0.15).sum()),
    fraud_rate=pd.NamedAgg(column='is_fake_hotel', aggfunc=lambda x: x.mean()),
)

In [None]:
plt.figure(figsize=(15,10))
y_axis_name = "Number of high-risk properties per day"
ax = sns.lineplot(
    data=pd.melt(new_high_risk_results.reset_index()[["registration_date",
                                                  "stale_prediction",
                                                  "retrained_prediction"]],
                 ['registration_date'],
                 value_name=y_axis_name
                ), 
    x="registration_date", 
    y=y_axis_name, 
    hue='variable',
)

ax.set_title("The number of High-risk properties per day, according to the new threshold of 0.15 with the daily-retrained model and the stale model.")
ax.legend(loc='upper left')
ax.set(ylim=(0, 1500))

for tick in ax.get_xticklabels():
    tick.set_rotation(60)

ax2 = ax.twinx()
ax3 = sns.lineplot(data=medium_risk_results[["fraud_rate"]], palette="BuGn")
ax3.legend(loc='upper right')
ax3.set(ylim=(0, 0.15))

plt.ylabel('Fraud rate', axes=ax3)

### Are the volumes at the new high-risk level better correlated with the fraud rate?

In [None]:
kendalltau(new_high_risk_results["fraud_rate"], new_high_risk_results["stale_prediction"])

In [None]:
kendalltau(new_high_risk_results["fraud_rate"], new_high_risk_results["retrained_prediction"])

**Observation:** Again, the stale model even has a negative correlation between the fraud rate and the model score: the stale model at the new high-risk level assigns more high-risk scores on days that there is less fraud! For the daily-retrained model there is a weak but positive correlation between the model score and the fraud rate.

# Conclusions

- We have seen evidence that the daily-retrained-model makes better predictions than the stale model that we currently run in production (RQ1)
- We have not seen any bad deployments of our daily-retrained-model so far (RQ2)
- At our current production thresholds, the volumnes of high-risk and of medium-risk properties are comparable (RQ3). Additionally, we see that under the daily-retrained model, the high-risk and medium-risk volumes correlate better with the actual fraud rate compared to under the stale model.