# Batman Flexrules Analysis: Impact of flexrules on our location verification decisions for new-property-registrations

# Introduction

In this analysis we explore how much each of the flexrules contributes to the overall Batman system. In particular, we explore high many additional **false positives**, **false negatives**, **true positives** and **true negatives** each of the flexrules brings in comparison to the Batman model at the [lower threshold that we are currently running experimentally](https://docs.google.com/document/d/1TMS4ohydf8E85xmvLjSOAqKfuSGqh1-gDcS2HCAX_gQ/edit). While the model currently catches 76% of the fake hotels that are not caught by the blacklist, it does so at the cost of quite many false positives and thus friction on genuine properties. The hypothesis is that this friction can be reduced by phasing out old flexrules that do not contribute in catching fakes that the model does not already catch at this threshold.

This analysis of flexrules contrasts the way in which they are typically evaluated: by looking at the precision of the rule itself. What ultimately matters for our business is the quality of our overall system, which is a combination of the model and the flexrules. The degree to which a flexrule affects this overall system is by calculating how often each flexrule changed the outcome of the business decision that was made by the overall system, compared to a hypothetical world in which the flexrule under investigation would not have existed. We call this difference between the overall system and the overall system where a given flexrule would not have existed the **incremental effect** of the flexrule. 

This notebook contains five sections:

- **Pre-Processing the Flexrule data** this section just loads the required tables and does some required data plumbing. This section can be skipped by those who are only interested in the business implications.
- **The Business Logic of Batman FlexRules** this section re-implements in PySpark the Perl-logic how the Batman model and Batman flexrules are jointly used to make business decisions (i.e., whether to send the property to hubs for location verification).
- **Evaluating a new rule** this section demonstrations to fraud operations analysts how the logic of the previous section can be used to measure the expected **incremental** effect of a new flexrule that is under development.
- **The Incremental Effect of each Individual Flexrule** this section measures the **incremental** effect of all flexrules that are currently operational.
- **A Proposal for Deprecating Flexrules** this section makes a proposal to decommission a large number of currently operational flexrules based on the findings of the previous section.

In [None]:
!pip install bkng-data --no-dependencies --pre

In [None]:
from typing import List
from pyspark.sql import DataFrame, functions as sf, types
from sklearn import metrics

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns

from bkng.data.teams.fpi.tools.spark import melt, parse_json_schema

In [None]:
spark.sql("REFRESH TABLE counterfraud.batman_model_features_and_labels")

In [None]:
flexrules = spark.table("dbimports.seclog_BatmanFlexRulesLog")

# Batman Instances

We start the analysis from 2021-03-26, which is the release date of the current production Batman model.

In [None]:
instances = (
    spark.table("counterfraud.batman_model_features_and_labels")
    .where(sf.col("signup_finished_at") >= "2021-03-26")
    .where(sf.col("signup_finished_at") < "2021-05-01")
    .withColumn("registration_date", sf.col("signup_finished_at").substr(0, 10))
    .select("property_id", "is_fake_hotel", "registration_date")
)

In [None]:
json_parse_schema = types.StructType(
    [
        types.StructField("score", types.FloatType(), True),
    ]
)


batman_predictions = (
    spark.table("dbimports.sectools_batmanmodelpredictionlog")
    .where(sf.col("model_instance") == "partner_fraud_preopening_batman_20210226")
    .where(sf.col("source_system") == "RS")
    .select("property_id", "prediction")
    .withColumn("prediction", sf.from_json("prediction", json_parse_schema))
    .select(sf.col("property_id"), sf.col("prediction.*"))
    .withColumnRenamed("score", "batman_score")
)

# Pre-Processing the Flexrule data

Most of the contents of this section are just data plumbing. The flexrule data is stored in slightly complicated json formats, so it takes a bit of code to process it to a useable form.

In [None]:
rule_actions_schema = parse_json_schema(flexrules, json_col="rule_actions", spark=spark)

In [None]:
flexrules_parsed = (
    flexrules
    .withColumn("rules_actions_json", sf.from_json(sf.col("rule_actions"), rule_actions_schema))
    .drop("rule_actions")
    .select("property_id", "rules_actions_json.*")
)

Some flexrules have characters in their name that are not allowed in Hive columns (spaces and hyphens). We will now replace those by "__" (i.e., dunder) to be able to process the data further.

In [None]:
illegal_chars = [" ", "-"]
for col in [col for col in flexrules_parsed.columns if any(illegal_char in col for illegal_char in illegal_chars)]:
    for illegal_char in illegal_chars:
        flexrules_parsed = (
            flexrules_parsed
            .withColumnRenamed(col, col.replace(illegal_char, '__'))
        )

We now use the melt method to transform from a wide one-columns-per-flexrule format to a long format that have the flexrule name in column *rule* and the action that it resulted in in column *action_was_taken*.

In [None]:
groupby_cols = ["property_id"]
pivot_cols = [col for col in flexrules_parsed.columns if col not in groupby_cols]

action_taken_per_rule = melt(
    flexrules_parsed, 
    groupby_cols = groupby_cols, 
    pivot_cols = pivot_cols,
    target_name_col = "rule", 
    target_value_col = "action_was_taken"
)

Each rule can trigger a list of actions, and not just one action. To get all the effects of a flexrule we explode this list in the *action_was_taken* column. Additionally, sometimes our Batman system mistakenly gets called multiple times for the same property by the B.Home team (this is pretty rare, and happens ~3 times a day). We use distinct to remove the duplicate rows that these situations cause.

In [None]:
actions_taken_per_rule = (
    action_taken_per_rule
    .where(sf.col("action_was_taken").isNotNull())
    .withColumn("action_taken", sf.explode(sf.col("action_was_taken")))
    .drop("action_was_taken")
    # it sometimes happens that there are multiple Batman calls per property
    # in those cases, we want to take the union of all flexrule triggerts
    # this is achieved by a distinct
    .distinct()
)

We now filter out rows that correspond to all actions that are just logging, and don't take any real business action.

In [None]:
relevant_actions_taken_per_rule = (
    actions_taken_per_rule
    .where(~sf.col("action_taken").startswith("log"))
    .where(~sf.col("action_taken").startswith("monitor"))
    # These are just here for lineage tracking, but don't really do anything
    .where(sf.col("action_taken") != "skip_autoclose_property")
)

We join in the remaining flexrule data and the Batman predictions to the instances (i.e., the non-autoclosed property registrations).

In [None]:
df = (
    instances
    .join(batman_predictions, on="property_id", how="left")
    .join(relevant_actions_taken_per_rule, on="property_id", how="left")
    .where(sf.col("batman_score").isNotNull())
)

# The Business Logic of Batman FlexRules

The logic of how these actions by the flexrules are applied to our properties is defined in https://gitlab.booking.com/core/main/-/blob/trunk/lib/Bookings/Fraud/PRS/RealtimePRS.pm#L534-554.

We now re-implement this logic in PySpark such that we can obtain the business decisions for each property. We give the method an **holdout_rules** argument that can be used to disable certain flexrules and calculate what would have happened if these rules would not have existed.

In [None]:
def system_performance(
    df: DataFrame, 
    holdout_rules: List[str], 
    medium_threshold: float = 0.086,
    high_threshold: float = 0.15):
    """
    This method applies the logic from https://gitlab.booking.com/core/main/-/blob/trunk/lib/Bookings/Fraud/PRS/RealtimePRS.pm#L534-554
    in Spark. It returns the number of false positves, true positives, false negatives, and true negatives of the whole
    Batman system when the medium and high thresholds are applied as provided in the arguments, and when all the flexrules
    are applied that are NOT in the list of the holdout_rules argument. 
    """
    df = (
        df
        .groupBy("property_id")
        .agg(
            sf.max(sf.col("is_fake_hotel")).alias("is_fake_hotel"),
            sf.max(sf.col("batman_score")).alias("batman_score"),
            sf.max(
                (sf.col("action_taken") == "force_score_to_low") & (~sf.col("rule").isin(holdout_rules))
            ).alias("force_score_to_low"),
            sf.max(
                (sf.col("action_taken") == "upgrade_score_to_1") & (~sf.col("rule").isin(holdout_rules))
            ).alias("upgrade_score_to_1"),
            sf.max(
                (sf.col("action_taken") == "move_score_to_high") & (~sf.col("rule").isin(holdout_rules))
            ).alias("move_score_to_high"),
            sf.max(
                (sf.col("action_taken") == "move_score_to_medium") & (~sf.col("rule").isin(holdout_rules))
            ).alias("move_score_to_medium"),
            sf.max(
                (sf.col("action_taken") == "move_score_to_low") & (~sf.col("rule").isin(holdout_rules))
            ).alias("move_score_to_low"),
            sf.max(
                (sf.col("action_taken") == "upgrade_score_medium_to_high") & (~sf.col("rule").isin(holdout_rules))
            ).alias("upgrade_score_medium_to_high"),
        )
        .fillna(False, subset = ["force_score_to_low", "upgrade_score_to_1", "move_score_to_high", 
                                 "move_score_to_medium", "move_score_to_low", "upgrade_score_medium_to_high"])
    )
    
    df = (
        df
        .withColumn("model_bucket", sf.when(sf.col("batman_score") >= high_threshold, "high").otherwise(
            sf.when(sf.col("batman_score") >= medium_threshold, "medium").otherwise("low")))
        .withColumn("label", sf.when(sf.col("force_score_to_low"), "low").otherwise(
            sf.when(sf.col("upgrade_score_to_1"), "high").otherwise(
            sf.when(sf.col("move_score_to_high"), "high").otherwise(
            sf.when(sf.col("move_score_to_medium"), "medium").otherwise(
            sf.when(sf.col("move_score_to_low"), "low").otherwise(
            sf.when(sf.col("upgrade_score_medium_to_high") & (sf.col("model_bucket") == "medium"), "high").otherwise(    
            sf.col("model_bucket"))))))))
    )
    
    high_fp = df.where(sf.col("label") == "high").where(sf.col("is_fake_hotel") == 0).count()
    high_tp = df.where(sf.col("label") == "high").where(sf.col("is_fake_hotel") == 1).count()
    high_fn = df.where(sf.col("label") != "high").where(sf.col("is_fake_hotel") == 1).count()
    high_tn = df.where(sf.col("label") != "high").where(sf.col("is_fake_hotel") == 0).count()
    return high_fp, high_tp, high_fn, high_tn

# Evaluating a new rule

This section contains some steps for fraud operations to evaluate a new flexrule.

As first step, go back to the definition of the **instances** variable above, and change the 
time range of the evaluation such that it matches the time range that you want to evaluate your flexrule on. It is recommended to leave out at least the most recent 10 days relative to the day of investigation, due to label maturity.

At this point in the notebook, our df has the following columns:

In [None]:
df.columns

Strictly speaking **registration_date** is not used in the system_performance method, so if you prefer, you can safely drop it.

To evaluate a new candidate for a flexrule, you can prepare a new dataframe that contains these columns. This new dataframe should contain one row for each property on which the new rule would have actioned.

I'll now get you going with a dataframe that gives you 4 of these 6 columns:

In [None]:
new_rule_df = (
    instances
    .join(batman_predictions, on="property_id", how="left")
)

In [None]:
new_rule_df.columns

From this point on, you can **tweak new_rule_df** according to the logic of the flexrule. The idea is to filter out all property_ids on which your new rule would not have actioned, such that you obtain a dataframe with only those property_ids on which your rule would have actioned.

In [None]:
### TODO: Implement the logic to filter new_rule_df down to only those property_ids that the rule would have actioned on

# as a ridiculous example, I'll make a flexrule here that actions on all properties that have an even property_id
new_rule_df = new_rule_df.where(sf.col("property_id") % 2 == 0)

Now we also need the **rule** and **action_taken** columns that are in the initial df. We can simply set them to fixed values.

In [None]:
rule_name = "THE_NAME_OF_THE_NEW_RULE" # set this to whatever name you want (no spaces, hyphens, or other special chars)
rule_action = "upgrade_score_to_1"# you could theoretically set this to a different action if you would want that

new_rule_df = (
    new_rule_df
    .withColumn("rule", sf.lit(rule_name))
    .withColumn("action_taken", sf.lit(rule_action)) 
)

In [None]:
new_rule_df.columns

Notice that now we have a dataframe with exactly those columns of the initial df that contains all the actionings of our existing set of flexrules.

In [None]:
df_with_old_rules_and_new_rule = df.unionByName(new_rule_df)

In [None]:
# let's cache this dataframe, this is just to speed up the computation that follows below.
df_with_old_rules_and_new_rule.cache().count()

# calculate the performance without the new rule
fp, tp, fn, tn = system_performance(df_with_old_rules_and_new_rule, holdout_rules=[rule_name])

# calculate the performance with the new rule
fp_with, tp_with, fn_with, tn_with = system_performance(df_with_old_rules_and_new_rule, holdout_rules=[])

# print the effect of the rule
print(f"{rule_name} leads to:")
print(f"    {fp_with-fp} incremental false positives")
print(f"    {fn_with-fn} incremental false negatives")
print(f"    {tp_with-tp} incremental true positives")
print(f"    {tn_with-tn} incremental true negatives")
print()

Not surprisingly, a rule that actions on all the even property_ids is not a great idea: it creates an incredibly high number of incremental false positives.
    
Now, make your own rule and see if you can do something smarter :).

# The Incremental Effect of each Individual Flexrule

For each individual flexrule we now calculate what business action we would have taken if that rule would not have existed, and we compare that to the business action that we took in the current system. This allows us to measure the
incremental number of **false positives**, **false negatives**, **true positives**, and **true negatives** that this flexrule contributed to our business decision-making.

In [None]:
# cache the dataframe such that we can process it multiple times below
df.cache().count()

In [None]:
# For comparison, calculate the system performance under presence of all the flexrules
fp, tp, fn, tn = system_performance(df, [])

# We now iterate over all the flexrules, and try to disable them one-by-one and measure 
# the effect on the overall system
rules = df.select("rule").distinct().collect()

for rule in rules:
    rule_name = rule["rule"]
    if rule_name is not None:
        fp_without, tp_without, fn_without, tn_without = system_performance(df, [rule_name])
        print(f"{rule_name} leads to:")
        print(f"    {fp-fp_without} incremental false positives")
        print(f"    {fn-fn_without} incremental false negatives")
        print(f"    {tp-tp_without} incremental true positives")
        print(f"    {tn-tn_without} incremental true negatives")
        print()

# A Proposal for Deprecating Flexrules

Based on the results above, I would recommend to turn off the following Batman flexrules because they do not have any impact on the Batman system:

- 09_2020_PFOps_BATMAN_Blacklist_IBAN_Match_Weak_Strong_UpgradeScoreTo1

Note that autoclosing flexrules are very hard to evaluate and outside the scope of this analysis because we are not certain about autoclosed properties whether they are fake or not. Therefore, such rules are not listed above.

At the current experimental threshold of 0.15, the Batman model catches >70% of the fake hotels and has a precision of 27.8% (i.e., 72.2% of what it identifies at high-risk is genuine and gets unnecessary friction). If a rule has less than 27.8% precision among its **incremental** contributions, we can conclude that it is harmful to the overall system rather than helpful. Looking at the results above, this is the case for the following flexrules:

- 09_2020_PFOps_BATMAN_Agency_FR_Fakes_upgradescoreto1_V2 (1.9% precision in incremental detection)
- 03_2021_PFOps_BATMAN_Agency_RU_Fakes_upgradescoreto1 (4.2% precision in incremental detection)
- 09_2020_PFOps_BATMAN_Agency_ES_Fakes_upgradescoreto1_V2 (3.9% precision in incremental detection)
- 10_2020_PFOps_BATMAN_Agency_AT_Fakes_Upgrade_Score_To_1 (1.8% precision in incremental detection)
- mismatch_ip_cc1_very_high (0% precision in incremental detection, it only adds false positives!)
- 10_2020_PFOps_BATMAN_Agency_UA_Fakes_Upgrade_Score_To_1 (2.7% precision in incremental detection)
- 09_2020_PFOps_BATMAN_Agency_NL_Fakes_upgradescoreto1_V2 (1.7% precision in incremental detection)
- 09_2020_PFOps_BATMAN_Agency_NL_Fakes_upgradescoreto1_V2 (1.7% precision in incremental detection)
- ip_cc1_risky (0% precision in incremental detection, it only adds false positives!)
- may18_15_min_kathy_model (8.5% precision in incremental detection)
- 11_2020_PFOps_BATMAN_Agency_LuxuryFakes_Germany_UpgradeScoreTo1 (2.4% precision in incremental detection)
- 10_2020_PFOps_BATMAN_Agency_AU_Fakes_Upgrade_Score_To_1 (4.8% precision in incremental detection)
- 10_2020_PFOps_BATMAN_Agency_CH_Fakes_Upgrade_Score_To_1 (5.7% precision in incremental detection)
- wpp_confidence_above470 (6.3% precision in incremental detection)

Rules that do have a precision that is sufficiently high among the incremental contributions:

- july18_15_min_room_price (23.07% precision in incremental detection)

**System performance under the proposal**:

In [None]:
deprecation_list = [
    "09_2020_PFOps_BATMAN_Blacklist_IBAN_Match_Weak_Strong_UpgradeScoreTo1",
    "09_2020_PFOps_BATMAN_Agency_FR_Fakes_upgradescoreto1_V2",
    "03_2021_PFOps_BATMAN_Agency_RU_Fakes_upgradescoreto1",
    "09_2020_PFOps_BATMAN_Agency_ES_Fakes_upgradescoreto1_V2",
    "10_2020_PFOps_BATMAN_Agency_AT_Fakes_Upgrade_Score_To_1",
    "mismatch_ip_cc1_very_high",
    "10_2020_PFOps_BATMAN_Agency_UA_Fakes_Upgrade_Score_To_1",
    "09_2020_PFOps_BATMAN_Agency_NL_Fakes_upgradescoreto1_V2",
    "09_2020_PFOps_BATMAN_Agency_NL_Fakes_upgradescoreto1_V2",
    "ip_cc1_risky",
    "may18_15_min_kathy_model",
    "11_2020_PFOps_BATMAN_Agency_LuxuryFakes_Germany_UpgradeScoreTo1",
    "10_2020_PFOps_BATMAN_Agency_AU_Fakes_Upgrade_Score_To_1",
    "10_2020_PFOps_BATMAN_Agency_CH_Fakes_Upgrade_Score_To_1",
    "wpp_confidence_above470",
]

fp_new, tp_new, fn_new, tn_new = system_performance(df, deprecation_list)

print(f"The overall proposal would have resulted in:")
print(f"    {fp_new} genuine properties with friction (change of {fp_new-fp})")
print(f"    {tp_new} fake properties caught (change of {tp_new-tp})")
print(f"    {fn_new} fake properties not-caught (change of {fn_new-fn})")
print(f"    {tn_new} genuine properties without friction (change of {tn_new-tn})")

**Baseline: system performance under all flexrules and pre-experiment Batman threshold**:

For comparison, we now show the results if we would instead keep the flexrules and set the threshold back up from 0.15 to 0.5.

In [None]:
fp_new, tp_new, fn_new, tn_new = system_performance(df, [], medium_threshold = 0.086, high_threshold = 0.5)

print(f"Raising back up the threshold without changing the flexrules would have resulted in:")
print(f"    {fp_new} genuine properties with friction (change of {fp_new-fp})")
print(f"    {tp_new} fake properties caught (change of {tp_new-tp})")
print(f"    {fn_new} fake properties not-caught (change of {fn_new-fn})")
print(f"    {tn_new} genuine properties without friction (change of {tn_new-tn})")

It is clear that the combination of lowering the Batman threshold + deprecating flexrules **increases the share of fake properties caught** and at the same time **decreases friction for genuine properties**!

# Conclusions

Lowering the Batman threshold from 0.5 to 0.15 increased our ability to catch 72% of the fake properties that the blacklist didn't catch instead of only 6.4% at the expense of (see [the proposal](https://docs.google.com/document/d/1TMS4ohydf8E85xmvLjSOAqKfuSGqh1-gDcS2HCAX_gQ/edit#heading=h.2x82ndvu33mx))

Deprecating this long list of flexrules results in a 40% decrease of friction on genuine partners (from 8538 genuine parters in high-risk instead of 8538+5366) while we would only reduce the share of fake properties caught by 10%.

The combined effect of keeping the Batman threshold low + deprecating the flexrules results in:

- **An increase** of the number of fake properties caught from 1567 to 2309
- **A decrease** of the number of genuine properties with friction from 11338 to 8538

In my view this is a no-brainer, and the right thing to do is to deprecate a long list of flexrules while keeping the Batman threshold at 0.15 (as in the current experiment).