# Analysis of Reversion Activity by Anti-Vandalism Bots
**Krishna Chaitanya Velaga, Data Scientist III, Wikimedia Foundation**

**Last updated on 28 July 2023**

[TASK: T341857](https://phabricator.wikimedia.org/T341857)

# Contents

1. [Overview](#Overview)
3. [Data Gathering](#Data-Gathering)
4. [Results](#Results)

# Overview
The goal of the following analysis was to understand the activity of various automated anti-[vandalism](https://en.wikipedia.org/wiki/Wikipedia:Vandalism) bots. Primarily, how much of the anti-vandalism burden is taken on by the bots in the respective communities? The findings will eventually inform the decisions to be taken during the development of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator), including but not limited to the development of a measurement plan and setting baselines. The [MediaWiki history dataset](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history) has been used for the analysis, for a time period of three years (July 2020 to June 2023).

**Questions Answered**

For the given list of anti-vandal bots, and their respective wikis:
- How many reverts happen per day, where the time between the edit and its revert is less than 24 hrs?
- How many bot reverts happen per day (as they tend to happen quickly)? 
    - How does this compare to the reverts taking place within 24 hrs of an edit?
- What percentage of the reverts made by the anti-vandal bots are reverted back?
    - Note: This is beyond the scope of the task, and further investigation is necessary to use this data to calculate the false positive ratio of the bots. 

**Considerations**

*(based on the inputs shared by [Sam Walton](https://phabricator.wikimedia.org/p/Samwalton9/), Product Manager for Automoderator)*

Given that the analysis will be used to inform the decisions for Automoderator development, the following considerations were taken into account:
- As bot edits take place quickly, only anti-vandalism activity by tracking new edits has been considered i.e. within 24 hrs of an edit taking place. 
    - Reverts taking place much later (say a month) are likely taking place through a different process than monitoring the recent changes.
- Only edits made to the [content namespaces](https://en.wikipedia.org/wiki/Wikipedia:Namespace) were considered.
- New page creations were excluded as Automoderator is not expected to monitor new page creations.

# Data-Gathering

## imports

In [1]:
import wmfdata as wmf
import pandas as pd
import numpy as np

pd.options.display.max_columns = None

## spark_session

In [2]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [3]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='bot-vandal-reverts',
    spark_config={
        "spark.driver.memory": "4g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "16g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

SPARK_HOME: /usr/lib/spark3
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/28 16:42:49 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/07/28 16:42:49 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
23/07/28 16:42:50 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/07/28 16:43:08 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
23/07/28 16:43:08 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002.
23/07/28 16:43:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!


In [4]:
spark_session

In [5]:
spark_session.sparkContext.setLogLevel("ERROR")

## run query

In [6]:
# list as per https://phabricator.wikimedia.org/T341857
bots = {
    'enwiki': 'ClueBot NG',
    'eswiki': 'SeroBOT',
    'frwiki': 'Salebot',
    'ptwiki': 'Salebot',
    'fawiki': 'Dexbot',
    'bgwiki': 'PSS 9',
    'simplewiki': 'ChenzwBot',
    'ruwiki': 'Рейму Хакурей',
    'rowiki': 'PatrocleBot'
}

In [7]:
%%time

query = """
WITH 
    base AS (
        SELECT 
            wiki_db,
            revision_id,
            event_timestamp,
            revision_first_identity_reverting_revision_id,
            revision_seconds_to_identity_revert
        FROM 
            wmf.mediawiki_history
        WHERE 
            snapshot = '{MW_SNAPSHOT}'
            AND wiki_db IN {DBS}
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND revision_is_identity_reverted
            AND page_namespace_is_content
            AND revision_seconds_to_identity_revert <= 24 * 60 * 60
            AND DATE (event_timestamp) >= DATE ('{START_DATE}')
            AND DATE (event_timestamp) <= DATE ('{END_DATE}')
            AND NOT revision_parent_id = 0
        )
            
SELECT 
    mwh.wiki_db,
    YEAR(mwh.event_timestamp) AS year,
    MONTH(mwh.event_timestamp) AS month,
    DAY(mwh.event_timestamp) AS day,
    COUNT(DISTINCT mwh.revision_id) AS all_reverts,
    COUNT(DISTINCT (
            CASE 
                WHEN event_user_text IN {BOTS}
                     THEN mwh.revision_id
            END)) AS bot_reverts,
    COUNT(DISTINCT (
            CASE 
                WHEN event_user_text IN {BOTS}
                     AND mwh.revision_is_identity_reverted = True
                     THEN mwh.revision_id
            END)) AS reverted_bot_reverts
FROM 
    base
JOIN wmf.mediawiki_history mwh
     ON base.revision_first_identity_reverting_revision_id = mwh.revision_id
        AND base.wiki_db = mwh.wiki_db
WHERE 
    snapshot = '{MW_SNAPSHOT}'
GROUP BY 
    YEAR(mwh.event_timestamp),
    MONTH(mwh.event_timestamp),
    DAY(mwh.event_timestamp),
    mwh.wiki_db
"""

time_bounds = ['2020-07-01', '2023-06-30'] # three years
mw_snapshot = '2023-06'

bot_revert_counts = wmf.spark.run(query.format(DBS=wmf.utils.sql_tuple(bots.keys()),
                                               BOTS=wmf.utils.sql_tuple(set(bots.values())),
                                               START_DATE=time_bounds[0],
                                               END_DATE=time_bounds[1],
                                               MW_SNAPSHOT=mw_snapshot))

                                                                                

CPU times: user 559 ms, sys: 90.2 ms, total: 649 ms
Wall time: 2min 30s


In [8]:
# for a few edits that took place on the last the day of the data end date i.e. 30 June 2023, their reverts took place the next day
# drop them as that is incomplete data and may skew the aggregations
bot_revert_counts = (bot_revert_counts
                     .drop(bot_revert_counts
                           .query("""(year == 2023) & (month >= 7)""")
                           .index))

In [9]:
# save data
(bot_revert_counts
 .sort_values(['wiki_db', 'year', 'month', 'day'])
 .to_csv('data_outputs/anti_vandal_bot_revert_counts.tsv', sep='\t', index=False))

# Results

In [18]:
percent_bot_reverts_mean = (
    (
        bot_revert_counts.groupby("wiki_db").agg(
            {"all_reverts": np.mean, "bot_reverts": np.mean, "reverted_bot_reverts": np.mean}
        )
    )
    .assign(
        bot_reverts_percent=lambda df: df["bot_reverts"] / df["all_reverts"] * 100,
        reverted_bot_reverts_percent=lambda df: df["reverted_bot_reverts"] / df["bot_reverts"] * 100,
    )
    .reset_index()
    .assign(bot=lambda df: df["wiki_db"].map(bots))
    .round(
        {
            "all_reverts": 0,
            "bot_reverts": 0,
            "reverted_bot_reverts": 1,
            "bot_reverts_percent": 2,
            "reverted_bot_reverts_percent": 2,
        }
    )
    .set_index(["wiki_db", "bot"])
    .sort_index()
)

percent_bot_reverts_mean

Unnamed: 0_level_0,Unnamed: 1_level_0,all_reverts,bot_reverts,reverted_bot_reverts,bot_reverts_percent,reverted_bot_reverts_percent
wiki_db,bot,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
bgwiki,PSS 9,56.0,4.0,0.3,6.45,8.35
enwiki,ClueBot NG,6577.0,356.0,33.5,5.41,9.41
eswiki,SeroBOT,2125.0,894.0,78.4,42.09,8.76
fawiki,Dexbot,393.0,90.0,9.8,22.83,10.96
frwiki,Salebot,731.0,26.0,2.9,3.53,11.09
ptwiki,Salebot,201.0,0.0,0.0,0.0,
rowiki,PatrocleBot,55.0,3.0,0.3,5.59,10.19
ruwiki,Рейму Хакурей,710.0,65.0,8.1,9.16,12.53
simplewiki,ChenzwBot,89.0,13.0,1.5,14.71,11.55


In [20]:
percent_bot_reverts_median = (
    (
        bot_revert_counts.groupby("wiki_db").agg(
            {"all_reverts": np.median, "bot_reverts": np.median, "reverted_bot_reverts": np.median}
        )
    )
    .assign(
        bot_reverts_percent=lambda df: df["bot_reverts"] / df["all_reverts"] * 100,
        reverted_bot_reverts_percent=lambda df: df["reverted_bot_reverts"] / df["bot_reverts"] * 100,
    )
    .reset_index()
    .assign(bot=lambda df: df["wiki_db"].map(bots))
    .round(
        {
            "all_reverts": 0,
            "bot_reverts": 0,
            "reverted_bot_reverts": 1,
            "bot_reverts_percent": 2,
            "reverted_bot_reverts_percent": 2,
        }
    )
    .set_index(["wiki_db", "bot"])
    .sort_index()
)

percent_bot_reverts_median

Unnamed: 0_level_0,Unnamed: 1_level_0,all_reverts,bot_reverts,reverted_bot_reverts,bot_reverts_percent,reverted_bot_reverts_percent
wiki_db,bot,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
bgwiki,PSS 9,52.0,3.0,0.0,5.77,0.0
enwiki,ClueBot NG,6508.0,348.0,33.0,5.35,9.48
eswiki,SeroBOT,2061.0,861.0,75.0,41.78,8.71
fawiki,Dexbot,406.0,0.0,0.0,0.0,
frwiki,Salebot,708.0,18.0,2.0,2.54,11.11
ptwiki,Salebot,173.0,0.0,0.0,0.0,
rowiki,PatrocleBot,51.0,0.0,0.0,0.0,
ruwiki,Рейму Хакурей,714.0,66.0,8.0,9.24,12.12
simplewiki,ChenzwBot,82.0,13.0,1.0,15.85,7.69


In [170]:
percent_bot_reverts.fillna(0).to_csv('data_outputs/anti_vandal_bot_revert_percentages.tsv', sep='\t')