# Analysis of Reversion Activity by Anti-Vandalism Bots
**Krishna Chaitanya Velaga, Data Scientist III, Wikimedia Foundation**

**Last updated on 31 July 2023**

[TASK: T341857](https://phabricator.wikimedia.org/T341857)

# Contents

1. [Overview](#Overview)
3. [Data Gathering (Bot Reverts)](#Data-Gathering-Bot-Reverts)
4. [Results (Bot Reverts)](#Results-Bot-Reverts)
3. [Data Gathering (Reverted Bot Reverts)](#Data-Gathering-Reverted-Bot-Reverts)
4. [Results (Reverted Bot Reverts)](#Results-Reverted-Bot-Reverts)

# Overview
The goal of the following analysis was to understand the activity of various automated anti-[vandalism](https://en.wikipedia.org/wiki/Wikipedia:Vandalism) bots. Primarily, how much of the anti-vandalism burden is taken on by the bots in the respective communities? The findings will eventually inform the decisions to be taken during the development of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator), including but not limited to the development of a measurement plan and setting baselines. The [MediaWiki history dataset](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history) has been used for the analysis, for a time period of three years (July 2020 to June 2023).

**Questions Answered**

<u>Primary</u>

For the given list of anti-vandal bots, and their respective wikis (along with user segmentation wherever possible):
- How many reverts happen per day, where the time between the edit and its revert is less than 24 hrs?
- How many bot reverts happen per day (as they tend to happen quickly)? 
    - How does this compare to the reverts taking place within 24 hrs of an edit?
    
<u>Secondary</u>

- What percentage of the reverts made by the anti-vandal bots are reverted back? (not to be confused as FPR)
- What percentage of the reverted bot reverts are reverted back by the same user/IP whose edit had been initially reverted?

**Considerations**

*(based on the inputs shared by [Sam Walton](https://phabricator.wikimedia.org/p/Samwalton9/), Product Manager for Automoderator)*

Given that the analysis will be used to inform the decisions for Automoderator development, the following considerations were taken into account:
- As bot edits take place quickly, only anti-vandalism activity by tracking new edits has been considered i.e. within 24 hrs of an edit taking place. 
    - Reverts taking place much later (say a month) are likely taking place through a different process than monitoring the recent changes.
- Only edits made to the [content namespaces](https://en.wikipedia.org/wiki/Wikipedia:Namespace) were considered.
- New page creations were excluded as Automoderator is not expected to monitor new page creations.

# Data-Gathering

## imports

In [1]:
import wmfdata as wmf
import pandas as pd
import numpy as np

pd.options.display.max_columns = None

import warnings

## spark_session

In [2]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [3]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='bot-vandal-reverts',
    spark_config={
        "spark.driver.memory": "4g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "16g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

SPARK_HOME: /usr/lib/spark3
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/15 09:32:08 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/08/15 09:32:18 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!


In [4]:
spark_session

In [5]:
spark_session.sparkContext.setLogLevel("ERROR")

## run query

In [6]:
# list as per https://phabricator.wikimedia.org/T341857
bots = {
    'enwiki': 'ClueBot NG',
    'eswiki': 'SeroBOT',
    'frwiki': 'Salebot',
    'ptwiki': 'Salebot',
    'fawiki': 'Dexbot',
    'bgwiki': 'PSS 9',
    'simplewiki': 'ChenzwBot',
    'ruwiki': 'Рейму Хакурей',
    'rowiki': 'PatrocleBot'
}

In [7]:
time_bounds = ['2020-07-01', '2023-06-30'] # three years
mw_snapshot = '2023-06'

### bot reverts

In [37]:
%%time

bot_reverts_query = """
WITH 
    base AS (
        SELECT 
            wiki_db,
            revision_id,
            event_timestamp,
            revision_first_identity_reverting_revision_id,
            revision_seconds_to_identity_revert,
            CASE
                WHEN event_user_is_anonymous THEN 'anon'
                ELSE 'registered'
            END AS user_type,
            CASE
                WHEN event_user_revision_count >= 0 AND event_user_revision_count < 100 THEN '0-99'
                WHEN event_user_revision_count >= 100 AND event_user_revision_count < 500 THEN '100-499'
                WHEN event_user_revision_count >= 500 THEN '500+'
                ELSE 'n/a'
            END AS edit_bucket
        FROM 
            wmf.mediawiki_history
        WHERE 
            snapshot = '{MW_SNAPSHOT}'
            AND wiki_db IN {DBS}
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND revision_is_identity_reverted
            AND page_namespace_is_content
            AND revision_seconds_to_identity_revert <= 24 * 60 * 60
            AND DATE (event_timestamp) >= DATE ('{START_DATE}')
            AND DATE (event_timestamp) <= DATE ('{END_DATE}')
            AND NOT revision_parent_id = 0
        )
            
SELECT 
    mwh.wiki_db,
    YEAR(mwh.event_timestamp) AS year,
    MONTH(mwh.event_timestamp) AS month,
    DAY(mwh.event_timestamp) AS day,
    user_type,
    edit_bucket,
    COUNT(DISTINCT mwh.revision_id) AS all_reverts,
    COUNT(DISTINCT (
            CASE 
                WHEN event_user_text IN {BOTS}
                     THEN mwh.revision_id
            END)) AS bot_reverts,
    COUNT(DISTINCT (
            CASE 
                WHEN event_user_text IN {BOTS}
                     AND mwh.revision_is_identity_reverted = True
                     THEN mwh.revision_id
            END)) AS reverted_bot_reverts
FROM 
    base
JOIN wmf.mediawiki_history mwh
     ON base.revision_first_identity_reverting_revision_id = mwh.revision_id
        AND base.wiki_db = mwh.wiki_db
WHERE 
    snapshot = '{MW_SNAPSHOT}'
GROUP BY 
    YEAR(mwh.event_timestamp),
    MONTH(mwh.event_timestamp),
    DAY(mwh.event_timestamp),
    mwh.wiki_db,
    edit_bucket,
    user_type
"""

bot_revert_counts = wmf.spark.run(bot_reverts_query.format(DBS=wmf.utils.sql_tuple(bots.keys()),
                                                           BOTS=wmf.utils.sql_tuple(set(bots.values())),
                                                           START_DATE=time_bounds[0],
                                                           END_DATE=time_bounds[1],
                                                           MW_SNAPSHOT=mw_snapshot))

                                                                                

CPU times: user 570 ms, sys: 89.5 ms, total: 659 ms
Wall time: 2min 28s


In [14]:
# although three years of data has been considered by default, a few bots have started running much later
# gathering the registration date of the bots for their respective wikis

warnings.filterwarnings('ignore')

bot_reg_query = """
SELECT
    user_name AS bot,
    DATE(user_registration) AS reg_date
FROM
    user
WHERE
    user_name = '{BOT_USERNAME}'
"""

bot_reg_dates = pd.DataFrame()
for wiki_db in bots.keys():
    result = wmf.mariadb.run(bot_reg_query.format(BOT_USERNAME=bots[wiki_db]), 
                             wiki_db)
    result['wiki_db'] = wiki_db
    bot_reg_dates = pd.concat([bot_reg_dates, result], ignore_index=False)
    
bot_reg_dates

Unnamed: 0,bot,reg_date,wiki_db
0,ClueBot NG,2010-10-20,enwiki
0,SeroBOT,2018-04-19,eswiki
0,Salebot,2006-11-10,frwiki
0,Salebot,2008-09-21,ptwiki
0,Dexbot,2012-04-20,fawiki
0,PSS 9,2017-03-01,bgwiki
0,ChenzwBot,2008-04-10,simplewiki
0,Рейму Хакурей,2016-08-20,ruwiki
0,PatrocleBot,2022-01-15,rowiki


In [38]:
bot_revert_counts = pd.merge(bot_revert_counts.assign(bot=lambda df: df["wiki_db"].map(bots)), 
                             bot_reg_dates, 
                             on=['bot', 'wiki_db'], 
                             how='left')

bot_revert_counts = bot_revert_counts.astype({'year': str, 'month': str, 'day': str})
bot_revert_counts['date'] = pd.to_datetime(bot_revert_counts['year'] + '-' + bot_revert_counts['month'] + '-' + bot_revert_counts['day'])
bot_revert_counts['reg_date'] = pd.to_datetime(bot_revert_counts['reg_date'])

In [39]:
# consider dates only after the bot registration date
bot_revert_counts = bot_revert_counts.query("""date > reg_date""")

In [40]:
# for a few edits that took place on the last the day of the data end date i.e. 30 June 2023, their reverts took place the next day
# drop them as that is incomplete data and may skew the aggregations
bot_revert_counts = bot_revert_counts.query("""date <= @pd.to_datetime('2023-06-30')""")

In [42]:
# calculate percentages
bot_revert_counts = (bot_revert_counts
               .assign(
                   bot_reverts_percent=lambda df: df["bot_reverts"] / df["all_reverts"] * 100,
                   reverted_bot_reverts_percent=lambda df: df["reverted_bot_reverts"] / df["bot_reverts"] * 100)
               .fillna(0))

In [43]:
# save data
(bot_revert_counts
 .sort_values(['wiki_db', 'year', 'month', 'day'])
 .to_csv('data_outputs/anti_vandal_bot_revert_counts.tsv', sep='\t', index=False))

### reverted bot reverts

In [35]:
%%time

reverted_bot_reverts_query = """
WITH 
    base AS (
        SELECT 
            wiki_db,
            revision_id AS base_rev_id,
            event_timestamp AS base_ts,
            event_user_text AS base_user_text,
            revision_first_identity_reverting_revision_id AS base_revert_id,
            revision_seconds_to_identity_revert AS base_seconds_to_revert,
            CASE
                WHEN event_user_is_anonymous THEN 'anonymous'
                ELSE 'registered'
            END AS user_type,
            CASE
                WHEN event_user_revision_count >= 0 AND event_user_revision_count < 100 THEN '0-99'
                WHEN event_user_revision_count >= 100 AND event_user_revision_count < 500 THEN '100-499'
                WHEN event_user_revision_count >= 500 THEN '500+'
                ELSE 'n/a'
            END AS edit_bucket
        FROM 
            wmf.mediawiki_history
        WHERE 
            snapshot = '{MW_SNAPSHOT}'
            AND wiki_db IN {DBS}
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND revision_is_identity_reverted
            AND page_namespace_is_content
            AND revision_seconds_to_identity_revert <= 24 * 60 * 60
            AND DATE (event_timestamp) >= DATE ('{START_DATE}')
            AND DATE (event_timestamp) <= DATE ('{END_DATE}')
            AND NOT revision_parent_id = 0
        ),
        
    bot_reverts AS (
        SELECT
            base.*,
            mwh.revision_id AS br_rev_id,
            mwh.event_timestamp AS br_timestamp,
            mwh.revision_first_identity_reverting_revision_id AS br_revert_id,
            mwh.revision_seconds_to_identity_revert AS br_seconds_to_revert,
            mwh.event_user_text AS br_user_text
        FROM
            base
        JOIN wmf.mediawiki_history mwh
             ON base.base_revert_id = mwh.revision_id
                AND base.wiki_db = mwh.wiki_db
        WHERE snapshot = '{MW_SNAPSHOT}'
            AND mwh.event_user_text IN {BOTS}
            AND mwh.revision_is_identity_reverted
    ),
    
    reverted_bot_reverts AS (
        SELECT
            br.*,
            mwh.revision_id AS rbr_rev_id,
            mwh.event_timestamp AS rbr_timestamp,
            mwh.revision_first_identity_reverting_revision_id AS rbr_revert_id,
            mwh.revision_seconds_to_identity_revert AS rbr_seconds_to_revert,
            mwh.event_user_text AS rbr_user_text            
        FROM 
            bot_reverts br
        JOIN
            wmf.mediawiki_history mwh
            ON br.br_revert_id = mwh.revision_id
            AND br.wiki_db = mwh.wiki_db
        WHERE snapshot = '{MW_SNAPSHOT}'
    )
    
SELECT
    wiki_db,
    user_type,
    edit_bucket,
    br_user_text AS bot,
    COUNT(DISTINCT br_revert_id) AS reverted_bot_reverts, 
    COUNT(DISTINCT (
            CASE 
                WHEN rbr_user_text = base_user_text THEN br_revert_id 
            END)) AS bot_reverts_by_base_editor,
    YEAR(base_ts) AS year,
    MONTH(base_ts) AS month,
    DAY(base_ts) AS day
FROM reverted_bot_reverts
GROUP BY
    wiki_db,
    user_type,
    edit_bucket,
    br_user_text,
    YEAR(base_ts),
    MONTH(base_ts),
    DAY(base_ts)
"""

reverted_bot_reverts = wmf.spark.run(reverted_bot_reverts_query.format(DBS=wmf.utils.sql_tuple(bots.keys()),
                                                           BOTS=wmf.utils.sql_tuple(set(bots.values())),
                                                           START_DATE=time_bounds[0],
                                                           END_DATE=time_bounds[1],
                                                           MW_SNAPSHOT=mw_snapshot))

                                                                                92]]]

CPU times: user 543 ms, sys: 121 ms, total: 665 ms
Wall time: 2min 23s


In [58]:
reverted_bot_reverts.to_csv('data_outputs/reverted_bot_reverts.tsv', sep='\t', index=False)

# Results

## Bot Reverts

In [73]:
def aggregate_bot_reverts(group_by: list):
    return (
        bot_reverts
        .drop(bot_reverts
              .query("""(user_type == 'registered') & (edit_bucket== 'n/a')""")
              .index)
        .query("""bot_reverts != 0""")
        .groupby(group_by)
        .agg({
            'all_reverts': np.mean, 
            'bot_reverts': np.mean, 
            'reverted_bot_reverts': np.mean, 
            'day': 'count', 
            'bot_reverts_percent': np.mean, 
            'reverted_bot_reverts_percent':np.mean})
        .reset_index()
        .round({
            "all_reverts": 0,
            "bot_reverts": 0,
            "reverted_bot_reverts": 1,
            "bot_reverts_percent": 1,
            "reverted_bot_reverts_percent": 1
        })
        .astype({'all_reverts': int, 'bot_reverts': int})
        .set_index(group_by)
        .sort_index()
        .rename({'day': 'n_days'}, axis=1)
    )

In [106]:
# grouped by wiki_db, bot, user_type, and edit_bucket

aggregate_bot_reverts(['wiki_db', 'bot', 'user_type', 'edit_bucket']).fillna(0).to_csv('data_outputs/anti_vandal_bot_revert_aggregates.tsv', sep='\t')

In [78]:
# grouped by wiki_db, bot, and user_type

aggregate_bot_reverts(['wiki_db', 'bot', 'user_type'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,all_reverts,bot_reverts,reverted_bot_reverts,n_days,bot_reverts_percent,reverted_bot_reverts_percent
wiki_db,bot,user_type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
bgwiki,PSS 9,anon,44,4,0.2,899,8.7,4.9
bgwiki,PSS 9,registered,10,2,0.4,342,26.8,17.9
enwiki,ClueBot NG,anon,4069,266,23.1,1090,6.2,8.5
enwiki,ClueBot NG,registered,1243,91,10.5,1098,7.0,11.9
eswiki,SeroBOT,anon,1666,846,71.3,1092,49.7,8.7
eswiki,SeroBOT,registered,189,32,4.5,1740,15.1,22.3
fawiki,Dexbot,anon,385,273,30.5,347,69.9,11.1
fawiki,Dexbot,registered,64,16,0.6,157,6.9,3.3
frwiki,Salebot,anon,477,21,2.1,1064,4.0,9.8
frwiki,Salebot,registered,129,6,0.9,1023,4.5,14.8


## Reverted Bot Reverts

In [92]:
reverted_bot_reverts_agg = (
    reverted_bot_reverts
    .drop(reverted_bot_reverts
          .query("""(user_type == 'reg_user') & (edit_bucket == 'n/a')""")
          .index)
    .groupby(["wiki_db", "bot", "user_type", "edit_bucket"])        
    .agg({
        "reverted_bot_reverts": np.mean,
        "bot_reverts_by_base_editor": np.mean,
        "day": "count"
    })
    .assign(
        bot_reverts_by_base_editor_percent=lambda df: df["bot_reverts_by_base_editor"] / df["reverted_bot_reverts"] * 100)
    .rename({'day': 'n_days'}, axis=1)
    .round({
        'reverted_bot_reverts': 2,
        'bot_reverts_by_base_editor': 2,
        'bot_reverts_by_base_editor_percent': 2
    })
)

reverted_bot_reverts_agg.to_csv('data_outputs/reverted_bot_reverts_aggegrates.tsv', sep='\t')

In [99]:
# percentage of bot reverts reverted back the same editor who initial edit was reverted, group by wiki and user_type

round(reverted_bot_reverts_agg.query("""n_days >= 30""")[['bot_reverts_by_base_editor_percent']]
      .groupby(['wiki_db', 'user_type'])
      .mean(), 2)

Unnamed: 0_level_0,Unnamed: 1_level_0,bot_reverts_by_base_editor_percent
wiki_db,user_type,Unnamed: 2_level_1
bgwiki,anonymous,34.08
bgwiki,registered,57.93
enwiki,anonymous,38.98
enwiki,registered,46.22
eswiki,anonymous,57.62
eswiki,registered,72.41
fawiki,anonymous,35.3
frwiki,anonymous,58.44
frwiki,registered,69.01
rowiki,anonymous,45.4


In [100]:
# percentage of bot reverts reverted back the same editor who initial edit was reverted, group by user type and edit bucket

round(reverted_bot_reverts_agg.query("""n_days >= 30""")[['bot_reverts_by_base_editor_percent']]
      .groupby(['user_type', 'edit_bucket'])
      .mean(), 2)

Unnamed: 0_level_0,Unnamed: 1_level_0,bot_reverts_by_base_editor_percent
user_type,edit_bucket,Unnamed: 2_level_1
anonymous,,45.87
registered,0-99,57.54
registered,100-499,80.0
registered,500+,71.72
