# Analysis of Reversion Activity by Anti-Vandalism Bots
**Krishna Chaitanya Velaga, Data Scientist III, Wikimedia Foundation**

**Last updated on 31 July 2023**

[TASK: T341857](https://phabricator.wikimedia.org/T341857)

# Contents

1. [Overview](#Overview)
3. [Data Gathering (Bot Reverts)](#Data-Gathering-Bot-Reverts)
4. [Results (Bot Reverts)](#Results-Bot-Reverts)
3. [Data Gathering (Reverted Bot Reverts)](#Data-Gathering-Reverted-Bot-Reverts)
4. [Results (Reverted Bot Reverts)](#Results-Reverted-Bot-Reverts)

# Overview
The goal of the following analysis was to understand the activity of various automated anti-[vandalism](https://en.wikipedia.org/wiki/Wikipedia:Vandalism) bots. Primarily, how much of the anti-vandalism burden is taken on by the bots in the respective communities? The findings will eventually inform the decisions to be taken during the development of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator), including but not limited to the development of a measurement plan and setting baselines. The [MediaWiki history dataset](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history) has been used for the analysis, for a time period of three years (July 2020 to June 2023).

**Questions Answered**

<u>Primary</u>

For the given list of anti-vandal bots, and their respective wikis (along with user segmentation wherever possible):
- How many reverts happen per day, where the time between the edit and its revert is less than 24 hrs?
- How many bot reverts happen per day (as they tend to happen quickly)? 
    - How does this compare to the reverts taking place within 24 hrs of an edit?
    
<u>Secondary</u>

- What percentage of the reverts made by the anti-vandal bots are reverted back? (not to be confused as FPR)
- What percentage of the reverted bot reverts are reverted back by the same user/IP whose edit had been initially reverted?

**Considerations**

*(based on the inputs shared by [Sam Walton](https://phabricator.wikimedia.org/p/Samwalton9/), Product Manager for Automoderator)*

Given that the analysis will be used to inform the decisions for Automoderator development, the following considerations were taken into account:
- As bot edits take place quickly, only anti-vandalism activity by tracking new edits has been considered i.e. within 24 hrs of an edit taking place. 
    - Reverts taking place much later (say a month) are likely taking place through a different process than monitoring the recent changes.
- Only edits made to the [content namespaces](https://en.wikipedia.org/wiki/Wikipedia:Namespace) were considered.
- New page creations were excluded as Automoderator is not expected to monitor new page creations.

# Data-Gathering-Bot-Reverts

## imports

In [56]:
import wmfdata as wmf
import pandas as pd
import numpy as np

pd.options.display.max_columns = None

import seaborn as sns
import matplotlib.pyplot as plt

import warnings

## spark_session

In [2]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [4]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='bot-vandal-reverts',
    spark_config={
        "spark.driver.memory": "4g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "16g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

SPARK_HOME: /usr/lib/spark3
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/14 10:12:00 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).


In [5]:
spark_session

In [6]:
spark_session.sparkContext.setLogLevel("ERROR")

## run query

In [7]:
# list as per https://phabricator.wikimedia.org/T341857
bots = {
    'enwiki': 'ClueBot NG',
    'eswiki': 'SeroBOT',
    'frwiki': 'Salebot',
    'ptwiki': 'Salebot',
    'fawiki': 'Dexbot',
    'bgwiki': 'PSS 9',
    'simplewiki': 'ChenzwBot',
    'ruwiki': 'Рейму Хакурей',
    'rowiki': 'PatrocleBot'
}

In [16]:
%%time

bot_reverts_query = """
WITH 
    base AS (
        SELECT 
            wiki_db,
            revision_id,
            event_timestamp,
            revision_first_identity_reverting_revision_id,
            revision_seconds_to_identity_revert,
            CASE
                WHEN event_user_is_anonymous THEN 'anon'
                ELSE 'registered'
            END AS user_type,
            CASE
                WHEN event_user_revision_count >= 0 AND event_user_revision_count < 100 THEN '0-99'
                WHEN event_user_revision_count >= 100 AND event_user_revision_count < 500 THEN '100-499'
                WHEN event_user_revision_count >= 500 THEN '500+'
                ELSE 'n/a'
            END AS edit_bucket
        FROM 
            wmf.mediawiki_history
        WHERE 
            snapshot = '{MW_SNAPSHOT}'
            AND wiki_db IN {DBS}
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND revision_is_identity_reverted
            AND page_namespace_is_content
            AND revision_seconds_to_identity_revert <= 24 * 60 * 60
            AND DATE (event_timestamp) >= DATE ('{START_DATE}')
            AND DATE (event_timestamp) <= DATE ('{END_DATE}')
            AND NOT revision_parent_id = 0
        )
            
SELECT 
    mwh.wiki_db,
    YEAR(mwh.event_timestamp) AS year,
    MONTH(mwh.event_timestamp) AS month,
    DAY(mwh.event_timestamp) AS day,
    user_type,
    edit_bucket,
    COUNT(DISTINCT mwh.revision_id) AS all_reverts,
    COUNT(DISTINCT (
            CASE 
                WHEN event_user_text IN {BOTS}
                     THEN mwh.revision_id
            END)) AS bot_reverts,
    COUNT(DISTINCT (
            CASE 
                WHEN event_user_text IN {BOTS}
                     AND mwh.revision_is_identity_reverted = True
                     THEN mwh.revision_id
            END)) AS reverted_bot_reverts
FROM 
    base
JOIN wmf.mediawiki_history mwh
     ON base.revision_first_identity_reverting_revision_id = mwh.revision_id
        AND base.wiki_db = mwh.wiki_db
WHERE 
    snapshot = '{MW_SNAPSHOT}'
GROUP BY 
    YEAR(mwh.event_timestamp),
    MONTH(mwh.event_timestamp),
    DAY(mwh.event_timestamp),
    mwh.wiki_db,
    edit_bucket,
    user_type
"""

time_bounds = ['2020-07-01', '2023-06-30'] # three years
mw_snapshot = '2023-06'

bot_revert_counts = wmf.spark.run(bot_reverts_query.format(DBS=wmf.utils.sql_tuple(bots.keys()),
                                                           BOTS=wmf.utils.sql_tuple(set(bots.values())),
                                                           START_DATE=time_bounds[0],
                                                           END_DATE=time_bounds[1],
                                                           MW_SNAPSHOT=mw_snapshot))

                                                                                

CPU times: user 652 ms, sys: 146 ms, total: 798 ms
Wall time: 2min 14s


In [9]:
# although three years of data has been considered by default, a few bots have started running much later
# gathering the registration date of the bots for their respective wikis

warnings.filterwarnings('ignore')

bot_reg_query = """
SELECT
    user_name AS bot,
    DATE(user_registration) AS reg_date
FROM
    user
WHERE
    user_name = '{BOT_USERNAME}'
"""

bot_reg_dates = pd.DataFrame()
for wiki_db in bots.keys():
    result = wmf.mariadb.run(bot_reg_query.format(BOT_USERNAME=bots[wiki_db]), 
                             wiki_db)
    result['wiki_db'] = wiki_db
    bot_reg_dates = pd.concat([bot_reg_dates, result], ignore_index=False)
    
bot_reg_dates

Unnamed: 0,bot,reg_date,wiki_db
0,ClueBot NG,2010-10-20,enwiki
0,SeroBOT,2018-04-19,eswiki
0,Salebot,2006-11-10,frwiki
0,Salebot,2008-09-21,ptwiki
0,Dexbot,2012-04-20,fawiki
0,PSS 9,2017-03-01,bgwiki
0,ChenzwBot,2008-04-10,simplewiki
0,Рейму Хакурей,2016-08-20,ruwiki
0,PatrocleBot,2022-01-15,rowiki


In [17]:
bot_revert_counts = pd.merge(bot_revert_counts.assign(bot=lambda df: df["wiki_db"].map(bots)), 
                             bot_reg_dates, 
                             on=['bot', 'wiki_db'], 
                             how='left')

bot_revert_counts = bot_revert_counts.astype({'year': str, 'month': str, 'day': str})
bot_revert_counts['date'] = pd.to_datetime(bot_revert_counts['year'] + '-' + bot_revert_counts['month'] + '-' + bot_revert_counts['day'])
bot_revert_counts['reg_date'] = pd.to_datetime(bot_revert_counts['reg_date'])

In [19]:
# consider dates only after the bot registration date
bot_revert_counts = bot_revert_counts.query("""date > reg_date""")

In [20]:
# for a few edits that took place on the last the day of the data end date i.e. 30 June 2023, their reverts took place the next day
# drop them as that is incomplete data and may skew the aggregations
bot_revert_counts = bot_revert_counts.query("""date <= @pd.to_datetime('2023-06-30')""")

In [63]:
# calculate percentages
bot_revert_counts = (bot_revert_counts
                     .assign(
                         bot_reverts_percent=lambda df: df["bot_reverts"] / df["all_reverts"] * 100,
                         reverted_bot_reverts_percent=lambda df: df["reverted_bot_reverts"] / df["bot_reverts"] * 100)
                     .fillna(0))

In [64]:
# save data
(bot_revert_counts
 .sort_values(['wiki_db', 'year', 'month', 'day'])
 .to_csv('data_outputs/anti_vandal_bot_revert_counts.tsv', sep='\t', index=False))

# Results-Bot-Reverts

In [71]:
percent_bot_reverts = (
    bot_revert_counts
    .drop(bot_revert_counts
          .query("""(user_type == 'registered') & (edit_bucket== 'n/a')""")
          .index)
    .query("""bot_reverts != 0""")
    .groupby(['wiki_db', 'bot', 'user_type', 'edit_bucket'])
    .agg({'all_reverts': np.mean, 'bot_reverts': np.mean, 'reverted_bot_reverts': np.mean, 'day': 'count', 'bot_reverts_percent': np.mean, 'reverted_bot_reverts_percent':np.mean})
    .reset_index()
    .round(
        {
            "all_reverts": 0,
            "bot_reverts": 0,
            "reverted_bot_reverts": 1,
            "bot_reverts_percent": 1,
            "reverted_bot_reverts_percent": 1,
        }
    )
    .astype({'all_reverts': int, 'bot_reverts': int})
    .set_index(['wiki_db', 'bot', 'user_type', 'edit_bucket'])
    .sort_index()
    .rename({'day': 'n_days'}, axis=1)
)

percent_bot_reverts

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,all_reverts,bot_reverts,reverted_bot_reverts,n_days,bot_reverts_percent,reverted_bot_reverts_percent
wiki_db,bot,user_type,edit_bucket,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
bgwiki,PSS 9,anon,,44,4,0.2,899,8.7,4.9
bgwiki,PSS 9,registered,0-99,10,2,0.4,328,25.3,17.0
bgwiki,PSS 9,registered,100-499,3,1,0.3,14,57.9,28.6
bgwiki,PSS 9,registered,500+,15,7,2.5,2,38.8,65.4
enwiki,ClueBot NG,anon,,4069,266,23.1,1090,6.2,8.5
enwiki,ClueBot NG,registered,0-99,1249,91,10.6,1089,7.1,11.5
enwiki,ClueBot NG,registered,100-499,296,1,0.7,7,0.4,71.4
enwiki,ClueBot NG,registered,500+,1165,1,0.0,3,0.1,0.0
eswiki,SeroBOT,anon,,1666,846,71.3,1092,49.7,8.7
eswiki,SeroBOT,registered,0-99,221,50,6.7,1092,21.9,13.2


In [72]:
percent_bot_reverts.fillna(0).to_csv('data_outputs/anti_vandal_bot_revert_percentages.tsv', sep='\t')

# Data-Gathering-Reverted-Bot-Reverts

In [98]:
%%time

reverted_bot_reverts_query = """
WITH 
    base AS (
        SELECT 
            wiki_db,
            revision_id AS base_rev_id,
            event_timestamp AS base_ts,
            event_user_text AS base_user_text,
            revision_first_identity_reverting_revision_id AS base_revert_id,
            revision_seconds_to_identity_revert AS base_seconds_to_revert,
            CASE
                WHEN event_user_is_anonymous THEN 'anonymous'
                ELSE 'registered'
            END AS user_type,
            CASE
                WHEN event_user_revision_count >= 0 AND event_user_revision_count < 100 THEN '0-99'
                WHEN event_user_revision_count >= 100 AND event_user_revision_count < 500 THEN '100-499'
                WHEN event_user_revision_count >= 500 THEN '500+'
                ELSE 'n/a'
            END AS edit_bucket
        FROM 
            wmf.mediawiki_history
        WHERE 
            snapshot = '{MW_SNAPSHOT}'
            AND wiki_db IN {DBS}
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND revision_is_identity_reverted
            AND page_namespace_is_content
            AND revision_seconds_to_identity_revert <= 24 * 60 * 60
            AND DATE (event_timestamp) >= DATE ('{START_DATE}')
            AND DATE (event_timestamp) <= DATE ('{END_DATE}')
            AND NOT revision_parent_id = 0
        ),
        
    bot_reverts AS (
        SELECT
            base.*,
            mwh.revision_id AS br_rev_id,
            mwh.event_timestamp AS br_timestamp,
            mwh.revision_first_identity_reverting_revision_id AS br_revert_id,
            mwh.revision_seconds_to_identity_revert AS br_seconds_to_revert,
            mwh.event_user_text AS br_user_text
        FROM
            base
        JOIN wmf.mediawiki_history mwh
             ON base.base_revert_id = mwh.revision_id
                AND base.wiki_db = mwh.wiki_db
        WHERE snapshot = '{MW_SNAPSHOT}'
            AND mwh.event_user_text IN {BOTS}
            AND mwh.revision_is_identity_reverted
    ),
    
    reverted_bot_reverts AS (
        SELECT
            br.*,
            mwh.revision_id AS rbr_rev_id,
            mwh.event_timestamp AS rbr_timestamp,
            mwh.revision_first_identity_reverting_revision_id AS rbr_revert_id,
            mwh.revision_seconds_to_identity_revert AS rbr_seconds_to_revert,
            mwh.event_user_text AS rbr_user_text            
        FROM 
            bot_reverts br
        JOIN
            wmf.mediawiki_history mwh
            ON br.br_revert_id = mwh.revision_id
            AND br.wiki_db = mwh.wiki_db
        WHERE snapshot = '{MW_SNAPSHOT}'
    )
    
SELECT
    wiki_db,
    user_type,
    edit_bucket,
    br_user_text AS bot,
    COUNT(DISTINCT br_revert_id) AS reverted_bot_edits,
    COUNT(DISTINCT (
            CASE 
                WHEN rbr_user_text = base_user_text THEN br_revert_id 
            END)) AS bot_reverts_by_base_editor,
    YEAR(base_ts) AS year,
    MONTH(base_ts) AS month,
    DAY(base_ts) AS day
FROM reverted_bot_reverts
GROUP BY
    wiki_db,
    user_type,
    edit_bucket,
    br_user_text,
    YEAR(base_ts),
    MONTH(base_ts),
    DAY(base_ts)
"""

time_bounds = ['2020-07-01', '2023-06-30'] # three years
mw_snapshot = '2023-06'

reverted_bot_reverts = wmf.spark.run(reverted_bot_reverts_query.format(DBS=wmf.utils.sql_tuple(bots.keys()),
                                                           BOTS=wmf.utils.sql_tuple(set(bots.values())),
                                                           START_DATE=time_bounds[0],
                                                           END_DATE=time_bounds[1],
                                                           MW_SNAPSHOT=mw_snapshot))



CPU times: user 412 ms, sys: 57.6 ms, total: 469 ms
Wall time: 1min 49s


                                                                                

In [80]:
reverted_bot_reverts.head()

Unnamed: 0,wiki_db,user_type,edit_bucket,br_user_text,reverted_bot_edits,bot_reverts_by_base_editor,year,month,day
0,enwiki,anon_user,,ClueBot NG,39,16,2021,1,2
1,eswiki,anon_user,,SeroBOT,95,74,2021,12,4
2,fawiki,anon_user,,Dexbot,29,10,2023,3,16
3,eswiki,anon_user,,SeroBOT,88,56,2021,6,6
4,frwiki,reg_user,0-99,Salebot,1,0,2020,12,8


In [84]:
reverted_bot_reverts.rename({'br_user_text': 'bot'}, axis=1, inplace=True)

In [83]:
bot_revert_counts.head()

Unnamed: 0,wiki_db,year,month,day,user_type,edit_bucket,all_reverts,bot_reverts,reverted_bot_reverts,bot,reg_date,date,bot_reverts_percent,reverted_bot_reverts_percent
0,enwiki,2022,2,7,registered,500+,1341,0,0,ClueBot NG,2010-10-20,2022-02-07,0.0,0.0
1,eswiki,2023,6,14,registered,500+,172,4,1,SeroBOT,2018-04-19,2023-06-14,2.325581,25.0
2,enwiki,2022,6,19,anon,,2996,34,2,ClueBot NG,2010-10-20,2022-06-19,1.134846,5.882353
3,enwiki,2021,5,28,anon,,4482,402,37,ClueBot NG,2010-10-20,2021-05-28,8.96921,9.20398
4,enwiki,2021,7,16,registered,0-99,1285,81,7,ClueBot NG,2010-10-20,2021-07-16,6.303502,8.641975


In [86]:
bot_revert_counts.dtypes

wiki_db                                 object
year                                    object
month                                   object
day                                     object
user_type                               object
edit_bucket                             object
all_reverts                              int64
bot_reverts                              int64
reverted_bot_reverts                     int64
bot                                     object
reg_date                        datetime64[ns]
date                            datetime64[ns]
bot_reverts_percent                    float64
reverted_bot_reverts_percent           float64
dtype: object

In [100]:
bot_revert_counts = bot_revert_counts.astype({'year': int, 'month': int, 'day': int})
del t
t = pd.merge(bot_revert_counts, reverted_bot_reverts.drop('reverted_bot_edits', axis=1), on=['year', 'month', 'day', 'wiki_db', 'bot', 'user_type', 'edit_bucket'], how='left')
t.head()

Unnamed: 0,wiki_db,year,month,day,user_type,edit_bucket,all_reverts,bot_reverts,reverted_bot_reverts,bot,reg_date,date,bot_reverts_percent,reverted_bot_reverts_percent,bot_reverts_by_base_editor
0,enwiki,2022,2,7,registered,500+,1341,0,0,ClueBot NG,2010-10-20,2022-02-07,0.0,0.0,
1,eswiki,2023,6,14,registered,500+,172,4,1,SeroBOT,2018-04-19,2023-06-14,2.325581,25.0,0.0
2,enwiki,2022,6,19,anon,,2996,34,2,ClueBot NG,2010-10-20,2022-06-19,1.134846,5.882353,
3,enwiki,2021,5,28,anon,,4482,402,37,ClueBot NG,2010-10-20,2021-05-28,8.96921,9.20398,
4,enwiki,2021,7,16,registered,0-99,1285,81,7,ClueBot NG,2010-10-20,2021-07-16,6.303502,8.641975,2.0


# Results-Reverted-Bot-Reverts

In [None]:
reverted_bot_reverts

In [78]:
(
    reverted_bot_reverts.drop(
        reverted_bot_reverts.query(
            """(user_type == 'reg_user') & (edit_bucket == 'n/a')"""
        ).index
    )
    .groupby(["wiki_db", "br_user_text", "user_type", "edit_bucket"])
    .agg(
        {
            "reverted_bot_edits": np.mean,
            "bot_reverts_by_base_editor": np.mean,
            "day": "count",
        }
    )
    .assign(
        bot_reverts_by_base_editor_percent=lambda df: df["bot_reverts_by_base_editor"]
        / df["reverted_bot_edits"]
        * 100,
    )
).query("""day >= 50""").groupby(["user_type", "edit_bucket"]).agg(
    {
        "bot_reverts_by_base_editor_percent": np.mean
    }
)

Unnamed: 0_level_0,Unnamed: 1_level_0,bot_reverts_by_base_editor_percent,non_base_editor_time_to_revert,base_editor_time_to_revert
user_type,edit_bucket,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
anon_user,,45.868908,604540.734192,150078.081537
reg_user,0-99,57.533741,652803.368723,138341.568119
reg_user,100-499,80.0,7301.75,344117.306788
reg_user,500+,71.721311,61296.125,700710.646231
