# Baseline: Accuracy of Patroller Reverts (Probable Vandalism)

**Last updated on 15 February 2024**

[TASK: T348862](https://phabricator.wikimedia.org/T348862)

# Contents
1. [Summary](#Summary)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)

## Summary

The following analysis is to determine a baseline for proportion of edits by bots, human patrollers and tool-assisted human patrollers (like [AWB](https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser), [Huggle](https://en.wikipedia.org/wiki/Wikipedia:Huggle)). During evaluation, the baseline will be used as a reference for evaluating the impact of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator) later. The [operational definitions](https://phabricator.wikimedia.org/T349083) within the scope of Automoderator are the following:

<u>probable vandalism:</u>
- edit belongs to the content namespace
- edit was reverted within 12 hours
- user is anonymous OR if registered
    - user edit count is less than 15 edits
    - time since user's first edit is less than 48 hours
- revert was made by a different editor

<u>patroller:</u>
- user's having user groups with any of the following permissions on the respective wikis: rollback, review, patrol, block, delete, deleterevision
- OR registered user who have made 150+ content namespace edits and 10+ content namespace reverts<br>(note: for this analysis, we have considered registered users with 150+ edits)

In [20]:
pr_centered('Proportion of Bots, Human Patrollers and Tool-assisted Human Patrollers', True)
pr_centered('to counter potential vandalism')
display_h({
    '': group_reverts_by_assistance(non_bot_reverts)
})

Unnamed: 0_level_0,% Tool-Assisted Humans,% Humans,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
enwiki,51.21,48.79,1688736
eswiki,2.29,97.71,345225
itwiki,4.92,95.08,231699
frwiki,3.14,96.86,202110
ruwiki,2.23,97.77,192436
dewiki,2.96,97.04,178573
jawiki,1.85,98.15,69556
fawiki,21.49,78.51,64834
zhwiki,2.12,97.88,61800
idwiki,0.86,99.14,39823


# Data-Gathering

## Imports

In [1]:
import pandas as pd
import numpy as np
import wmfdata as wmf
import great_tables as gt

pd.options.display.max_columns = None
from IPython.display import display_html
from IPython.display import display, HTML
from IPython.display import clear_output

import os
import requests
import warnings

In [2]:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

In [3]:
os.environ.pop('HTTP_PROXY', None)
os.environ.pop('HTTPS_PROXY', None)
os.environ.pop('http_proxy', None)
os.environ.pop('https_proxy', None)

'http://webproxy.eqiad.wmnet:8080'

## spark_session

In [4]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [5]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='counter-vandalism-assistance',
    spark_config={
        "spark.driver.memory": "6g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "16g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session.sparkContext.setLogLevel("ERROR")
spark_session

## functions

In [6]:
# prints a string at center of the output, bold if needed
def pr_centered(content, bold=False):
    if bold:
        content = f"<b>{content}</b>"
    
    centered_html = f"<div style='text-align:center'>{content}</div>"
    
    display(HTML(centered_html))


# display dataframes horizontally with title for each
def display_h(frames, space=100):
    html = ""
    
    for key in frames.keys():
        html_df =f'<div>{key} {frames[key]._repr_html_()}</div>'
        html += html_df
        
    html = f"""
    <div style="display:flex; justify-content: space-evenly;">
    {html}
    </div>"""
    
    display_html(html, raw=True)

## query

In [7]:
mwh_snapshot = '2024-01'

lang_list = ['en', 'es', 'ja', 'de', 'fr', 'ru', 'zh', 'it', 'pt', 'fa', 'id']
wikis_list = [f'{lang}wiki' for lang in lang_list]
wikis_sql = wmf.utils.sql_tuple(wikis_list)

api_endpoint = 'https://api-ro.discovery.wmnet/w/api.php'

## query: user rights info

In [8]:
def extract_ugroups(group_rights_info, rights):

    groups = []
    
    for user_right in group_rights_info:

        if any(right in user_right['rights'] for right in rights):
            groups.append(user_right['name'])

    return groups

In [9]:
warnings.filterwarnings('ignore')

rights = ['rollback', 'review', 'patrol', 'block', 'delete', 'deleterevision']

params = {
    "action": "query",
    "format": "json",
    "meta": "siteinfo",
    "formatversion": "2",
    "siprop": "usergroups"
}

all_ugroups = {}

for lang in lang_list:
    
    response = (
        requests
        .get(
            api_endpoint, 
            headers={'Host': f'{lang}.wikipedia.org'}, 
            params=params, 
            verify=False)
        .json()
    )
    
    ugroups = extract_ugroups(response['query']['usergroups'], rights)
    all_ugroups[lang] = ugroups

## query: reverts

In [10]:
def check_user_groups(groups):
    
    allowed_groups = ['autoconfirmed', 'confirmed', 'ipblock-exempt']
    return len(groups) == 0 or all(group in allowed_groups for group in groups)

check_user_groups_udf = udf(check_user_groups, BooleanType())
spark_session.udf.register("check_user_groups", check_user_groups_udf)

<function __main__.check_user_groups(groups)>

In [11]:
# manually gathered list for tools available on Special:Tags for each wiki
tags = {
    'tools': ['AWB', 'twinkle', 'huggle', 'WPCleaner', 'RedWarn', 'OAuth CID: 1805', 'STiki',\
              'AntiVandal script', 'Ultraviolet', 'OAuth CID: 1352', 'OAuth CID: 6365',\
              'WikiLoop Battlefield', 'OAuth CID: 85', 'Deputy', 'OAuth CID: 1503',\
              'OAuth CID: 1261', 'OAuth CID: 1887', 'OAuth CID: 1413', 'OAuth CID: 1188',\
              'DevScript', 'fast-buttons', 'diff-tools','ساخته شده توسط Tofawiki'],
    'paws': ['OAuth CID: 429', 'OAuth CID: 3711', 'OAuth CID: 4664', 'OAuth CID: 1841']
}

In [None]:
%%time

query = """
WITH 
    base AS (
        SELECT
            wiki_db,
            revision_id AS rev_id,
            event_timestamp AS rev_ts,
            event_user_text AS user_name,
            revision_first_identity_reverting_revision_id AS rv_rev_id
        FROM 
            wmf.mediawiki_history
        WHERE 
            snapshot = '{MWH_SNAPSHOT}'
            AND wiki_db = '{DB}'
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND page_namespace_is_content
            AND 
                (
                    event_user_is_anonymous 
                    OR event_user_revision_count <= 15
                )
            AND SIZE(event_user_is_bot_by_historical) = 0
            AND revision_is_identity_reverted
            AND revision_seconds_to_identity_revert <= 12*60*60
            AND revision_seconds_to_identity_revert >= 0
            AND YEAR(event_timestamp) = 2023
    ),
    
    rv_info AS (
        SELECT
            base.*,
            mwh.event_user_text AS rv_user_name,
            mwh.event_user_groups AS rv_user_groups,
            CASE
                WHEN ARRAY_CONTAINS(event_user_groups, 'sysop') THEN TRUE
                ELSE FALSE
            END AS is_sysop,
            CHECK_USER_GROUPS(event_user_groups) AS rv_user_has_no_rights,
            CASE
                WHEN {TOOL_TAGS} THEN 'tool_assisted'
                WHEN {PAWS_TAGS} THEN 'paws_assisted'
                WHEN SIZE(mwh.event_user_is_bot_by) > 0 THEN 'bot_revert'
                ELSE 'no_assistance'
            END AS assistance_type
        FROM 
            base
        JOIN
            wmf.mediawiki_history mwh
            ON base.wiki_db = mwh.wiki_db 
                AND base.rv_rev_id = mwh.revision_id
        WHERE
            snapshot = '{MWH_SNAPSHOT}'
            AND NOT base.user_name =  event_user_text
            AND NOT event_user_is_anonymous
            AND 
                (
                    mwh.event_user_revision_count >= 150
                    OR {USER_GROUPS_CONDITIONS}
                )
        )

SELECT
    wiki_db,
    rev_id,
    rv_rev_id,
    is_sysop,
    rv_user_has_no_rights,
    assistance_type
FROM
    rv_info
"""

reverts = pd.DataFrame()

for lang in all_ugroups.keys():
    
    reverts_by_wiki = wmf.spark.run(
        query
        .format(
            MWH_SNAPSHOT=mwh_snapshot, 
            DB=f'{lang}wiki', 
            USER_GROUPS_CONDITIONS= " OR ".join([f"ARRAY_CONTAINS(event_user_groups, '{value}')" for value in all_ugroups[lang]]),
            TOOL_TAGS=" OR ".join([f"ARRAY_CONTAINS(revision_tags, '{value}')" for value in tags['tools']]),
            PAWS_TAGS=" OR ".join([f"ARRAY_CONTAINS(revision_tags, '{value}')" for value in tags['paws']])
        )
    )
    
    reverts = pd.concat([reverts, reverts_by_wiki], ignore_index=True)
    
reverts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3588443 entries, 0 to 3588442
Data columns (total 6 columns):
 #   Column                 Dtype 
---  ------                 ----- 
 0   wiki_db                object
 1   rev_id                 int64 
 2   rv_rev_id              int64 
 3   is_sysop               bool  
 4   rv_user_has_no_rights  bool  
 5   assistance_type        object
dtypes: bool(2), int64(2), object(2)
memory usage: 116.4+ MB
CPU times: user 5.54 ms, sys: 195 µs, total: 5.73 ms
Wall time: 5.22 ms


In [15]:
reverts.assistance_type.value_counts()

assistance_type
no_assistance    2179639
tool_assisted     930545
bot_revert        478239
paws_assisted         20
Name: count, dtype: int64

In [16]:
# non-bot reverts
# ignore paws_assisted as the frequency is insignificant
non_bot_reverts = reverts.query("""assistance_type != ['bot_revert', 'paws_assisted']""")

# Analysis

In [17]:
def group_reverts_by_assistance(df):
    
    grouped = (
        df
        .groupby(['wiki_db', 'assistance_type'])['rev_id']
        .nunique()
        .reset_index()
        .pivot(index='wiki_db', columns='assistance_type', values='rev_id')
    )
    grouped.columns.name = None
    grouped['# Reverts'] = grouped.sum(axis=1)
    grouped = grouped.fillna(0).astype(int)

    grouped = (
        grouped
        .assign(
            **{
                '% Tool-Assisted Humans': lambda x: round(x.get('tool_assisted', 0) / x['# Reverts'] * 100, 2),
                '% Humans': lambda x: round(x.get('no_assistance', 0) / x['# Reverts'] * 100, 2)
            }
        )
    )

    return grouped[['% Tool-Assisted Humans', '% Humans', '# Reverts']].sort_values('# Reverts', ascending=False)

In [18]:
pr_centered('Proportion of Bots, Human Patrollers and Tool-assisted Human Patrollers', True)
pr_centered('to counter potential vandalism')
display_h({
    'All patrollers': group_reverts_by_assistance(non_bot_reverts),
    'Patrollers with sysop rights': group_reverts_by_assistance(non_bot_reverts.query("""(is_sysop == True) & (assistance_type != 'bot')"""))
})
display_h({
    'Patrollers with extended rights (excl. sysop)': group_reverts_by_assistance(non_bot_reverts.query(("""(is_sysop == False) & rv_user_has_no_rights == False & (assistance_type != 'bot')"""))),
    'Patrollers with no extended rights': group_reverts_by_assistance(non_bot_reverts.query((("""(rv_user_has_no_rights == True) & (assistance_type != 'bot')"""))))
})

Unnamed: 0_level_0,% Tool-Assisted Humans,% Humans,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
enwiki,51.21,48.79,1688736
eswiki,2.29,97.71,345225
itwiki,4.92,95.08,231699
frwiki,3.14,96.86,202110
ruwiki,2.23,97.77,192436
dewiki,2.96,97.04,178573
jawiki,1.85,98.15,69556
fawiki,21.49,78.51,64834
zhwiki,2.12,97.88,61800
idwiki,0.86,99.14,39823

Unnamed: 0_level_0,% Tool-Assisted Humans,% Humans,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
enwiki,38.85,61.15,347753
itwiki,6.75,93.25,138714
eswiki,0.13,99.87,65168
frwiki,0.05,99.95,56857
ruwiki,0.0,100.0,56056
dewiki,0.57,99.43,30791
jawiki,2.15,97.85,17622
idwiki,0.86,99.14,15536
fawiki,15.97,84.03,12977
ptwiki,33.44,66.56,9408


Unnamed: 0_level_0,% Tool-Assisted Humans,% Humans,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
enwiki,54.6,45.4,1326826
eswiki,3.84,96.16,191108
dewiki,3.46,96.54,147716
ruwiki,3.18,96.82,134736
itwiki,2.24,97.76,80844
zhwiki,2.3,97.7,56923
jawiki,1.77,98.23,50535
fawiki,23.56,76.44,50249
frwiki,7.53,92.47,49126
ptwiki,38.88,61.12,22269

Unnamed: 0_level_0,% Tool-Assisted Humans,% Humans,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
frwiki,2.72,97.28,96127
eswiki,0.54,99.46,88949
enwiki,37.66,62.34,14157
itwiki,1.89,98.11,12141
idwiki,1.68,98.32,8149
ptwiki,47.89,52.11,3715
ruwiki,0.0,100.0,1644
fawiki,1.18,98.82,1608
jawiki,1.22,98.78,1399
zhwiki,0.0,100.0,601
