# Baseline: Accuracy of Patroller Reverts (Probable Vandalism)

**Last updated on 15 February 2024**

[TASK: T348859](https://phabricator.wikimedia.org/T348859)

# Contents
1. [Summary](#Summary)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)

## Summary

The following analysis is to determine a baseline for 'accuracy' of human patrollers, by checking how many of the patrollers reverts were reverted back by another patroller. The baseline will be used as a reference for evaluating the impact of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator) later. The [operational definitions](https://phabricator.wikimedia.org/T349083) within the scope of Automoderator are the following:

<u>probable vandalism:</u>
- edit belongs to the content namespace
- edit was reverted within 12 hours
- user is anonymous OR if registered
    - user edit count is less than 15 edits
    - time since user's first edit is less than 48 hours
- revert was made by a different editor

<u>patroller:</u>
- user's having user groups with any of the following permissions on the respective wikis: rollback, review, patrol, block, delete, deleterevision
- OR registered user who have made 150+ content namespace edits and 10+ content namespace reverts<br>(note: for this analysis, we have considered registered users with 150+ edits)

In [23]:
pr_centered('Percent of Reverts by Patrollers on Potential Vandalism Reverted Back (2023)', True)
display_h({
    '': group_reverts_by_status(valid_non_bot_reverts)
})

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.32,1566103
eswiki,5.4,318860
itwiki,3.69,207467
frwiki,3.86,193518
ruwiki,3.21,180866
dewiki,2.09,169182
jawiki,2.73,64106
fawiki,4.72,61569
zhwiki,4.01,54107
idwiki,4.75,36600


# Data-Gathering

## Imports

In [1]:
import pandas as pd
import numpy as np
import wmfdata as wmf
import great_tables as gt

pd.options.display.max_columns = None
from IPython.display import display_html
from IPython.display import display, HTML
from IPython.display import clear_output

import os
import requests
import warnings

In [2]:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

In [3]:
os.environ.pop('HTTP_PROXY', None)
os.environ.pop('HTTPS_PROXY', None)
os.environ.pop('http_proxy', None)
os.environ.pop('https_proxy', None)

'http://webproxy.eqiad.wmnet:8080'

## spark_session

In [4]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [5]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='vandalism-patroller-accuracy',
    spark_config={
        "spark.driver.memory": "6g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "24g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session.sparkContext.setLogLevel("ERROR")
spark_session

## functions

In [18]:
# prints a string at center of the output, bold if needed
def pr_centered(content, bold=False):
    if bold:
        content = f"<b>{content}</b>"
    
    centered_html = f"<div style='text-align:center'>{content}</div>"
    
    display(HTML(centered_html))


# display dataframes horizontally with title for each
def display_h(frames, space=100):
    html = ""
    
    for key in frames.keys():
        html_df =f'<div>{key} {frames[key]._repr_html_()}</div>'
        html += html_df
        
    html = f"""
    <div style="display:flex; justify-content: space-evenly;">
    {html}
    </div>"""
    
    display_html(html, raw=True)

In [7]:
mwh_snapshot = '2024-01'

lang_list = ['en', 'es', 'ja', 'de', 'fr', 'ru', 'zh', 'it', 'pt', 'fa', 'id']
wikis_list = [f'{lang}wiki' for lang in lang_list]
wikis_sql = wmf.utils.sql_tuple(wikis_list)

api_endpoint = 'https://api-ro.discovery.wmnet/w/api.php'

## query: user rights info

In [8]:
# extract user groups having required permissions, from MediaWiki API output
def extract_ugroups(group_rights_info, rights):

    groups = []
    
    for user_right in group_rights_info:

        if any(right in user_right['rights'] for right in rights):
            groups.append(user_right['name'])

    return groups

In [9]:
warnings.filterwarnings('ignore')

# permissions within scope
rights = ['rollback', 'review', 'patrol', 'block', 'delete', 'deleterevision']

params = {
    "action": "query",
    "format": "json",
    "meta": "siteinfo",
    "formatversion": "2",
    "siprop": "usergroups"
}

all_ugroups = {}

for lang in lang_list:
    
    response = (
        requests
        .get(
            api_endpoint, 
            headers={'Host': f'{lang}.wikipedia.org'}, 
            params=params, 
            verify=False)
        .json()
    )
    
    ugroups = extract_ugroups(response['query']['usergroups'], rights)
    all_ugroups[lang] = ugroups
    
print('** User Groups by Wikipedia **')
for lang in all_ugroups:
    print(f'{lang}wiki:', all_ugroups[lang])

** User Groups by Wikipedia **
enwiki: ['sysop', 'suppress', 'rollbacker', 'patroller', 'reviewer']
eswiki: ['sysop', 'suppress', 'rollbacker', 'patroller', 'botadmin']
jawiki: ['autoconfirmed', 'sysop', 'interface-admin', 'suppress', 'rollbacker', 'eliminator']
dewiki: ['sysop', 'suppress', 'editor', 'reviewer']
frwiki: ['sysop', 'suppress', 'autopatrolled', 'rollbacker']
ruwiki: ['sysop', 'suppress', 'closer', 'editor', 'rollbacker']
zhwiki: ['sysop', 'suppress', 'rollbacker', 'patroller']
itwiki: ['sysop', 'suppress', 'rollbacker', 'autopatrolled', 'botadmin']
ptwiki: ['autoconfirmed', 'sysop', 'suppress', 'eliminator', 'rollbacker']
fawiki: ['sysop', 'suppress', 'patroller', 'rollbacker', 'image-reviewer', 'botadmin', 'eliminator', 'reviewer']
idwiki: ['sysop', 'suppress', 'rollbacker', 'editor', 'reviewer']


## query: reverts

In [10]:
def check_user_groups(groups):
    
    allowed_groups = ['autoconfirmed', 'confirmed', 'ipblock-exempt']
    return len(groups) == 0 or all(group in allowed_groups for group in groups)

check_user_groups_udf = udf(check_user_groups, BooleanType())
spark_session.udf.register("check_user_groups", check_user_groups_udf)

<function __main__.check_user_groups(groups)>

In [11]:
%%time

query = """
WITH 
    base AS (
        SELECT
            wiki_db,
            revision_id AS rev_id,
            event_timestamp AS rev_ts,
            event_user_text AS user_name,
            revision_first_identity_reverting_revision_id AS rv_rev_id,
            CASE 
                WHEN ARRAY_CONTAINS(event_user_groups, 'sysop') THEN TRUE
                ELSE FALSE
            END AS is_init_user_sysop,
            CASE 
                WHEN revision_is_identity_revert THEN TRUE
                ELSE FALSE
            END AS is_init_rev_revert                
        FROM 
            wmf.mediawiki_history
        WHERE 
            snapshot = '{MWH_SNAPSHOT}'
            AND wiki_db = '{DB}'
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND page_namespace_is_content
            AND 
                (
                    event_user_is_anonymous 
                    OR event_user_revision_count <= 15
                )
            AND SIZE(event_user_is_bot_by_historical) = 0
            AND revision_is_identity_reverted
            AND revision_seconds_to_identity_revert <= 12*60*60
            AND revision_seconds_to_identity_revert >= 0
            AND YEAR(event_timestamp) = 2023
    ),
    
    rv_info AS (
        SELECT
            base.*,
            mwh.event_user_text AS rv_user_name,
            mwh.event_user_groups AS rv_user_groups,
            CASE
                WHEN ARRAY_CONTAINS(event_user_groups, 'sysop') THEN TRUE
                ELSE FALSE
            END AS is_rv_user_sysop,
            CASE
                WHEN ARRAY_CONTAINS(event_user_groups, 'bot') THEN TRUE
                ELSE FALSE
            END AS is_rv_user_bot,
            CHECK_USER_GROUPS(event_user_groups) AS rv_user_has_no_rights,
            mwh.revision_is_identity_reverted AS is_rv_reverted,
            revision_first_identity_reverting_revision_id AS rv_rv_rev_id
        FROM 
            base
        JOIN
            wmf.mediawiki_history mwh
            ON base.wiki_db = mwh.wiki_db 
                AND base.rv_rev_id = mwh.revision_id
        WHERE
            snapshot = '{MWH_SNAPSHOT}'
            AND NOT base.user_name =  event_user_text
            AND NOT event_user_is_anonymous
            AND 
                (
                    mwh.event_user_revision_count >= 150
                    OR {USER_GROUPS_CONDITIONS}
                )
        ),    
       
        final AS (
            SELECT
                rv_info.*,
                CASE 
                    WHEN mwh.event_user_is_anonymous = TRUE THEN TRUE
                    ELSE FALSE
                END AS rv_rv_user_is_anon,
                CASE 
                    WHEN rv_info.user_name = mwh.event_user_text THEN TRUE
                    ELSE FALSE
                END AS is_rv_rv_user_init,
                CASE
                    WHEN mwh.event_user_revision_count <= 100 THEN TRUE
                    ELSE FALSE
                END AS is_rv_rv_user_new
            FROM
                rv_info
            JOIN
                wmf.mediawiki_history mwh
                ON rv_info.wiki_db = mwh.wiki_db 
                    AND rv_info.rv_rv_rev_id = mwh.revision_id
            WHERE
                snapshot = '{MWH_SNAPSHOT}'
                AND is_rv_reverted
            UNION ALL
            SELECT 
                rv_info.*,
                NULL AS rv_rv_user_is_anon,
                NULL AS is_rv_rv_user_init,
                NULL AS is_rv_rv_user_new
            FROM
                rv_info
            WHERE
                NOT is_rv_reverted
        )


SELECT
    wiki_db,
    rev_id,
    rv_rev_id,
    is_rv_reverted,
    is_init_user_sysop,
    is_init_rev_revert,
    is_rv_user_sysop,
    is_rv_user_bot,
    rv_user_has_no_rights,
    rv_rv_user_is_anon,
    is_rv_rv_user_init,
    is_rv_rv_user_new
FROM
    final
"""

reverts = pd.DataFrame()

for lang in all_ugroups.keys():
    sql_ugroups_statements = " OR ".join([f"ARRAY_CONTAINS(event_user_groups, '{value}')" for value in all_ugroups[lang]])
    
    reverts_by_wiki = wmf.spark.run(
        query
        .format(
            MWH_SNAPSHOT=mwh_snapshot, 
            DB=f'{lang}wiki', 
            USER_GROUPS_CONDITIONS=sql_ugroups_statements
        )
    )
    
    reverts = pd.concat([reverts, reverts_by_wiki], ignore_index=True)
    
reverts.info()

                                                                                / 8192]]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3588443 entries, 0 to 3588442
Data columns (total 12 columns):
 #   Column                 Dtype 
---  ------                 ----- 
 0   wiki_db                object
 1   rev_id                 int64 
 2   rv_rev_id              int64 
 3   is_rv_reverted         bool  
 4   is_init_user_sysop     bool  
 5   is_init_rev_revert     bool  
 6   is_rv_user_sysop       bool  
 7   is_rv_user_bot         bool  
 8   rv_user_has_no_rights  bool  
 9   rv_rv_user_is_anon     object
 10  is_rv_rv_user_init     object
 11  is_rv_rv_user_new      object
dtypes: bool(6), int64(2), object(4)
memory usage: 184.8+ MB
CPU times: user 33.8 s, sys: 4.47 s, total: 38.2 s
Wall time: 38min 18s


In [17]:
# remove: reverts reverted back by anonymous users, users who edit was initially reverted, or the user is a newcomer
non_bot_reverts = reverts.query("""(is_rv_user_bot == False)""")

valid_non_bot_reverts = pd.concat([
    reverts.query("""(is_rv_user_bot == False) & (is_rv_reverted == False)"""),
    reverts.query("""(is_rv_user_bot == False) & (is_rv_reverted == True) & (rv_rv_user_is_anon == False) & (is_rv_rv_user_init == False) & (is_rv_rv_user_new == False)""")],
    ignore_index=False
)

print(f'percentage of potentially invalid non-bot reverts: {round(100 - valid_non_bot_reverts.shape[0] / reverts.shape[0] * 100)}%')

percentage of potentially invalid non-bot reverts: 20%


# Analysis

In [19]:
def group_reverts_by_status(df):
    
    grouped = (
        df
        .groupby(['wiki_db', 'is_rv_reverted'])['rev_id']
        .nunique()
        .reset_index()
        .pivot(index='wiki_db', columns='is_rv_reverted', values='rev_id')
    )
    grouped.columns.name = None
    
    grouped['# Reverts'] = grouped.sum(axis=1)
    grouped = grouped.fillna(0).astype(int)
    grouped['Percent of Reverts Reverted'] = round(grouped[True] / grouped['# Reverts'] * 100, 2)
    
    return grouped[['Percent of Reverts Reverted', '# Reverts']].sort_values('# Reverts', ascending=False)

In [20]:
pr_centered('Percent of Reverts by Patrollers on Potential Vandalism [All Reverts]', True)
display_h({
    'All patrollers': group_reverts_by_status(non_bot_reverts),
    'Patrollers with sysop rights': group_reverts_by_status(non_bot_reverts.query("""is_rv_user_sysop == True"""))
})
display_h({
    'Patrollers with extended rights (excl. sysop)': group_reverts_by_status(non_bot_reverts.query(("""is_rv_user_sysop == False & rv_user_has_no_rights == False"""))),
    'Patrollers with no extended rights': group_reverts_by_status(non_bot_reverts.query(("""rv_user_has_no_rights == True""")))
})

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,10.34,1688735
eswiki,12.62,345225
itwiki,13.76,231699
frwiki,7.94,202097
ruwiki,9.02,192409
dewiki,7.24,178573
jawiki,10.35,69556
fawiki,9.68,64951
zhwiki,15.98,61820
idwiki,12.46,39823

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,9.68,347753
itwiki,13.17,138714
eswiki,12.05,65168
frwiki,6.24,56857
ruwiki,8.76,56056
dewiki,5.78,30791
jawiki,7.06,17622
idwiki,11.56,15536
fawiki,7.7,12977
ptwiki,10.08,9408


Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,10.46,1326825
eswiki,12.02,191108
dewiki,7.55,147716
ruwiki,9.05,134709
itwiki,14.09,80844
zhwiki,16.42,56943
jawiki,11.22,50535
fawiki,9.78,50366
frwiki,7.31,49112
ptwiki,7.53,22269

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
frwiki,9.26,96128
eswiki,14.34,88949
enwiki,15.03,14157
itwiki,18.33,12141
idwiki,16.39,8149
ptwiki,11.52,3716
ruwiki,14.78,1644
fawiki,22.45,1608
jawiki,20.51,1399
zhwiki,16.64,601


In [21]:
pr_centered('Percent of Reverts by Patrollers on Potential Vandalism Reverted Back [Valid Reverts]', True)
display_h({
    'All patrollers': group_reverts_by_status(valid_non_bot_reverts),
    'Patrollers with sysop rights': group_reverts_by_status(valid_non_bot_reverts.query("""is_rv_user_sysop == True"""))
})
display_h({
    'Patrollers with extended rights (excl. sysop)': group_reverts_by_status(valid_non_bot_reverts.query(("""is_rv_user_sysop == False & rv_user_has_no_rights == False"""))),
    'Patrollers with no extended rights': group_reverts_by_status(valid_non_bot_reverts.query(("""rv_user_has_no_rights == True""")))
})

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.32,1566103
eswiki,5.4,318860
itwiki,3.69,207467
frwiki,3.86,193518
ruwiki,3.21,180866
dewiki,2.09,169182
jawiki,2.73,64106
fawiki,4.72,61569
zhwiki,4.01,54107
idwiki,4.75,36600

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.01,323846
itwiki,3.23,124473
eswiki,4.87,60248
frwiki,3.14,55037
ruwiki,3.51,53004
dewiki,1.73,29522
jawiki,2.0,16713
idwiki,5.05,14471
fawiki,3.95,12471
ptwiki,4.8,8887


Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.36,1229326
eswiki,4.58,176215
dewiki,2.17,139597
ruwiki,3.03,126344
itwiki,3.93,72290
zhwiki,4.04,49598
fawiki,4.53,47594
frwiki,3.19,47018
jawiki,2.94,46224
ptwiki,2.95,21218

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
frwiki,4.64,91463
eswiki,7.52,82397
enwiki,6.98,12931
itwiki,7.37,10704
idwiki,5.68,7223
ptwiki,7.22,3544
ruwiki,7.71,1518
fawiki,17.09,1504
jawiki,4.88,1169
zhwiki,6.88,538


In [22]:
reverts_non_init_reverts = valid_non_bot_reverts.query("""is_init_rev_revert == False""")

pr_centered('Percent of Reverts by Patrollers on Potential Vandalism Reverted Back', True)
pr_centered('only reverts where the edit being reverted was not a revert', True)
display_h({
    'All patrollers': group_reverts_by_status(reverts_non_init_reverts),
    'Patrollers with sysop rights': group_reverts_by_status(reverts_non_init_reverts.query("""is_rv_user_sysop == True"""))
})
display_h({
    'Patrollers with extended rights (excl. sysop)': group_reverts_by_status(reverts_non_init_reverts.query(("""is_rv_user_sysop == False & rv_user_has_no_rights == False"""))),
    'Patrollers with no extended rights': group_reverts_by_status(reverts_non_init_reverts.query(("""rv_user_has_no_rights == True""")))
})

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.25,1512587
eswiki,5.34,308412
itwiki,3.65,198463
frwiki,3.83,188725
ruwiki,3.17,175593
dewiki,2.05,163835
jawiki,2.62,60436
fawiki,4.72,59662
zhwiki,3.89,51165
idwiki,4.73,35210

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,2.94,312027
itwiki,3.19,118187
eswiki,4.83,57881
frwiki,3.12,53619
ruwiki,3.48,51300
dewiki,1.71,28490
jawiki,2.01,15499
idwiki,4.98,13998
fawiki,4.0,12035
ptwiki,4.9,8482


Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.3,1188181
eswiki,4.54,170648
dewiki,2.12,135287
ruwiki,2.99,122825
itwiki,3.87,70095
zhwiki,3.92,46924
fawiki,4.5,46202
frwiki,3.15,46005
jawiki,2.8,43845
ptwiki,2.89,20666

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
frwiki,4.6,89101
eswiki,7.45,79883
enwiki,6.59,12379
itwiki,7.33,10181
idwiki,5.79,6775
ptwiki,7.2,3416
ruwiki,7.49,1468
fawiki,17.89,1425
jawiki,4.12,1092
zhwiki,6.41,515
