# Baseline: Accuracy of Patroller Reverts (Probable Vandalism)

**Last updated on 2 January 2024**

[TASK: T348859](https://phabricator.wikimedia.org/T348859)

# Contents
1. [Summary](#Summary)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)

## Summary

The following analysis is to determine a baseline for 'accuracy' of human patrollers, by checking how many of the patrollers reverts were reverted back by another patroller. The baseline will be used as a reference for evaluating the impact of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator) later. The [operational definitions](https://phabricator.wikimedia.org/T349083) within the scope of Automoderator are the following:

<u>probable vandalism:</u>
- edit belongs to the content namespace
- edit was reverted within 12 hours
- user is anonymous OR if registered
    - user edit count is less than 15 edits
    - time since user's first edit is less than 48 hours
- revert was made by a different editor

<u>patroller:</u>
- user's having user groups with any of the following permissions on the respective wikis: rollback, review, patrol, block, delete, deleterevision
- OR registered user who have made 150+ content namespace edits and 10+ content namespace reverts<br>(note: for this analysis, we have considered registered users with 150+ edits)

In [238]:
pr_centered('Percent of Reverts by Patrollers on Potential Vandalism Reverted Back (2022)', True)
display_h({
    '': group_reverts_by_status(valid_non_bot_reverts)
})

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.64,1628305
eswiki,5.25,308797
itwiki,3.77,226128
ruwiki,3.22,206098
frwiki,3.38,195289
dewiki,1.94,170661
jawiki,3.48,78123
fawiki,4.13,72164
zhwiki,4.87,70370
ptwiki,3.36,35227


# Data-Gathering

## Imports

In [1]:
import pandas as pd
import numpy as np
import wmfdata as wmf

pd.options.display.max_columns = None
from IPython.display import display_html
from IPython.display import display, HTML
from IPython.display import clear_output

import os
import requests
import warnings

In [202]:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

In [2]:
os.environ.pop('HTTP_PROXY', None)
os.environ.pop('HTTPS_PROXY', None)
os.environ.pop('http_proxy', None)
os.environ.pop('https_proxy', None)

'http://webproxy.eqiad.wmnet:8080'

## spark_session

In [3]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [4]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='vandalism-patroller-accuracy',
    spark_config={
        "spark.driver.memory": "6g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "24g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session.sparkContext.setLogLevel("ERROR")
spark_session

## functions

In [5]:
# display dataframes horizontally with title for each
def display_h(frames, space=100):
    html = ""
    
    for key in frames.keys():
        html_df =f'<div>{key} {frames[key]._repr_html_()}</div>'
        html += html_df
        
    html = f"""
    <div style="display:flex; justify-content: space-evenly;">
    {html}
    </div>"""
    
    display_html(html, raw=True)

In [8]:
mwh_snapshot = '2023-11'

lang_list = ['en', 'es', 'ja', 'de', 'fr', 'ru', 'zh', 'it', 'pt', 'fa', 'id']
wikis_list = [f'{lang}wiki' for lang in lang_list]
wikis_sql = wmf.utils.sql_tuple(wikis_list)

api_endpoint = 'https://api-ro.discovery.wmnet/w/api.php'

## query: user rights info

In [9]:
# extract user groups having required permissions, from MediaWiki API output
def extract_ugroups(group_rights_info, rights):

    groups = []
    
    for user_right in group_rights_info:

        if any(right in user_right['rights'] for right in rights):
            groups.append(user_right['name'])

    return groups

In [11]:
warnings.filterwarnings('ignore')

# permissions within scope
rights = ['rollback', 'review', 'patrol', 'block', 'delete', 'deleterevision']

params = {
    "action": "query",
    "format": "json",
    "meta": "siteinfo",
    "formatversion": "2",
    "siprop": "usergroups"
}

all_ugroups = {}

for lang in lang_list:
    
    response = (
        requests
        .get(
            api_endpoint, 
            headers={'Host': f'{lang}.wikipedia.org'}, 
            params=params, 
            verify=False)
        .json()
    )
    
    ugroups = extract_ugroups(response['query']['usergroups'], rights)
    all_ugroups[lang] = ugroups
    
print('** User Groups by Wikipedia **')
for lang in all_ugroups:
    print(f'{lang}wiki:', all_ugroups[lang])

** User Groups by Wikipedia **
enwiki: ['sysop', 'suppress', 'rollbacker', 'patroller', 'reviewer']
eswiki: ['sysop', 'suppress', 'rollbacker', 'patroller', 'botadmin']
jawiki: ['autoconfirmed', 'sysop', 'interface-admin', 'suppress', 'rollbacker', 'eliminator']
dewiki: ['sysop', 'suppress', 'editor', 'reviewer']
frwiki: ['sysop', 'suppress', 'autopatrolled', 'rollbacker']
ruwiki: ['sysop', 'suppress', 'closer', 'editor', 'rollbacker']
zhwiki: ['sysop', 'suppress', 'rollbacker', 'patroller']
itwiki: ['sysop', 'suppress', 'rollbacker', 'autopatrolled', 'botadmin']
ptwiki: ['autoconfirmed', 'sysop', 'suppress', 'eliminator', 'rollbacker']
fawiki: ['sysop', 'suppress', 'patroller', 'rollbacker', 'image-reviewer', 'botadmin', 'eliminator', 'reviewer']
idwiki: ['sysop', 'suppress', 'rollbacker', 'editor', 'reviewer']


## query: reverts

In [211]:
def check_user_groups(groups):
    
    allowed_groups = ['autoconfirmed', 'confirmed', 'ipblock-exempt']
    return len(groups) == 0 or all(group in allowed_groups for group in groups)

check_user_groups_udf = udf(check_user_groups, BooleanType())
spark_session.udf.register("check_user_groups", check_user_groups_udf)

<function __main__.check_user_groups(groups)>

In [215]:
%%time

query = """
WITH 
    base AS (
        SELECT
            wiki_db,
            revision_id AS rev_id,
            event_timestamp AS rev_ts,
            event_user_text AS user_name,
            revision_first_identity_reverting_revision_id AS rv_rev_id,
            CASE 
                WHEN ARRAY_CONTAINS(event_user_groups, 'sysop') THEN TRUE
                ELSE FALSE
            END AS is_init_user_sysop,
            CASE 
                WHEN revision_is_identity_revert THEN TRUE
                ELSE FALSE
            END AS is_init_rev_revert                
        FROM 
            wmf.mediawiki_history
        WHERE 
            snapshot = '{MWH_SNAPSHOT}'
            AND wiki_db = '{DB}'
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND page_namespace_is_content
            AND 
                (
                    event_user_is_anonymous 
                    OR event_user_revision_count <= 15
                )
            AND SIZE(event_user_is_bot_by_historical) = 0
            AND revision_is_identity_reverted
            AND revision_seconds_to_identity_revert <= 12*60*60
            AND revision_seconds_to_identity_revert >= 0
            AND YEAR(event_timestamp) = 2022
    ),
    
    rv_info AS (
        SELECT
            base.*,
            mwh.event_user_text AS rv_user_name,
            mwh.event_user_groups AS rv_user_groups,
            CASE
                WHEN ARRAY_CONTAINS(event_user_groups, 'sysop') THEN TRUE
                ELSE FALSE
            END AS is_rv_user_sysop,
            CASE
                WHEN ARRAY_CONTAINS(event_user_groups, 'bot') THEN TRUE
                ELSE FALSE
            END AS is_rv_user_bot,
            CHECK_USER_GROUPS(event_user_groups) AS rv_user_has_no_rights,
            mwh.revision_is_identity_reverted AS is_rv_reverted,
            revision_first_identity_reverting_revision_id AS rv_rv_rev_id
        FROM 
            base
        JOIN
            wmf.mediawiki_history mwh
            ON base.wiki_db = mwh.wiki_db 
                AND base.rv_rev_id = mwh.revision_id
        WHERE
            snapshot = '{MWH_SNAPSHOT}'
            AND NOT base.user_name =  event_user_text
            AND NOT event_user_is_anonymous
            AND 
                (
                    mwh.event_user_revision_count >= 150
                    OR {USER_GROUPS_CONDITIONS}
                )
        ),    
       
        final AS (
            SELECT
                rv_info.*,
                CASE 
                    WHEN mwh.event_user_is_anonymous = TRUE THEN TRUE
                    ELSE FALSE
                END AS rv_rv_user_is_anon,
                CASE 
                    WHEN rv_info.user_name = mwh.event_user_text THEN TRUE
                    ELSE FALSE
                END AS is_rv_rv_user_init,
                CASE
                    WHEN mwh.event_user_revision_count <= 100 THEN TRUE
                    ELSE FALSE
                END AS is_rv_rv_user_new
            FROM
                rv_info
            JOIN
                wmf.mediawiki_history mwh
                ON rv_info.wiki_db = mwh.wiki_db 
                    AND rv_info.rv_rv_rev_id = mwh.revision_id
            WHERE
                snapshot = '{MWH_SNAPSHOT}'
                AND is_rv_reverted
            UNION ALL
            SELECT 
                rv_info.*,
                NULL AS rv_rv_user_is_anon,
                NULL AS is_rv_rv_user_init,
                NULL AS is_rv_rv_user_new
            FROM
                rv_info
            WHERE
                NOT is_rv_reverted
        )


SELECT
    wiki_db,
    rev_id,
    rv_rev_id,
    is_rv_reverted,
    is_init_user_sysop,
    is_init_rev_revert,
    is_rv_user_sysop,
    is_rv_user_bot,
    rv_user_has_no_rights,
    rv_rv_user_is_anon,
    is_rv_rv_user_init,
    is_rv_rv_user_new
FROM
    final
"""

reverts = pd.DataFrame()

for lang in all_ugroups.keys():
    sql_ugroups_statements = " OR ".join([f"ARRAY_CONTAINS(event_user_groups, '{value}')" for value in all_ugroups[lang]])
    
    reverts_by_wiki = wmf.spark.run(
        query
        .format(
            MWH_SNAPSHOT=mwh_snapshot, 
            DB=f'{lang}wiki', 
            USER_GROUPS_CONDITIONS=sql_ugroups_statements
        )
    )
    
    reverts = pd.concat([reverts, reverts_by_wiki], ignore_index=True)
    
reverts.info()

                                                                                / 8192]]]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3800300 entries, 0 to 3800299
Data columns (total 12 columns):
 #   Column                 Dtype 
---  ------                 ----- 
 0   wiki_db                object
 1   rev_id                 int64 
 2   rv_rev_id              int64 
 3   is_rv_reverted         bool  
 4   is_init_user_sysop     bool  
 5   is_init_rev_revert     bool  
 6   is_rv_user_sysop       bool  
 7   is_rv_user_bot         bool  
 8   rv_user_has_no_rights  bool  
 9   rv_rv_user_is_anon     object
 10  is_rv_rv_user_init     object
 11  is_rv_rv_user_new      object
dtypes: bool(6), int64(2), object(4)
memory usage: 195.7+ MB
CPU times: user 42.5 s, sys: 8.5 s, total: 51 s
Wall time: 37min 36s


In [216]:
# remove: reverts reverted back by anonymous users, users who edit was initially reverted, or the user is a newcomer
non_bot_reverts = reverts.query("""(is_rv_user_bot == False)""")

valid_non_bot_reverts = pd.concat([
    reverts.query("""(is_rv_user_bot == False) & (is_rv_reverted == False)"""),
    reverts.query("""(is_rv_user_bot == False) & (is_rv_reverted == True) & (rv_rv_user_is_anon == False) & (is_rv_rv_user_init == False) & (is_rv_rv_user_new == False)""")],
    ignore_index=False
)

print(f'percentage of potentially invalid non-bot reverts: {round(100 - valid_non_bot_reverts.shape[0] / reverts.shape[0] * 100)}%')

percentage of potentially invalid non-bot reverts: 20%


# Analysis

In [242]:
def group_reverts_by_status(df):
    
    grouped = (
        df
        .groupby(['wiki_db', 'is_rv_reverted'])['rev_id']
        .nunique()
        .reset_index()
        .pivot(index='wiki_db', columns='is_rv_reverted', values='rev_id')
    )
    grouped.columns.name = None
    
    grouped['# Reverts'] = grouped.sum(axis=1)
    grouped = grouped.fillna(0).astype(int)
    grouped['Percent of Reverts Reverted'] = round(grouped[True] / grouped['# Reverts'] * 100, 2)
    
    return grouped[['Percent of Reverts Reverted', '# Reverts']].sort_values('# Reverts', ascending=False)

In [239]:
pr_centered('Percent of Reverts by Patrollers on Potential Vandalism [All Reverts]', True)
display_h({
    'All patrollers': group_reverts_by_status(non_bot_reverts),
    'Patrollers with sysop rights': group_reverts_by_status(non_bot_reverts.query("""is_rv_user_sysop == True""")),
    'Patrollers with extended rights (excl. sysop)': group_reverts_by_status(non_bot_reverts.query(("""is_rv_user_sysop == False & rv_user_has_no_rights == False"""))),
    'Patrollers with no extended rights': group_reverts_by_status(non_bot_reverts.query(("""rv_user_has_no_rights == True""")))
})

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,11.42,1771264
eswiki,13.09,336656
itwiki,17.62,264151
ruwiki,10.44,222711
frwiki,7.7,204423
dewiki,6.65,179266
jawiki,13.91,87588
zhwiki,13.83,77682
fawiki,9.85,76742
idwiki,14.81,37545

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,10.78,298552
itwiki,17.62,176004
frwiki,6.28,71777
eswiki,14.29,62606
ruwiki,9.69,59035
dewiki,5.42,41938
jawiki,11.76,22140
fawiki,8.6,18719
idwiki,13.41,12935
ptwiki,7.54,12221

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,11.5,1452343
eswiki,12.48,201467
ruwiki,10.68,161501
dewiki,7.0,137141
itwiki,17.04,75021
zhwiki,14.19,68241
jawiki,14.12,62728
fawiki,10.07,52277
frwiki,7.84,52011
ptwiki,7.29,22082

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
frwiki,8.86,80635
eswiki,13.74,72583
enwiki,14.85,20369
itwiki,20.9,13126
idwiki,22.13,9369
fawiki,11.92,5746
jawiki,26.58,2720
ptwiki,13.9,2641
ruwiki,12.69,2175
zhwiki,27.23,573


In [244]:
pr_centered('Percent of Reverts by Patrollers on Potential Vandalism Reverted Back [Valid Reverts]', True)
display_h({
    'All patrollers': group_reverts_by_status(valid_non_bot_reverts),
    'Patrollers with sysop rights': group_reverts_by_status(valid_non_bot_reverts.query("""is_rv_user_sysop == True""")),
    'Patrollers with extended rights (excl. sysop)': group_reverts_by_status(valid_non_bot_reverts.query(("""is_rv_user_sysop == False & rv_user_has_no_rights == False"""))),
    'Patrollers with no extended rights': group_reverts_by_status(valid_non_bot_reverts.query(("""rv_user_has_no_rights == True""")))
})

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.64,1628303
eswiki,5.25,308797
itwiki,3.77,226128
ruwiki,3.22,206095
frwiki,3.38,195289
dewiki,1.94,170661
jawiki,3.48,78123
fawiki,4.13,72164
zhwiki,4.87,70370
ptwiki,3.36,35227

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.3,275442
itwiki,3.55,150322
frwiki,2.71,69143
eswiki,5.64,56867
ruwiki,3.31,55137
dewiki,1.5,40267
jawiki,3.06,20152
fawiki,3.47,17724
idwiki,5.79,11888
ptwiki,2.49,11587

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.68,1334362
eswiki,4.71,185024
ruwiki,3.18,148988
dewiki,2.07,130227
itwiki,3.69,64619
zhwiki,5.0,61640
jawiki,3.54,55845
frwiki,3.73,49787
fawiki,4.07,49007
ptwiki,3.24,21158

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
frwiki,3.76,76359
eswiki,6.43,66906
enwiki,6.24,18499
itwiki,7.19,11187
idwiki,11.56,8250
fawiki,6.85,5433
ptwiki,8.38,2482
jawiki,6.07,2126
ruwiki,3.6,1970
zhwiki,8.35,455


In [245]:
reverts_non_init_reverts = valid_non_bot_reverts.query("""is_init_rev_revert == False""")

pr_centered('Percent of Reverts by Patrollers on Potential Vandalism Reverted Back', True)
pr_centered('only reverts where the edit being reverted was not a revert', True)
display_h({
    'All patrollers': group_reverts_by_status(reverts_non_init_reverts),
    'Patrollers with sysop rights': group_reverts_by_status(reverts_non_init_reverts.query("""is_rv_user_sysop == True""")),
    'Patrollers with extended rights (excl. sysop)': group_reverts_by_status(reverts_non_init_reverts.query(("""is_rv_user_sysop == False & rv_user_has_no_rights == False"""))),
    'Patrollers with no extended rights': group_reverts_by_status(reverts_non_init_reverts.query(("""rv_user_has_no_rights == True""")))
})

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.57,1571870
eswiki,5.22,298555
itwiki,3.73,215246
ruwiki,3.18,199622
frwiki,3.35,190627
dewiki,1.9,165715
jawiki,3.4,72919
fawiki,4.1,70285
zhwiki,4.79,67312
ptwiki,3.35,34278

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.25,264030
itwiki,3.5,142245
frwiki,2.68,67460
eswiki,5.58,54591
ruwiki,3.28,53237
dewiki,1.48,39135
jawiki,3.09,18235
fawiki,3.46,17261
idwiki,5.69,11409
ptwiki,2.47,11209

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
enwiki,3.61,1290003
eswiki,4.68,179311
ruwiki,3.15,144506
dewiki,2.02,126422
itwiki,3.65,62369
zhwiki,4.91,58981
jawiki,3.42,52786
frwiki,3.7,48748
fawiki,4.04,47830
ptwiki,3.26,20676

Unnamed: 0_level_0,Percent of Reverts Reverted,# Reverts
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1
frwiki,3.72,74419
eswiki,6.41,64653
enwiki,6.03,17837
itwiki,7.18,10632
idwiki,11.85,7868
fawiki,6.72,5194
ptwiki,8.19,2393
jawiki,5.69,1898
ruwiki,2.82,1879
zhwiki,7.96,427
