# Baseline: Accuracy of Patroller Reverts (Probable Vandalism)

**Last updated on 9 January 2024**

[TASK: T353795](https://phabricator.wikimedia.org/T353795)

# Contents
1. [Summary](#Summary)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)

# Summary

**Goal**
* The goal of the analysis is to inform whether filtering out edits made by [extended confirmed users](https://en.wikipedia.org/wiki/Wikipedia:User_access_levels#Extended_confirmed_users) for [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator) to take action on, will be beneficial or not.
* The distribution of revert risk scores at various percentiles for extendedconfirmed users and non-extended confirmed users, and also number of edits by extended confirmed users have revert risk scope greater than 0.97.

**Conclusion**
* Overall, very few edits by extendedconfirmed users had a revert risk score of greater than 0.97.
    * ~1800 edits in the year 2022 across enwiki, fawiki, jawiki, zhwiki.
* The median revert risk score for extendedconfirmed users was ~0.2 whereas for other users was ~0.75.
* In cases where revert risk for extendedconfirmed users was greater than 0.97, in approximately 60% of the cases, the edit was reverted.
* Given the scale of edits by extendedconfirmed users with 0.97, excluding or not exclude (as a global setting) might not have any significant impact of the accuracy of Automoderator.
----
Note: edits made by administrators and bots, self-reverts, and new page creations were excluded.

In [251]:
pr_centered('Distribution of Revert Risk for Extended Confirmed Users', True)
pr_centered('count = average number of edits per day above the thresold')
display_h({
    wiki: quantiles(rr_edits.query(f"""(wiki_db == '{wiki}') & (is_extendedconfirmed == True)"""), 'risk') for wiki in wikis
})

pr_centered('Distribution of Revert Risk for All Users (excl. Extended Confirmed)', True)
pr_centered('count = average number of edits per day above the thresold')
display_h({
    wiki: quantiles(rr_edits.query(f"""(wiki_db == '{wiki}') & (is_extendedconfirmed == False)"""), 'risk') for wiki in wikis
})

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.058,51570
25th,0.111,42975
50th,0.221,28650
75th,0.378,14325
90th,0.533,5730
99th,0.801,573

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.066,1279
25th,0.123,1066
50th,0.264,711
75th,0.448,355
90th,0.611,142
99th,0.872,14

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.064,5708
25th,0.107,4756
50th,0.192,3171
75th,0.322,1585
90th,0.468,634
99th,0.757,63

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.057,5417
25th,0.102,4514
50th,0.194,3009
75th,0.324,1505
90th,0.464,602
99th,0.74,60


Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.489,27005
25th,0.697,22504
50th,0.827,15003
75th,0.906,7501
90th,0.952,3001
99th,0.986,300

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.226,1142
25th,0.551,951
50th,0.814,634
75th,0.913,317
90th,0.957,127
99th,0.988,13

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.369,4242
25th,0.54,3535
50th,0.703,2356
75th,0.832,1178
90th,0.909,471
99th,0.978,47

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.429,2482
25th,0.614,2069
50th,0.765,1379
75th,0.868,690
90th,0.927,276
99th,0.98,28


In [252]:
display_h({
    'Edits by Extended Confirmed Users (risk > 0.97)': r97_frequency,
    'Revert Status of Edits by Extended Confirmed Users (risk > 0.97)': r97_revert_status
})

Unnamed: 0_level_0,Unnamed: 1_level_0,# Edits
Wiki,Is Revert,Unnamed: 2_level_1
enwiki,False,1205
enwiki,True,250
fawiki,False,76
fawiki,True,2
jawiki,False,120
jawiki,True,28
zhwiki,False,114
zhwiki,True,22

Unnamed: 0_level_0,Unnamed: 1_level_0,# Edits
Wiki,Was Reverted,Unnamed: 2_level_1
enwiki,False,633
enwiki,True,822
fawiki,False,38
fawiki,True,40
jawiki,False,50
jawiki,True,98
zhwiki,False,40
zhwiki,True,96


# Data-Gathering

## Setup

In [10]:
import wmfdata as wmf
import pandas as pd
import numpy as np

import random
from datetime import datetime

from IPython.display import display_html, display, HTML, clear_output
import warnings

pd.options.display.max_columns = None
pd.options.display.max_rows = 250

In [4]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) == type(None):
    spark_session = wmf.spark.create_custom_session(
        master="yarn",
        app_name='rr-dist-extended-confirmed',
        spark_config={
            "spark.driver.memory": "4g",
            "spark.dynamicAllocation.maxExecutors": 64,
            "spark.executor.memory": "16g",
            "spark.executor.cores": 4,
            "spark.sql.shuffle.partitions": 256,
            "spark.driver.maxResultSize": "2g"
        }
    )

spark_session.sparkContext.setLogLevel("ERROR")

clear_output()

spark_session

## Functions

In [109]:
# prints a string at center of the output, bold if needed
def pr_centered(content, bold=False):
    if bold:
        content = f"<b>{content}</b>"
    
    centered_html = f"<div style='text-align:center'>{content}</div>"
    
    display(HTML(centered_html))


# display dataframes horizontally with title for each
def display_h(frames, space=100):
    html = ""
    
    for key in frames.keys():
        html_df =f'<div>{key} {frames[key]._repr_html_()}</div>'
        html += html_df
        
    html = f"""
    <div style="display:flex; justify-content: space-evenly;">
    {html}
    </div>"""
    
    display_html(html, raw=True)

In [249]:
# applies cell color to a given nth percentile
def style_percentile(i, percentile='50th'):
    return ['background-color: Aquamarine' if i.name == percentile else '' for _ in i]

# return quatiles for a given series (dataframe and column name)
def quantiles(frame, col='risk', style_median=False, return_counts=True):
    
    quantile_values = [0.1, 0.25, 0.5, 0.75, 0.9, 0.99]
    qdict = {f"{int(q * 100)}th": frame[col].quantile(q) for q in quantile_values}
    
    df = pd.DataFrame(list(qdict.items()), columns=['percentile', col])    
    
    if return_counts:
        df['count'] = df[col].apply(lambda x: round(frame[frame[col] >= x].shape[0] / 30, 0))
    
    df[col] = round(df[col], 3)
    df['count'] = df['count'].astype(int)
    df.set_index('percentile', inplace=True)
    
    if style_median:
        df = df.style.apply(style_percentile, axis=1).format("{:.1f}")
        return df
    else:
        return df

## Query

In [5]:
# paths to pre-calculated revert risk scores
# generated by https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/mnz/examples/examples/notebooks/revertrisk_example.ipynb
rr_scores_path = '/user/paragon/riskobservatory/revertrisk_20212022_anonymous_bot.parquet'

rr_scores = spark_session.read.parquet(rr_scores_path)
rr_scores.createOrReplaceTempView('rr_scores')

rr_scores.printSchema()

                                                                                

root
 |-- rev_id: long (nullable = true)
 |-- wiki_db: string (nullable = true)
 |-- rev_timestamp: string (nullable = true)
 |-- revision_is_identity_reverted: boolean (nullable = true)
 |-- revision_seconds_to_identity_revert: long (nullable = true)
 |-- page_id: long (nullable = true)
 |-- revision_revert_risk: float (nullable = true)
 |-- user_is_anonymous: boolean (nullable = true)
 |-- user_is_bot: boolean (nullable = true)



In [12]:
wikis = ['enwiki', 'fawiki', 'jawiki', 'zhwiki']
wikis_sql = wmf.utils.sql_tuple(wikis)
mwh_snapshot = '2023-12'

In [11]:
# generate 30 random dates in an year

def generate_random_dates(year, num_dates):
    dates = []
    for _ in range(num_dates):
        month = random.randint(1, 12)
        if month in [1, 3, 5, 7, 8, 10, 12]:
            day = random.randint(1, 31)
        elif month == 2:
            day = random.randint(1, 28)
        else:
            day = random.randint(1, 30)
        
        date = datetime(year, month, day)
        dates.append(date.strftime("%Y-%m-%d"))
    
    return dates

random_dates_2022 = generate_random_dates(2022, 30)
random_dates_2022_sql = wmf.utils.sql_tuple(random_dates_2022)

### distribution of revert risk scores

In [187]:
%%time

query = f"""
WITH 
    base AS (
        SELECT
            rr.wiki_db,
            rr.rev_id,
            revision_revert_risk AS risk,
            mwh.event_user_text,
            page_title,
            
            -- was original edit reverting another edit
            CASE
                WHEN revision_is_identity_revert THEN TRUE
                ELSE FALSE
            END AS is_revert,
            
            -- was this edit reverted
            CASE
                WHEN rr.revision_is_identity_reverted THEN TRUE
                ELSE FALSE
            END AS is_reverted,
            
            -- was the user extended confirmed
            CASE
                WHEN ARRAY_CONTAINS(event_user_groups, 'extendedconfirmed') THEN TRUE
                ELSE FALSE
            END AS is_extendedconfirmed,
            
            -- was the user anonymous
            CASE
                WHEN event_user_is_anonymous THEN TRUE
                ELSE FALSE
            END AS is_anon
        FROM 
            rr_scores rr
        JOIN 
            wmf.mediawiki_history mwh 
            ON rr.wiki_db = mwh.wiki_db 
                AND rr.rev_id = mwh.revision_id
        WHERE 
            snapshot = '{mwh_snapshot}'
            AND rr.wiki_db IN {wikis_sql}

            -- exclude page creations
            AND NOT mwh.revision_parent_id = 0

            -- exclude adminstrators
            AND 
                (
                    event_user_groups IS NULL
                    OR NOT ARRAY_CONTAINS(mwh.event_user_groups_historical, 'sysop') 
                )

            -- exclude bots
            AND SIZE(event_user_is_bot_by_historical) = 0        
            AND YEAR(event_timestamp) = 2022
            AND DATE(event_timestamp) IN {random_dates_2022_sql}
            AND page_namespace_is_content
    ),
    
    excl_self_reverts AS (
        SELECT
            b.*
        FROM
            base b
        JOIN 
            wmf.mediawiki_history mwh
            ON b.rev_id = mwh.revision_first_identity_reverting_revision_id 
                AND b.wiki_db = mwh.wiki_db
        WHERE
            snapshot = '{mwh_snapshot}'
            AND b.is_revert
            
            -- exclude self reverts
            AND NOT b.event_user_text = mwh.event_user_text
    )
    
SELECT
    *
FROM
    base
WHERE
    NOT is_revert
UNION ALL
SELECT
    *
FROM
    excl_self_reverts
"""

rr_edits = wmf.spark.run(query).drop_duplicates()
rr_edits.info()

                                                                                92]]

<class 'pandas.core.frame.DataFrame'>
Index: 3294801 entries, 0 to 3493472
Data columns (total 9 columns):
 #   Column                Dtype  
---  ------                -----  
 0   wiki_db               object 
 1   rev_id                int64  
 2   risk                  float32
 3   event_user_text       object 
 4   page_title            object 
 5   is_revert             bool   
 6   is_reverted           bool   
 7   is_extendedconfirmed  bool   
 8   is_anon               bool   
dtypes: bool(4), float32(1), int64(1), object(3)
memory usage: 150.8+ MB
CPU times: user 59.9 s, sys: 7.68 s, total: 1min 7s
Wall time: 5min 43s


### frequency of edits with risk greater 0.97 for extended confirmed users

In [None]:
%%time

query = f"""
WITH 
    base AS (
        SELECT
            rr.wiki_db,
            rr.rev_id,
            event_timestamp AS rev_ts,
            page_title,
            revision_revert_risk AS risk,
            mwh.event_user_text,
            
            CASE
                WHEN revision_is_identity_revert THEN TRUE
                ELSE FALSE
            END AS is_revert,
            
            CASE
                WHEN rr.revision_is_identity_reverted THEN TRUE
                ELSE FALSE
            END AS is_reverted,
            
            CASE
                WHEN ARRAY_CONTAINS(event_user_groups, 'extendedconfirmed') THEN TRUE
                ELSE FALSE
            END AS is_extendedconfirmed
        FROM 
            rr_scores rr
        JOIN 
            wmf.mediawiki_history mwh
            ON rr.wiki_db = mwh.wiki_db AND rr.rev_id = mwh.revision_id
        WHERE 
            snapshot = '{mwh_snapshot}'
            AND rr.wiki_db IN {wikis_sql}

            -- exclude page creations
            AND NOT mwh.revision_parent_id = 0

            -- exclude adminstrators
            AND 
                (
                    event_user_groups IS NULL
                    OR NOT ARRAY_CONTAINS(mwh.event_user_groups_historical, 'sysop') 
                )

            -- exclude bots
            AND SIZE(event_user_is_bot_by_historical) = 0        
            AND YEAR(event_timestamp) = 2022
            AND page_namespace_is_content
            AND revision_revert_risk > 0.97
    ),
    
    excl_self_reverts AS (
        SELECT
            b.*
        FROM
            base b
        JOIN 
            wmf.mediawiki_history mwh
            ON b.rev_id = mwh.revision_first_identity_reverting_revision_id 
                AND b.wiki_db = mwh.wiki_db
        WHERE
            snapshot = '{mwh_snapshot}'
            AND b.is_revert
            
            -- exclude self reverts
            AND NOT b.event_user_text = mwh.event_user_text
    )
    
SELECT
    *
FROM
    base
WHERE
    NOT is_revert
    AND is_extendedconfirmed
UNION ALL
SELECT
    *
FROM
    excl_self_reverts
WHERE
    is_extendedconfirmed
"""

r97 = wmf.spark.run(query).drop_duplicates()
r97.info()

                                                                                26]]

<class 'pandas.core.frame.DataFrame'>
Index: 1817 entries, 0 to 39005
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   wiki_db               1817 non-null   object 
 1   rev_id                1817 non-null   int64  
 2   rev_ts                1817 non-null   object 
 3   page_title            1817 non-null   object 
 4   risk                  1817 non-null   float32
 5   event_user_text       1817 non-null   object 
 6   is_revert             1817 non-null   bool   
 7   is_reverted           1817 non-null   bool   
 8   is_extendedconfirmed  1817 non-null   bool   
dtypes: bool(3), float32(1), int64(1), object(4)
memory usage: 97.6+ KB
CPU times: user 1.48 s, sys: 284 ms, total: 1.76 s
Wall time: 5min 1s


In [None]:
# r97.to_csv('ec_r97.tsv', sep='\t', index=False)

In [194]:
def rr_dist_wiki(wiki):
    
    pr_centered(f'{wiki} - Distribution of Revert Risk for Various Users Groups', True)
    pr_centered(f'count indicates average of number of edits per day above the threshold')
    
    display_h({
        'anonymous users': quantiles(rr_edits.query(f"""(wiki_db == '{wiki}') & (is_anon == True)"""), 'risk'),
        'registered users (excl. extendedconfirmed)': quantiles(rr_edits.query(f"""(wiki_db == '{wiki}') & (is_extendedconfirmed == False) & (is_anon == False)"""), 'risk'),
        'extendedconfirmed users': quantiles(rr_edits.query(f"""(wiki_db == '{wiki}') & (is_extendedconfirmed == True)"""), 'risk'),
        'extendedconfirmed (edit is not a revert)': quantiles(rr_edits.query(f"""(wiki_db == '{wiki}') & (is_extendedconfirmed == True) & (is_revert == False)"""), 'risk')
    })

# Analysis

## Distribution of revert risk scores acorss various user groups

In [250]:
for wiki in wikis:
    rr_dist_wiki(wiki)

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.692,17293
25th,0.783,14410
50th,0.862,9607
75th,0.921,4804
90th,0.959,1921
99th,0.986,192

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.293,9713
25th,0.486,8094
50th,0.691,5396
75th,0.841,2698
90th,0.925,1079
99th,0.985,108

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.058,51570
25th,0.111,42975
50th,0.221,28650
75th,0.378,14325
90th,0.533,5730
99th,0.801,573

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.056,47800
25th,0.104,39833
50th,0.206,26555
75th,0.357,13278
90th,0.518,5311
99th,0.799,531


Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.784,374
25th,0.848,311
50th,0.906,208
75th,0.948,104
90th,0.971,42
99th,0.992,4

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.152,768
25th,0.396,640
50th,0.686,427
75th,0.862,213
90th,0.937,85
99th,0.984,9

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.066,1279
25th,0.123,1066
50th,0.264,711
75th,0.448,355
90th,0.611,142
99th,0.872,14

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.064,1174
25th,0.115,978
50th,0.244,652
75th,0.432,326
90th,0.608,130
99th,0.876,13


Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.503,2702
25th,0.621,2251
50th,0.749,1501
75th,0.855,750
90th,0.918,300
99th,0.978,30

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.257,1540
25th,0.389,1283
50th,0.578,855
75th,0.758,428
90th,0.876,171
99th,0.978,17

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.064,5708
25th,0.107,4756
50th,0.192,3171
75th,0.322,1585
90th,0.468,634
99th,0.757,63

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.063,5513
25th,0.104,4594
50th,0.186,3063
75th,0.315,1531
90th,0.463,613
99th,0.751,61


Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.574,1655
25th,0.695,1379
50th,0.803,920
75th,0.886,460
90th,0.935,184
99th,0.981,18

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.284,827
25th,0.439,689
50th,0.629,460
75th,0.793,230
90th,0.892,92
99th,0.976,9

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.057,5417
25th,0.102,4514
50th,0.194,3009
75th,0.324,1505
90th,0.464,602
99th,0.74,60

Unnamed: 0_level_0,risk,count
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.056,5223
25th,0.1,4353
50th,0.188,2902
75th,0.314,1451
90th,0.456,580
99th,0.737,58


## Frequency edits with revert risk score greater than 0.97 for extended confirmed usersr97_frequency.index

In [224]:
r97_frequency = (
    r97.groupby(['wiki_db', 'is_revert'])
    .agg({'rev_id': 'count'})
    .rename({'rev_id': '# Edits'}, axis=1)
)
r97_frequency.index.names = ['Wiki', 'Is Revert']

pr_centered('Edits by Extended Confirmed Users (risk > 0.97)', True)
display_h({'': r97_frequency})

Unnamed: 0_level_0,Unnamed: 1_level_0,# Edits
Wiki,Is Revert,Unnamed: 2_level_1
enwiki,False,1205
enwiki,True,250
fawiki,False,76
fawiki,True,2
jawiki,False,120
jawiki,True,28
zhwiki,False,114
zhwiki,True,22


In [241]:
r97_revert_status = (
    r97
    .groupby(['wiki_db', 'is_reverted'])
    .agg({'rev_id': 'count'})
    .reset_index()
    .sort_values(['wiki_db', 'is_reverted'])
    .set_index(['wiki_db', 'is_reverted'])
    .rename({'rev_id': '# Edits'}, axis=1)
)
r97_revert_status.index.names = ['Wiki', 'Was Reverted']

pr_centered('Revert Status of Edits by Extended Confirmed Users (risk > 0.97)', True)
display_h({'': r97_revert_status})

Unnamed: 0_level_0,Unnamed: 1_level_0,# Edits
Wiki,Was Reverted,Unnamed: 2_level_1
enwiki,False,633
enwiki,True,822
fawiki,False,38
fawiki,True,40
jawiki,False,50
jawiki,True,98
zhwiki,False,40
zhwiki,True,96


## Quality check
As the number of edits marked with revert risk greater 0.97 for extended confirmed users is quite low, it is better to do a quality check to ensure accuracy of results.<br>It will be done using simple filters i.e. excluding complex filtering such as self reverts. The numbers should be comparable to above.

In [221]:
qa_query = f"""
SELECT
    rr.wiki_db,
    COUNT(DISTINCT rr.rev_id) AS n_edits
FROM 
    rr_scores rr
JOIN 
    wmf.mediawiki_history mwh
    ON rr.wiki_db = mwh.wiki_db 
        AND rr.rev_id = mwh.revision_id
WHERE 
    snapshot = '{mwh_snapshot}'
    AND rr.wiki_db IN {wikis_sql}
    
    -- exclude page creations
    
    AND NOT mwh.revision_parent_id = 0
    AND SIZE(event_user_is_bot_by_historical) = 0        
    AND YEAR(event_timestamp) = 2022
    AND page_namespace_is_content
    AND revision_revert_risk > 0.97
    AND ARRAY_CONTAINS(event_user_groups, 'extendedconfirmed')
GROUP BY
    rr.wiki_db
"""

qa = wmf.spark.run(qa_query).set_index('wiki_db')
qa

                                                                                

Unnamed: 0_level_0,n_edits
wiki_db,Unnamed: 1_level_1
jawiki,176
zhwiki,140
enwiki,1518
fawiki,81


The numbers are close to the results by excluding self-reverts among other filters.