# Baseline: Content Moderation Backlog (Flagged Revisions)

**Last updated on 5 January 2024**

[TASK: T348863](https://phabricator.wikimedia.org/T348863)

# Contents
1. [Summary](#Summary)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)

## Summary

The following analysis is to determine a baseline content moderation backlogs, specifically flagged revisions (with a [follow-up analysis](https://github.com/wikimedia-research/automoderator-measurement/blob/main/baselines/T348863_content_moderation_backlogs_rchanges.ipynb) on recent changes patrolling). The baseline will be used as a reference for evaluating the impact of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator) later. The [operational definitions](https://phabricator.wikimedia.org/T349083) within the scope of Automoderator are the following:

<u>probable vandalism:</u>
- edit belongs to the content namespace
- edit was reverted within 12 hours
- user is anonymous OR if registered
    - user edit count is less than 15 edits
    - time since user's first edit is less than 48 hours
- revert was made by a different editor

<u>patroller:</u>
- user's having user groups with any of the following permissions on the respective wikis: rollback, review, patrol, block, delete, deleterevision
- OR registered user who have made 150+ content namespace edits and 10+ content namespace reverts<br>(note: for this analysis, we have considered registered users with 150+ edits)

In [164]:
pr_centered('Median Time for a Flagged Revision to be Reviewed', True)
display_h({wiki:quantiles(flagged_revs.query(f"""wiki_db == '{wiki}'"""), style_median=True) for wiki in wikis_list})

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,2.0,0.0
25th,56.0,0.9
50th,790.0,13.2
75th,2637.0,44.0
90th,8541.0,142.4
99th,25873.0,431.2

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,10.0,0.2
25th,346.0,5.8
50th,13358.0,222.6
75th,134967.0,2249.5
90th,1827769.0,30462.8
99th,5338227.0,88970.5

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.0,0.0
25th,44.0,0.7
50th,13049.0,217.5
75th,170257.0,2837.6
90th,4034150.0,67235.8
99th,18728238.0,312137.3

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,15.0,0.2
25th,281.0,4.7
50th,7272.0,121.2
75th,35097.0,585.0
90th,825148.0,13752.5
99th,6775111.0,112918.5


In [160]:
display_h({
    'Average Monthly Unique Reviewers Reviewing Flagged Revs (2023)': avg_monthly_fr_reviewers,
    'Median Number of Reviews by Each Reviewer (2023)': reviews_per_reviewer
})

Unnamed: 0_level_0,# Unique Reviewers
wiki_db,Unnamed: 1_level_1
dewiki,2455
enwiki,231
idwiki,21
ruwiki,737

Unnamed: 0_level_0,# Reviews
wiki_db,Unnamed: 1_level_1
dewiki,7
enwiki,3
idwiki,5
ruwiki,32


In [165]:
pr_centered('Number of Reviews by Each Reviewer by Edit bucket', True)
display_h({
    '': reviews_per_reviewer_by_bucket
})

Unnamed: 0_level_0,Unnamed: 1_level_0,# Unique Reviewers,n_edits,# Reviews per Reviewer
wiki_db,Reviewer Edit Bucket,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dewiki,0-99,1,1,1
dewiki,100-999,1526,20133,13
dewiki,1000-4999,1948,61347,31
dewiki,5000+,2359,396962,168
enwiki,100-999,16,249,16
enwiki,1000-4999,154,2741,18
enwiki,5000+,758,18967,25
idwiki,1000-4999,14,190,14
idwiki,5000+,57,2285,40
ruwiki,0-99,5,41,8


# Data-Gathering

## Imports

In [1]:
import pandas as pd
import numpy as np
import wmfdata as wmf

pd.options.display.max_columns = None
pd.options.display.max_rows = 250

from IPython.display import display_html
from IPython.display import display, HTML
from IPython.display import clear_output

import os
import requests
import warnings

## spark_session

In [2]:
# spark_session = wmf.spark.get_active_session()

# if type(spark_session) != type(None):
#     spark_session.stop()
# else:
#     print('no active session')

no active session


In [None]:
# spark_session = wmf.spark.create_custom_session(
#     master="yarn",
#     app_name='content-moderation-backlogs',
#     spark_config={
#         "spark.driver.memory": "6g",
#         "spark.dynamicAllocation.maxExecutors": 64,
#         "spark.executor.memory": "16g",
#         "spark.executor.cores": 4,
#         "spark.sql.shuffle.partitions": 256,
#         "spark.driver.maxResultSize": "2g"
        
#     }
# )

# clear_output()

# spark_session.sparkContext.setLogLevel("ERROR")
# spark_session

SPARK_HOME: /usr/lib/spark3
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/15 07:50:08 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
24/02/15 07:50:09 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
24/02/15 07:50:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/02/15 07:50:17 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
24/02/15 07:50:18 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!


## functions

In [13]:
# prints a string at center of the output, bold if needed
def pr_centered(content, bold=False):
    if bold:
        content = f"<b>{content}</b>"
    
    centered_html = f"<div style='text-align:center'>{content}</div>"
    
    display(HTML(centered_html))


# display dataframes horizontally with title for each
def display_h(frames, space=100):
    html = ""
    
    for key in frames.keys():
        html_df =f'<div>{key} {frames[key]._repr_html_()}</div>'
        html += html_df
        
    html = f"""
    <div style="display:flex; justify-content: space-evenly;">
    {html}
    </div>"""
    
    display_html(html, raw=True)
    
# applies cell color to a given nth percentile
def style_percentile(i, percentile='50th'):
    return ['background-color: Aquamarine' if i.name == percentile else '' for _ in i]

In [16]:
# return quatiles for a given series (dataframe and column name)
def quantiles(frame, col='diff_sec', style_median=False):    
    qdict = {
        '10th': frame[col].quantile(0.1),
        '25th': frame[col].quantile(0.25),
        '50th': frame[col].quantile(0.5),
        '75th': frame[col].quantile(0.7),
        '90th': frame[col].quantile(0.9),
        '99th': frame[col].quantile(0.99)
    }
    
    df = pd.DataFrame(qdict.values(),
                      index=qdict.keys(),
                      columns=['seconds'])
    
    df['minutes'] = round(df['seconds'] / 60, 2)
    
    df = df.astype({'seconds': int})
    df.index.name = 'percentile'
    
    if style_median:
        df = df.style.apply(style_percentile, axis=1).format("{:.1f}")
        # df = df.astype({'seconds': int})
        return df
    else:
        return df

## query: flagged revisions

In [2]:
# mwh_snapshot = '2024-01'

lang_list = ['en', 'es', 'ja', 'de', 'fr', 'ru', 'zh', 'it', 'pt', 'fa', 'id']

# the following languages do not have FlaggedRevisions enabled
exclude_langs = ['es', 'ja', 'fr', 'zh', 'it', 'pt', 'fa']

wikis_list = [f'{lang}wiki' for lang in lang_list if lang not in exclude_langs]
wikis_sql = wmf.utils.sql_tuple(wikis_list)

In [10]:
%%time

warnings.filterwarnings('ignore')

flagged_revs = pd.DataFrame()

for wiki in wikis_list:
    
    fr_query = """
    SELECT 
        fr_rev_id AS rev_id,
        page_namespace,
        fr_timestamp AS review_ts,
        MONTH(fr_timestamp) AS review_month,
        fr_rev_timestamp AS rev_ts,
        CASE
            WHEN user_editcount < 100 THEN '0-99'
            WHEN user_editcount BETWEEN 100 AND 999 THEN '100-999'
            WHEN user_editcount BETWEEN 1000 AND 4999 THEN '1000-4999'
            WHEN user_editcount >= 5000 THEN '5000+'
        END AS reviewer_edit_bucket,
        user_id AS reviewer_id    
    FROM 
        flaggedrevs fr
    JOIN 
        user u 
        ON fr.fr_user = u.user_id
    JOIN
        page p
        ON fr.fr_page_id = p.page_id
    WHERE
        fr_flags NOT LIKE '%auto%'
        AND user_name NOT LIKE '%bot%'
        AND YEAR(fr_rev_timestamp) = 2023
    ORDER BY 
        fr_timestamp DESC
    """

    flagged_revs_by_wiki = wmf.mariadb.run(fr_query, dbs=wiki)

    flagged_revs_by_wiki = (
        flagged_revs_by_wiki
        .assign(
            review_ts=pd.to_datetime(flagged_revs_by_wiki['review_ts']),
            rev_ts=pd.to_datetime(flagged_revs_by_wiki['rev_ts']),
            reviewer_edit_bucket=pd.Categorical(flagged_revs_by_wiki['reviewer_edit_bucket'])
        )
    )
    
    flagged_revs_by_wiki = (
        flagged_revs_by_wiki
        .assign(
            diff_sec=round((flagged_revs_by_wiki['review_ts'] - flagged_revs_by_wiki['rev_ts']) / np.timedelta64(1, 's')),
            diff_min=round((flagged_revs_by_wiki['review_ts'] - flagged_revs_by_wiki['rev_ts']) / np.timedelta64(1, 'm'), 2),
            wiki_db=wiki
        )
    )
    
    flagged_revs = pd.concat([flagged_revs, flagged_revs_by_wiki], ignore_index=True)
    
flagged_revs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920200 entries, 0 to 920199
Data columns (total 10 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   rev_id                920200 non-null  int64         
 1   page_namespace        920200 non-null  int64         
 2   review_ts             920200 non-null  datetime64[ns]
 3   review_month          920200 non-null  int64         
 4   rev_ts                920200 non-null  datetime64[ns]
 5   reviewer_edit_bucket  920200 non-null  object        
 6   reviewer_id           920200 non-null  int64         
 7   diff_sec              920200 non-null  float64       
 8   diff_min              920200 non-null  float64       
 9   wiki_db               920200 non-null  object        
dtypes: datetime64[ns](2), float64(2), int64(4), object(2)
memory usage: 70.2+ MB
CPU times: user 11.1 s, sys: 554 ms, total: 11.7 s
Wall time: 12min 51s


In [9]:
wmf.mariadb.run("""SELECT * FROM flaggedpage_pending""", 'dewiki')

  result = pd.read_sql_query(


Unnamed: 0,fpp_page_id,fpp_quality,fpp_rev_id,fpp_pending_since
0,53,0,241916960,20240207183913
1,236,0,235215414,20240125090150
2,875,0,234791500,20240208143204
3,968,0,241831210,20240214140741
4,1186,0,240135266,20240213085358
...,...,...,...,...
7660,12992704,0,242147287,20240214083850
7661,12993565,0,242201713,20240214223845
7662,12993703,0,242205839,20240215074746
7663,12994074,0,242208690,20240215075849


In [17]:
avg_monthly_fr_reviewers = (
    flagged_revs
    .groupby(['wiki_db', 'review_month'])['reviewer_id']
    .nunique()
    .reset_index()
    .groupby('wiki_db')
    .reviewer_id
    .mean()
    .reset_index()
    .set_index('wiki_db')
    .astype(int)
    .rename({
        'reviewer_id': '# Unique Reviewers'
    }, axis=1)
)

reviews_per_reviewer = (
    flagged_revs
    .groupby(['wiki_db', 'reviewer_id'])['rev_id']
    .nunique()
    .reset_index()
    .groupby('wiki_db')['rev_id']
    .median()
    .reset_index()
    .set_index('wiki_db')
    .astype(int)
    .rename({
        'rev_id': '# Reviews'
    }, axis=1)
)    

reviews_per_reviewer_by_bucket = (
    pd.merge(
        flagged_revs
        .groupby(['wiki_db', 'reviewer_edit_bucket'])['reviewer_id']
        .nunique()
        .reset_index()
        .rename({ 
            'reviewer_id': 'n_unique_reviewers' 
        }, axis=1),
        flagged_revs
        .groupby(['wiki_db', 'reviewer_edit_bucket'])['rev_id']
        .nunique()
        .reset_index()
        .rename({
            'rev_id': 'n_edits' 
        }, axis=1),
        on=['wiki_db', 'reviewer_edit_bucket'])
)

reviews_per_reviewer_by_bucket['edits_per_reviewer'] = round(reviews_per_reviewer_by_bucket['n_edits'] / reviews_per_reviewer_by_bucket['n_unique_reviewers']).astype(int)

rename_cols = {
    'reviewer_edit_bucket': 'Reviewer Edit Bucket',
    'n_unique_reviewers': '# Unique Reviewers',
    'edits_per_reviewer': '# Reviews per Reviewer'
}

reviews_per_reviewer_by_bucket = (
    reviews_per_reviewer_by_bucket
    .rename(rename_cols, axis=1)
    .set_index(['wiki_db', 'Reviewer Edit Bucket'], verify_integrity=True)
)

In [18]:
pr_centered('Median Time for a Flagged Revision to be Reviewed', True)
display_h({wiki:quantiles(flagged_revs.query(f"""wiki_db == '{wiki}'"""), style_median=True) for wiki in wikis_list})

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,2.0,0.0
25th,57.0,0.9
50th,791.0,13.2
75th,2629.0,43.8
90th,8509.0,141.8
99th,25803.0,430.1

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,10.0,0.2
25th,363.0,6.0
50th,14215.0,236.9
75th,149781.0,2496.3
90th,1818238.0,30304.0
99th,5405532.0,90092.2

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.0,0.0
25th,66.0,1.1
50th,16477.0,274.6
75th,243634.0,4060.6
90th,5097042.0,84950.7
99th,21312476.0,355208.0

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,15.0,0.2
25th,312.0,5.2
50th,7547.0,125.8
75th,36976.0,616.3
90th,1016766.0,16946.1
99th,7996650.0,133277.5


In [19]:
display_h({
    'Average Monthly Unique Reviewers Reviewing Flagged Revs (2023)': avg_monthly_fr_reviewers,
    'Median Number of Reviews by Each Reviewer (2023)': reviews_per_reviewer
})

Unnamed: 0_level_0,# Unique Reviewers
wiki_db,Unnamed: 1_level_1
dewiki,2472
enwiki,231
idwiki,22
ruwiki,752

Unnamed: 0_level_0,# Reviews
wiki_db,Unnamed: 1_level_1
dewiki,7
enwiki,3
idwiki,5
ruwiki,33


In [20]:
pr_centered('Number of Reviews by Each Reviewer by Edit bucket', True)
display_h({
    '': reviews_per_reviewer_by_bucket
})

Unnamed: 0_level_0,Unnamed: 1_level_0,# Unique Reviewers,n_edits,# Reviews per Reviewer
wiki_db,Reviewer Edit Bucket,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dewiki,0-99,1,1,1
dewiki,100-999,1513,18448,12
dewiki,1000-4999,1980,65021,33
dewiki,5000+,2381,404305,170
enwiki,100-999,14,230,16
enwiki,1000-4999,152,2752,18
enwiki,5000+,764,19151,25
idwiki,1000-4999,14,204,15
idwiki,5000+,57,2324,41
ruwiki,0-99,6,42,7
