# Baseline: Median Time to Revert (Probable Vandalism)

**Last updated on 16 December 2023**


[TASK: T348860](https://phabricator.wikimedia.org/T348860)<br>
 ➤ ➤ [View the notebook on nbviewer](https://nbviewer.org/github/wikimedia-research/automoderator-measurement/blob/main/baselines/T348860_median_time_to_revert.ipynb)

# Contents
1. [Summary](#Summary)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)
    * [Median Time to Revert, by Wikipedia](#Median-Time-to-Revert)
    * [Time to Revert Percentiles, by Wikipedia](#Time-to-Revert-Percentiles)

## Summary

The following analysis is to determine a baseline for Median Time to Revert for Probable Vandalism for Wikipedias in consideration. The baseline will be used as a reference for evaluating the impact of Automoderator later. The [operational definition](https://phabricator.wikimedia.org/T349083) for probable vandalism (within the scope of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator)) is as the following:
- edit belongs to the content namespace
- edit was reverted within 12 hours
- user is anonymous OR if registered
    - user edit count is less than 15 edits
    - time since user's first edit is less than 48 hours
- revert was made by a different editor

## Baseline: Median Time to Revert for Probable Vandalism

In [363]:
pr_centered('Median Time to Revert for Probable Vandalism', True)
display_h({
    '': median_ttr_by_wiki
})

Unnamed: 0_level_0,seconds,minutes
Wikipedia,Unnamed: 1_level_1,Unnamed: 2_level_1
German Wikipedia,186,3.1
English Wikipedia,569,9.48
Spanish Wikipedia,60,1.0
Persian Wikipedia,198,3.31
French Wikipedia,435,7.25
Indonesian Wikipedia,3085,51.42
Italian Wikipedia,358,5.97
Japanese Wikipedia,1139,18.98
Portuguese Wikipedia,1023,17.06
Russian Wikipedia,777,12.95


# Data-Gathering

## Imports

In [2]:
import pandas as pd
import numpy as np
import wmfdata as wmf

from datetime import timedelta, datetime

pd.options.display.max_columns = None
from IPython.display import clear_output

from IPython.display import display_html
from IPython.display import display, HTML
from IPython.display import clear_output

import warnings

In [4]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


## spark_session

In [5]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='vandalism-time-to-revert',
    spark_config={
        "spark.driver.memory": "6g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "24g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session.sparkContext.setLogLevel("ERROR")
spark_session

## functions

In [10]:
# prints a string at center of the output, bold if needed
def pr_centered(content, bold=False):
    if bold:
        content = f"<b>{content}</b>"
    
    centered_html = f"<div style='text-align:center'>{content}</div>"
    
    display(HTML(centered_html))


# display dataframes horizontally with title for each
def display_h(frames, space=100):
    html = ""
    
    for key in frames.keys():
        html_df =f'<div>{key} {frames[key]._repr_html_()}</div>'
        html += html_df
        
    html = f"""
    <div style="display:flex; justify-content: space-evenly;">
    {html}
    </div>"""
    
    display_html(html, raw=True)

In [11]:
# calculate time difference in seconds between two columns
# note: the columns should be datetime formatted
def time_delta(df, start_column, end_column):
    try: 
        return df.apply(lambda row: (row[end_column] - row[start_column]).total_seconds(), axis=1)
    except:
        return np.NaN

# applies cell color to a given nth percentile
def style_percentile(i, percentile='50th'):
    return ['background-color: Aquamarine' if i.name == percentile else '' for _ in i]

# return quatiles for a given series (dataframe and column name)
def quantiles(frame, col='time_to_revert', style_median=False):    
    qdict = {
        '10th': frame[col].quantile(0.1),
        '25th': frame[col].quantile(0.25),
        '50th': frame[col].quantile(0.5),
        '75th': frame[col].quantile(0.7),
        '90th': frame[col].quantile(0.9),
        '99th': frame[col].quantile(0.99)
    }
    
    df = pd.DataFrame(qdict.values(),
                      index=qdict.keys(),
                      columns=['seconds'])
    
    df['minutes'] = round(df['seconds'] / 60, 2)
    
    df = df.astype({'seconds': int})
    df.index.name = 'percentile'
    
    if style_median:
        df = df.style.apply(style_percentile, axis=1).format("{:.1f}")
        # df = df.astype({'seconds': int})
        return df
    else:
        return df

In [12]:
def split_into_groups(dfs, group_size=4):
    return [dfs[i:i + group_size] for i in range(0, len(dfs), group_size)]

## query

In [7]:
mwh_snapshot = '2023-11'

wikis_list = [f'{lang}wiki' for lang in ['en', 'es', 'ja', 'de', 'fr', 'ru', 'zh', 'it', 'pt', 'fa', 'id']]
wikis_sql = wmf.utils.sql_tuple(wikis_list)

In [8]:
%%time

query = f"""
WITH 
    base AS (
        SELECT
            wiki_db,
            event_user_text AS user_name,
            event_user_is_anonymous AS is_anon,
            revision_seconds_to_identity_revert AS time_to_revert,
            revision_first_identity_reverting_revision_id AS reverting_edit_id,
            event_timestamp,
            event_user_first_edit_timestamp
        FROM 
            wmf.mediawiki_history
        WHERE 
            snapshot = '{mwh_snapshot}'
            AND wiki_db IN {wikis_sql}
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND page_namespace_is_content
            AND 
                (
                    event_user_is_anonymous 
                    OR event_user_revision_count <= 15
                )
            AND SIZE(event_user_is_bot_by_historical) = 0
            AND revision_is_identity_reverted
            AND revision_seconds_to_identity_revert <= 12*60*60
            AND revision_seconds_to_identity_revert >= 0
            AND YEAR(event_timestamp) = 2022
    )
SELECT
    base.*
FROM 
    base
JOIN
    wmf.mediawiki_history mwh
    ON base.wiki_db = mwh.wiki_db 
        AND base.reverting_edit_id = mwh.revision_id
WHERE
    snapshot = '{mwh_snapshot}'
    AND NOT base.user_name = mwh.event_user_text
"""

vandal_edits = wmf.spark.run(query).drop(['user_name', 'reverting_edit_id'], axis=1)
vandal_edits_df1 = vandal_edits.copy()

                                                                                

CPU times: user 22.3 s, sys: 2.37 s, total: 24.7 s
Wall time: 3min 47s


In [24]:
vandal_edits = vandal_edits_base.copy()

In [35]:
vandal_edits = (
    vandal_edits
    .assign(
        event_timestamp=pd.to_datetime(vandal_edits['event_timestamp'], utc=True),
        event_user_first_edit_timestamp=pd.to_datetime(vandal_edits['event_user_first_edit_timestamp'], utc=True),
        is_anon=pd.Categorical(vandal_edits['is_anon'])
    )
    .assign(
        elapsed_user_first_rev=time_delta(vandal_edits, 'event_user_first_edit_timestamp', 'event_timestamp')
    )
    .query("(elapsed_user_first_rev <= 48*60*60) | (is_anon == True)")
    .drop(['event_timestamp', 'event_user_first_edit_timestamp'], axis=1)
    .reset_index(drop=True)
)

vandal_edits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3771812 entries, 0 to 3771811
Data columns (total 5 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   wiki_db                 object  
 1   risk                    float32 
 2   is_anon                 category
 3   time_to_revert          int64   
 4   elapsed_user_first_rev  float64 
dtypes: category(1), float32(1), float64(1), int64(1), object(1)
memory usage: 104.3+ MB


In [46]:
for k, v in dict(vandal_edits.is_anon.value_counts()).items():
    assert v > 0, f'is_anon=={k} has {v} records'

In [14]:
db_names = (
    pd
    .read_csv('https://raw.githubusercontent.com/wikimedia-research/canonical-data/master/wiki/wikis.tsv', sep='\t')
    .query("""database_code == @wikis_list""")[['database_code', 'english_name']]
    .set_index('database_code')['english_name']
    .to_dict()
)

all_dbs_ttr = {db_names[db]: quantiles(vandal_edits.query(f"wiki_db == '{db}'"), style_median=True) for db in vandal_edits.wiki_db.unique()}

# Analysis

## Median-Time-to-Revert

In [47]:
median_ttr_by_wiki = (
    vandal_edits
    .groupby('wiki_db')['time_to_revert']
    .median()
    .reset_index(name='seconds')
    .assign(
        **{
            'minutes': lambda x: round(x['seconds'] / 60, 2),
            'Wikipedia': lambda x: x['wiki_db'].map(db_names)
        }
    )
    .astype({'seconds': int})
    .set_index('Wikipedia')
    .drop('wiki_db', axis=1)
)

In [48]:
pr_centered('Median Time to Revert (by Wikipedia)', True)
display_h({
    '': median_ttr_by_wiki
})

Unnamed: 0_level_0,seconds,minutes
Wikipedia,Unnamed: 1_level_1,Unnamed: 2_level_1
German Wikipedia,186,3.1
English Wikipedia,569,9.48
Spanish Wikipedia,60,1.0
Persian Wikipedia,198,3.31
French Wikipedia,435,7.25
Indonesian Wikipedia,3085,51.42
Italian Wikipedia,358,5.97
Japanese Wikipedia,1139,18.98
Portuguese Wikipedia,1023,17.06
Russian Wikipedia,777,12.95


## Time-to-Revert-Percentiles

In [49]:
pr_centered('<big>Time to Revert (by Wikipedia)<big>', True)
pr_centered('<u>coloured cell -> median</u>')

for group in split_into_groups(list(all_dbs_ttr.items())):
    display_h(dict(group))

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,16.0,0.3
25th,37.0,0.6
50th,166.0,2.8
75th,852.0,14.2
90th,9181.0,153.0
99th,37210.0,620.2

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,17.0,0.3
25th,58.0,1.0
50th,610.0,10.2
75th,3579.0,59.6
90th,18846.0,314.1
99th,39439.0,657.3

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,1.0,0.0
25th,2.0,0.0
50th,50.0,0.8
75th,912.0,15.2
90th,11520.0,192.0
99th,37460.0,624.3

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,19.0,0.3
25th,39.0,0.7
50th,109.0,1.8
75th,835.0,13.9
90th,10067.0,167.8
99th,36953.0,615.9


Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,27.0,0.5
25th,69.0,1.1
50th,414.0,6.9
75th,2209.0,36.8
90th,15542.0,259.0
99th,38850.0,647.5

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,105.0,1.8
25th,545.0,9.1
50th,3156.0,52.6
75th,9047.0,150.8
90th,25021.0,417.0
99th,40159.0,669.3

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,20.0,0.3
25th,50.0,0.8
50th,361.0,6.0
75th,2502.0,41.7
90th,16748.0,279.1
99th,39721.0,662.0

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,72.0,1.2
25th,239.0,4.0
50th,1543.0,25.7
75th,6393.0,106.5
90th,22878.0,381.3
99th,40446.0,674.1


Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,13.0,0.2
25th,100.0,1.7
50th,733.0,12.2
75th,3200.0,53.3
90th,17238.0,287.3
99th,39444.0,657.4

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,80.0,1.3
25th,327.0,5.5
50th,2119.0,35.3
75th,7596.0,126.6
90th,24289.0,404.8
99th,40814.0,680.2
