# Baseline: Median Time to Revert (Probable Vandalism)

[TASK: T348860](https://phabricator.wikimedia.org/T348860)<br>**[View the Notebook on nbviewer]()**

# Contents
1. [Summary](#Summary)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)
    * [Median Time to Revert, by Wikipedia](#Median-Time-to-Revert)
    * [Time to Revert Percentiles, by Wikipedia](#Time-to-Revert-Percentiles)

## Summary

The following analysis is to determine a baseline for Median Time to Revert for Probable Vandalism for Wikipedias in consideration. The baseline will be used as a reference for evaluating the impact of Automoderator later. The [operational definition](https://phabricator.wikimedia.org/T349083) for probable vandalism (within the scope of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator)) is as the following:
- edit belongs to the content namespace
- edit was reverted within 12 hours
- user is anonymous OR if registered
    - user edit count is less than 15 edits
    - time since user's first edit is less than 48 hours
- revert was made by a different editor

## Baseline: Median Time to Revert for Probable Vandalism

In [363]:
pr_centered('Median Time to Revert for Probable Vandalism', True)
display_h({
    '': median_ttr_by_wiki
})

Unnamed: 0_level_0,seconds,minutes
Wikipedia,Unnamed: 1_level_1,Unnamed: 2_level_1
German Wikipedia,186,3.1
English Wikipedia,569,9.48
Spanish Wikipedia,60,1.0
Persian Wikipedia,198,3.31
French Wikipedia,435,7.25
Indonesian Wikipedia,3085,51.42
Italian Wikipedia,358,5.97
Japanese Wikipedia,1139,18.98
Portuguese Wikipedia,1023,17.06
Russian Wikipedia,777,12.95


# Data-Gathering

## Imports

In [192]:
import pandas as pd
import numpy as np
import wmfdata as wmf

import great_tables as gt
from datetime import timedelta, datetime

pd.options.display.max_columns = None
from IPython.display import clear_output

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter, FixedLocator

from IPython.display import display_html
from IPython.display import display, HTML
from IPython.display import clear_output

import warnings

In [3]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


## spark_session

In [5]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='vandalism-time-to-revert',
    spark_config={
        "spark.driver.memory": "6g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "24g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session.sparkContext.setLogLevel("ERROR")
spark_session

## query

In [8]:
rr_scores_path = '/user/paragon/riskobservatory/revertrisk_20212022_anonymous_bot.parquet'

rr_scores = spark_session.read.parquet(rr_scores_path)
rr_scores.createOrReplaceTempView('rr_scores')

rr_scores.printSchema()

                                                                                

root
 |-- rev_id: long (nullable = true)
 |-- wiki_db: string (nullable = true)
 |-- rev_timestamp: string (nullable = true)
 |-- revision_is_identity_reverted: boolean (nullable = true)
 |-- revision_seconds_to_identity_revert: long (nullable = true)
 |-- page_id: long (nullable = true)
 |-- revision_revert_risk: float (nullable = true)
 |-- user_is_anonymous: boolean (nullable = true)
 |-- user_is_bot: boolean (nullable = true)



In [9]:
mwh_snapshot = '2023-11'

wikis_list = [f'{lang}wiki' for lang in ['en', 'es', 'ja', 'de', 'fr', 'ru', 'zh', 'it', 'pt', 'fa', 'id']]
wikis_sql = wmf.utils.sql_tuple(wikis_list)

In [69]:
%%time

query = f"""
WITH 
    base AS (
        SELECT
            mwh.wiki_db,
            revision_revert_risk AS risk,
            mwh.event_user_text AS user_name,
            event_user_is_anonymous AS is_anon,
            mwh.revision_seconds_to_identity_revert AS time_to_revert,
            revision_first_identity_reverting_revision_id AS reverting_edit_id,
            event_timestamp,
            event_user_first_edit_timestamp
        FROM 
            rr_scores rr
        JOIN 
            wmf.mediawiki_history mwh 
            ON rr.wiki_db = mwh.wiki_db 
                AND rr.rev_id = mwh.revision_id
        WHERE 
            snapshot = '{mwh_snapshot}'
            AND rr.wiki_db IN {wikis_sql}
            AND event_entity = 'revision'
            AND event_type = 'create'
            AND page_namespace_is_content
            AND 
                (
                    event_user_is_anonymous 
                    OR event_user_revision_count <= 15
                )
            AND SIZE(event_user_is_bot_by_historical) = 0
            AND mwh.revision_is_identity_reverted
            AND mwh.revision_seconds_to_identity_revert <= 12*60*60
            AND mwh.revision_seconds_to_identity_revert >= 0
            AND YEAR(event_timestamp) = 2022
    )
SELECT
    base.*
FROM 
    base
JOIN
    wmf.mediawiki_history mwh
    ON base.wiki_db = mwh.wiki_db 
        AND base.reverting_edit_id = mwh.revision_id
WHERE
    snapshot = '{mwh_snapshot}'
    AND NOT base.user_name = mwh.event_user_text
"""

vandal_edits = wmf.spark.run(query).drop(['user_name', 'reverting_edit_id'], axis=1)
vandal_edits_base = vandal_edits.copy()

                                                                                92]]]

CPU times: user 52.7 s, sys: 16.6 s, total: 1min 9s
Wall time: 4min 57s


In [90]:
%%time

vandal_edits = (
    vandal_edits
    .assign(
        event_timestamp=pd.to_datetime(vandal_edits['event_timestamp'], utc=True),
        event_user_first_edit_timestamp=pd.to_datetime(vandal_edits['event_user_first_edit_timestamp'], utc=True),
        is_anon=pd.Categorical(vandal_edits['is_anon'])
    )
    .assign(
        elapsed_user_first_rev=time_delta(vandal_edits, 'event_user_first_edit_timestamp', 'event_timestamp')
    )
    .query("(elapsed_user_first_rev <= 48*60*60) | (is_anon == True)")
    .drop(['event_timestamp', 'event_user_first_edit_timestamp'], axis=1)
    .reset_index(drop=True)
)

vandal_edits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3771812 entries, 0 to 3771811
Data columns (total 5 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   wiki_db                 object  
 1   risk                    float32 
 2   is_anon                 category
 3   time_to_revert          int64   
 4   elapsed_user_first_rev  float64 
dtypes: category(1), float32(1), float64(1), int64(1), object(1)
memory usage: 104.3+ MB
CPU times: user 1.01 s, sys: 39.7 ms, total: 1.05 s
Wall time: 1.1 s


In [321]:
db_names = (
    pd
    .read_csv('https://raw.githubusercontent.com/wikimedia-research/canonical-data/master/wiki/wikis.tsv', sep='\t')
    .query("""database_code == @wikis_list""")[['database_code', 'english_name']]
    .set_index('database_code')['english_name']
    .to_dict()
)

all_dbs_ttr = {db_names[db]: quantiles(vandal_edits.query(f"wiki_db == '{db}'"), style_median=True) for db in vandal_edits.wiki_db.unique()}

## functions

In [346]:
# prints a string at center of the output, bold if needed
def pr_centered(content, bold=False):
    if bold:
        content = f"<b>{content}</b>"
    
    centered_html = f"<div style='text-align:center'>{content}</div>"
    
    display(HTML(centered_html))


# display dataframes horizontally with title for each
def display_h(frames, space=100):
    html = ""
    
    for key in frames.keys():
        html_df =f'<div>{key} {frames[key]._repr_html_()}</div>'
        html += html_df
        
    html = f"""
    <div style="display:flex; justify-content: space-evenly;">
    {html}
    </div>"""
    
    display_html(html, raw=True)

In [79]:
# calculate time difference in seconds between two columns
# note: the columns should be datetime formatted
def time_delta(df, start_column, end_column):
    try: 
        return df.apply(lambda row: (row[end_column] - row[start_column]).total_seconds(), axis=1)
    except:
        return np.NaN

# applies cell color to a given nth percentile
def style_percentile(i, percentile='50th'):
    return ['background-color: Aquamarine' if i.name == percentile else '' for _ in i]

# return quatiles for a given series (dataframe and column name)
def quantiles(frame, col='time_to_revert', style_median=False):    
    qdict = {
        '10th': frame[col].quantile(0.1),
        '25th': frame[col].quantile(0.25),
        '50th': frame[col].quantile(0.5),
        '75th': frame[col].quantile(0.7),
        '90th': frame[col].quantile(0.9),
        '99th': frame[col].quantile(0.99)
    }
    
    df = pd.DataFrame(qdict.values(),
                      index=qdict.keys(),
                      columns=['seconds'])
    
    df['minutes'] = round(df['seconds'] / 60, 2)
    
    df = df.astype({'seconds': int})
    df.index.name = 'percentile'
    
    if style_median:
        df = df.style.apply(style_percentile, axis=1).format("{:.1f}")
        # df = df.astype({'seconds': int})
        return df
    else:
        return df

In [356]:
def split_into_groups(dfs, group_size=4):
    return [dfs[i:i + group_size] for i in range(0, len(dfs), group_size)]

# Analysis

## Median-Time-to-Revert

In [354]:
median_ttr_by_wiki = (
    vandal_edits
    .groupby('wiki_db')['time_to_revert']
    .median()
    .reset_index(name='seconds')
    .assign(
        **{
            'minutes': lambda x: round(x['seconds'] / 60, 2),
            'Wikipedia': lambda x: x['wiki_db'].map(db_names)
        }
    )
    .astype({'seconds': int})
    .set_index('Wikipedia')
    .drop('wiki_db', axis=1)
)

In [355]:
pr_centered('Median Time to Revert (by Wikipedia)', True)
display_h({
    '': median_ttr_by_wiki
})

Unnamed: 0_level_0,seconds,minutes
Wikipedia,Unnamed: 1_level_1,Unnamed: 2_level_1
German Wikipedia,186,3.1
English Wikipedia,569,9.48
Spanish Wikipedia,60,1.0
Persian Wikipedia,198,3.31
French Wikipedia,435,7.25
Indonesian Wikipedia,3085,51.42
Italian Wikipedia,358,5.97
Japanese Wikipedia,1139,18.98
Portuguese Wikipedia,1023,17.06
Russian Wikipedia,777,12.95


## Time-to-Revert-Percentiles

In [360]:
pr_centered('<big>Time to Revert (by Wikipedia)<big>', True)
pr_centered('<u>coloured cell -> median</u>')

for group in split_into_groups(list(all_dbs_ttr.items())):
    display_h(dict(group))

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,17.0,0.3
25th,40.0,0.7
50th,186.0,3.1
75th,975.0,16.2
90th,9779.0,163.0
99th,37463.0,624.4

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,17.0,0.3
25th,57.0,0.9
50th,569.0,9.5
75th,3339.0,55.6
90th,18299.0,305.0
99th,39313.0,655.2

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,1.0,0.0
25th,2.0,0.0
50th,60.0,1.0
75th,993.0,16.6
90th,11662.0,194.4
99th,37465.0,624.4

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,21.0,0.3
25th,43.0,0.7
50th,198.0,3.3
75th,1338.0,22.3
90th,13104.0,218.4
99th,38194.0,636.6


Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,27.0,0.5
25th,71.0,1.2
50th,435.0,7.2
75th,2266.0,37.8
90th,15607.0,260.1
99th,38854.0,647.6

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,105.0,1.8
25th,533.0,8.9
50th,3085.0,51.4
75th,9038.0,150.7
90th,24925.0,415.4
99th,40147.0,669.1

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,20.0,0.3
25th,51.0,0.8
50th,358.0,6.0
75th,2468.0,41.1
90th,16531.0,275.5
99th,39669.0,661.1

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,67.0,1.1
25th,208.0,3.5
50th,1139.0,19.0
75th,5035.0,83.9
90th,21155.0,352.6
99th,40043.0,667.4


Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,61.0,1.0
25th,195.0,3.2
50th,1023.0,17.1
75th,3846.0,64.1
90th,17242.0,287.4
99th,39209.0,653.5

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,14.0,0.2
25th,108.0,1.8
50th,777.0,12.9
75th,3301.0,55.0
90th,17352.0,289.2
99th,39479.0,658.0

Unnamed: 0_level_0,seconds,minutes
percentile,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,80.0,1.3
25th,323.0,5.4
50th,2043.0,34.1
75th,7278.0,121.3
90th,23887.0,398.1
99th,40768.0,679.5
