# Overview
The goal is to generate a dataset for [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator) model testing interface. The dataset will have the following dimensions:
* revision_id: unique id of an edit
* revision_revert_risk: revert risk score provided by [Language-agnostic revert risk](https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_revert_risk) model
* wiki_db: Wikimedia project
* revision_is_identity_reverted: whether the edit has been reverted
* event_user_revision_count: edit count of the user who made the edit (until the edit)
* user_is_anonymous: whether the user is an anonymous (IP) user; false in this case would mean a registered user
* user_is_bot: whether the user is a bot or not
* is_self_revert: in case the edit was a revert, whether it was reverting a previous edit by the same user
* is_sysop: whether the user has admin privileges on the given wiki
* is_page_creation: whether the edit resulted in a creation of a new page
* is_newcomer_task: whether the edit was made a result of [newcomer task add-a-link task] task (https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link)
* is_cx_edit: whether the edit was made using the [Content Translation tool](https://www.mediawiki.org/wiki/Content_translation)

# Data-Gathering

In [1]:
import pandas as pd
import wmfdata as wmf

pd.options.display.max_columns = None
from IPython.display import clear_output

import warnings




You are using Wmfdata v2.0.0, but v2.0.1 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md.


## spark_session

In [2]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [3]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='revert-risk-data-sample',
    spark_config={
        "spark.driver.memory": "4g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "16g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session

In [4]:
spark_session.sparkContext.setLogLevel("ERROR")

## query

In [5]:
# paths to pre-calculated revert risk scores
# generated by https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/mnz/examples/examples/notebooks/revertrisk_example.ipynb
rr_scores_path = '/user/paragon/riskobservatory/revertrisk_20212022_anonymous_bot.parquet'

rr_scores = spark_session.read.parquet(rr_scores_path)
rr_scores.createOrReplaceTempView('rr_scores')

                                                                                

In [6]:
sample = spark_session.sql("""
WITH 
    base AS (
        SELECT 
            *,
            ROW_NUMBER() OVER (
                PARTITION BY wiki_db, revision_is_identity_reverted
                ORDER BY RAND(0923)
            ) AS row_num
        FROM rr_scores
        WHERE wiki_db IN ('enwiki', 'idwiki')
    ),

    sample AS (
        SELECT
            *
        FROM
            base
        WHERE row_num <= 12500
    ),

    base_sample AS (
        SELECT
            mwh.event_user_text,
            s.rev_id,
            revision_revert_risk,
            s.wiki_db,
            s.revision_is_identity_reverted,
            event_user_revision_count,
            s.user_is_anonymous,
            user_is_bot,
            page_title,
            CASE 
                WHEN ARRAY_CONTAINS(mwh.event_user_groups, 'sysop') THEN TRUE
                ELSE FALSE
            END AS is_sysop,
            CASE 
                WHEN mwh.revision_parent_id = 0 THEN TRUE 
                ELSE FALSE 
            END AS is_page_creation,
            CASE 
                WHEN ARRAY_CONTAINS(mwh.revision_tags, 'newcomer task add link') THEN TRUE
                ELSE FALSE
            END AS is_newcomer_task,
            CASE
                WHEN ARRAY_CONTAINS(mwh.revision_tags, 'contenttranslation') THEN TRUE
                ELSE FALSE
            END AS is_cx_edit,
            CASE
                WHEN revision_is_identity_revert THEN TRUE
                ELSE FALSE
            END reverting_edit
        FROM 
            sample s
        JOIN 
            wmf.mediawiki_history mwh 
            ON s.wiki_db = mwh.wiki_db AND s.rev_id = mwh.revision_id
        WHERE 
            snapshot = '2023-09'
        ),
    
    reverts AS (
        SELECT 
            * 
        FROM 
            base_sample 
        WHERE 
            reverting_edit),
            
    non_reverts AS (
        SELECT 
            *, 
            NULL AS is_self_revert 
        FROM 
            base_sample 
            WHERE NOT reverting_edit),
    
    self_reverts AS (
        SELECT
            rv.*,
            CASE 
                WHEN rv.event_user_text = mwh.event_user_text THEN TRUE
                ELSE FALSE
            END AS is_self_revert
        FROM 
            reverts rv
            LEFT JOIN wmf.mediawiki_history mwh
            ON rv.wiki_db = mwh.wiki_db 
                AND rv.rev_id = mwh.revision_first_identity_reverting_revision_id
        )
    
SELECT * FROM non_reverts
UNION ALL
SELECT * FROM self_reverts
""")

# Output

In [None]:
sample_frame = sample.toPandas()

In [11]:
sample_frame.drop_duplicates().to_csv('revert_risk_test_data.tsv', sep='\t')