# How many edits would Automoderator revert per day at different caution levels?

[TASK: T348869](https://phabricator.wikimedia.org/T348869)

**Purpose**<br>As part of the model testing process, we want to understand how many can we expect Automoderator revert per day on average. 
This will be helpful for community to understand the potential impact of Automoderator. For the analysis, [revert risk scores generated by WMF's Research team](https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/mnz/examples/examples/notebooks/revertrisk_example.ipynb) were used, and edits made by admins, bots, self-reverts, and new page creations were excluded.

**Results**<br>Average daily number of edits Automoderator would potentially revert per day at different thresholds

| wiki | t-0.99 | t-0.985 | t-0.98 | t-0.975 | t-0.97
| ----- | ----- | ----- | ----- | ----- | -----
| English Wikipedia | 46 | 100 | 165 | 235 | 308
| Indonesian Wikipedia | 1 | 2 | 4 | 6 | 8

# Data-Gathering

In [1]:
import pandas as pd
import wmfdata as wmf

pd.options.display.max_columns = None
from IPython.display import clear_output

import warnings




You are using Wmfdata v2.0.0, but v2.0.1 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md.


## spark_session

In [28]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

In [29]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='automod-activity',
    spark_config={
        "spark.driver.memory": "4g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "16g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session.sparkContext.setLogLevel("ERROR")
spark_session

## query

In [4]:
# paths to pre-calculated revert risk scores
# generated by https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/mnz/examples/examples/notebooks/revertrisk_example.ipynb
rr_scores_path = '/user/paragon/riskobservatory/revertrisk_20212022_anonymous_bot.parquet'

rr_scores = spark_session.read.parquet(rr_scores_path)
rr_scores.createOrReplaceTempView('rr_scores')

rr_scores.printSchema()

                                                                                

root
 |-- rev_id: long (nullable = true)
 |-- wiki_db: string (nullable = true)
 |-- rev_timestamp: string (nullable = true)
 |-- revision_is_identity_reverted: boolean (nullable = true)
 |-- revision_seconds_to_identity_revert: long (nullable = true)
 |-- page_id: long (nullable = true)
 |-- revision_revert_risk: float (nullable = true)
 |-- user_is_anonymous: boolean (nullable = true)
 |-- user_is_bot: boolean (nullable = true)



In [10]:
%%time

query = """
WITH base AS (
    SELECT
        rr.wiki_db,
        rr.rev_id,
        revision_revert_risk AS risk,
        event_user_text,
        DATE(event_timestamp) AS date,
        mwh.revision_is_identity_revert,
        CASE 
            WHEN mwh.revision_is_identity_revert THEN 'revert' 
            ELSE 'non_revert' 
        END AS revision_type
    FROM 
        rr_scores rr
    JOIN 
        wmf.mediawiki_history mwh 
        ON rr.wiki_db = mwh.wiki_db AND rr.rev_id = mwh.revision_id
    WHERE 
        snapshot = '2023-09' 
        AND rr.wiki_db IN ('idwiki', 'enwiki')
        -- exclude page creations
        AND mwh.revision_parent_id <> 0
        -- exclude adminstrators
        AND NOT ARRAY_CONTAINS(mwh.event_user_groups, 'sysop')
        -- exclude bots
        AND SIZE(event_user_is_bot_by) = 0
        AND YEAR(event_timestamp) = 2022
),

excl_self_reverts AS (
    SELECT
        b.*
    FROM
        base b
    JOIN wmf.mediawiki_history mwh
        ON b.rev_id = mwh.revision_first_identity_reverting_revision_id
    WHERE
        snapshot = '2023-09'
        AND b.revision_type = 'revert'
        -- exclude self reverts
        AND b.event_user_text <> mwh.event_user_text
),

sample AS (
    SELECT * FROM base WHERE revision_type = 'non_revert'
    UNION ALL
    SELECT * FROM excl_self_reverts
),

count_score AS (
    SELECT
        date,
        wiki_db,
        SUM(CASE WHEN risk > 0.99 THEN 1 ELSE 0 END) AS r99,
        SUM(CASE WHEN risk > 0.985 THEN 1 ELSE 0 END) AS r985,
        SUM(CASE WHEN risk > 0.98 THEN 1 ELSE 0 END) AS r98,
        SUM(CASE WHEN risk > 0.975 THEN 1 ELSE 0 END) AS r975,
        SUM(CASE WHEN risk > 0.97 THEN 1 ELSE 0 END) AS r97
    FROM
        base
    GROUP BY
        wiki_db,
        date
)

SELECT 
    wiki_db,
    ROUND(AVG(r99)) AS r99,
    ROUND(AVG(r985)) AS r985,
    ROUND(AVG(r98)) AS r98,
    ROUND(AVG(r975)) AS r975,
    ROUND(AVG(r97)) AS r97
FROM
    count_score
GROUP BY
    wiki_db
ORDER BY
    wiki_db
"""

result = wmf.spark.run(query)

                                                                                

CPU times: user 301 ms, sys: 105 ms, total: 406 ms
Wall time: 1min 37s


In [30]:
result

Unnamed: 0,wiki_db,r99,r985,r98,r975,r97
0,enwiki,46.0,100.0,165.0,235.0,308.0
1,idwiki,1.0,2.0,4.0,6.0,8.0
