# SDC/Non-SDC Edits to Files on Commons

[T252443](https://phabricator.wikimedia.org/T252443)

This set of metrics is aiming to understand to what extent SDC has led to users updating structured data about existing media. It might be that users are instead adding structured data when uploading new media. 

In [1]:
import datetime as dt
import pandas as pd
import numpy as np

from wmfdata import hive, spark

You are using wmfdata v1.3.1, but v1.3.3 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release --ignore-installed`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md


### Configuration variables


In [6]:
wmf_snapshot = '2022-02'
start_date = '2022-02-01'
end_date = '2022-03-01'

### Approach

We'll use edit comments to identify structured data edits, and then use timestamps to measure time to edit since the page was created. Then, we'll plot a histogram to see if there's a reasonable cutoff we can use for separating edits happening around the initial upload from those happening later.

For non-SDC edits, we'll look for edits that added information to a page. These edits cannot be a revert, because that would mean it's reinstating a previous state of the file, nor should it have been reverted within 48 hours, as that would mean it's likely an unproductive edit.

In both cases, we'll only look at non-bot edits to files (page_namespace = 6) that have not been deleted. While there might also have been valid edits to deleted files, we are mainly interested in learning whether existing pages are updated with structured data. 

### Aggregation Tables


We define a set of tables in the Data Lake for aggregation of results.


In [3]:
edit_table = 'cchen_sd.sdc_edits_count'

In [6]:
create_table_query = '''
CREATE TABLE IF NOT EXISTS {table_name} (
    month DATE COMMENT "the month of the aggregated caption time counts",
    edit_type STRING COMMENT "SDC or non-SDC edits",
    time_to_edits STRING COMMENT "bucketed time to edits for data ",
    max_time_to_edits BIGINT COMMENT "max time within each bucket (for Superset chart)",
    edit_count BIGINT COMMENT "Aggregated number of edits"
)
'''

In [7]:
hive.run(create_table_query.format(
            table_name = edit_table
))

### Edit data

In [7]:
edits_query = '''
WITH edits_data (
    SELECT 
        unix_timestamp(event_timestamp) - unix_timestamp(page_creation_timestamp) AS time_to_edit,
        CASE WHEN (event_comment REGEXP "^...wbsetclaim-create:.*?Special:EntityPage/(P\\\\d+)"
                OR event_comment REGEXP "^...wbsetlabel-add:"
                OR event_comment REGEXP "^...wbcreateclaim-create:"
                OR event_comment REGEXP "^...wbeditentity-update:") THEN "sdc"
             WHEN (event_comment NOT REGEXP "^...wbsetclaim-create:.*?Special:EntityPage/(P\\\\d+)"
                AND event_comment NOT REGEXP "^...wbsetlabel-add:"
                AND event_comment NOT REGEXP "^...wbcreateclaim-create:"
                AND event_comment NOT REGEXP "^...wbeditentity-update:"
                AND revision_text_bytes_diff > 0
                AND (revision_is_identity_reverted = false
                    OR revision_seconds_to_identity_revert > 48 * 60 * 60)
                AND revision_is_identity_revert = false) THEN "non-sdc"
        END AS edit_type
    FROM wmf.mediawiki_history
    WHERE snapshot = "{snapshot}"
      AND wiki_db = "commonswiki"
      AND event_entity = "revision"
      AND event_type = "create"
      AND event_timestamp >= "{start_date}"
      AND event_timestamp < "{end_date}"
      AND page_namespace = 6 -- only files
      AND page_is_deleted = false -- only live pages
      AND size(event_user_is_bot_by_historical) = 0 -- no bots
      AND size(event_user_is_bot_by) = 0 -- no bots
),

bucketed_edits_data AS (
    SELECT 
        time_to_edit,
        CASE WHEN time_to_edit < 60 THEN '0-1min'
             WHEN time_to_edit >= 60 and time_to_edit < 60*60 THEN '1-60min'
             WHEN time_to_edit >= 60*60 and time_to_edit < 60*60*12 THEN '1-12h'
             WHEN time_to_edit >= 60*60*12 and time_to_edit < 60*60*24 THEN '12-24h'
             WHEN time_to_edit >= 60*60*24 and time_to_edit < 7*24*60*60 THEN '1d-1w'
             WHEN time_to_edit >= 60*60*24*7 and time_to_edit < 30*24*60*60 THEN '1w-1m'
             WHEN time_to_edit >= 30*24*60*60 and time_to_edit < 90*24*60*60 THEN '1m-3m'
             WHEN time_to_edit >= 90*24*60*60 and time_to_edit < 180*24*60*60 THEN '3m-6m'
             WHEN time_to_edit >= 180*24*60*60 and time_to_edit < 365*24*60*60 THEN '6m-1y'
             WHEN time_to_edit >= 60*60*24*365 and time_to_edit < 365*24*60*60*5 THEN '1y-5y'
             WHEN time_to_edit >= 365*24*60*60*5 and time_to_edit < 365*24*60*60*10 THEN '5y-10y'
             WHEN time_to_edit >= 365*24*60*60*10 THEN '10y+'
        END AS edits_time,
        edit_type
    FROM edits_data
    WHERE edit_type IS NOT NULL
)

INSERT INTO {aggregate_table}
SELECT
    "{start_date}" AS month,
    edit_type,
    edits_time AS time_to_edits,
    MAX(time_to_edit) AS max_time_to_edits,
    COUNT(*) AS edit_count
FROM bucketed_edits_data
GROUP BY edit_type, edits_time
'''

In [9]:
spark.run(edits_query.format(
   snapshot = wmf_snapshot,
    start_date = start_date,
    end_date = end_date,   
    aggregate_table = edit_table
))

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
22/03/29 16:58:29 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 167 for reason Container killed by YARN for exceeding memory limits.  9.3 GB of 8.8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
22/03/29 16:58:29 WARN TaskSetManager: Lost task 3326.0 in stage 3.0 (TID 7778, an-worker1106.eqiad.wmnet, executor 167): ExecutorLostFailure (executor 167 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  9.3 GB of 8.8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
22/03/29 16:58:29 WARN TaskSetManager: Lost task 3609.0 in stage 3.0 (TID 7819, an-worker1106.eqiad.wmnet, executor 167): ExecutorLostFailure (executor 167 exited caused by one