# Baselines for Commons Upload Wizard Improvements

[T337466](https://phabricator.wikimedia.org/T337466)

The goal of this task is to calculate baselines success metrics and goals for Commons upload wizard improvements. 

In this notebook, we want include the baselines for the follwoing metrics:
- Total number of upload media within a month through upload wizard (filter by own work and not own work)
- Total number of filed deletion requests and total number of speedy deletions within a month (filter by own work and not own work)
- The ratio of upload media (filter by own work and not own work) to filed deletion requests within a month. This does not include speedy deletions
- The ratio of upload media (filter by own work and not own work) to speedy deletion within a month

In [4]:
import re

from wmfdata import hive, mariadb, spark
import wmfdata 

import math
import pandas as pd
import numpy as np

from datetime import datetime, timedelta, date

In [None]:
spark_session = wmfdata.spark.create_session(app_name='pyspark regular; media-uploads',
                                  type='yarn-large')  

In [6]:
snapshot = '2023-05'  
start_date = '2023-05-01'
end_date = '2023-06-01'

## File Uploads

Get file uploads data, we will exclude bot and mobile uploads here. For the baselines, we are using data from May 2023.

In [7]:
uploads_query = """
SELECT
   event_timestamp,
   page_id,
   page_title,
   event_comment,
   CASE WHEN ARRAY_CONTAINS(revision_tags, 'uploadwizard') THEN true ELSE false END AS upload_wizard,
   CASE WHEN ARRAY_CONTAINS(revision_tags, 'ios app edit') 
            OR ARRAY_CONTAINS(revision_tags, 'android app edit') 
            OR ARRAY_CONTAINS(revision_tags, 'mobile app edit') 
            OR ARRAY_CONTAINS(revision_tags, 'mobile web edit') THEN 'mobile'
        ELSE 'other'
    END AS platform,
    CASE WHEN LOWER(event_comment) LIKE '%own work%' THEN true ELSE false END AS own_work
FROM wmf.mediawiki_history
WHERE snapshot = '{mw_snapshot}' 
    AND event_timestamp >= '{start_date}'
    AND event_timestamp < '{end_date}' 
    AND event_entity = 'revision' 
    AND event_type = 'create' 
    AND page_namespace_is_content_historical 
    AND NOT page_is_redirect
    AND revision_parent_id = 0
    AND wiki_db = 'commonswiki'
    AND SIZE(event_user_is_bot_by) <= 0
    AND SIZE(event_user_is_bot_by_historical) <= 0
"""

In [8]:
upload_data = spark.run( 
        uploads_query.format(
          start_date = start_date,
          end_date = end_date,
          mw_snapshot = snapshot
        )
    )

                                                                                

In [9]:
# store in global temp view
uploads_sdf = spark_session.createDataFrame(upload_data)
uploads_sdf.createGlobalTempView("upload_data")

## Deletion Requests

Templates and corresponding template ids: {{Delete}} (id 589), {{Speedydelete}} (id 157), {{SD}} (id 417), {{Copyvio}} (id 458)

In [10]:
deletion_query = """

SELECT 
    tl_from as page_id,
    CASE WHEN tl_target_id = 589 THEN 'deletion'
         WHEN tl_target_id = 157 OR tl_target_id = 417 THEN 'speedy_deletion'
         WHEN tl_target_id = 458 THEN 'copy_vio'
    END AS dr_type
FROM wmf_raw.mediawiki_templatelinks
WHERE snapshot = '{mw_snapshot}' 
    AND wiki_db='commonswiki'
    AND tl_from_namespace=6 
    AND tl_target_id in (589,157,417,458)
"""

In [11]:
deletion_data = spark.run( 
        deletion_query.format(
          mw_snapshot = snapshot
        )
    )

                                                                                

In [12]:
# store in global temp view
deletion_sdf = spark_session.createDataFrame(deletion_data)
deletion_sdf.createGlobalTempView("deletion_data")

## File Uploards Metrics

In [13]:
spark.run("""
SELECT upload_wizard, own_work, 
    COUNT(DISTINCT(page_id)) AS uploads,
    COUNT(DISTINCT(page_id)) * 100.0 / SUM(COUNT(DISTINCT(page_id))) OVER () AS pct
FROM global_temp.upload_data
WHERE platform = 'other'
GROUP BY upload_wizard, own_work
""")

23/06/21 07:15:36 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/06/21 07:15:36 WARN TaskSetManager: Stage 2 contains a task of very large size (5435 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

Unnamed: 0,upload_wizard,own_work,uploads,pct
0,True,False,76111,13.301653990269
1,True,True,229782,40.15819864660813
2,False,True,157,0.02743834237459
3,False,False,266142,46.51270902074828


In May 2023, there are 572,192 uploads on desktop in Commons excluding bot uploads. 305,893 (53.5%) of uploads are from upload wizard; and 266,299 (46.5%) are from other upload method. 

In all the uploads from upload wizard, 75.1% are own work, and 24.9% are not own work. While from other upload method, it's hard to define the files are own work or not, because not as the upload wizard, the edit comments are not standard or in other languages. We can only detevt 0.05% of the files are own work. 

## Deletion Request Metrics

In [14]:
spark.run("""
SELECT 
    upload_wizard, 
    own_work, 
    COUNT(DISTINCT(u.page_id)) AS deletion_request,
    COUNT(DISTINCT CASE WHEN dr_type = 'speedy_deletion' THEN u.page_id END) AS sd_request
FROM global_temp.upload_data u 
    INNER JOIN global_temp.deletion_data d ON u.page_id = d.page_id
WHERE platform = 'other'
GROUP BY upload_wizard, own_work
""")

23/06/21 07:15:48 WARN TaskSetManager: Stage 6 contains a task of very large size (5435 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

Unnamed: 0,upload_wizard,own_work,deletion_request,sd_request
0,True,False,333,4
1,True,True,732,6
2,False,True,1,0
3,False,False,5344,5


There are 6,410 deletion requests in total and 83% of the deletion requests are from files through other workstream. 

For upload wizard uploads, there are 1,065 deletion requests, and 10 of them are speedy deletions. 732 are own works and 333 are not own works. 

## Uploads & Deletion Ratios

In [16]:
spark.run("""
SELECT 
    upload_wizard, 
    own_work, 
    COUNT(DISTINCT(u.page_id)) AS uploads,
    COUNT(DISTINCT CASE WHEN dr_type != 'speedy_deletion' THEN u.page_id END) AS deletion_request,
    COUNT(DISTINCT CASE WHEN dr_type = 'speedy_deletion' THEN u.page_id END) AS sd_request
FROM global_temp.upload_data u 
    LEFT JOIN global_temp.deletion_data d ON u.page_id = d.page_id
WHERE platform = 'other'
GROUP BY upload_wizard, own_work
""")

23/06/21 07:17:48 WARN TaskSetManager: Stage 12 contains a task of very large size (5435 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

Unnamed: 0,upload_wizard,own_work,uploads,sd_request,sd_request.1
0,True,False,76111,329,4
1,True,True,229782,726,6
2,False,True,157,1,0
3,False,False,266142,5339,5


For upload wizard, the ratio of upload media to filed deletion requests within a month, excluding speedy deletions is 1:0.003. this ratio is 1:0.004 for not own work and 1:0.003 for own work.

For uploads not through upload wizard, the ratio of upload media to filed deletion requests within a month, excluding speedy deletions is 1:0.02. This ratio is 1:0.006 for own work and 1:0.02 for not own work. 