# Media files containing captions in English/non-English languages

[T252443](https://phabricator.wikimedia.org/T252443)

This set of metrics is a comparison of media files containing structured fields in English and non-English languages on a monthly basis. Including:
- Monthly changes in media files with capitions en/non-en languages
- Time to add captions to media files

In [1]:
import re

import wmfdata 
from wmfdata import hive, spark

You are using wmfdata v1.3.1, but v1.3.3 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release --ignore-installed`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md


In [2]:
spark = wmfdata.spark.get_session(app_name='pyspark regular',
                                  type='yarn-large', # local, yarn-regular, yarn-large
                                  )  

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


PYSPARK_PYTHON=/usr/lib/anaconda-wmf/bin/python3


SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/30 06:15:17 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
22/03/30 06:15:18 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
22/03/30 06:15:18 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002.
22/03/30 06:15:18 

## Configuring Timestamps


In [3]:
wmf_snapshot = '2022-02'
start_date = '2022-02-01'
end_date = '2022-03-01' # last creation date

## Approach

According to [this ticket](https://phabricator.wikimedia.org/T231952#5710215), it's not straightforward to identify what articles contain captions in specific languages using the replicated MediaWiki databases. We would expect that the `wbc_entity_usage` table provides this information, but inspecting the entries there for a couple of pages reveals that it contains captions ("labels") that are not shown on the corresponding page on Commons.

Instead, here we reused approaches that identifies a page getting a label added, changed, or deleted through edit comments and uses the `mediawiki_history` table in the Data Lake as the source of truth. A caption being added with the edit comment "Added [de] caption", where "[de]" is the language (German in this case, I've also seen English) as well as "wbsetlabel-add:".


The majority of these operations are additions (about 1.6 million edits through December 2021), changes and deletions are two orders of magnitude fewer. Due to this huge difference, we ignore all changes/deletions and instead accept that an estimate of might be off by about 10,000 pages because the number of pages is much larger than that.


## Aggregation Tables


We define a set of tables in the Data Lake for aggregation of results.



In [4]:
caption_count_table = 'cchen_sd.caption_counts'
caption_add_table = 'cchen_sd.caption_add_time'

In [14]:
create_count_table_query = '''
CREATE TABLE IF NOT EXISTS {table_name} (
    month DATE COMMENT "the month of the aggregated caption counts",
    num_captions BIGINT COMMENT "Number of files with captions",
    num_captions_non_en BIGINT COMMENT "Number of files with non-English captions",
    num_captions_en BIGINT COMMENT "Number of files with English captions",
    num_captions_both BIGINT COMMENT "Number of files with both (sum of the latter two minus the first)"
)
'''

In [15]:
create_time_table_query = '''
CREATE TABLE IF NOT EXISTS {table_name} (
    month DATE COMMENT "the month of the aggregated caption time counts",
    caption_language STRING COMMENT "English or non-English captions",
    caption_time STRING COMMENT "How quickly after creation do captions get added",
    max_caption_time BIGINT COMMENT "max caption time within each bucket (for Superset chart)",
    num_captions BIGINT COMMENT "Aggregated number of captions"
)
'''

In [16]:
hive.run(create_count_table_query.format(
            table_name = caption_count_table
))

In [17]:
hive.run(create_time_table_query.format(
            table_name = caption_add_table
))

## Number of captions

In [5]:
caption_count_query = '''
WITH captions_counts AS ( 
    SELECT
        "{start_date}" AS month, 
        COUNT(DISTINCT page_id) AS num_captions,
        COUNT(DISTINCT(CASE WHEN regexp_extract(event_comment, "^...wbsetlabel-add:\\\\d.(\\\\w+(-\\\\w+)?)", 1) 
                               NOT REGEXP "^simple|en|(en-.+)$" 
                            THEN page_id 
                        END)) AS num_captions_non_en,
        COUNT(DISTINCT(CASE WHEN regexp_extract(event_comment, "^...wbsetlabel-add:\\\\d.(\\\\w+(-\\\\w+)?)", 1) 
                               REGEXP "^simple|en|(en-.+)$" 
                            THEN page_id 
                        END)) AS num_captions_en
    FROM wmf.mediawiki_history
    WHERE snapshot = "{snapshot}"
    AND wiki_db = "commonswiki"
    AND event_entity = "revision"
    AND event_type = "create"
    AND event_timestamp >= "{start_date}"
    AND event_timestamp < "{end_date}"
    AND page_is_deleted = false -- only count live pages
    AND page_namespace = 6 -- only count files
    AND event_comment REGEXP "^...wbsetlabel-add"
)

INSERT INTO {aggregate_table}
SELECT 
    month,
    num_captions,
    num_captions_non_en,
    num_captions_en,
    (num_captions_non_en + num_captions_en - num_captions) AS num_captions_both
FROM captions_counts

'''

In [6]:
spark.sql(caption_count_query.format(
    snapshot = wmf_snapshot,
    start_date = start_date,
    end_date = end_date,
    aggregate_table = caption_count_table
))

22/03/30 06:16:01 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
22/03/30 06:16:50 ERROR YarnScheduler: Lost executor 36 on an-worker1105.eqiad.wmnet: Container killed by YARN for exceeding memory limits.  9.1 GB of 8.8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
22/03/30 06:18:35 ERROR YarnScheduler: Lost executor 116 on an-worker1139.eqiad.wmnet: Container killed by YARN for exceeding memory limits.  8.9 GB of 8.8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
22/03/30 06:18:35 WARN TaskSetManager: Lost task 6062.0 in stage 1.0 (TID 5472, an-worker1139.eqiad.wmnet, executor 116): ExecutorLostFailure (executor 116 exited caused by one of the running tasks) R

DataFrame[]

## Time to add captions

In [7]:
caption_add_query = '''
WITH captions_time AS (
    SELECT 
        CASE WHEN regexp_extract(event_comment, "^...wbsetlabel-add:\\\\d.(\\\\w+(-\\\\w+)?)", 1)
                 NOT REGEXP "^simple|en|(en-.+)$" THEN "non-en"
             WHEN regexp_extract(event_comment, "^...wbsetlabel-add:\\\\d.(\\\\w+(-\\\\w+)?)", 1)
                 REGEXP "^simple|en|(en-.+)$" THEN "en"
        END AS caption_language,
        unix_timestamp(event_timestamp) - unix_timestamp(page_creation_timestamp) AS time_to_caption
    FROM wmf.mediawiki_history
    WHERE snapshot = "{snapshot}"
    AND wiki_db = "commonswiki"
    AND event_entity = "revision"
    AND event_type = "create"
    AND event_timestamp >= "{start_date}"
    AND event_timestamp < "{end_date}"
    AND page_is_deleted = false -- only count live pages
    AND page_namespace = 6 -- only count files
    AND event_comment REGEXP "^...wbsetlabel-add"
),

bucketed_captions_time AS (
    SELECT 
        caption_language,
        CASE WHEN time_to_caption < 60 THEN '0-1min'
             WHEN time_to_caption >= 60 and time_to_caption < 5*60 THEN '1-5min'
             WHEN time_to_caption >= 5*60 and time_to_caption < 60*60 THEN '5-60min'
             WHEN time_to_caption >= 60*60 and time_to_caption < 60*60*12 THEN '1-12h'
             WHEN time_to_caption >= 60*60*12 and time_to_caption < 60*60*24 THEN '12-24h'
             WHEN time_to_caption >= 60*60*24 and time_to_caption < 7*24*60*60 THEN '1d-1w'
             WHEN time_to_caption >= 60*60*24*7 and time_to_caption < 30*24*60*60 THEN '1w-1m'
             WHEN time_to_caption >= 30*24*60*60 and time_to_caption < 180*24*60*60 THEN '1m-6m'
             WHEN time_to_caption >= 60*60*24*180 and time_to_caption < 365*24*60*60 THEN '6m-1y'
             WHEN time_to_caption >= 60*60*24*365 and time_to_caption < 365*24*60*60*5 THEN '1y-5y'
             WHEN time_to_caption >= 365*24*60*60*5 THEN '5y+'
        END AS caption_time,
        time_to_caption
    FROM captions_time
)

INSERT INTO {aggregate_table}
SELECT
    "{start_date}" AS month, 
    caption_language,
    caption_time,
    MAX(time_to_caption) AS max_caption_time, 
    COUNT(*) AS num_captions
FROM bucketed_captions_time
GROUP BY caption_language,caption_time
'''

In [8]:
spark.sql(caption_add_query.format(
    snapshot = wmf_snapshot,
    start_date = start_date,
    end_date = end_date,
    aggregate_table = caption_add_table
))

22/03/30 06:19:33 ERROR YarnScheduler: Lost executor 39 on an-worker1105.eqiad.wmnet: Container killed by YARN for exceeding memory limits.  9.0 GB of 8.8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
22/03/30 06:19:33 WARN TaskSetManager: Lost task 920.0 in stage 5.0 (TID 7299, an-worker1105.eqiad.wmnet, executor 39): ExecutorLostFailure (executor 39 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  9.0 GB of 8.8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
22/03/30 06:19:33 WARN TaskSetManager: Lost task 1363.0 in stage 5.0 (TID 7774, an-worker1105.eqiad.wmnet, executor 39): ExecutorLostFailure (executor 39 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  9.0 GB of 8.8 

DataFrame[]