# Media Usage Across Wikis

This is a proof of concept notebook, showing that with the data we have available it's possible to identify usage of media across all Wikimedia projects and whether that media is hosted on Commons, locally, or appears to be missing. The phab task associated with this work is [T265768](https://phabricator.wikimedia.org/T265768).

We take the query from [T247417#6017438](https://phabricator.wikimedia.org/T247417#6017438) and modify it. We join with `mediawiki_image`, which allows us to identify all media hosted on the given wiki. We also join with `mediawiki_page` to correctly identify media hosted on Commons. If a file in use but not found in either place, we label it a redlink.

It's worth noting that the Commons wiki has a database table that tracks usage of an image on other wikis, but this table is optimized for looking up usage of a specific image so it can be shown on the file page ([here's an example](https://commons.wikimedia.org/wiki/File:Black_hole_-_Messier_87_crop_max_res.jpg)). That database table should not be used for aggregations per wiki, which is our use case.

We also note that the `image` table is needed to correctly identify locally hosted files. One might think that the `page` table could be used, but it's possible to create a file page on a local wiki for a file that's hosted on Commons (see [nn:Fil:Attendekall.jpg](https://nn.wikipedia.org/wiki/Fil:Attendekall.jpg) for an example).

In [2]:
from wmfdata import spark

In [3]:
media_usage_query = '''
WITH ims AS ( -- image uses from content namespaces
    SELECT wiki_db, il_to
    FROM wmf_raw.mediawiki_imagelinks AS m
    INNER JOIN wmf_raw.mediawiki_project_namespace_map AS ns
    ON ns.namespace_is_content=1
    AND ns.dbname=m.wiki_db 
    AND ns.namespace = m.il_from_namespace
    WHERE m.snapshot = "{snapshot}"
    AND ns.snapshot = "{snapshot}"
),
lp AS ( -- local files
    SELECT wiki_db, img_name
    FROM wmf_raw.mediawiki_image
    WHERE snapshot = "{snapshot}"
),
cp AS ( -- files from Commons
    SELECT wiki_db, page_title
    FROM wmf_raw.mediawiki_page
    WHERE snapshot = "{snapshot}"
    AND wiki_db = "commonswiki"
    AND page_namespace = 6
)
SELECT ims.wiki_db,
    CASE
        WHEN lp.img_name IS NOT NULL THEN "local"
        WHEN cp.page_title IS NOT NULL THEN "commons"
        ELSE "redlink"
    END AS media_source,
    SUM(1) AS num_file_uses,
    COUNT(DISTINCT il_to) AS num_files
FROM ims
LEFT JOIN lp
ON ims.wiki_db = lp.wiki_db
AND ims.il_to = lp.img_name
LEFT JOIN cp
ON ims.il_to = cp.page_title
GROUP BY ims.wiki_db, media_source
'''

In [4]:
usage_stats = spark.run(media_usage_query.format(
    snapshot = '2020-11'
))

The query as written provides us with aggregated counts showing the number of files in use from a given source (`num_files`), and the number of times those files are used on content pages (`num_file_uses`). Below are a couple of examples using the November 2020 snapshot. Said snapshot is created at the beginning of December 2020, thus reflects the state of file usage at that point.

We can see that English Wikipedia had 786,800 local files that were in use on content pages, and they were used a total of 6,559,706 times. English Wikipedia also used 4,768,444 files from Commons, and these were used 21,457,937 times. There were also 6,169 files referenced that did not exist, used 6,869 times.

Nynorsk Wikipedia ("nnwiki") shows a very different example, because it only had 11 local files in use, for a total of 14 uses. That wiki instead used media from Commons: 140,559 files, used 407,347 times.

In [6]:
usage_stats.loc[usage_stats['wiki_db'] == 'enwiki']

Unnamed: 0,wiki_db,media_source,num_file_uses,num_files
722,enwiki,commons,21457937,4768444
1784,enwiki,local,6559706,786800
1807,enwiki,redlink,6869,6169


In [7]:
usage_stats.loc[usage_stats['wiki_db'] == 'nnwiki']

Unnamed: 0,wiki_db,media_source,num_file_uses,num_files
220,nnwiki,local,14,11
690,nnwiki,commons,407347,140559
1144,nnwiki,redlink,621,427


This method allows us to aggregate this on a monthly basis using the snapshots that are available, and then further process it (e.g. visualize it). We can also for example modify the query to count the number of distinct files on Commons that are in use on other wikis and the number of times they are used, and similarly the number of distinct files that are hosted locally on wikis and how many times they are used.