# SDC Usage through Lua

This is a small addon to the analysis done in [T231952](https://phabricator.wikimedia.org/T231952).

We're interested in understanding the impact SDC has now that the Information and Artwork templates have been updated to pull in SDC data through Lua. This is relatively straightforward to do, because as we larned from Matthias Mullie's explanation in [T231952#5717638](https://phabricator.wikimedia.org/T231952#5717638), `wbc_entity_usage` is populated when data is used through Lua. To measure the impact therefore becomes a question of counting the number of files that show up in this table that have either the Information or Artwork template.

In [1]:
import pandas as pd
import numpy as np

import datetime as dt

from wmfdata import hive, mariadb

You can find the source for `wmfdata` at https://github.com/neilpquinn/wmfdata


## What aspects are in use?

The `wbc_entity_usage` table has a column `eu_aspect`, which tracks which property is in use. I therefore first ran the following query through a terminal to get a sense of what property are in use, in other words what to look for.

```
SELECT eu_aspect , count(*) AS num_pages
FROM wbc_entity_usage
GROUP BY eu_aspect
ORDER BY num_pages
DESC LIMIT 250
```

Looking at this, we have the following properties:

* S: sitelinks
* L: labels, also known as "Captions"
* O: statement
* T: title
* D: looks like this is used to link Commons categories to WikiData items (ref [T238878#5685577](https://phabricator.wikimedia.org/T238878#5685577)

However, if we dig into the [source code of the Information template](https://commons.wikimedia.org/w/index.php?title=Module:Information&oldid=375848825) (in this case the most recent revision as of Dec 15), we see that they're only pulling in descriptions (captions). It's a little unclear what the Artwork template does, it at least grabs a whole bunch of information from WikiData. So, I ran the following query to get some information out of `wbc_entity_usage` for files containing the Artwork template.

```
SELECT eu_aspect
FROM templatelinks
JOIN wbc_entity_usage
ON tl_from = eu_page_id
WHERE tl_from_namespace = 6
AND tl_namespace = 10
AND tl_title = "Artwork"
LIMIT 250
```

It's a lot of S, O, and L entries. We know that O is a statement and L is a caption, but what is "S"? According to [the manual](https://www.mediawiki.org/wiki/Wikibase/Schema/wbc_entity_usage), it's "sitelinks".

Based on this, we'd like to know, for each of the Information and Artwork templates, the number of pages with the template, the number of them using captions, the number of them using statements, and the number of them using both.

I further dug into the aspects used for the Artwork template through the following query:

```
SELECT *
FROM templatelinks
JOIN wbc_entity_usage
ON tl_from = eu_page_id
WHERE tl_from_namespace = 6
AND tl_namespace = 10
AND tl_title = "Artwork"
AND eu_entity_id REGEXP "^M"
AND eu_aspect != "T"
LIMIT 250
```

It looks like a lot of files are populating the template with the title, which isn't really part of SDC, hence why I ran the query. It only returns 26 files, though. How many of the pages that contain the template also pull in the title?

In [5]:
title_template_query = '''
SELECT count(*), SUM(IF(eu_page_id IS NOT NULL, 1, 0)) AS num_with_title
FROM templatelinks
LEFT JOIN wbc_entity_usage
ON tl_from = eu_page_id
WHERE tl_from_namespace = 6
AND tl_namespace = 10
AND tl_title = "Artwork"
AND eu_entity_id REGEXP "^M"
AND eu_aspect = "T"
'''

In [6]:
mariadb.run(title_template_query, 'commonswiki')

Unnamed: 0,count(*),num_with_title
0,3291,3291.0


So that's every file. This means that usage of the title is meaningless for artwork?

I don't think it's meaningful to keep digging down this rabbit hole at this point, and instead focus on the number of pages that use labels (captions).

## Queries

In [7]:
## Query to count the number of files that uses a given template
template_count_query = '''
SELECT count(DISTINCT tl_from) AS num_files
FROM templatelinks
WHERE tl_namespace = 10 -- template
AND tl_title = "{template_title}"
AND tl_from_namespace = 6 -- only files
'''

In [8]:
## Query to count the number of files that uses a given template
## and uses Lua to pull in the file's caption.

caption_count_query = '''
SELECT count(DISTINCT tl_from) AS num_files
FROM templatelinks
JOIN wbc_entity_usage
ON tl_from = eu_page_id
WHERE tl_namespace = 10 -- template
AND tl_title = "{template_title}"
AND tl_from_namespace = 6 -- only files
AND eu_entity_id REGEXP "^M" -- pulling from MediaInfo
AND eu_aspect REGEXP "^L" -- pulling in a description
'''

## Information template

Number of files that contain the template:

In [10]:
mariadb.run(template_count_query.format(template_title = 'Information'), 'commonswiki')

Unnamed: 0,count(DISTINCT tl_from)
0,51836906


Number of files that contain the template and pull in the description (in any language):

In [11]:
mariadb.run(caption_count_query.format(template_title = 'Information'), 'commonswiki')

Unnamed: 0,count(DISTINCT tl_from)
0,2013


## Artwork template

Number of files that contain the template:

In [12]:
mariadb.run(template_count_query.format(template_title = 'Artwork'), 'commonswiki')

Unnamed: 0,count(DISTINCT tl_from)
0,2429145


Number of files that contain the template and pull in the description (in any language):

In [13]:
mariadb.run(caption_count_query.format(template_title = 'Artwork'), 'commonswiki')

Unnamed: 0,count(DISTINCT tl_from)
0,17


## Conclusion

This isn't in widespread use currently, but it's perhaps something worth monitoring to understand if it is replacing existing usage. In other words, that new files that get uploaded will have a caption in SDC, which is then automatically added because of the template, rather than users adding the description in the wikitext.