# comparing works' completeness vs clicks

is there a correlation between how well described a work is (the number of fields filled out, the number of tokens in those fields, etc) and how many times they're accessed? Can we see patterns in where cataloguing effort is being spent vs the 'reward' of access?

In [None]:
import gzip
import json
import os
import shutil
from pathlib import Path

import httpx
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
from weco_datascience import reporting

## start by collecting the data we'll use for the analysis

In [None]:
data_dir = Path("../data")

if not data_dir.exists():
    data_dir.mkdir()

### clicks

This data comes from our reporting cluster. We just want clicks on works in search results for now. Could expand this to look at images if we needed to.

In [None]:
clicks = reporting.query_es(
    config=os.environ,
    index="metrics-conversion-prod",
    query={
        "size": 100000,
        "sort": [{"@timestamp": {"order": "desc"}}],
        "query": {
            "bool": {
                "must": [
                    {"range": {"@timestamp": {"lte": "2021-10-01T00:00:00.000Z"}}},
                    {
                        "term": {
                            "page.name": {
                                "value": "work",
                            }
                        }
                    }
                    # {
                    #     "term": {
                    #         "page.name": {
                    #             "value": "image",
                    #         }
                    #     }
                    # },
                ],
            }
        },
    },
)

clicks.to_json(data_dir / "searches.json")

In [None]:
clicks.head()

get the counts of how many times each work is clicked on.

In [None]:
click_counts = clicks["page.query.id"].value_counts()
click_counts

In [None]:
click_counts.plot();

Most works are almost never clicked on, and some works are clicked on a lot. unsurprisingly, most of the most-clicked works are images of naked people and genitals etc.

# catalogue
To compare how popular each work is with how complete its data is, we need to load in the full catalogue.

In [None]:
url = "https://data.wellcomecollection.org/catalogue/v2/works.json.gz"
filename = Path(url).name
zipped_works_file_path = data_dir / filename
works_file_path = data_dir / zipped_works_file_path.stem

In [None]:
if not works_file_path.exists():
    if not zipped_works_file_path.exists():
        with open(zipped_works_file_path, "wb") as download_file:
            with httpx.stream("GET", url, timeout=999999) as response:
                total = int(response.headers["Content-Length"])
                with tqdm(
                    total=total,
                    unit_scale=True,
                    unit_divisor=1024,
                    unit="B",
                    desc=filename,
                ) as progress:
                    num_bytes_downloaded = response.num_bytes_downloaded
                    for chunk in response.iter_bytes():
                        download_file.write(chunk)
                        progress.update(
                            response.num_bytes_downloaded - num_bytes_downloaded
                        )
                        num_bytes_downloaded = response.num_bytes_downloaded


    with gzip.open(zipped_works_file_path, "rb") as f_in:
        with open(works_file_path, "wb") as f_out:
            shutil.copyfileobj(f_in, f_out)

In [None]:
def load_records(path):
    with open(path) as f:
        while line := f.readline():
            yield json.loads(line)

In [None]:
n_records = sum([1 for _ in load_records(works_file_path)])

In [None]:
n_records

## completeness
### number of fields
In my mind, the simplest measure of a work's completeness is the number of fields which exist on the record.

In [None]:
completeness = {
    record["id"]: {key: bool(record[key]) for key in record}
    for record in tqdm(load_records(works_file_path), total=n_records)
}

Get the count of existing fields on each record

In [None]:
completeness_counts = {id: sum(record.values()) for id, record in completeness.items()}

count the frequency of complete-field-counts across the whole catalogue

In [None]:
pd.Series(completeness_counts).value_counts().sort_index()

In [None]:
pd.Series(completeness_counts).value_counts().sort_index().plot();

looks like most records have betwene 13-15 fields. Do those counts correlate with how many times a work is clicked on?

In [None]:
df = pd.DataFrame(
    {
        id: {"clicks": clicks, "completeness": completeness_counts[id]}
        for id, clicks in click_counts.items()
        if id in completeness_counts
    }
).T

df.plot.scatter(x="clicks", y="completeness", alpha=0.1);

Apparently not. I can't see any kind of clear correlation there

In [None]:
df.plot.scatter(x="clicks", y="completeness", alpha=0.1, xlim=[0, 50]);

Even within the most concentrated section of the click data, a correlation is very hard to pick out.

### token counts
Going one step further, we could see a works completeness as the number of tokens it has in its most important fields. Surely works with more words attached to them are going to perform better in search than works which are poorly described?

Let's look at the number of words (split by whitespace) in each work's title, description, subject and contributor fields.

In [None]:
def count_tokens(record):
    count = 0
    count += sum(
        [
            len(record[field].split())
            for field in ["title", "description"]
            if field in record
        ]
    )
    count += sum([
        len(contributor['agent']['label'].split())
        for contributor in record['contributors']
    ])
    count += sum([
        len(subject['label'].split())
        for subject in record['subjects']
    ])
    return count

In [None]:
completeness = {
    record["id"]: count_tokens(record)
    for record in tqdm(load_records(works_file_path), total=n_records)
}

In [None]:
df = pd.DataFrame(
    {
        id: {"clicks": clicks, "completeness": completeness[id]}
        for id, clicks in click_counts.items()
        if id in completeness
    }
).T

In [None]:
df.plot.scatter(x="clicks", y="completeness", alpha=0.1);

Again, no sign of any correlation.

We can try it on a log scale instead, to diminish the visual effect of those extreme works.

In [None]:
df.plot.scatter(x="clicks", y="completeness", alpha=0.1, logx=True, logy=True);

This makes a lot of sense really. Elasticsearch normalises token counts, ie diminishing the effect of each additional word in a field. Works which contain a term from a user's query once in a 4-word title are probably strongly linked to that topic, while works which mention the same term once in a 500 word description are probably not so strongly related to the topic. The lack of correlation here is in part affected by that normalisation.

## further work

had we found a strong correlation here, we could expand on this work by
- using random forests to determine which fields contribute most to the views on a work
- figuring out which individual unique terms which are most clicky
- refining the tokenisation processes etc to match elasticsearch's approach more closely.