# Creating a comparable dataset
We want our LCSH terms and wellcome collection search queries to be analysable in a similar format. Let's do some normalisation of both sets (similar to our standard ES language analysis) to reduce them into a more matchable form.

In [None]:
import os
import string
from collections import Counter
from pathlib import Path

import orjson
from tqdm.notebook import tqdm
from weco_datascience.reporting import get_data_in_date_range

First we need to actually load a set of queries from our reporting cluster

In [None]:
df = get_data_in_date_range(
    config=os.environ,
    index="metrics-conversion-prod",
    start_date="2021-09-01",
    end_date="2021-09-02",
)


In [None]:
unique_queries = df["page.query.query"].unique()

In [None]:
len(unique_queries.tolist())

and the LCSH labels which we downloaded in notebook 01

In [None]:
data_dir = Path("../data/lcsh")
with open(data_dir / "lcsh_ids_and_labels.json", "rb") as f:
    lcsh_dict = orjson.loads(f.read())

In [None]:
lcsh = set(list(lcsh_dict.keys()) + list(lcsh_dict.values()))

In [None]:
len(lcsh)

## Naive matching
Let's see how many matches we find without applying any transformations to the data

In [None]:
intersection = [query for query in unique_queries if query in lcsh]

print(len(intersection))
print(intersection)

## Lowercasing
The simplest change I can imagine making is to lowercase all of the terms before looking for matches

In [None]:
lowercased_queries = set([str(x).lower() for x in unique_queries])

lowercased_lcsh = set([str(x).lower() for x in lcsh])

In [None]:
intersection = [query for query in lowercased_queries if query in lowercased_lcsh]

print(len(intersection) / len(lowercased_queries))
print(intersection)

## Removing punctuation

In [None]:
from unicodedata import category

In [None]:
def strip_punctuation(input_string):
    return "".join(ch for ch in input_string if category(ch)[0] != "P")

In [None]:
unpunctuated_queries = set([strip_punctuation(x) for x in tqdm(lowercased_queries)])
unpunctuated_lcsh = set([strip_punctuation(x) for x in tqdm(lowercased_lcsh)])

In [None]:
intersection = [query for query in unpunctuated_queries if query in unpunctuated_lcsh]

print(len(intersection) / len(unpunctuated_queries))
print(intersection)

That's more that 10% of the unique queries for a 24 hour period which can be directly matched to subjects in lcsh, with minimal normalisation and disambiguation!

## Accounting for query counts
Let's instead look at the raw numbers of queries, rather than the unique ones - maybe that 10% figure will change

In [None]:
query_counts = df["page.query.query"].value_counts()

In [None]:
query_counts.sum()

In [None]:
query_counts.plot()

Th distribution looks nice and logarithmic, as we'd expect. If we have some of those high-value queries in our LCSH list, our matching-percentage might even go up!

In [None]:
count = 0

for query, n in query_counts.items():
    normalised_query = strip_punctuation(query.lower())
    if normalised_query in unpunctuated_lcsh:
        count += n

print(count / query_counts.sum())

23 percent!! near a quarter of queries neatly map to concepts in LCSH alone, with only the most basic normalisations applied to terms. A full suite of elasticsearch analysers might even bring that percentage closer to 30%.

Whatever the 'real' percentage is, I think it's fair to call it significant.