# 2. Extracting more data for local analysis

In the last notebook, we saw that the `/works` API can do some clever querying and filtering. However, we often have questions which can't be answered by the API by itself. In those cases, it's useful to collect a load of data from the API and then analyse it locally.

In this notebook, we'll try to query the API for bigger chunks of data so that we can answer a more interesting question.

We'll aim to find out:

> If we filter the works API for a set of subjects, can we find the other subjects that most commonly co-occur with them?

We'll start by fetching all of the works which are tagged with a single subject.

Here's our base URL again:

In [2]:
base_url = "https://api.wellcomecollection.org/catalogue/v2/"

Lets' make a request to the API, asking for all the works which are tagged with the subject "Influenza".

In [3]:
import requests

response = requests.get(
    base_url + "works", params={"subjects.label": "Influenza"}
).json()

In [4]:
response["totalResults"]

81

## 2.1 Page sizes

In [5]:
response["totalPages"]

9

In [6]:
len(response["results"])

10

At the moment, we're getting our results spread across 9 pages, because `pageSize` is set to 10 by default. 

We can increase the `pageSize` to get all of our 81 works in one go (up to a maximum of 100):

In [10]:
import requests

response = requests.get(
    base_url + "works", params={"subjects.label": "Influenza", "pageSize": 100}
).json()

In [15]:
response["totalResults"]


81

In [16]:
response["totalPages"]

1

## 2.2 Requesting multiple pages of results

Some subjects only appear on a few works, but others appear on thousands. If we want to be able to analyse those larger subjects, we'll need to fetch more than 100 works at a time. To do this, we'll page through the results, making multiple requests and building a local list of results as we go.

If the API finds more than one page of results for a query, it will provide a `nextPage` field in the response, with a link to the next page of results. We can use this to fetch the next page of results, and the next, and the next, until the `nextPage` field is no longer present, at which point we know we've got all the results.

We're going to use these results to answer our question from the introduction, so we'll also ask the API to include the subjects which are associated with each work, and collect them too.

In [20]:
from tqdm.auto import tqdm


In [22]:
results = []

# fetch the first page of results
response = requests.get(
    base_url + "works",
    params={
        "subjects.label": "England",
        "include": "subjects",
        "pageSize": "100",
    },
).json()

# start a progress bar to keep track of how many results we've fetched
progress_bar = tqdm(total=response["totalResults"])

# add our results to the list and update our progress bar
results.extend(response["results"])
progress_bar.update(len(response["results"]))

# as long as there's a "nextPage" key in the response, keep fetching results
# adding them to the list, and updating the progress bar
while "nextPage" in response:
    response = requests.get(response["nextPage"]).json()
    results.extend(response["results"])
    progress_bar.update(len(response["results"]))

progress_bar.close()

works_about_england = results

  0%|          | 0/4245 [00:00<?, ?it/s]

let's check that we've got the correct number of results:

In [23]:
len(works_about_england) == response["totalResults"]

True

Great! Now let's try collecting works for a second subject:

In [24]:
results = []

response = requests.get(
    base_url + "works",
    params={
        "subjects.label": "Germany",
        "include": "subjects",
        "pageSize": "100",
    },
).json()

progress_bar = tqdm(total=response["totalResults"])

results.extend(response["results"])
progress_bar.update(len(response["results"]))

while "nextPage" in response:
    response = requests.get(response["nextPage"]).json()
    results.extend(response["results"])
    progress_bar.update(len(response["results"]))

progress_bar.close()

works_about_germany = results

  0%|          | 0/4638 [00:00<?, ?it/s]

## 2.3 Analyzing our two sets of results 

Let's find the works which are tagged with both subjects by filtering the results of the first list by IDs from the second list.

In [25]:
ids_from_works_about_england = set([work["id"] for work in works_about_england])

In [26]:
works_about_england_and_germany = [
    work
    for work in works_about_germany
    if work["id"] in ids_from_works_about_england
]

In [27]:
len(works_about_england_and_germany)

32

In [28]:
works_about_england_and_germany

[{'physicalDescription': '363 pages : illustrations ; 23 cm.',
  'subjects': [{'label': 'Tuberculosis, Pulmonary - history',
    'concepts': [{'id': 'f32xsm7t',
      'label': 'Tuberculosis, Pulmonary',
      'type': 'Concept'},
     {'id': 'm7x7qxg6', 'label': 'history', 'type': 'Concept'}],
    'id': 'hndy2z49',
    'type': 'Subject'},
   {'label': 'Hospitals, Special - history',
    'concepts': [{'id': 'qrk6shrv',
      'label': 'Hospitals, Special',
      'type': 'Concept'},
     {'id': 'm7x7qxg6', 'label': 'history', 'type': 'Concept'}],
    'id': 't5k7xqfg',
    'type': 'Subject'},
   {'label': '19th-20th centuries',
    'concepts': [{'id': 't7sgt4ee',
      'label': '19th-20th centuries',
      'type': 'Period'}],
    'id': 't7sgt4ee',
    'type': 'Subject'},
   {'label': 'Germany',
    'concepts': [{'id': 'v5h4ytrw', 'label': 'Germany', 'type': 'Place'}],
    'id': 'v5h4ytrw',
    'type': 'Subject'},
   {'label': 'England',
    'concepts': [{'id': 's52pc3b6', 'label': 'England'

That's 32 works which are tagged with both `England` and `Germany`. Let's see if we can find the other subjects which are most commonly found on these works. 

Let's use a `Counter` to figure that out:

N.B. We're collecting the _concepts_ on each work because they are the atomic constituent parts of subjects. Our catalogue includes subjects like "Surgery - 18th Century" which are made up of the concepts "Surgery" and "18th Century". It's more desirable to compare the concepts, because the subjects can be so specific and are less likely to overlap.

In [30]:
from collections import Counter

concepts = Counter()

for record in works_about_england_and_germany:
    # we need to navigate the nested structure of the subject and its concepts to
    # get the complete list of _concepts_ on each work
    for subject in record["subjects"]:
        for concept in subject["concepts"]:
            concepts.update([concept["label"]])



The `Counter` object keeps track of the counts of each unique item we pass to it. Now that we've added the complete list, we can ask it for the most common items:

In [31]:
concepts.most_common(20)

[('England', 33),
 ('Germany', 32),
 ('history', 27),
 ('France', 12),
 ('Physicians', 6),
 ('Sweden', 5),
 ('20th century', 5),
 ('18th century', 5),
 ('Belgium', 4),
 ('Austria', 4),
 ('History', 4),
 ('19th-20th centuries', 3),
 ('Finland', 3),
 ('Public Health', 2),
 ('Europe', 2),
 ('Medicine', 2),
 ('19th century', 2),
 ('Technology', 2),
 ('Hypersensitivity, Immediate', 2),
 ('Tuberculosis, Pulmonary', 1)]

Great! We've solved our original problem:

> If we filter the works API for a set of subjects, can we find the other concepts that most commonly co-occur with them?

## 2.4 Creating a generic function for finding subject intersections

Now that we've solved this problem, let's try to make it more generic so that we can use it for other pairs of subjects.

We can re-use a lot of the code we've already written, and wrap it in a couple of reusable function definitions.

In [32]:
def get_subject_results(subject):
    response = requests.get(
        base_url + "works",
        params={
            "subjects.label": subject,
            "include": "subjects",
            "pageSize": "100",
        },
    ).json()

    progress_bar = tqdm(total=response["totalResults"])
    results = response["results"]
    progress_bar.update(len(response["results"]))

    while "nextPage" in response:
        response = requests.get(response["nextPage"]).json()
        results.extend(response["results"])
        progress_bar.update(len(response["results"]))

    progress_bar.close()
    
    return results


def find_intersecting_subject_concepts(subject_1, subject_2, n=20):
    subject_1_results = get_subject_results(subject_1)
    subject_2_results = get_subject_results(subject_2)
    subject_2_ids = set(result["id"] for result in subject_2_results)

    intersecting_results = [
        result for result in subject_1_results if result["id"] in subject_2_ids
    ]

    concepts = Counter()
    for record in intersecting_results:
        for subject in record["subjects"]:
            for concept in subject["concepts"]:
                concepts.update([concept["label"]])

    return concepts.most_common(n)

Calling the `find_intersecting_subject_concepts()` function with any two subjects will return a counter of the most common concepts found on the works which are tagged with both subjects.

In [33]:
find_intersecting_subject_concepts("Europe", "United States")

  0%|          | 0/2523 [00:00<?, ?it/s]

  0%|          | 0/8312 [00:00<?, ?it/s]

[('United States', 94),
 ('Europe', 94),
 ('history', 79),
 ('History', 20),
 ('legislation & jurisprudence', 8),
 ('Great Britain', 7),
 ('19th century', 7),
 ('20th century', 7),
 ('19th-20th centuries', 7),
 ('Research', 6),
 ('Psychoanalysis', 5),
 ('Canada', 5),
 ('Biotechnology', 5),
 ('Science', 5),
 ('Fashion', 4),
 ('Feeding and Eating Disorders', 4),
 ('Sociology', 4),
 ('Patents as Topic', 4),
 ('Drug Industry', 4),
 ('Education, Higher', 4)]

In [34]:
find_intersecting_subject_concepts("Vomiting", "Witchcraft")

  0%|          | 0/90 [00:00<?, ?it/s]

  0%|          | 0/391 [00:00<?, ?it/s]

[('Witchcraft', 2),
 ('Vomiting', 2),
 ('Costume', 2),
 ('Magic', 1),
 ('Demonology', 1),
 ('Devil', 1),
 ('Toilets', 1),
 ('Kings and rulers', 1),
 ('Enema', 1),
 ('Medicine', 1),
 ('Early works to 1800', 1),
 ('Viziers', 1),
 ('Phlebotomy', 1),
 ('Wine', 1),
 ('Physiological effect', 1),
 ('Pulse', 1),
 ('Measurement', 1),
 ('Urine', 1),
 ('Examination', 1),
 ('Woolen and worsted spinning', 1)]

## Exercises

1. Try running the function with different subjects. Use the API to find two subjects which appear on a few hundred or a few thousand works, and see if you can find the most common concepts which appear on both of them.
2. Adapt the code to compare an arbitrary number of subjects, rather than just two.
