# Downloading subject headings from the library of congress 
LoC don't make it easy to get a complete list of subject headings. The neatest route seems to be the [downloads section of the LoC website](https://id.loc.gov/download/), where they make a [skos](https://www.w3.org/2004/02/skos/), [ndjson](http://ndjson.org/) version of the headings available.

In this notebook we'll download the file, and parse out the useful information from it. In the next notebook we'll focus on getting meaningful results from the intersection of this dataset and the queries sent through [wellcomecollection.org/collections](https://wellcomecollection.org/collections).

In [None]:
import zipfile
from pathlib import Path

import httpx
import orjson
from tqdm.notebook import tqdm

In [None]:
url = "https://lds-downloads.s3.amazonaws.com/lcsh.skos.ndjson.zip"
filename = Path(url).name
data_dir = Path("../data/lcsh")

if not data_dir.exists():
    data_dir.mkdir()

We've defined where we want to fetch the file from, and where we want to save it - now we just need to download it. It's a fairly large file so I've added a progress bar

In [None]:
file_path = data_dir / filename

In [None]:
if not file_path.exists():
    with open(file_path, "wb") as download_file:
        with httpx.stream("GET", url) as response:
            total = int(response.headers["Content-Length"])
            with tqdm(
                total=total, unit_scale=True, unit_divisor=1024, unit="B", desc=filename
            ) as progress:
                num_bytes_downloaded = response.num_bytes_downloaded
                for chunk in response.iter_bytes():
                    download_file.write(chunk)
                    progress.update(
                        response.num_bytes_downloaded - num_bytes_downloaded
                    )
                    num_bytes_downloaded = response.num_bytes_downloaded

We also need to unzip the file.

In [None]:
with zipfile.ZipFile(file_path, "r") as zip_ref:
    zip_ref.extractall(data_dir)

## extract the useful data
This is a pretty big file - let's find out how many lines (ie records) it contains

In [None]:
! wc -l ../data/lcsh/lcsh.skos.ndjson

450645 lines is probably big enough to be worth iterating through it gradually, rather than reading it all at once. Let's set up a function to yield lines one-by-one

In [None]:
def load_records(file_path):
    with open(file_path) as f:
        while line := f.readline():
            yield orjson.loads(line)

all that's left to do now is work through each of those records and extract the LCSH ID and the heading (`prefLabel`) for each record

In [None]:
data = {}

generator = load_records(data_dir / Path(url).stem)

for record in tqdm(generator, total=450645):
    lcsh_id = Path(record["@context"]["about"]).name
    for item in record["@graph"]:
        if item["@id"] == record["@context"]["about"]:
            try:
                data[lcsh_id] = item["skos:prefLabel"]["@value"]
            except KeyError:
                # have inspected these lines. it looks like they're
                # all duplicate/deleted records
                pass

Now we can save our cleaned records to the `/data` directory for use in future notebooks

In [None]:
with open(data_dir / "lcsh_ids_and_labels.json", "wb") as f:
    f.write(orjson.dumps(data))