Skip to content

thatandromeda/lc_etl

Repository files navigation

What this is

This is code for the back end of the 2021 Library of Congress Computing Cultural Heritage in the Cloud project "Situating Ourselves in Cultural Heritage" (Andromeda Yelton, lead investigator).

Code for the front end resides at lc_site.

Installation

  • git clone
  • cd lc_etl
  • pipenv install
  • python -m spacy download en_core_web_sm

Installation on m1

The pipenv install may fail due to unresolved upstream issues with numpy/scipy, and missing system dependencies. Before attempting it, create your pipenv, and manually install the following:

brew install gfortran
brew install openblas
brew install llvm@11
export PATH="/opt/homebrew/opt/llvm@11/bin:$PATH"
export LDFLAGS=-L/opt/homebrew/opt/llvm@11/lib,-L/opt/homebrew/lib
export CPPFLAGS=-I/opt/homebrew/include,-I/opt/homebrew/opt/llvm/include,-I/opt/homebrew/opt/llvm@11/include/c++/v1
export CXX=/opt/homebrew/opt/llvm@11/bin/clang++
export CC=/opt/homebrew/opt/llvm@11/bin/clang
LLVM_CONFIG=/opt/homebrew/opt/llvm@11/bin/llvm-config pipenv run pip install --no-binary :all: --no-use-pep517 llvmlite
pipenv run pip install cython pybind11 pythran
pipenv run pip install --no-binary :all: --no-use-pep517 numpy
export OPENBLAS=/opt/homebrew/opt/openblas/lib/
pipenv run pip install --no-binary :all: --no-use-pep517 scipy

You may need to export additional flags, per whatever brew outputs on installation.

Then you can pipenv install. It will not succeed, but you will have all of the dependencies available in your virtualenv.

The data pipeline

How it works

The overall structure of the data pipeline is as follows:

  • define a data set
  • download fulltexts (where possible) of every item in this data set
  • download metadata for all fulltexts found
  • run one or more filters to clean up the data
  • train a neural net on the fulltexts
  • optionally enrich the downloaded metadata with information derived from the fulltexts
  • embed the (many-dimensional) neural net in two-dimensional space, so that it can be visualized in a browser
  • generate a file of 2D coordinates and metadata suitable for consumption by the quadfeather library

Running the data pipeline

The script run_pipeline.py shows the workflow for downloading texts and metadata from the Library of Congress; training a neural net on the texts; and transforming the neural net and metadata into a format that can be represented on the web.

You can, of course, mix and match steps as you prefer. Changes you might want to make include:

  • writing your own dataset definitions (see lc_etl/dataset_definitions for examples);
  • altering the neural net hyperparameters (see config_files for examples);
  • using filters differently (fewer of them; in a different order; with different threshold values);
  • passing different base words into assign_similarity_metadata.

Some of these may be time-consuming (in particular: train_doc2vec, filter_nonwords, assign_similarity_metadata, and downloading large data sets), so you might want to run them with nohup, or whatever you like for being able to walk away from a process for a while.

Note that filter_nonwords can only be run if you already have an intermediate neural net you can use to find real words that are similar in meaning to OCR errors. (The BOOTSTRAP_MODEL_PATH referenced in run_pipeline.py is not part of this repository.) You can train a suitable neural net on your whole data set, but if that data set is large, it may take an enormous amount of memory to handle all the OCR errors your neural net must learn; you will be happier training your intermediate net on a reasonably-sized subset of your data, accepting that it will not see low-frequency OCR errors, but trusting it will learn the common ones.

Exploring the data

To load a model, in order to explore it:

  • pipenv run python
  • >>> import gensim
  • >>> model = gensim.models.Doc2Vec.load("path/to/model")

You can also explore visually using deepscatter. Install this and edit its index.html file to point to the directory containing the files generated by quadfeather, in the run_pipeline script above.

Tests

  • pipenv run python -m unittest tests/tests.py
  • You can run a single test case with, e.g., pipenv run python -m unittest tests.tests.TestBulkScripts (or a single test by appending it to this format).

About

Data pipeline for my Library of Congress CCHC work

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published