This is code for the back end of the 2021 Library of Congress Computing Cultural Heritage in the Cloud project "Situating Ourselves in Cultural Heritage" (Andromeda Yelton, lead investigator).
Code for the front end resides at lc_site.
- git clone
cd lc_etl
pipenv install
python -m spacy download en_core_web_sm
The pipenv install
may fail due to unresolved upstream issues with numpy/scipy, and missing system dependencies. Before attempting it, create your pipenv, and manually install the following:
brew install gfortran
brew install openblas
brew install llvm@11
export PATH="/opt/homebrew/opt/llvm@11/bin:$PATH"
export LDFLAGS=-L/opt/homebrew/opt/llvm@11/lib,-L/opt/homebrew/lib
export CPPFLAGS=-I/opt/homebrew/include,-I/opt/homebrew/opt/llvm/include,-I/opt/homebrew/opt/llvm@11/include/c++/v1
export CXX=/opt/homebrew/opt/llvm@11/bin/clang++
export CC=/opt/homebrew/opt/llvm@11/bin/clang
LLVM_CONFIG=/opt/homebrew/opt/llvm@11/bin/llvm-config pipenv run pip install --no-binary :all: --no-use-pep517 llvmlite
pipenv run pip install cython pybind11 pythran
pipenv run pip install --no-binary :all: --no-use-pep517 numpy
export OPENBLAS=/opt/homebrew/opt/openblas/lib/
pipenv run pip install --no-binary :all: --no-use-pep517 scipy
You may need to export additional flags, per whatever brew
outputs on installation.
Then you can pipenv install
. It will not succeed, but you will have all of the dependencies available in your virtualenv.
The overall structure of the data pipeline is as follows:
- define a data set
- download fulltexts (where possible) of every item in this data set
- download metadata for all fulltexts found
- run one or more filters to clean up the data
- train a neural net on the fulltexts
- optionally enrich the downloaded metadata with information derived from the fulltexts
- embed the (many-dimensional) neural net in two-dimensional space, so that it can be visualized in a browser
- generate a file of 2D coordinates and metadata suitable for consumption by the quadfeather library
The script run_pipeline.py
shows the workflow for downloading texts and metadata from the Library of Congress; training a neural net on the texts; and transforming the neural net and metadata into a format that can be represented on the web.
You can, of course, mix and match steps as you prefer. Changes you might want to make include:
- writing your own dataset definitions (see
lc_etl/dataset_definitions
for examples); - altering the neural net hyperparameters (see
config_files
for examples); - using filters differently (fewer of them; in a different order; with different threshold values);
- passing different base words into
assign_similarity_metadata
.
Some of these may be time-consuming (in particular: train_doc2vec
, filter_nonwords
, assign_similarity_metadata
, and downloading large data sets), so you might want to run them with nohup
, or whatever you like for being able to walk away from a process for a while.
Note that filter_nonwords
can only be run if you already have an intermediate neural net you can use to find real words that are similar in meaning to OCR errors. (The BOOTSTRAP_MODEL_PATH
referenced in run_pipeline.py
is not part of this repository.) You can train a suitable neural net on your whole data set, but if that data set is large, it may take an enormous amount of memory to handle all the OCR errors your neural net must learn; you will be happier training your intermediate net on a reasonably-sized subset of your data, accepting that it will not see low-frequency OCR errors, but trusting it will learn the common ones.
To load a model, in order to explore it:
pipenv run python
>>> import gensim
>>> model = gensim.models.Doc2Vec.load("path/to/model")
You can also explore visually using deepscatter. Install this and edit its index.html file to point to the directory containing the files generated by quadfeather, in the run_pipeline
script above.
pipenv run python -m unittest tests/tests.py
- You can run a single test case with, e.g.,
pipenv run python -m unittest tests.tests.TestBulkScripts
(or a single test by appending it to this format).