What this is

This is code for the back end of the 2021 Library of Congress Computing Cultural Heritage in the Cloud project "Situating Ourselves in Cultural Heritage" (Andromeda Yelton, lead investigator).

Code for the front end resides at lc_site.

Installation

git clone
cd lc_etl
pipenv install
python -m spacy download en_core_web_sm

Installation on m1

The pipenv install may fail due to unresolved upstream issues with numpy/scipy, and missing system dependencies. Before attempting it, create your pipenv, and manually install the following:

brew install gfortran
brew install openblas
brew install llvm@11
export PATH="/opt/homebrew/opt/llvm@11/bin:$PATH"
export LDFLAGS=-L/opt/homebrew/opt/llvm@11/lib,-L/opt/homebrew/lib
export CPPFLAGS=-I/opt/homebrew/include,-I/opt/homebrew/opt/llvm/include,-I/opt/homebrew/opt/llvm@11/include/c++/v1
export CXX=/opt/homebrew/opt/llvm@11/bin/clang++
export CC=/opt/homebrew/opt/llvm@11/bin/clang
LLVM_CONFIG=/opt/homebrew/opt/llvm@11/bin/llvm-config pipenv run pip install --no-binary :all: --no-use-pep517 llvmlite
pipenv run pip install cython pybind11 pythran
pipenv run pip install --no-binary :all: --no-use-pep517 numpy
export OPENBLAS=/opt/homebrew/opt/openblas/lib/
pipenv run pip install --no-binary :all: --no-use-pep517 scipy

You may need to export additional flags, per whatever brew outputs on installation.

Then you can pipenv install. It will not succeed, but you will have all of the dependencies available in your virtualenv.

The data pipeline

How it works

The overall structure of the data pipeline is as follows:

define a data set
download fulltexts (where possible) of every item in this data set
download metadata for all fulltexts found
run one or more filters to clean up the data
train a neural net on the fulltexts
optionally enrich the downloaded metadata with information derived from the fulltexts
embed the (many-dimensional) neural net in two-dimensional space, so that it can be visualized in a browser
generate a file of 2D coordinates and metadata suitable for consumption by the quadfeather library

Running the data pipeline

The script run_pipeline.py shows the workflow for downloading texts and metadata from the Library of Congress; training a neural net on the texts; and transforming the neural net and metadata into a format that can be represented on the web.

You can, of course, mix and match steps as you prefer. Changes you might want to make include:

writing your own dataset definitions (see lc_etl/dataset_definitions for examples);
altering the neural net hyperparameters (see config_files for examples);
using filters differently (fewer of them; in a different order; with different threshold values);
passing different base words into assign_similarity_metadata.

Some of these may be time-consuming (in particular: train_doc2vec, filter_nonwords, assign_similarity_metadata, and downloading large data sets), so you might want to run them with nohup, or whatever you like for being able to walk away from a process for a while.

Note that filter_nonwords can only be run if you already have an intermediate neural net you can use to find real words that are similar in meaning to OCR errors. (The BOOTSTRAP_MODEL_PATH referenced in run_pipeline.py is not part of this repository.) You can train a suitable neural net on your whole data set, but if that data set is large, it may take an enormous amount of memory to handle all the OCR errors your neural net must learn; you will be happier training your intermediate net on a reasonably-sized subset of your data, accepting that it will not see low-frequency OCR errors, but trusting it will learn the common ones.

Exploring the data

To load a model, in order to explore it:

pipenv run python
>>> import gensim
>>> model = gensim.models.Doc2Vec.load("path/to/model")

You can also explore visually using deepscatter. Install this and edit its index.html file to point to the directory containing the files generated by quadfeather, in the run_pipeline script above.

Tests

pipenv run python -m unittest tests/tests.py
You can run a single test case with, e.g., pipenv run python -m unittest tests.tests.TestBulkScripts (or a single test by appending it to this format).

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
lc_etl		lc_etl
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
__init__.py		__init__.py
reconstruction_era_papers		reconstruction_era_papers
run_pipeline.py		run_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What this is

Installation

Installation on m1

The data pipeline

How it works

Running the data pipeline

Exploring the data

Tests

About

Releases

Packages

Languages

License

thatandromeda/lc_etl

Folders and files

Latest commit

History

Repository files navigation

What this is

Installation

Installation on m1

The data pipeline

How it works

Running the data pipeline

Exploring the data

Tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages