Skip to content
This repository has been archived by the owner on Dec 2, 2021. It is now read-only.

sul-dlss-deprecated/rialto-etl

Repository files navigation

RIALTO-ETL

Travis Maintainability Test Coverage Documentation API Apache 2.0 License

RIALTO-ETL is a set of ETL tools for RIALTO, Stanford Libraries' research intelligence project

Dependencies

  • Ruby >= 2.5.0

Usage

Pipeline to harvest organizations from Stanford Profiles API into RIALTO Core

exe/extract call StanfordOrganizations > organizations.json
exe/transform call StanfordOrganizations -i organizations.json > organizations.sparql
exe/load call Sparql -i organizations.sparql

Pipeline to harvest researchers from Stanford Profiles API into RIALTO Core

Notes:

  • The extract step takes about 20 min as it has to make ~796 requests to get the full 1.6GB of data.
  • The transform step depends on organizations.json from organizations pipeline.
  • The transform step takes about 35 minutes on a single thread
  • The load step takes about 7 hours on a single thread
exe/extract call StanfordResearchers > researchers.ndj
exe/transform call StanfordPeople -i researchers.ndj > researchers.sparql
exe/load call Sparql -i researchers.sparql

Composite ETL

The composite ETL tools allow you to streamline operations by running extracts, transforms, and loads on batches of data. These tools are available for grants and publications, currently.

Pipeline to harvest grants from Stanford SeRA API into RIALTO Core

Notes:

  • The transform step depends on researchers.ndj from the researcher pipeline
  • Extracting and transforming will be sped up by setting a batch size with -s.
  • The load step can be skipped with the --skip-load flag.
  • The extract and transform steps will be skipped if the files already exist. Use the --force/-f flag to overwrite files.
  • If you need to run the transform and load on already extracted grant data, you can run them independently via exe/transform call StanfordGrants -i my_extracted_grant_file.json > my_transformed_grant_file.sparql and exe/load call Sparql -i my_transformed_grant_file.sparql.
exe/transform call StanfordPeopleList -i researchers.ndj > researchers.csv
exe/grants load -s 3 -i researchers.csv

See the output of exe/grants help load to see more of the available CLI options

Pipeline to harvest publications from Web of Science API into RIALTO Core

Notes:

  • The extract step can be skipped with the --skip-extract flag, in which case cached files in the input directory (--input-directory/-d flag) will be used for transformation and loading.
  • The load step can be skipped with the --skip-load flag.
  • The extract and transform steps will be skipped if the files already exist. Use the --force/-f flag to overwrite files.
  • If you need to run the transform and load on already extracted publication data, you can run them independently via exe/transform call WebOfScience -i my_extracted_publication_file.ndj > my_transformed_publication_file.sparql and exe/load call Sparql -i my_transformed_publication_file.sparql.
exe/publications load -d ../rialto-sample-data/publications -o data/transformed_publications

See the output of exe/publications help load to see more of the available CLI options

Authentication

If you are using the StanfordResearchers or StanfordOrganizations extract methods, you will first need to obtain a token for the CAP API and set the Settings.cap.api_key value to this token. To set this value, either set an environment variable named SETTINGS__CAP__API_KEY or add the value for this to config/settings.local.yml (which is ignored under version control and should never be checked in), like so:

cap:
  api_key: 'foobar'

Similarly, if you are using the SPARQL writer, then you need to set SETTINGS__SPARQL_WRITER__API_KEY or:

sparql_writer:
  api_key: 'key' # SPARQL Proxy API key

Tokens are stored in shared_configs.

Run the extract process

Run exe/extract to run a named extractor and print output to STDOUT:

$ exe/extract call StanfordResearchers
{"count":10,"firstPage":true,"lastPage":false,"page":1,"totalCount":29089,"totalPages":2909,"values":[{"administrativeAppointments":[...

List registered extract processes

Run exe/extract list to print out the list of callable extractors.

Transform

Run exe/transform to run a named transformer, based on Traject, on a named input file and print output to STDOUT:

$ exe/transform call StanfordOrganizationsToVivo -i stanford_organizations.json
{"@id":"http://authorities.stanford.edu/orgs#vice-provost-for-undergraduate-education/stanford-introductory-studies/freshman-and-sophomore-programs","@type":"http://vivoweb.org/ontology/core#Division","rdfs:label":"Freshman and Sophomore Programs","vivo:abbreviation":["FFQH"]}

Run exe/transform list to print out the list of callable transformers.

Load

Run exe/load to run a named extractor and print output to STDOUT:

$ exe/load call Sparql -i whatever.sparql
...

Configuration

RIALTO-ETL uses the config gem to manage configuration, allowing for flexible variation of configs between environments and hosts. By default, the gem assumes it is running in the 'production' environment and will look for its configurations per the config gem documentation. To explicitly set the environment to test or development, set an environment variable named ENV.

Help

$ exe/extract help
Commands:
  extract call NAME       # Call named extractor (`extract list` to see available names)
  extract help [COMMAND]  # Describe subcommands or one specific subcommand
  extract list            # List callable extractors

$ exe/transform help
Commands:
  transform call NAME       # Call named transformer (`transform list` to see available names)
  transform help [COMMAND]  # Describe subcommands or one specific subcommand
  transform list            # List callable transformers

$ exe/load help
Commands:
  load call NAME -i, --input-file=FILENAME  # Call named loader (` list` to see available names)
  load help [COMMAND]                       # Describe available commands or one specific command
  load list                                 # List callable loaders

Documentation

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Sample Data

The sample data we use to work with Rialto::Etl is contained in a private GitHub repository

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/sul-dlss/rialto-etl.