Gen3 Tube ETL - a process from PostgreSQL to ElasticSearch

Purpose

Providing a quick response for every data query is challenging, since we need to balance the drawback of the data storage space and the performance of the query. Given a database schema represented in the figure, querying all the data from all the data tables requires multiple joins to connect the data to each other.

SQL databases provide an optimal and standard way to store data. For that optimization in processing or retrieving data, we pay the price in saving space. Given a database with a schema such as the one in the figure above, in order to gather all information related to Subject in descendant tables, we need perform sixteen joins (one per link). With big data, it is an expensive task.

NoSQL and document databases offer a way to circumvent the cost by duplicating data or materializing necessary values for the frequent requests. Normally, when data are received by the system, they are stored in the "source of truth" database and streamed to the secondary document database via an Extract-Transform-Load (ETL) process.

The Gen3 Tube ETL is designed to translate data from a graph data model, stored in a PostgreSQL database, to indexed documents in ElasticSearch (ES), which supports efficient ways to query data from the front-end. The purpose of the Gen3 Tube ETL is to create indexed documents to reduce the response time of requests to query data. It is configured through an etlMapping.yaml configuration file, which describes which tables and fields to ETL to ElasticSearch.

Key documentation

ETL overview: more information about general ETL processes
How to configure the Gen3 Tube ETL
Configuration examples
Local development installation guide
How to run unit tests locally
Configuring SSL

Name		Name	Last commit message	Last commit date
Latest commit History 377 Commits
.github/workflows		.github/workflows
deployment/scripts/postgresql		deployment/scripts/postgresql
docs		docs
elasticsearch-hadoop-8.3.3/dist		elasticsearch-hadoop-8.3.3/dist
tests		tests
tube		tube
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
azure-devops-pipeline.yaml		azure-devops-pipeline.yaml
creds.json		creds.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run_config.py		run_config.py
run_etl.py		run_etl.py
wait_for_quay.sh		wait_for_quay.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gen3 Tube ETL - a process from PostgreSQL to ElasticSearch

Purpose

Key documentation

Gen3 graph data flow

About

Releases 69

Packages

Contributors 26

Languages

License

uc-cdis/tube

Folders and files

Latest commit

History

Repository files navigation

Gen3 Tube ETL - a process from PostgreSQL to ElasticSearch

Purpose

Key documentation

Gen3 graph data flow

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 69

Packages 0

Contributors 26

Languages

Packages