Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.
/ judah Public archive

Judah is a simplistic service-oriented framework to handle ETL tasks in python.

License

Notifications You must be signed in to change notification settings

sopherapps/judah

Repository files navigation

judah

She (Leah) said, “This time I will praise the LORD”; so she named him Judah - Genesis 29: 35

This project is no longer being maintained.

judah is a service-oriented Python package to handle ETL (extract-transform-load) tasks easily.

It follows a service-oriented architectural (SOA) design.

Under the hood, it uses the nice little ETL framework called Bonobo under the hood.

This project is still under heavy development

Purpose

The judah framework was created to standardize the integration or ETL (Extract-transform-load) applications that collect energy data from multiple external sources and saves it in a warehouse.

Links

Here are a few important links:

Languages Used

Dependencies

Getting Started

  • Install the package
pip install judah
  • Copy the .example.env file to .env and make appropriate edits on it
cp .example.env .env
  • Import the source, destination and transformer classes, as well as any utility functions you may like and use accordingly
from judah.sources.export_site.date_based import DateBasedExportSiteSource
# ...  

Expected App System Design and Architecture

The judah framework expects all applications that use it to follow a service-oriented-architecture as shown below.

  • The app should have a services folder (or in python, what we call package) to contain the separate ETL services, each corresponding to a given third-party data source e.g. CNN, BBC
  • Subsequently, each ETL service should be divided up into child services. Each child service should represent a unique data flow path e.g. REST-API-to-database, REST-API-to-cache, REST-API-to-queue, file-download-site-to-database, file-download-site-to-queue etc.
  • Each child service should be divided up into a number of microservices. Each microservice should correspond to a single dataset, e.g. 'available_capacity', 'installed_capacity' etc.
  • Each microservice is expected to have a destination folder, a source.py file, a controller.py file and a transformers.py file.
    • The destination folder contains the database model file to which the data is to be saved. It contains a child class of the DatabaseBaseModel class of the judah framework
    • The source.py file contains a child class of the BaseSource class of the judah framework. This is the class responsible for connecting to the data source (e.g. the REST API) and downloading the data from there.
    • The transfomers.py file contains child classes of the BaseTransformer class of the judah framework. They are responsible for transforming the source data into the data that can be saved. This may involve changing field names and types, exploading the data etc.
    • The controller.py file contains child class of the BaseController class of the judah framework. This class is responsible for controlling the data flow from the source class, through the transformers, to the destination model.
  • Each child service foldershould contain a registry of these microservices in its __init__.py file. The registry is just a list of the controllers of the microservices.
  • The app should have a main.py file as the entry point of the app where the Bonobo graph is instantiated and the microservice registries mentioned in the point above are added to the graph. Look at the example_main.py file for inspiration.

Why service-oriented architectural (SOA) design

Service oriented architecture makes it easy to connect actual feature requests with the actual code that is written. Many a time, software requirements are structured in typically a service-oriented manner. For example.

  • User can see realtime data about bitcoin
  • User can see realtime data about Ethereum
  • User can view historical data about bitcoin

When we have source code that follows the exact manner these requirements are laid out, it is easy to comprehend for anyone really.

For example, for the above example, each of those requirements will have a single pipeline, each having its own independent folder.

It is even easy to transfer that architecture into a stable microservice architecture if there is ever need to do so.

Watch this talk by Alexandra Noonan and this other one by Simon Brown

How to set up Debian server for Selenium Chrome driver

  • Install an in-memory display server (xvfb)
sudo apt-get update
sudo apt-get install -y curl unzip xvfb libxi6 libgconf-2-4
  • Install Google Chrome
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo echo "deb [arch=amd64]  http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

How to test

git clone https://github.com/sopherapps/judah.git && cd judah
  • Copy the .example.env file to .env and make appropriate edits on it
cp .example.env .env
  • Create the test database: 'test_judah' in this case
sudo -su postgres
createdb test_judah
  • Update the TEST_POSTGRES_DB_URI variable in the .env file to that test database's connection details

  • Create a virtual environment and activate it

virtualenv -p /usr/bin/python3.6 env && source env/bin/activate
  • Install the dependencies
pip install -r requirements.txt
  • Run the test command
python -m unittest
  • To view test coverage and then report the results
coverage run -m unittest && coverage report -m

How to Use (Example commands for Linux)

  • Ensure you have Google Chrome installed. For debian servers, see instructions under the title "How to set up Debian server for Selenium Chrome driver"

Maintainers

Folder Structure

The judah package holds the framework components that are basically base classes to be overridden.

The folder structure as generated by th command tree -d --matchdirs -I 'env|__pycache__' is as shown below

.
├── judah
│   ├── controllers
│   │   ├── base
│   │   ├── db_to_db
│   │   ├── export_site_to_db
│   │   └── rest_api_to_db
│   ├── destinations
│   │   └── database
│   ├── sources
│   │   ├── base
│   │   ├── database
│   │   ├── export_site
│   │   └── rest_api
│   ├── transformers
│   └── utils
└── test
    ├── assets
    ├── test_controllers
    ├── test_destinations
    │   └── test_database
    ├── test_sources
    │   ├── test_database
    │   ├── test_exports_site
    │   └── test_rest_api
    ├── test_transformers
    └── test_utils

Acknowledgements

About

Judah is a simplistic service-oriented framework to handle ETL tasks in python.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published