Skip to content

A custom Python ETL pipeline with a Postgres data warehouse for modeling art inventory.

License

Notifications You must be signed in to change notification settings

srenegado/paintings-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

paintings-data

Woman with a Parasol – Madame Monet and Her Son, 1875 by Claude Monet,
National Gallery of Art, Washington DC

This project is an ETL pipeline that ingests data on museums, artists, and their paintings into a data warehouse in Postgres. The intent here is building the data models and the ETL process from scratch, understanding these concepts by limiting the use of more sophisticated tools. It has the following features:

  • Establishes an ETL process with Python using SQLAlchemy and Pandas
  • Creates a data warehouse with a staging and presentation area using Postgres as the storage solution
  • Validates every deployed table in the data warehouse with a comprehensive test suite

Below is a diagram that overviews the entire process.

Note: For more on data modeling and data quality, check out the docs.

Installation

This project requires Python: I used version 3.11.

Clone this repo. Then you can install the required Python dependencies via:

make init

To install the database either:

Download the data and extract the csvs to the data folder.

Usage

How the database is setup depends on if you installed PostgreSQL and pgAdmin locally or are using Docker.

Setting up the DB locally

Usually the default user is postgres and the password is normally configured when installing PostgreSQL, but you can always change it.

Inside the pipelines/config.json file, set the following values:

"user": "your_default_postgres_username",
"password": "your_default_postgres_password",
"host": "localhost",
"port": 5432

Open pgAdmin and make a new server, filling in the following areas:

Host name/address: localhost
Port: 5432
Username: your_default_postgres_username
Password: your_default_postgres_password

To create a database, first connect to the default postgres database with the default user. Then run the following query:

CREATE DATABASE paintings;

A paintings database should appear under the Databases drop down menu.

Setting up the DB via Docker

Spin up the Docker containers via

docker compose up

Open up the compose.yml file for reference.

Inside the pipelines/config.json file, set the following values:

"user": POSTGRES_USER,         # from compose.yml
"password": POSTGRES_PASSWORD, # from compose.yml
"host": "localhost",
"port": 5433

Go to localhost:5050 in your browser directs and login using PGADMIN_DEFAULT_EMAIL and PGADMIN_DEFAULT_PASSWORD.

Make a new server with the following values:

Host name/address: db
Port: 5432
Username: POSTGRES_USER
Password: POSTGRES_PASSWORD

A paintings database should already be listed under Databases.

To shut down the containers, use:

docker compose down

Running the pipeline

You can deploy the tables using

make all

You should see these tables now in the paintings database: check it out through pgAdmin.

Then you can validate the tables with

make test

and see all the tests results in the console.

Data

The data was sourced from Kaggle.

License

This project is under the MIT license (see LICENSE)

About

A custom Python ETL pipeline with a Postgres data warehouse for modeling art inventory.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published