📦 Payflow Case Study - End‑to‑End ETL & Analytics Pipeline

This project implements a production‑grade ETL and analytics pipeline for the Brazilian e‑commerce dataset (Olist).
It includes:

A raw → processed transformation pipeline
A processed → analytics warehouse loader
A wipe/reset system for safe re‑runs
A full pipeline orchestrator
Clean, color‑coded logging
Schema‑validated dimension and fact building

The goal is to demonstrate a clean, maintainable, and reproducible data engineering workflow.

🏗️ Project Architecture

🚀 Pipeline Overview

The pipeline runs in two major stages:

1. Raw → Processed (transform.py)

This stage:

Normalizes column names
Renames fields to match analytics DDL
Cleans customers, merchants, and transactions
Converts timestamps
Saves processed datasets to data_base/processed_data/

2. Processed → Analytics Warehouse (load_analytics_data.py)

This stage:

Loads processed datasets
Validates schema
Builds dimensions (customer, merchant, product, date, etc.)
Builds fact tables
Loads everything into the analytics schema in PostgreSQL

🔄 Full Pipeline Execution

Run the entire pipeline with one command:

python python/run_all.py

Code

This performs:

Wipe raw, processed, and analytics schema
Transform raw → processed
Load processed → analytics warehouse

All steps include clean, color‑coded logging.

🧹 Resetting the Environment

Use wipe_data.py to safely reset any layer:

python python/wipe_data.py processed python python/wipe_data.py analytics python python/wipe_data.py raw python python/wipe_data.py all

Code

This ensures no stale files or mismatched schemas remain.

🧪 Processed Data Schema

Customers (clean_customers.csv)

Column	Description
customer_id	Unique customer ID
customer_unique_id	Persistent customer identifier
customer_zip_code_prefix	ZIP prefix
customer_city	City
customer_state	State
country	Always "Brazil"

Merchants (clean_merchants.csv)

Column	Description
merchant_id	Seller ID
merchant_zip_code_prefix	ZIP prefix
merchant_city	City
merchant_state	State
country	Always "Brazil"

Transactions (transactions.csv)

Merged from orders, items, and payments.

🏛️ Analytics Schema

Defined in:

sql/analytics_schema.sql

Code

Created automatically during pipeline execution.

Includes:

dim_customer
dim_merchant
dim_product
dim_date
fact_orders
fact_payments
fact_items
etc.

⚙️ Environment Setup

1. Create virtual environment

python -m venv vvenv source venv/bin/activate # Linux/Mac venv\Scripts\activate # Windows

Code

2. Install dependencies

pip install -r requirements.txt

Code

3. Configure `.env`

DB_URL=postgresql+psycopg2://user:password@localhost:5432/payflow

Code

🧭 Development Workflow

Typical workflow:

python python/wipe_data.py all python python/run_all.py

Code

Or manually:

python python/transform.py python python/load_analytics_data.py

Code

🧩 Notes & Design Decisions

The old processed loader was deprecated to avoid schema conflicts
All analytics loaders now read exclusively from processed_data/
Schema validation prevents silent mismatches
DDL is executed before loading any analytics tables
Logging is consistent across all scripts

📈 Future Enhancements

Add dbt-style documentation
Add Airflow orchestration
Add unit tests for dimension builders
Add data quality checks (Great Expectations)

Author

Yomi Ismail

Data Engineering portfolio project focused on ETL design, PostgreSQL integration, and schema preparation for analytics use cases.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data_base		data_base
python		python
sql		sql
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📦 Payflow Case Study - End‑to‑End ETL & Analytics Pipeline

🏗️ Project Architecture

🚀 Pipeline Overview

1. Raw → Processed (transform.py)

2. Processed → Analytics Warehouse (load_analytics_data.py)

🔄 Full Pipeline Execution

🧹 Resetting the Environment

🧪 Processed Data Schema

Customers (clean_customers.csv)

Merchants (clean_merchants.csv)

Transactions (transactions.csv)

🏛️ Analytics Schema

⚙️ Environment Setup

1. Create virtual environment

2. Install dependencies

3. Configure `.env`

🧭 Development Workflow

🧩 Notes & Design Decisions

📈 Future Enhancements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📦 Payflow Case Study - End‑to‑End ETL & Analytics Pipeline

🏗️ Project Architecture

🚀 Pipeline Overview

1. Raw → Processed (transform.py)

2. Processed → Analytics Warehouse (load_analytics_data.py)

🔄 Full Pipeline Execution

🧹 Resetting the Environment

🧪 Processed Data Schema

Customers (clean_customers.csv)

Merchants (clean_merchants.csv)

Transactions (transactions.csv)

🏛️ Analytics Schema

⚙️ Environment Setup

1. Create virtual environment

2. Install dependencies

3. Configure .env

🧭 Development Workflow

🧩 Notes & Design Decisions

📈 Future Enhancements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3. Configure `.env`

Packages