This project implements a production‑grade ETL and analytics pipeline for the Brazilian e‑commerce dataset (Olist).
It includes:
- A raw → processed transformation pipeline
- A processed → analytics warehouse loader
- A wipe/reset system for safe re‑runs
- A full pipeline orchestrator
- Clean, color‑coded logging
- Schema‑validated dimension and fact building
The goal is to demonstrate a clean, maintainable, and reproducible data engineering workflow.
The pipeline runs in two major stages:
This stage:
- Normalizes column names
- Renames fields to match analytics DDL
- Cleans customers, merchants, and transactions
- Converts timestamps
- Saves processed datasets to
data_base/processed_data/
This stage:
- Loads processed datasets
- Validates schema
- Builds dimensions (customer, merchant, product, date, etc.)
- Builds fact tables
- Loads everything into the
analyticsschema in PostgreSQL
Run the entire pipeline with one command:
python python/run_all.py
Code
This performs:
- Wipe raw, processed, and analytics schema
- Transform raw → processed
- Load processed → analytics warehouse
All steps include clean, color‑coded logging.
Use wipe_data.py to safely reset any layer:
python python/wipe_data.py processed python python/wipe_data.py analytics python python/wipe_data.py raw python python/wipe_data.py all
Code
This ensures no stale files or mismatched schemas remain.
| Column | Description |
|---|---|
| customer_id | Unique customer ID |
| customer_unique_id | Persistent customer identifier |
| customer_zip_code_prefix | ZIP prefix |
| customer_city | City |
| customer_state | State |
| country | Always "Brazil" |
| Column | Description |
|---|---|
| merchant_id | Seller ID |
| merchant_zip_code_prefix | ZIP prefix |
| merchant_city | City |
| merchant_state | State |
| country | Always "Brazil" |
Merged from orders, items, and payments.
Defined in:
sql/analytics_schema.sql
Code
Created automatically during pipeline execution.
Includes:
dim_customerdim_merchantdim_productdim_datefact_ordersfact_paymentsfact_items- etc.
python -m venv vvenv source venv/bin/activate # Linux/Mac venv\Scripts\activate # Windows
Code
pip install -r requirements.txt
Code
DB_URL=postgresql+psycopg2://user:password@localhost:5432/payflow
Code
Typical workflow:
python python/wipe_data.py all python python/run_all.py
Code
Or manually:
python python/transform.py python python/load_analytics_data.py
Code
- The old processed loader was deprecated to avoid schema conflicts
- All analytics loaders now read exclusively from
processed_data/ - Schema validation prevents silent mismatches
- DDL is executed before loading any analytics tables
- Logging is consistent across all scripts
- Add dbt-style documentation
- Add Airflow orchestration
- Add unit tests for dimension builders
- Add data quality checks (Great Expectations)
Yomi Ismail
Data Engineering portfolio project focused on ETL design, PostgreSQL integration, and schema preparation for analytics use cases.