A complete, production-ready modern data stack built entirely with open-source components. Demonstrates cross-database federation, lakehouse architecture with Apache Iceberg, dbt transformations, semantic layer, and self-service BIβall vendor-agnostic and Git-based. Plus: Dual AI interfaces - Claude MCP for natural language exploration and Streamlit app for multi-provider comparison (Claude vs Mistral vs Ollama).
v2.1: Added dbt MetricFlow with OSI-compatible semantic modelsβtesting the promise of "define once, use everywhere" across semantic layer tools.
v2.0: Migrated from Hive Metastore to Apache Polaris (Iceberg REST catalog) for modern lakehouse capabilities with improved authentication and setup automation. Plus: Dual AI interfaces - Claude MCP integration for Claude Desktop and Streamlit comparison app (Claude API vs Mistral AI vs Ollama local).
This implementation proves that enterprise-grade data architecture is achievable without vendor lock-in:
- β Cross-database federation via Trino - query PostgreSQL, MySQL, and object storage simultaneously
- β Modern lakehouse with Apache Polaris and Iceberg - ACID transactions, time travel, schema evolution
- β Git-based transformations with dbt - version-controlled SQL models
- β Semantic layer with Cube.js - centralized metrics and governance
- β OSI-ready semantic models with dbt MetricFlow - testing Open Semantic Interchange specification for vendor interoperability
- β Self-service analytics with Metabase - drag-and-drop visualization
- β AI-powered interfaces - Claude MCP for exploration + Streamlit for multi-provider comparison (Claude/Mistral/Ollama)
- β Full data sovereignty - complete control over data location and processing
- β Hybrid-ready - mix self-hosted with managed services as needed
Processing synthetic e-commerce data: Orders from PostgreSQL β Product catalogs from MySQL β User events from object storage β Unified analytics layer β AI-powered natural language interface.
Modern Data Stack v2 Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA SOURCES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PostgreSQL MySQL MinIO (S3-compatible) β
β (Orders) (Products) (User Events - Parquet) β
ββββββββ¬ββββββββββββββ¬ββββββββββββββββ¬βββββββββββββββββββββββββββ
β β β
βββββββββββββββ΄ββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββ ββββββββββββββββββββββββββββ
β FEDERATION LAYER βββββββββ AI INTERFACES β
β Trino (35+ connectors) β β β’ Claude MCP β
β Real-time cross-DB joins β β β’ Streamlit (3-way) β
βββββββββββββββ¬ββββββββββββββββ β Claude/Mistral/Ollama β
β ββββββββββββββββββββββββββββ
βββββββββββββββΌββββββββββββββββ
β LAKEHOUSE CATALOG β
β Apache Polaris β
β (Iceberg REST Catalog) β
β - OAuth authentication β
β - ACID transactions β
β - Schema evolution β
βββββββββββββββ¬ββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββ
β TRANSFORMATION LAYER β
β dbt Core β
β - Staging β Intermediate β
β - β Marts (star schema) β
β - Writes Iceberg tables β
βββββββββββββββ¬ββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββ
β SEMANTIC LAYER β
β Cube.js β
β - Metrics definitions β
β - Access control β
β - Pre-aggregations β
βββββββββββββββ¬ββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββ
β VISUALIZATION β
β Metabase β
β - Self-service BI β
β - Interactive dashboards β
βββββββββββββββββββββββββββββββ
- Docker Desktop (with Docker Compose)
- 8GB RAM minimum (16GB recommended)
- 10GB free disk space
- Optional: Claude Desktop for AI interface
# Clone repository
git clone https://github.com/vincevv017/modern-data-stack.git
cd modern-data-stack
# Start all services (takes 2-3 minutes)
docker compose up -d
# Wait for services to initialize
sleep 30
# Setup Apache Polaris catalog (auto-detects credentials)
bash init-scripts/polaris/setup-polaris.sh
# Create lakehouse schemas
bash init-scripts/polaris/setup-lakehouse-schemas.shAfter schemas are created, load user events data into the lakehouse:
# Upload Parquet file to MinIO
docker compose cp lakehouse-data/user_event/data-001.parquet mc:/tmp/
docker compose exec mc mc cp /tmp/data-001.parquet myminio/raw-data/user_event/
# Verify upload
docker compose exec mc mc ls myminio/raw-data/user_event/
# Should show: data-001.parquet
# Create external table in lakehouse
docker compose exec trino trino << 'EOSQL'
CREATE SCHEMA IF NOT EXISTS lakehouse.raw_data
WITH (location = 's3://raw-data/');
CREATE TABLE IF NOT EXISTS lakehouse.raw_data.user_events (
user_id INTEGER,
event_type VARCHAR,
session_id VARCHAR,
event_timestamp TIMESTAMP(6),
page_url VARCHAR
)
WITH (
format = 'PARQUET',
external_location = 's3://raw-data/user_event/'
);
SELECT COUNT(*) FROM lakehouse.raw_data.user_events;
EOSQL
# β οΈ CRITICAL: Run dbt transformations to create dbt_marts tables
# Without this, Cube.js and the AI interfaces won't work!
docker compose exec dbt dbt run
# Verify complete setup
docker compose exec trino trino --execute "SHOW CATALOGS;"
docker compose exec trino trino --execute "SHOW SCHEMAS IN lakehouse;"
docker compose exec trino trino --execute "SHOW TABLES IN lakehouse.dbt_marts;"# Build time spine (calendar table)
docker compose exec dbt dbt run --select metricflow_time_spine
# Run validation script
bash dbt/scripts/validate-metricflow.shExpected output:
π MetricFlow Validation Suite - Testing All 12 Metrics
Total Tests: 22
Passed: 20
Expected Failures: 2 (growth metrics - need more time series data)
Failed: 0
β
VALIDATION SUCCESSFUL!
Experience natural language queries to your lakehouse:
# Install Claude Desktop
# Download from: https://claude.ai/download
# Install Python dependencies
/opt/homebrew/bin/python3 -m pip install mcp trino requests
# Configure Claude Desktop
cat > ~/Library/Application\ Support/Claude/claude_desktop_config.json << EOF
{
"mcpServers": {
"trino": {
"command": "python3",
"args": [
"$(pwd)/mcp-servers/trino/server.py"
]
}
}
}
EOF
# Restart Claude Desktop
# Now you can query your lakehouse in natural language!Try it:
- "What schemas exist in the lakehouse?"
- "Show me tables in dbt_marts"
- "What's the total revenue from fct_orders?"
| Service | URL | Credentials |
|---|---|---|
| Trino UI | http://localhost:8080 | None (auto-login as admin) |
| Cube.js Playground | http://localhost:4000 | None |
| Metabase | http://localhost:3000 | Setup on first visit |
| MinIO Console | http://localhost:9001 | admin / password123 |
| Polaris API | http://localhost:8181 | OAuth (auto-configured) |
| MetricFlow API | http://localhost:8001 | future use |
| Claude MCP | Claude Desktop App | Natural language interface |
Experience the full stack with this federation query:
docker compose exec trino trino --execute "
SELECT
product_name,
supplier_country,
COUNT(*) as order_count,
SUM(revenue) as total_revenue,
AVG(revenue) as avg_revenue
FROM lakehouse.dbt_marts.fct_orders
GROUP BY product_name, supplier_country
ORDER BY total_revenue DESC
LIMIT 10;"Or ask Claude:
"Show me the top 10 products by revenue, grouped by supplier country"
This query:
- Reads from dbt-transformed Iceberg tables in the lakehouse
- Aggregates data with ACID guarantees
- Returns business metrics ready for visualization
- Modern Iceberg REST catalog with OAuth authentication
- Auto-credential detection from Polaris logs
- Comprehensive setup scripts with error handling
- Proper role-based access control (RBAC)
- dbt now writes Iceberg tables directly to lakehouse
- Separation of storage (MinIO/S3) and compute (Trino)
- ACID transactions for analytics tables
- Time travel and schema evolution support
setup-polaris.sh- Main setup with auto-detectionsetup-lakehouse-schemas.sh- Schema initializationrecreate-catalog.sh- Quick catalog recreationcheck-what-broke.sh- Diagnostic troubleshooting
fs.native-s3.enabled=trueenables Trino native S3- Required for Polaris REST catalog with MinIO
- Fixes "No factory for location" errors
- Natural language queries to lakehouse
- Conversational schema exploration
- No SQL knowledge required
- Demonstrates modern AI + data integration
- Hive Metastore container removed
lakehouse.propertiesnow usesiceberg.rest-catalog.*properties- New initialization workflow required
- OAuth credentials must be configured
modern-data-stack/
βββ docker-compose.yml # Infrastructure as code
βββ init-scripts/
β βββ polaris/ # Polaris setup scripts
β β βββ setup-polaris.sh # Main setup (use this)
β β βββ setup-lakehouse-schemas.sh
β β βββ recreate-catalog.sh # Quick rebuild
β β βββ check-what-broke.sh # Diagnostics
β βββ postgres/ # PostgreSQL init
β βββ mysql/ # MySQL init
βββ lakehouse-data/
β βββ user_event/
β βββ data-001.parquet # Sample user events
βββ trino/
β βββ catalog/ # Data source configs
β β βββ lakehouse.properties # Polaris catalog
β β βββ postgres.properties # Orders DB
β β βββ mysql.properties # Products DB
β βββ config/
β βββ config.properties # Trino settings
βββ dbt/
β βββ dbt_project.yml
β βββ profiles.yml # Trino connection
β βββ models/
β βββ staging/ # Raw data models
β βββ intermediate/ # Business logic
β βββ marts/ # Analytics-ready facts
β βββ semantic_models/ # v2.1: MetricFlow OSI definitions
β βββ orders.yml # Semantic model & metrics
β βββ metricflow_time_spine.sql # Calendar table
β βββ metricflow_time_spine.yml # Time spine metadata
β βββ scripts/
β βββ validate-metricflow.sh # v2.1: MetricFlow validation
βββ cube/
β βββ model/
β βββ Orders.js # Semantic layer definitions
βββ mcp-servers/ # π AI Interface (Claude MCP)
β βββ trino/
β βββ server.py # Claude MCP server
βββ streamlit-app/ # π AI Interface (Multi-Provider)
β βββ app.py # Streamlit application
β βββ requirements.txt # Python dependencies
β βββ .env.example # API key template
β βββ README.md # Detailed documentation
βββ POLARIS_TRINO_CONFIG.md # Configuration notes
βββ README.md
# View service status
docker compose ps
# View logs
docker compose logs -f polaris
docker compose logs -f trino
# Restart a service
docker compose restart trino
# Stop all services
docker compose down
# Stop and remove volumes (fresh start)
docker compose down -v# Check if catalog exists
bash init-scripts/polaris/check-what-broke.sh
# Recreate catalog (if needed)
bash init-scripts/polaris/recreate-catalog.sh
# View Polaris credentials
docker compose logs polaris | grep "root principal credentials"
# Update Trino with new credentials (if needed)
CREDS=$(docker compose logs polaris | grep "root principal credentials" | tail -1 | sed 's/.*credentials: //')
sed -i.bak "s/iceberg.rest-catalog.oauth2.credential=.*/iceberg.rest-catalog.oauth2.credential=$CREDS/" trino/catalog/lakehouse.properties
docker compose restart trino# Interactive Trino CLI
docker compose exec trino trino
# Example queries in CLI
SHOW CATALOGS;
SHOW SCHEMAS IN lakehouse;
SHOW TABLES IN lakehouse.dbt_marts;
# Exit: Ctrl+D or \q# Run all models
docker compose exec dbt dbt run
# Run specific model
docker compose exec dbt dbt run --select fct_orders
# Test data quality
docker compose exec dbt dbt test
# Generate documentation
docker compose exec dbt dbt docs generate# Upload additional Parquet files to MinIO
docker compose cp /path/to/file.parquet mc:/tmp/
docker compose exec mc mc cp /tmp/file.parquet myminio/raw-data/new-dataset/
# Create external table for new data
docker compose exec trino trino --execute "
CREATE TABLE lakehouse.raw_data.new_table (...)
WITH (format = 'PARQUET', external_location = 's3://raw-data/new-dataset/');"# Check MCP server status
tail -f ~/Library/Logs/Claude/mcp-server-trino.log
# Test MCP server manually
cd mcp-servers/trino
python3 server.py
# Restart Claude Desktop to reload MCP servers
# Then ask Claude natural language questions about your dataThis project includes two complementary AI interfaces for different use cases:
Natural language interface integrated directly into Claude Desktop for interactive data exploration.
Use case: Ad-hoc exploration, iterative analysis, conversational data discovery
Setup: See "Optional: Setup AI Interface (Claude MCP)" section above
Compare how different AI providers (Claude, Mistral AI, Ollama) generate SQL from natural language queries.
π Location: streamlit-app/ directory
π Full documentation: streamlit-app/README.md
Use case: Evaluate AI providers, data sovereignty requirements, cost optimization
- πͺπΊ European AI Sovereignty: Mistral AI (EU) + Ollama (on-premises) for GDPR compliance
- βοΈ Three-Way Comparison: Claude API vs Mistral AI vs Local Ollama
- π° Cost Options: Free tier (Mistral) + local (Ollama) + paid (Claude)
- π Performance Metrics: Track generation time, success rates, SQL quality
- π― Stateless Design: Each query is independent for clean comparisons
cd streamlit-app
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # Optional: add MISTRAL_API_KEY (free) or ANTHROPIC_API_KEY
brew install ollama && ollama serve & && ollama pull qwen2.5-coder:7b
streamlit run app.py # Opens at http://localhost:8501For detailed setup, usage, and troubleshooting: See streamlit-app/README.md
Problem: Trino cannot see lakehouse catalog
# 1. Check Polaris is running
docker compose ps polaris
# 2. Verify catalog exists in Polaris
bash init-scripts/polaris/check-what-broke.sh
# 3. Check credentials in lakehouse.properties
cat trino/catalog/lakehouse.properties
# 4. Recreate catalog if needed
bash init-scripts/polaris/recreate-catalog.shProblem: Configuration property errors
Check lakehouse.properties has the correct format. See POLARIS_TRINO_CONFIG.md for details.
Critical property for MinIO:
fs.native-s3.enabled=trueWithout this, you'll get "No factory for location: s3://..." errors.
Problem: Schema 'marts' does not exist
Cube.js must reference dbt_marts, not marts:
# docker-compose.yml
CUBEJS_DB_SCHEMA: dbt_marts # Not just "marts"// cube/model/Orders.js
sql: `SELECT * FROM lakehouse.dbt_marts.fct_orders`Problem: Table does not exist errors
Ensure raw data is loaded:
# Check if user_events table exists
docker compose exec trino trino --execute "
SELECT COUNT(*) FROM lakehouse.raw_data.user_events;"
# If not, load the data (see "Load Sample Data" section)Error: "At least one time spine must be configured"
# Build the time spine table
docker compose exec dbt dbt run --select metricflow_time_spineError: "The given input does not match any of the available group-by-items"
- Use entity-prefixed names:
order_id__supplier_countrynotsupplier_country - Check available dimensions:
docker compose exec dbt mf list metrics
Errors related to the metrics
- Clean the environment
docker compose exec dbt dbt clean
- Force a full parse to rebuild the manifest
docker compose exec dbt dbt parse --no-partial-parse
- Validate MetricFlow configs against the warehouse
docker compose exec dbt mf validate-configs
Problem: Claude can't connect to MCP server
# Check logs
tail -50 ~/Library/Logs/Claude/mcp-server-trino.log
# Common issue: Wrong Python
# Install MCP in the Python Claude uses
/opt/homebrew/bin/python3 -m pip install mcp trino requests
# Test server manually
cd mcp-servers/trino
python3 server.py
# Should show: "Starting Trino MCP server..."
# Verify Trino is accessible
curl http://localhost:8080
# Restart Claude Desktop completelyWhen scaling beyond proof-of-concept:
-
Starburst Galaxy (Trino)
- Enterprise query optimization (Warp Speed)
- Auto-scaling compute clusters
- 24/7 support and SLAs
-
- Integrated development environment
- Automated scheduling and orchestration
- CI/CD pipelines
-
- Auto-scaling for query spikes
- Built-in AI/BI interfaces
- Enhanced caching
-
- Automated backups and updates
- Natural language queries
- Alerting and monitoring
# Mix open-source and managed services
Storage: Self-hosted MinIO (data sovereignty)
Catalog: Self-hosted Polaris (control)
Compute: Starburst Galaxy (performance)
Transform: dbt Cloud (productivity)
Semantic: Cube Cloud (AI features)
BI: Metabase Cloud (reliability)
AI: Claude MCP (natural language)- Apache Polaris Docs
- Apache Iceberg Docs
- Trino Documentation
- dbt Documentation
- dbt MetricFlow
- OSI Specification
- Cube.js Documentation
- Metabase Documentation
- Model Context Protocol (MCP)
- Building a Modern Lakehouse
- dbt Best Practices
- Iceberg Table Format
- Open Semantic Interchange (OSI)
- MCP: Connecting AI to Data
Contributions welcome! Areas for improvement:
- Add more dbt models (metrics layer, KPIs)
- Implement dbt tests and documentation
- Create Cube.js dashboards
- Add data quality checks
- Implement incremental loading
- Add more data sources
- Create streaming ingestion with Kafka
- Expand MCP capabilities (dbt generation, troubleshooting)
MIT License - see LICENSE file for details
- GitHub Issues: Report bugs or request features
- LinkedIn: Connect with me
#DataEngineering #ModernDataStack #OpenSource #ApachePolaris #ApacheIceberg #Trino #dbt #VendorAgnostic #DataLakehouse #DataSovereignty #AI #ClaudeMCP #MistralAI #NaturalLanguage #EuropeanAI
Built with β€οΈ for the data community
Proving that vendor-agnostic, open-source data infrastructure is not just possibleβit's practical. Now with dual AI interfaces for natural language exploration and multi-provider comparison.




