Demo Video in LinkedIn Demo Video in Reddit
This repository contains my submission for the Databricks Hackathon. The project implements a complete, end-to-end data and machine learning pipeline built on the Databricks Data Intelligence Platform, designed for Databricks Free Edition (available since June 2024).
The pipeline automates the following processes:
- Data Ingestion: Sourcing data from a public FTP site and a REST API, landing it in Unity Catalog Volumes.
- Data Analytics: Running analytical queries using PySpark to derive insights from the ingested data.
- ML Pipeline: Training, registering, and serving a HuggingFace Transformer model using MLflow and Databricks Model Serving.
- Orchestration: Automating the entire pipeline using Databricks Workflows with serverless compute.
- Deployment: Packaging the entire project (notebooks, jobs, and endpoints) as a Databricks Asset Bundle (DAB) for reproducible, CI/CD-friendly deployment.
This project is broken down into the following parts, which are orchestrated by a Databricks Asset Bundle.
A SQL script (set_up_schema.sql) creates the necessary Unity Catalog structures:
- Catalog:
main - Schema:
main.hackathon - Volumes:
main.hackathon.bls_dataandmain.hackathon.population_data
This runs as the first task in the workflow to ensure the infrastructure is ready.
Two data sources are ingested and stored in Unity Catalog Volumes:
- BLS Time Series Data: A script fetches and syncs files from the BLS open dataset. It handles new/removed files and avoids duplicate uploads.
- US Population Data: A script calls the DataUSA API and saves the resulting JSON to a UC Volume.
Notebook: 01_data_ingestion.py
An analysis notebook generates three reports:
- Population Statistics: The mean and standard deviation of the US population between 2013 and 2018.
- Best Year Report: For each
series_idin the BLS data, finds the year with the maximum sum of values across all quarters. - Joined Report: A combined report showing the
valueforseries_id = PRS30006032(Q01) joined with thepopulationfor that same year.
Notebook: 02_data_analytics.py
A Databricks Workflow with serverless compute automates the data pipeline with the following tasks:
- Task 0 (Setup): Runs the SQL script to create schema and volumes.
- Task 1 (Ingestion): Runs the notebook to ingest data from BLS and DataUSA API.
- Task 2 (Analysis): Runs the analytics notebook after ingestion completes.
- Task 3 (ML Training): Trains and registers the sentiment analysis model.
All tasks run on serverless compute, eliminating the need to manage clusters.
A full MLOps lifecycle is implemented for a Sentiment Analysis model:
- Train: A notebook (
03_model_training.py) loads a pretrained HuggingFace Transformer model. - Log: The model is logged to MLflow with tracking.
- Register: The model is registered in the Unity Catalog Model Registry as
main.hackathon.sentiment_analysis_model. - Serve: A Databricks Model Serving Endpoint is created to serve the registered model, providing a REST API for real-time inference.
Notebook: 03_model_training.py
The entire project is defined within a databricks.yml file. This allows for one-command deployment of all project resources, including:
- The Databricks Workflow (Job) with serverless compute
- The Model Serving Endpoint
- All associated notebooks and SQL scripts
- Library dependencies
This project is packaged as a Databricks Asset Bundle and is optimized for Databricks Free Edition with serverless compute.
- A Databricks Free Edition account (free tier available since June 2024)
- Databricks CLI installed (v0.205.0 or higher recommended)
- Python 3.8+ installed locally
-
Install Databricks CLI:
pip install databricks-cli
-
Configure Databricks CLI:
databricks configure --token
You'll be prompted to enter:
- Host: Your workspace URL (e.g.,
https://dbc-abc123-def.cloud.databricks.com) - Token: Generate a personal access token from User Settings > Developer > Access Tokens
- Host: Your workspace URL (e.g.,
-
Clone this repository:
git clone [YOUR_REPO_URL] cd databricks-hackathon -
Create the src directory structure:
mkdir -p src
-
Move the notebooks and SQL file to the src folder:
mv 01_data_ingestion.py src/ mv 02_data_analytics.py src/ mv 03_model_training.py src/ mv set_up_schema.sql src/
-
Validate the bundle configuration:
databricks bundle validate
-
Deploy the bundle:
databricks bundle deploy -t dev
This will:
- Upload all notebooks to your workspace
- Create the workflow job with serverless compute
- Set up task dependencies
- Create the model serving endpoint configuration
-
Run the complete pipeline:
databricks bundle run databricks-hackathon-job -t dev
-
Monitor the job:
- Go to your Databricks workspace
- Navigate to Workflows in the left sidebar
- Find
[Hackathon] End-to-End Pipeline - Click on it to view task progress and outputs
- All tasks run on serverless compute - no cluster management needed!
-
Check the Model Serving Endpoint:
- Navigate to Serving in the left sidebar
- Find the
sentiment-analysisendpoint - Wait for it to be in "Ready" state (may take a few minutes)
- Click on it to view the endpoint details and get the API URL
After the pipeline completes:
- Open the
02_data_analytics.pynotebook output in the workflow run - View the three analytical reports:
- Population statistics (2013-2018)
- Best year by series ID
- Joined BLS and population data
- Navigate to Machine Learning > Models in the left sidebar
- Find
main.hackathon.sentiment_analysis_model - View model versions, metadata, and lineage
Test via UI:
- Go to Serving >
sentiment-analysis - Click on the Query endpoint tab
- Test with sample input:
{ "inputs": ["Databricks is awesome!", "This project is challenging."] }
Test via API:
# Get your endpoint URL from the Serving UI
curl -X POST \
https://[YOUR-WORKSPACE].cloud.databricks.com/serving-endpoints/sentiment-analysis/invocations \
-H "Authorization: Bearer [YOUR-TOKEN]" \
-H "Content-Type: application/json" \
-d '{
"inputs": ["Databricks makes data engineering easy!"]
}'Test via Python:
import requests
import os
# Set these from your workspace
DATABRICKS_HOST = "https://[YOUR-WORKSPACE].cloud.databricks.com"
DATABRICKS_TOKEN = "[YOUR-TOKEN]"
endpoint_url = f"{DATABRICKS_HOST}/serving-endpoints/sentiment-analysis/invocations"
headers = {
"Authorization": f"Bearer {DATABRICKS_TOKEN}",
"Content-Type": "application/json"
}
data = {
"inputs": ["I love using Databricks!", "This is frustrating."]
}
response = requests.post(endpoint_url, headers=headers, json=data)
print(response.json())databricks-hackathon/
βββ databricks.yml # DAB configuration file
βββ README.md # This file
βββ src/ # Source code directory
βββ set_up_schema.sql # Schema and volume creation
βββ 01_data_ingestion.py # Data ingestion from BLS and API
βββ 02_data_analytics.py # Data analysis and reporting
βββ 03_model_training.py # ML model training and registration
Issue: Bundle deployment fails with authentication error
Solution: Ensure your Databricks CLI is configured correctly with databricks configure --token. Verify your token is valid and hasn't expired.
Issue: SQL task fails to find set_up_schema.sql
Solution: Ensure the file is in the src/ folder and the artifact_path in databricks.yml is set to src.
Issue: Model serving endpoint stays in "Not Ready" state
Solution: This can take 5-10 minutes on first deployment. Check the endpoint logs for any errors. Ensure the model version "1" exists in the registry.
Issue: "User-Agent required" error when downloading BLS data
Solution: The notebooks already include User-Agent headers. If you still see this error, update the email in the User-Agent string to your own in 01_data_ingestion.py.
Issue: Volume creation fails
Solution: Ensure you have the necessary permissions in your workspace. The main catalog should be available by default in Free Edition.
Issue: Notebook runs out of memory
Solution: Serverless compute automatically scales, but if you encounter memory issues, you can optimize the data processing by filtering or sampling in the notebooks.
- Serverless Compute: All jobs run on serverless compute - no need to create or manage clusters!
- Model Serving: Fully supported with scale-to-zero enabled to optimize costs.
- Unity Catalog: Full Unity Catalog support for data governance and model registry.
- Usage Limits: Free Edition has usage limits. Monitor your usage in the workspace settings.
- Auto-scaling: Serverless compute automatically scales based on workload.
- Idempotent Data Ingestion: The BLS sync only downloads new files and removes stale ones.
- Serverless-First Design: No cluster management required - everything runs on serverless compute.
- End-to-End MLOps: From data ingestion to model serving, all automated.
- Asset Bundle: One-command deployment of the entire project.
- Production-Ready: Uses Unity Catalog for governance and includes proper error handling.
- Real-Time Inference: Model serving endpoint provides REST API for predictions.
- Scheduled Runs: Add a schedule to the workflow to run daily/weekly
- Data Validation: Add data quality checks using Great Expectations
- Model Monitoring: Track model performance and data drift
- CI/CD Integration: Add GitHub Actions workflow for automated deployment
- Alerting: Set up email/Slack notifications for job failures
- Additional Models: Train custom models on the BLS economic data
- Dashboard: Create a Databricks SQL dashboard to visualize insights
- Databricks Free Edition
- Databricks Asset Bundles Documentation
- Serverless Compute Documentation
- Model Serving Documentation
- Unity Catalog Documentation
For questions or feedback about this project:
- Email: zhou.wu@lwtech.edu
- GitHub: https://github.com/zwu-net
License: MIT License
Acknowledgments:
- Bureau of Labor Statistics for the open data
- DataUSA for the population API
- HuggingFace for the pretrained sentiment analysis model
- Databricks for the Free Edition platform and hackathon opportunity