# 🏋️‍♂️ Fitness Lakehouse Project – End-to-End Data & ML Pipeline on `Databricks`

## 📌 Project Summary

This project demonstrates the design and implementation of a full-stack Lakehouse pipeline using Databricks. It simulates real-time ingestion of fitness tracking data and builds a streaming data pipeline using Delta Live Tables (DLT), followed by insightful visualizations in DBSQL and a machine learning model for calorie burn prediction using MLflow.

The solution is production-ready and orchestrated using a Databricks Job, showcasing structured streaming, automated data quality checks, and model deployment.

## 🎯 Project Goals

- Simulate real-time ingestion of fitness data using structured streaming and Autoloader
- Create a Bronze → Silver → Gold pipeline using Delta Live Tables (DLT)
- Build a dashboard with Databricks SQL for activity insights
- Train and deploy a calorie-burn prediction model using MLflow
- Demonstrate pipeline orchestration using Databricks Workflows (Jobs)

## 🛠️ Technologies & Features Used

- *Databricks Autoloader* – To ingest new files into the Bronze table using structured streaming
- *Delta Live Tables (DLT)* – For building robust, streaming data pipelines with quality constraints
- *Unity Catalog* – For unified governance and managed storage of all Delta tables
- *DBSQL Dashboards* – To visualize fitness patterns, activity levels, and calorie expenditure
- *MLflow* – To track, evaluate, and register ML models predicting total calories burned
- *Databricks Workflows* – To orchestrate a 3-task pipeline job from ingestion to dashboard refresh

## 🗂️ Folder & Notebook Structure

The project notebooks are organized under the main folder fitness-lakehouse, with the following subdirectories to reflect each stage of development and production:

#### 📂 dev/
Used for exploration and testing before implementing with DLT pipelines.

- **01_ingest_bronze**: Manually ingests the raw dataset and writes to a Delta table.
- **02_transform_silver**: Cleans the data and creates activity_level feature.
- **03_aggregate_gold**: First version of gold table aggregation logic.
- **04_autoloader_bronze**: Sets up Autoloader to read streaming data from the S3 landing path.
- **utils**: Used for ad hoc exploration and testing SQL logic during development.

#### 📂 jobs/
Notebooks used in the production *Databricks Job pipeline*, executed in this order:

- **00_simulate_landing_data**: Adds one new row of activity data to the landing zone per run (simulated ingestion).
- **01_dlt_pipeline_fitness**: Contains the DLT logic for bronze, silver, and gold table creation.
- **02_refresh_dashboard_data**: Refreshes materialized views used by DBSQL dashboards.

#### 📂 ml/
Machine learning notebooks for calorie burn prediction.

- **train_calories_model**: Trains a RandomForestRegressor to predict daily calories burned based on activity patterns. Logs metrics and model to MLflow.
- **predict_calories_from_model**: Loads the trained model and simulates calorie predictions on new user input.

## 🚀 How to run the project

---

#### 1. Initial Dataset Preparation
- *Source Dataset:* The project uses the *"dailyActivity_merged.csv"* file from the [FitBit Fitness Tracker Kaggle dataset](https://www.kaggle.com/datasets/arashnic/fitbit).
- *Subset Used:* Only the dailyActivity_merged.csv from **April–May 2016** file was used.
- *Upload to S3:* The dataset was manually uploaded to the landing path (`landing/bronze/daily_activity_stream/`), simulating streaming ingestion.
    
---
#### 2. Upload Backup Dataset to Hive Metastore
   - The secondary file, dailyActivity_merged.csv from **March–April 2016**, must be uploaded as a *Hive Metastore table* named `default.daily_activity_merged_march_april`.
     
   - This table is used in the *simulate_landing_data* notebook to ingest 1 new row of data per job run in a controlled, streaming-like manner.

---

#### 3. Create and run the Job Pipeline
The Databricks Job contains *three tasks* chained in order:

##### Task 1: simulate_landing_data
- Simulates a new row of data being streamed daily into the *Autoloader landing zone*.
- Pulls the next row from a backup table (dailyActivity_merged.csv from **March–April 2016**) and writes it as a new CSV file into the landing S3 path.
- Updates a *control Delta table* (_fitness_dlt.ingestion_control_) to track the next row to insert.

##### Task 2: run_fitness_dlt_pipeline
- Runs a *Delta Live Tables (DLT)* pipeline with three layers:
  1. *Bronze:* Uses Autoloader to ingest all new CSV files in the landing path.
  2. *Silver:* Cleans, parses, and engineers features (e.g., adds activity_level).
  3. *Gold:* Generates two optimized gold tables:
     - gold_daily_activity_dashboard: for dashboards (aggregated by day).
     - gold_daily_activity_ml: for ML training (aggregated by user & activity level).

All tables are *managed tables* stored under the *Unity Catalog* S3 path.

##### Task 3: refresh_dashboard_data
- Refreshes the materialized view used by the DBSQL dashboard (gold_daily_activity_dashboard).
- Ensures the dashboard reflects the latest data ingested by the pipeline.

---

#### 📊 4. Dashboard & ML Integration
- A *DBSQL dashboard* titled Fitness Activity Summary visualizes KPIs like steps, calories, and activity-level breakdowns.
- An ML pipeline is trained on the gold_daily_activity_ml table using *RandomForestRegressor, and tracked via **MLflow*.
- The trained model is *registered in MLflow Model Registry* and can be used to predict calories from test data via a separate prediction notebook.

## 🖼️ Sample Outputs

### 📊 Final DBSQL Dashboard:
- Steps & Calories Over Time
- Calories vs Steps Correlation
- Time Spent in Activity Modes
- Weekly Steps Activity Heatmap
- Activity Level Distribution

![Dashboard Screenshot](/Workspace/Users/walidelkhatib@hotmail.com/fitness-lakehouse/README_assets/dbsql_dashboard_summary.png)

### 🤖 ML Model Summary:
- Model type: *Tuned RandomForestRegressor*
- R² Score: ~0.45 on test data
- RMSE: ~450
- Tracked using MLflow and registered to Unity Catalog

![MLflow Experiment Screenshot](/Workspace/Users/walidelkhatib@hotmail.com/fitness-lakehouse/README_assets/mlflow_experiment_overview.png)
![MLflow Model Registry Screenshot](/Workspace/Users/walidelkhatib@hotmail.com/fitness-lakehouse/README_assets/mlflow_model_registry.png)