GitHub - sameerhussai230/Dynamic_ETL_Workflow_with_Azure_DataBricks_Delta_Lake_and_ADF: Dynamic ETL Workflow with Azure DataBricks, Delta Lake and ADF

Dynamic ETL Workflow with Azure Databricks, Delta Lake and ADF

Project Overview

The project focuses on building a real-world data engineering solution using Azure Databricks, PySpark, Azure Data Lake Gen2, Delta Lake, and Azure Data Factory. It involves analyzing and reporting on Formula1 motor racing data.

Data Source

The data for all Formula 1 races from the 1950s onwards is obtained from an open-source API called Ergast Developer API (https://ergast.com/mrd/). The API provides several tables, and the attributes of each table are described in detail in the F1DB User Guide(https://ergast.com/docs/f1db_user_guide.txt). Each table is present as a single-line/multiline CSV or JSON file in the data folder.

The data covers various aspects such as race results, circuits, drivers, constructors, pit stops, lap times, etc. Data sources files are: Circuits (CSV) , Races (CSV) , Constructors (JSON) , Drivers (JSON), Results (JSON), Pitstops (JSON), Lap Times (Split CSV Files), Qualifying (Split JSON Files)

Architecture Explanation & Project Working

The project utilizes Azure Databricks, Delta Lake, Azure Data Factory. It involves:

Connecting Databricks using Service Principal and Keyvault Secrets
Implementing Incremental Load using Delta Lake upsert concept
Designing Ingest, Transform, and Process Pipelines

Data Processing Pipeline Overview

The data obtained from the Ergest Developer API is imported into a raw Azure Data Lake Storage (ADLS) container on Azure. Subsequently, this raw data is processed using Databricks notebooks to ingest it into the ingested raw layer.

Ingested Raw Layer In this layer, schema is applied to the data and it is stored in the Delta format. Where applicable, partitions are created, and additional information, such as date and data source, is added for audit purposes.

Transformation for Presentation Layer The ingested data is then transformed via Databricks notebooks for the presentation layer. In this layer, dashboards are created to meet the requirements for analysis.

Azure Data Factory Integration Azure Data Factory is utilized for scheduling and monitoring requirements, orchestrating the entire data processing pipeline.

Delta Lakehouse Architecture To address additional requirements such as GDPR compliance, time travel, and data versioning, the pipeline is later converted into a Delta Lakehouse architecture.

This robust pipeline ensures efficient data processing, from ingestion to presentation, while meeting various compliance and analytical needs.

Incremental Load using Delta Lake

In Azure Databricks, a scalable solution for incremental and full load file processing is implemented using Delta Lake's upsert and merge capabilities. This approach leverages versioning and time-travel features inherent to Delta Lake, ensuring efficient data management and integrity across multiple processing cycles.

Ingest Pipeline

The Ingest Pipeline in Databricks utilizes SQL to create tables with provided schemas for ingesting raw data. It handles various file formats including single-line JSON, CSV, and multi-line JSON files. Additionally, it formats column names and dates to ensure data consistency and usability.

Transform Pipeline

Transforming ingested processed data into a presentation format is achieved using PySpark in Azure Databricks, where data is structured and prepared for visualization. This process involves shaping data into a format suitable for dashboard creation. Automation is enabled through Azure Data Factory (ADF), scheduling the transformation pipeline for seamless dashboard generation and analysis.

Process Pipeline

The process pipeline is responsible for orchestrating the execution of both the Ingest Pipeline and Transform Pipeline, ensuring proper dependency management between them. This ensures that data is ingested and transformed in a coordinated manner, leading to accurate and reliable analysis results.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ADF Pipelines ARM Templates		ADF Pipelines ARM Templates
Ergast_Data		Ergast_Data
analysis		analysis
demo		demo
includes		includes
ingestion		ingestion
raw		raw
set-up		set-up
trans		trans
utils		utils
Formula1_Visualization.pbix		Formula1_Visualization.pbix
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic ETL Workflow with Azure Databricks, Delta Lake and ADF

Project Overview

Data Source

Architecture Explanation & Project Working

Data Processing Pipeline Overview

Incremental Load using Delta Lake

Ingest Pipeline

Transform Pipeline

Process Pipeline

Connected Databricks Cluster to Power BI for Data Visualization

Results Visualisations

About

Releases

Packages

Languages

sameerhussai230/Dynamic_ETL_Workflow_with_Azure_DataBricks_Delta_Lake_and_ADF

Folders and files

Latest commit

History

Repository files navigation

Dynamic ETL Workflow with Azure Databricks, Delta Lake and ADF

Project Overview

Data Source

Architecture Explanation & Project Working

Data Processing Pipeline Overview

Incremental Load using Delta Lake

Ingest Pipeline

Transform Pipeline

Process Pipeline

Connected Databricks Cluster to Power BI for Data Visualization

Results Visualisations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages