Skip to content

ETL motor racing data project using Azure Databricks, Pyspark and Azure Date Lakes

Notifications You must be signed in to change notification settings

randyroac/azure-databricks-etl-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

azure-databricks-etl-project

ETL motor racing data

####The data is from the Ergast website.

The data is stored in the form of an API, downloadable CSVs, and nested or non-nested JSON files. Azure Databricks on top of Apache Spark, Azure Notebook, and Azure Data Lakes Storage are the main tools for this ETL Project.

In this project, I focused on extraction from the CSV AND JSON files for my ETL. This can be done on a free AZURE trial option from Microsoft.

Here is a quick diagram of the high-level plan.

etl_motor_racing_1

Quick Overview of my ETL Processes

Purple Blocks show columns were renamed and/or transformed Red Blocks show columns that were dropped Green Blocks show columns that were Added

etl_motor_racing_2

etl_motor_racing_3

Both horizontal and vertical scaling is very much possible but a larger budget would be necessary to truly take advantage of the full potential of Azure Databricks.

etl_motor_racing_4

Below are random snapshots the reproducable files are avalable DataBricks files are in the folder

Creating secure secret keys and connecting and create and mounting the raw empty folder

etl_adls_notebook_1

Uploading raw files to Data Lakes Storage raw folder

etl_adls_notebook_2

read the json file using the spark dataframe

etl_adls_notebook_3

Output to parquet file

etl_adls_notebook_4

About

ETL motor racing data project using Azure Databricks, Pyspark and Azure Date Lakes

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages