Azure Synapse Analytics Integrating Serverless SQL, Spark Pools, and Synapse Link

Project Overview

The project begins by sourcing data from the NYC Taxi website (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page), which provides a diverse range of files crucial for analysis. The files are converted into CSV, JSON, TSV and Parquet formats to facilitate real data engineering tasks, containing detailed information about New York City taxi trips, including trip duration, zones, and payment types. To efficiently query this data, the project utilizes the Serverless SQL pool in Azure Synapse Analytics, leveraging the OpenRowSet function to directly query files stored in the raw folder of the Data Lake. Data pruning techniques are applied using the filename() function to optimize queries and improve performance.

To facilitate effective data processing and management, tables are created using the CETAS (Create Table As) statements for both the silver and gold layers. The silver layer undergoes transformation using the Spark Pool to enhance computing processes, particularly for complex computation and aggregation tasks. Additionally, user-defined stored procedures (USPs) are created to automate the table creation process, ensuring consistency and efficiency. Implemented parametrized pipelines streamlines workflow orchestration by facilitating dynamic folder paths and user-defined stored procedure (USP) names through a JSON configuration file.

The project extends its capabilities by seamlessly querying Cosmos DB data in Azure Synapse Analytics using Serverless SQL Pool and Spark Pool. This approach enhances operational efficiency by leveraging near real-time OLAP and OLTP analytics without the need for complex ETL processes. Finally, Azure Synapse is seamlessly connected with Power BI via SQL Endpoint, enabling the creation of insightful visualizations for analyzing demand patterns and payment types, enhancing the overall data analysis and reporting capabilities of the project.

Utilizing Serverless SQL Pool for Efficient Data Querying

To efficiently query this data, the project utilizes the Serverless SQL pool in Azure Synapse Analytics. Leveraging the OpenRowSet function, to directly queries files stored in the raw folder of the Data Lake, enabling efficient data retrieval and analysis.

Implementing Data Pruning Techniques for Query Optimization

Data pruning techniques are applied using the filename() function to optimize queries and improve performance. This ensures that only relevant data is processed, leading to faster query execution and enhanced efficiency in data analysis.

Transforming Data Using Spark Pool for Enhanced Computing

The silver layer undergoes transformation with multiple join conditions and aggregations to generate tables in the gold layer, leveraging the Spark pool to enhance computing processes. This is especially advantageous for handling intricate computation and aggregation tasks, thereby enhancing the overall efficiency of data processing.

Automating Table Creation with User-Defined Stored Procedures (USPs)

User-defined stored procedures (USPs) are created to automate the table creation process. This ensures consistency and efficiency in managing data tables, reducing manual intervention and streamlining the data processing workflow.

Streamlining Workflow Orchestration with Parametrized Pipelines

Parametrized pipelines are implemented to streamline workflow orchestration. This allows for dynamic folder paths and USP names through a JSON configuration file, enhancing the flexibility and scalability of the data processing pipeline.

Trial Silver Pipeline for USP

Dynamic Silver Pipeline Execution

Firstly, Script Activity is executed for getting all file names, then in For Each Activity we will use stored procedure on this file , then we will create view on top of all files of silver folder

Json File for providing Folder path and USP in parameter value of main pipeline

Executing all Pipelines

Connecting to Cosmos DB with Azure Synapse Link

Connecting to Cosmos DB with Azure Synapse Link for querying Cosmos DB data and integrating it seamlessly within Azure Synapse Analytics offers significant benefits. Azure Synapse Link eliminates the need for complex ETL processes or data replication, allowing for real-time access to operational data. With automatic data synchronization in just a few minutes, it streamlines the conversion of row data into columnar format data. This enables near real-time analytical insights, empowering organizations to make data-driven decisions swiftly and effectively.

Querying Cosmos DB Data in Synapse with Serverless SQL Pool

Querying Cosmos DB Data in Synapse Using Spark Pool (OLAP and OLTP Versions)

Connecting Azure Synapse with Power BI for Data Visualization

Finally, Azure Synapse is seamlessly connected with Power BI via SQL Endpoint, enabling the creation of insightful visualizations for analyzing demand patterns and payment types. This enhances the overall data analysis and reporting capabilities of the project.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Nyc_Taxi_Scripts		Nyc_Taxi_Scripts
Pipelines ARM template		Pipelines ARM template
Synapse_link_Cosmos_DB		Synapse_link_Cosmos_DB
Azure Synapse Analytics on NYC Taxi Data.pbix		Azure Synapse Analytics on NYC Taxi Data.pbix
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure Synapse Analytics Integrating Serverless SQL, Spark Pools, and Synapse Link

Project Overview

Utilizing Serverless SQL Pool for Efficient Data Querying

Implementing Data Pruning Techniques for Query Optimization

Transforming Data Using Spark Pool for Enhanced Computing

Automating Table Creation with User-Defined Stored Procedures (USPs)

Streamlining Workflow Orchestration with Parametrized Pipelines

Trial Silver Pipeline for USP

Dynamic Silver Pipeline Execution

Json File for providing Folder path and USP in parameter value of main pipeline

Executing all Pipelines

Connecting to Cosmos DB with Azure Synapse Link

Querying Cosmos DB Data in Synapse with Serverless SQL Pool

Querying Cosmos DB Data in Synapse Using Spark Pool (OLAP and OLTP Versions)

Connecting Azure Synapse with Power BI for Data Visualization

Taking SQL Endpoint of Servless SQL Pool from Synapse Workspace

Using SQL Endpoint for connecting to Power BI.

Results Visualizations

About

Releases

Packages

Languages

sameerhussai230/Azure-Synapse-Analytics-Integrating-Serverless-SQL-Spark-Pools-and-Synapse-Link

Folders and files

Latest commit

History

Repository files navigation

Azure Synapse Analytics Integrating Serverless SQL, Spark Pools, and Synapse Link

Project Overview

Utilizing Serverless SQL Pool for Efficient Data Querying

Implementing Data Pruning Techniques for Query Optimization

Transforming Data Using Spark Pool for Enhanced Computing

Automating Table Creation with User-Defined Stored Procedures (USPs)

Streamlining Workflow Orchestration with Parametrized Pipelines

Trial Silver Pipeline for USP

Dynamic Silver Pipeline Execution

Json File for providing Folder path and USP in parameter value of main pipeline

Executing all Pipelines

Connecting to Cosmos DB with Azure Synapse Link

Querying Cosmos DB Data in Synapse with Serverless SQL Pool

Querying Cosmos DB Data in Synapse Using Spark Pool (OLAP and OLTP Versions)

Connecting Azure Synapse with Power BI for Data Visualization

Taking SQL Endpoint of Servless SQL Pool from Synapse Workspace

Using SQL Endpoint for connecting to Power BI.

Results Visualizations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages