Skip to content

sameerhussai230/Azure-Synapse-Analytics-Integrating-Serverless-SQL-Spark-Pools-and-Synapse-Link

Repository files navigation

Azure Synapse Analytics Integrating Serverless SQL, Spark Pools, and Synapse Link

Project Overview

The project begins by sourcing data from the NYC Taxi website (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page), which provides a diverse range of files crucial for analysis. The files are converted into CSV, JSON, TSV and Parquet formats to facilitate real data engineering tasks, containing detailed information about New York City taxi trips, including trip duration, zones, and payment types. To efficiently query this data, the project utilizes the Serverless SQL pool in Azure Synapse Analytics, leveraging the OpenRowSet function to directly query files stored in the raw folder of the Data Lake. Data pruning techniques are applied using the filename() function to optimize queries and improve performance.

To facilitate effective data processing and management, tables are created using the CETAS (Create Table As) statements for both the silver and gold layers. The silver layer undergoes transformation using the Spark Pool to enhance computing processes, particularly for complex computation and aggregation tasks. Additionally, user-defined stored procedures (USPs) are created to automate the table creation process, ensuring consistency and efficiency. Implemented parametrized pipelines streamlines workflow orchestration by facilitating dynamic folder paths and user-defined stored procedure (USP) names through a JSON configuration file.

The project extends its capabilities by seamlessly querying Cosmos DB data in Azure Synapse Analytics using Serverless SQL Pool and Spark Pool. This approach enhances operational efficiency by leveraging near real-time OLAP and OLTP analytics without the need for complex ETL processes. Finally, Azure Synapse is seamlessly connected with Power BI via SQL Endpoint, enabling the creation of insightful visualizations for analyzing demand patterns and payment types, enhancing the overall data analysis and reporting capabilities of the project.

image


image

Utilizing Serverless SQL Pool for Efficient Data Querying

To efficiently query this data, the project utilizes the Serverless SQL pool in Azure Synapse Analytics. Leveraging the OpenRowSet function, to directly queries files stored in the raw folder of the Data Lake, enabling efficient data retrieval and analysis.

image

Implementing Data Pruning Techniques for Query Optimization

Data pruning techniques are applied using the filename() function to optimize queries and improve performance. This ensures that only relevant data is processed, leading to faster query execution and enhanced efficiency in data analysis.

image

Transforming Data Using Spark Pool for Enhanced Computing

The silver layer undergoes transformation with multiple join conditions and aggregations to generate tables in the gold layer, leveraging the Spark pool to enhance computing processes. This is especially advantageous for handling intricate computation and aggregation tasks, thereby enhancing the overall efficiency of data processing.

image

Automating Table Creation with User-Defined Stored Procedures (USPs)

User-defined stored procedures (USPs) are created to automate the table creation process. This ensures consistency and efficiency in managing data tables, reducing manual intervention and streamlining the data processing workflow.

image

Streamlining Workflow Orchestration with Parametrized Pipelines

Parametrized pipelines are implemented to streamline workflow orchestration. This allows for dynamic folder paths and USP names through a JSON configuration file, enhancing the flexibility and scalability of the data processing pipeline.

Trial Silver Pipeline for USP

image

Dynamic Silver Pipeline Execution

Firstly, Script Activity is executed for getting all file names, then in For Each Activity we will use stored procedure on this file , then we will create view on top of all files of silver folder

image

Json File for providing Folder path and USP in parameter value of main pipeline

image

Executing all Pipelines

image

Connecting to Cosmos DB with Azure Synapse Link

Connecting to Cosmos DB with Azure Synapse Link for querying Cosmos DB data and integrating it seamlessly within Azure Synapse Analytics offers significant benefits. Azure Synapse Link eliminates the need for complex ETL processes or data replication, allowing for real-time access to operational data. With automatic data synchronization in just a few minutes, it streamlines the conversion of row data into columnar format data. This enables near real-time analytical insights, empowering organizations to make data-driven decisions swiftly and effectively.

image

Querying Cosmos DB Data in Synapse with Serverless SQL Pool

image

Querying Cosmos DB Data in Synapse Using Spark Pool (OLAP and OLTP Versions)

image

Connecting Azure Synapse with Power BI for Data Visualization

Finally, Azure Synapse is seamlessly connected with Power BI via SQL Endpoint, enabling the creation of insightful visualizations for analyzing demand patterns and payment types. This enhances the overall data analysis and reporting capabilities of the project.

Taking SQL Endpoint of Servless SQL Pool from Synapse Workspace

image

Using SQL Endpoint for connecting to Power BI.

image

image

Results Visualizations

image

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages