The project begins by sourcing data from the NYC Taxi website (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page), which provides a diverse range of files crucial for analysis. The files are converted into CSV, JSON, TSV and Parquet formats to facilitate real data engineering tasks, containing detailed information about New York City taxi trips, including trip duration, zones, and payment types. To efficiently query this data, the project utilizes the Serverless SQL pool in Azure Synapse Analytics, leveraging the OpenRowSet function to directly query files stored in the raw folder of the Data Lake. Data pruning techniques are applied using the filename() function to optimize queries and improve performance.
To facilitate effective data processing and management, tables are created using the CETAS (Create Table As) statements for both the silver and gold layers. The silver layer undergoes transformation using the Spark Pool to enhance computing processes, particularly for complex computation and aggregation tasks. Additionally, user-defined stored procedures (USPs) are created to automate the table creation process, ensuring consistency and efficiency. Implemented parametrized pipelines streamlines workflow orchestration by facilitating dynamic folder paths and user-defined stored procedure (USP) names through a JSON configuration file.
The project extends its capabilities by seamlessly querying Cosmos DB data in Azure Synapse Analytics using Serverless SQL Pool and Spark Pool. This approach enhances operational efficiency by leveraging near real-time OLAP and OLTP analytics without the need for complex ETL processes. Finally, Azure Synapse is seamlessly connected with Power BI via SQL Endpoint, enabling the creation of insightful visualizations for analyzing demand patterns and payment types, enhancing the overall data analysis and reporting capabilities of the project.
To efficiently query this data, the project utilizes the Serverless SQL pool in Azure Synapse Analytics. Leveraging the OpenRowSet function, to directly queries files stored in the raw folder of the Data Lake, enabling efficient data retrieval and analysis.
Data pruning techniques are applied using the filename() function to optimize queries and improve performance. This ensures that only relevant data is processed, leading to faster query execution and enhanced efficiency in data analysis.
The silver layer undergoes transformation with multiple join conditions and aggregations to generate tables in the gold layer, leveraging the Spark pool to enhance computing processes. This is especially advantageous for handling intricate computation and aggregation tasks, thereby enhancing the overall efficiency of data processing.
User-defined stored procedures (USPs) are created to automate the table creation process. This ensures consistency and efficiency in managing data tables, reducing manual intervention and streamlining the data processing workflow.
Parametrized pipelines are implemented to streamline workflow orchestration. This allows for dynamic folder paths and USP names through a JSON configuration file, enhancing the flexibility and scalability of the data processing pipeline.
Firstly, Script Activity is executed for getting all file names, then in For Each Activity we will use stored procedure on this file , then we will create view on top of all files of silver folder
Connecting to Cosmos DB with Azure Synapse Link for querying Cosmos DB data and integrating it seamlessly within Azure Synapse Analytics offers significant benefits. Azure Synapse Link eliminates the need for complex ETL processes or data replication, allowing for real-time access to operational data. With automatic data synchronization in just a few minutes, it streamlines the conversion of row data into columnar format data. This enables near real-time analytical insights, empowering organizations to make data-driven decisions swiftly and effectively.
Finally, Azure Synapse is seamlessly connected with Power BI via SQL Endpoint, enabling the creation of insightful visualizations for analyzing demand patterns and payment types. This enhances the overall data analysis and reporting capabilities of the project.