# `DP-900`: _Microsoft Azure Data Fundamentals_


## `1.` Data Workloads


### Database


- A `database` is designed for Online Transaction Processing (`OLTP`), which handles day-to-day operations and transactions
- A `database` is optimized for quick, real-time operations like creating, reading, updating, and deleting data. It's the engine behind most applications, from e-commerce websites to banking systems.


#### `Purpose`


To support daily operational tasks and transactions.


#### `Data Type`


Typically stores real-time, current data from a single source or application.


#### `Structure`


Highly normalized to reduce data redundancy, which means data is split into multiple tables with specific relationships. This is efficient for transactional processes but can be complex for large-scale queries.


#### `Queries`


Queries are simple, focused on retrieving a small number of records quickly to support a specific function (e.g., "get the customer's order history").


#### `Users`


Designed to handle a high number of concurrent users, each performing small, fast transactions.


### Data warehouse


- A `data warehouse` is designed for Online Analytical Processing (`OLAP`), which is used for analysis and reporting to support business intelligence and decision-making
- A `data warehouse` is a central repository that consolidates historical data from multiple sources (including various databases). Its primary goal is to provide a unified, long-term view of a business to help analysts identify trends and patterns.


#### `Purpose`


To support business intelligence, data analysis, and strategic decision-making


#### `Data Type`


Stores large volumes of historical data, often aggregated and cleaned from various sources


#### `Structure`


Often denormalized using a `star` or `snowflake` schema to optimize for fast, complex queries and reporting. This structure allows analysts to easily group and summarize data


#### `Queries`


Queries are complex, often involving large datasets and multiple joins to answer broad business questions (e.g., "what was our total sales revenue by region for the last five years?")


#### `Users`


Typically has a smaller number of users, such as business analysts and data scientists, who run complex, long-running queries


### Data warehouse workload


#### `1.` Process of loading data


This refers to the `ETL` (Extract, Transform, Load) or `ELT` (Extract, Load, Transform) process. It involves getting data from various sources, cleaning and transforming it, and then loading it into the data warehouse for storage.


#### `2.` Performing analysis and reporting


This is the primary function of a data warehouse. It involves running complex queries and using business intelligence (BI) tools to analyze the data and generate reports that help with decision-making.


#### `3.` Managing the data


This includes tasks like data governance, ensuring data quality, and performing administrative tasks such as backups and security management to maintain the integrity and reliability of the data warehouse


#### `4.` Exporting the data


This involves moving data out of the warehouse to other repositories or applications, often for further processing, use in machine learning models, or to be shared with other systems


### Data Warehouse and Data Lake


#### `Data Warehouse`


- A `data warehouse` is a centralized repository of `structured`, `processed data`
- It's designed for Online Analytical Processing (OLAP), which means it's optimized for business intelligence (BI), reporting, and analysis


#### `Data Lake`


- A `data lake` is a vast pool of `raw`, `unprocessed data`
- It's designed to store all types of `data—structured`, `semi-structured`, and `unstructured` at a massive scale.


### Structured, Semi-structured, and Unstructured Data


#### `Structured` Data


`Structured` data is the most traditional type of data. It resides in a fixed field within a record or file and is easily searchable because its format is well-defined

`Examples`: Relational databases, CSV files, and spreadsheet


#### `Semi-structured` Data


`Semi-structured` data doesn't fit into the rigid rows and columns of structured data, but it does contain tags or other markers to enforce a hierarchy and organize the data

`Examples`: JSON files, XML files, and emails


#### `SUnstructured` Data


`Unstructured` data is the opposite of structured data. It has no predefined format, making it difficult to search and analyze with traditional methods. It makes up the vast majority of data in the world today

`Examples`: Text documents, social media posts, images, audio files, and videos.


### Azure Batch Process


#### `1.` Prepare your data and applications


You start by uploading the input data and the application files (e.g., executables, scripts) that will process this data to an Azure Storage account, typically Azure Blob Storage


#### `2.` Create a Batch pool


A pool is a collection of compute nodes (VMs) that will execute your tasks. When you create a pool, you specify the number of nodes, their size (which determines CPU and memory), the operating system (Windows or Linux), and how applications should be installed on them. You can also configure auto-scaling to automatically adjust the number of nodes based on the workload, which helps optimize costs


#### `3.` Create a job and tasks


A job is a logical container for a collection of tasks. A task is a single unit of work that will run on a compute node. You define the tasks within a job, and each task specifies the command line to be executed and any input files it needs


#### `4.` Execute the tasks


The Batch service automatically schedules the tasks to run on the available compute nodes in the pool. It manages the queueing, execution, and retries of tasks. Before a task runs, it can download the necessary input files and applications from Azure Storage to the assigned node


#### `5.` Monitor and retrieve output


While the tasks are running, you can monitor their progress. Once tasks are complete, they can upload their output data back to Azure Storage. You can then download and process the final results


### Azure Real-time Processing


#### `1.` Data Ingestion Services


These services are designed to handle high-throughput data streams from various sources


##### Azure Event Hubs


This is a fully managed, cloud-native service from Microsoft. It's designed to ingest massive volumes of data from a wide range of sources, like applications, websites, and IoT devices. A key benefit is that it's a platform-as-a-service (PaaS), so you don't have to worry about managing the underlying infrastructure, which makes it easy to set up and scale. Event Hubs also has a native Kafka protocol endpoint, allowing existing Kafka applications to connect without code changes


##### Azure IoT Hub


This service is specifically for Internet of Things (IoT) scenarios. While it also ingests high-volume data streams, its core purpose is to provide a secure and bidirectional communication channel between IoT devices and the cloud. It offers rich features for device management, security, and two-way messaging, allowing you to send commands back to devices. This is what distinguishes it from Event Hubs, which is primarily a one-way data ingestion service


##### Apache Kafka


This is a powerful, open-source, distributed event-streaming platform. It's known for its high throughput and fault tolerance. Unlike Event Hubs and IoT Hub, Kafka is not a managed service by default; you have to set it up and manage it yourself. This gives you a high degree of control and flexibility, but it comes with a significant operational overhead. Many organizations choose to run Kafka on-premises or use a managed Kafka service from a cloud provider (like Azure HDInsight or Confluent Cloud) to reduce the management burden


#### `2.` Processing/Analysis


These services are the heart of the real-time pipeline, where the actual data transformation and analysis take place


##### Azure Stream Analytics


A fully managed, serverless stream processing engine that enables you to run complex, real-time analytics on streaming data. It uses a simple, SQL-like query language, which makes it easy to filter, aggregate, and join data from multiple sources. It's ideal for scenarios like real-time dashboards, alerting, and anomaly detection


##### Azure Functions


An event-driven, serverless compute platform. You can use it to process data streams in real-time by triggering functions based on new data arriving in Event Hubs or IoT Hub. Functions are a great choice for implementing custom, lightweight processing logic that doesn't fit a standard SQL-based query


##### Azure Databricks


A fast, powerful, Apache Spark-based analytics platform. While it can be used for batch processing, its Structured Streaming capabilities are excellent for real-time processing of large-scale streaming data. It's a more powerful and flexible option than Stream Analytics for complex machine learning tasks or when you need to write custom logic in Python, Scala, or R


#### `3.` Data Storage Services


After processing, the data needs to be stored for various purposes


##### Azure Cosmos DB


A globally distributed, multi-model database service that's ideal for storing the results of your real-time processing. Its low-latency read and write capabilities make it perfect for powering real-time dashboards and applications


##### Azure Data Lake Storage


A massively scalable and secure data lake service. It's often used as a long-term storage solution for raw or processed streaming data, which can later be used for historical analysis or machine learning


#### `4.` Visualization and Action Services


These services allow you to act on the insights derived from your real-time data


##### Power BI


A business analytics service that can be used to create real-time dashboards and reports from the output of Stream Analytics or other storage services


##### Azure Functions/Logic Apps


You can use these services to trigger actions based on the processed data. For example, a function could send an email or an SMS alert if an anomaly is detected, or a Logic App could initiate a workflow


These services allow you to act on the insights derived from your real-time data


#### `Example` Real-Time Architecture


`Ingestion`: IoT devices send telemetry data to an Azure IoT Hub

`Processing`: An Azure Stream Analytics job ingests the data from IoT Hub, uses a SQL query to calculate a rolling average, and checks for anomalies

`Storage`: The job outputs the aggregated data to Azure Cosmos DB for a real-time dashboard

`Action`: If an anomaly is detected, the Stream Analytics job can also send an alert to an Azure Functions app, which then sends an email to an operations team.


In Azure, a `resource group` is a fundamental organizational unit that serves as a logical container for your Azure resources. Think of it as a folder for your cloud assets. All resources in a resource group are managed as a single unit


## `2.` Data Analytics


### Azure data explorer


- `Azure Data Explorer` is a fully managed, high-performance, big data analytics platform that makes it easy to analyze high volumes of data in near real time. The Azure Data Explorer toolbox gives you an end-to-end solution for data ingestion, query, visualization, and management
- By analyzing structured, semi-structured, and unstructured data across time series, and by using Machine Learning, `Azure Data Explorer` makes it simple to extract key insights, spot patterns and trends, and create forecasting models.
- `Azure Data Explorer` uses a traditional relational model, organizing data into tables with strongly typed schemas. Tables are stored within databases, and a cluster can manage multiple databases.
- `Azure Data Explorer` is scalable, secure, robust, and enterprise-ready, and is useful for log analytics, time series analytics, IoT, and general-purpose exploratory analytics


#### When should you use Azure Data Explorer?


Use the following questions to help decide if Azure Data Explorer is right for your use case:

- `Interactive analytics`: Is interactive analysis part of the solution? For example, aggregation, correlation, or anomaly detection.
- `Variety, Velocity, Volume`: Is your schema diverse? Do you need to ingest massive amounts of data in near real-time?
- `Data organization`: Do you want to analyze raw data? For example, not fully curated star schema.
- `Query concurrency`: Will multiple users or processes use Azure Data Explorer?
- `Build vs Buy`: Do you plan on customizing your data platform?


#### Azure Data Explorer flow


![IMAGE](/home/saadkh/dataops-bc/src/Images/Azure_Data_Explorer_flow.png)


### Azure Storage services


#### `1-` Azure Blob Storage


`What it is`: A massively scalable object store for unstructured data

`Use it for`: Storing any kind of text or binary data, such as documents, videos, images, and application backups. It's also the foundation for Azure Data Lake Storage Gen2, making it essential for big data analytics


#### `2-` Azure Files


`What it is`: A fully managed file share in the cloud, accessible via the standard Server Message Block (SMB) protocol.

`Use it for`: "Lifting and shifting" on-premises applications to the cloud that rely on traditional file shares. It works just like a shared network drive


#### `3-` Azure Queue Storage


`What it is`: A messaging store for building scalable and reliable applications

`Use it for`: Decoupling application components. One part of your app can leave a message in the queue (e.g., "process this image"), and another part can pick it up when it has the capacity. It creates a reliable buffer between services. 📬


#### `4-` Azure Table Storage


`What it is`: A NoSQL key-attribute store

`Use it for`: Storing large amounts of structured, non-relational data with a flexible schema. It's great for things like user data for web apps, address books, or device information


#### `5-` Azure Disk Storage


`What it is`: High-performance, persistent block storage for Azure Virtual Machines (VMs)

`Use it for`: Acting as the virtual hard drive (SSD or HDD) for your Azure VMs, where the operating system, applications, and data are stored


### Azure Storage services


### Processing ELT data


### Processing ETL data


## `3.` Relational Data Workloads


## `4.` Relational Data Management


## `5.` Provisioning & Configuring Relational Data Services


## `6.` Azure SQL Querying Techniques


## `7.` Non-relational Data Workloads


## `8.` Azure Cosmos DB


## `9.` Non-relational Data Management


## `10.` Azure Analytics Workloads


## `11.` Modern Data Warehousing


## `12.` Azure Data Ingestion & Processing


## `13.` Azure Data Visualization


# `AZ-900`: _Microsoft Azure Fundamentals_


# `AZ-104`: _Microsoft Azure Administrator_
