# Data Modeling <br>

Data modeling is the process of creating a visual representation of a system's data and its relationships. It defines how data is stored, connected, processed, and retrieved in databases or data warehouses.
<br>
  OR  <br>
its a process of creating the blueprint how data is stored,connected and retrived from the systems

# 📊 Types of Data Models

Data modeling is the process of visually representing how data is stored, connected, and accessed. There are three primary types of data models:

---

## 1. 🧠 Conceptual Data Model
- **Purpose**: High-level overview of business data.
- **Audience**: Business stakeholders and analysts.
- **Focus**: What data is important (not how it's stored).
- **Example**: 
  - Entities: Customer, Product, Order
  - No attributes or data types specified.

---

## 2. 📘 Logical Data Model
- **Purpose**: More detailed view including relationships and attributes.
- **Audience**: Data architects, analysts.
- **Focus**: Defines data structure regardless of physical implementation.
- **Includes**:
  - Entities
  - Attributes (with data types)
  - Primary and foreign keys
  - Relationships (1:1, 1:N, M:N)
- **Example**: An ER diagram with tables like `Customer`, `Order`, `Product`.

---

## 3. 💽 Physical Data Model
- **Purpose**: Actual implementation on a specific database or system.
- **Audience**: DBAs, developers.
- **Focus**: Performance, indexing, storage format.
- **Includes**:
  - Table names, column names
  - Data types, indexes, partitions
  - Constraints (PK, FK, NOT NULL, etc.)
- **Example**: SQL table definitions in Snowflake, SQL Server, Databricks, etc.

---

## 📌 Comparison Table

| Feature             | Conceptual Model | Logical Model       | Physical Model         |
|---------------------|------------------|----------------------|-------------------------|
| Purpose             | Business view     | Logical structure    | Implementation detail   |
| Includes attributes | ❌                | ✅                   | ✅                      |
| Includes data types | ❌                | ✅ (abstract)        | ✅ (actual)             |
| Tied to DBMS        | ❌                | ❌                   | ✅                      |
| Audience            | Business          | Data Architects      | Engineers / DBAs        |

---

## ✅ Summary

- Use **conceptual** to define the *what*.
- Use **logical** to define the *how* (abstractly).
- Use **physical** to define the *exact implementation*.



# 🔄 OLTP vs OLAP

Understanding the difference between **OLTP (Online Transaction Processing)** and **OLAP (Online Analytical Processing)** is key in data engineering and data warehousing.

---

## 🔹 What is OLTP?

**OLTP (Online Transaction Processing)** systems are designed to handle **day-to-day transactional data**.

- 📥 **Use Case**: Inserting, updating, and deleting records in real-time.
- 🏢 **Example**: Banking systems, e-commerce order systems, ticket booking systems.

### ✅ Key Features of OLTP:
- Handles **high volume of short transactions**
- Data is **highly normalized**
- Focuses on **data integrity and speed**
- Supports **CRUD operations** (Create, Read, Update, Delete)
- **Real-time access** to operational data

---

## 🔸 What is OLAP?

**OLAP (Online Analytical Processing)** systems are used for **data analysis and reporting**.

- 📊 **Use Case**: Running complex queries, dashboards, historical data analysis.
- 🏢 **Example**: Business intelligence tools, data warehouses, Power BI dashboards.

### ✅ Key Features of OLAP:
- Handles **complex queries over large datasets**
- Data is often **denormalized** (star or snowflake schema)
- Supports **aggregations, drill-downs, slicing/dicing**
- Used for **decision making**
- Data is typically **read-heavy** and not updated frequently

---

## 🆚 OLTP vs OLAP: Comparison Table

| Feature                 | OLTP                                | OLAP                              |
|-------------------------|-------------------------------------|-----------------------------------|
| Stands For              | Online Transaction Processing       | Online Analytical Processing      |
| Purpose                 | Day-to-day operations               | Analysis and decision support     |
| Data Structure          | Normalized                          | Denormalized (Star/Snowflake)     |
| Operations              | Read & Write (Insert/Update/Delete) | Read-heavy (complex queries)      |
| Query Type              | Simple, fast transactions           | Complex analytical queries        |
| Data Volume             | Small transactions, high volume     | Large datasets                    |
| Users                   | End users, clerks, front-line staff | Analysts, managers, data scientists |
| Examples                | ATM systems, online stores          | Sales dashboard, marketing analysis |

---

## ✅ Summary

- Use **OLTP** systems for **transactional** workloads.
- Use **OLAP** systems for **analytical** workloads.
- In a modern data architecture, **data flows from OLTP → OLAP** through ETL/ELT pipelines.



# 💾 Persistent vs ⚡ Transient Data

Understanding the difference between **persistent** and **transient** data is important when designing data pipelines, managing storage, and optimizing performance.

---

## 💾 Persistent Data

- **Stored permanently** until explicitly deleted.
- Survives system crashes, restarts, or job failures.
- Used for long-term storage, compliance, and analytics.

### ✅ Characteristics:
- Stored on disk or cloud (e.g., ADLS, S3, HDFS, databases).
- Can be queried or retrieved multiple times.
- Examples:
  - Tables in Delta Lake
  - Data stored in SQL or NoSQL databases
  - Parquet/CSV files stored in data lakes

---

## ⚡ Transient Data

- **Temporary** data used during processing or within a session.
- Lost when the session ends or the process completes.
- Used for intermediate results, testing, or caching.

### ⚠️ Characteristics:
- Stored in memory or temporary storage.
- Not intended for long-term access.
- Examples:
  - Spark DataFrames (if not saved)
  - Temporary SQL views or tables
  - Session variables or pipeline staging buffers

---

## 🔍 Comparison Table

| Feature             | Persistent Data                     | Transient Data                        |
|---------------------|--------------------------------------|----------------------------------------|
| Storage Medium      | Disk, cloud, databases               | Memory, temp disk                      |
| Lifetime            | Until deleted manually               | Lost after session/job ends            |
| Use Case            | Final storage, analytics, reporting  | Intermediate transformations           |
| Durability          | High                                 | Low                                    |
| Recovery after crash| Possible                             | Not possible                           |

---

## 🧠 Summary

- Use **persistent storage** when you need durability, compliance, and historical tracking.
- Use **transient storage** to optimize performance and reduce storage costs during processing stages.



# 🔄 Incremental Loading in Data Engineering

**Incremental loading** is the process of loading only **new or updated data** from a source system into your data warehouse or data lake, rather than loading the entire dataset every time.

This approach is **efficient, faster, and cost-effective**, especially for large datasets.

---

## 🚀 Why Use Incremental Loading?

| Benefit             | Description                                      |
|---------------------|--------------------------------------------------|
| ⏱️ Faster Loads      | Only processes new/changed records               |
| 💰 Lower Cost        | Reduces compute and storage usage               |
| 📦 Scalable          | Ideal for large-scale or streaming data systems |
| 🔍 Easier Auditing   | Allows tracking changes over time               |

---

## 🧩 Types of Incremental Loads

### 1. **Append-Only (Insert-only)**
- Load only new rows (e.g., new orders, new log entries)
- Common with time-based or auto-incremented keys
- Simple and efficient

### 2. **Upserts (Insert + Update)**
- Detect and load both new and **changed rows**
- Requires a **unique key** (e.g., order_id, customer_id)
- Uses `MERGE`, `UPDATE`, or `DELETE + INSERT` logic

---

## 🔍 How to Identify New/Changed Data

- **Timestamps** (e.g., `last_updated`, `order_date`)
- **Change Data Capture (CDC)** tools or logs
- **Hashing rows** and comparing for changes

---

## 🔧 Example: PySpark with Delta Lake

```python
# 1. Read new data from source
new_data = spark.read.format("csv").load("/mnt/source/data.csv")

# 2. Read existing target table
target = spark.read.format("delta").load("/mnt/delta/silver_table")

# 3. Perform merge (UPSERT)
from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/mnt/delta/silver_table")

delta_table.alias("target").merge(
    new_data.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
