- **Name:** 04.4_dataframe_delta_table
- **Author:** Shamas Imran
- **Desciption:** Creating and managing Delta tables
- **Date:** 15-Oct-2025
<!--
REVISION HISTORY
Version          Date        Author           Desciption
01           19-Aug-2025   Shamas Imran       Created managed Delta table  
                                              Inserted and queried data  
                                              Showcased time travel with Delta  
-->

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DatapurProgram").getOrCreate()

In [None]:
# Root path in Unity Catalog volume
rootPath = "Files/client_output_data/delta/"
deltaPath = rootPath + "score_delta"

In [None]:
from pyspark.sql import Row
import random

# --------------------------------
# 1. Generate Score Data (100+ rows)
# --------------------------------
semesters = ["2023-Spring", "2023-Fall", "2024-Spring", "2024-Fall", "2025-Spring"]

score_data = []
score_id = 1

for enrollment_id in range(1, 21):   # 20 enrollments
    for sem in random.sample(semesters, k=random.randint(2, 4)):  # Each enrollment has 2–4 semesters
        score = random.randint(60, 100)  # Random score between 60 and 100
        score_data.append(Row(
            ScoreID=score_id,
            EnrollmentID_FK=enrollment_id,
            Semester=sem,
            Score=score
        ))
        score_id += 1

df_score = spark.createDataFrame(score_data)
df_score.show()

In [None]:
df_score.show(20, truncate=False)
print(f"Total rows generated: {df_score.count()}")

In [None]:
df_score.write.format("delta").mode("overwrite").save(deltaPath)

In [None]:

spark.sql(f"""
CREATE TABLE IF NOT EXISTS score_table_external
USING DELTA
LOCATION '{deltaPath}'
""")

In [None]:
df_score.write.format("delta").mode("overwrite").saveAsTable("score_table")

## 🧩 Managed vs External Tables in Lakehouse

| Feature | **Managed Table** | **External Table** |
|----------|-------------------|--------------------|
| **Storage Location** | Stored **inside the Lakehouse’s managed storage** (default location). | Stored **outside the managed storage**, at a user-defined path (e.g., OneLake, ADLS, Blob). |
| **Creation Syntax** | `CREATE TABLE tableName (...)` | `CREATE TABLE tableName (...) LOCATION 'path'` |
| **Ownership** | Lakehouse (Fabric / Spark) **fully owns** the data and metadata. | Only **metadata** is managed by Lakehouse; **data** remains external. |
| **Data Files** | Stored under the Lakehouse’s internal folder (e.g., `/Tables/`) | Stored in the specified external path. |
| **When Dropped** | Both **data and metadata** are deleted. | Only **metadata** is dropped; **data files remain** in external storage. |
| **Use Case** | Best when Lakehouse should manage the full lifecycle of data. | Best when data is **shared, reused, or managed elsewhere**. |
| **Example – Managed Table** | ```sql<br>CREATE TABLE sales (id INT, amount DECIMAL);<br>``` |  |
| **Example – External Table** |  | ```sql<br>CREATE TABLE sales_ext (id INT, amount DECIMAL)<br>USING DELTA<br>LOCATION '/Files/external/sales';<br>``` |
| **Default Type** | ✅ Yes (if no LOCATION specified) | ❌ No (requires explicit LOCATION clause) |
| **Typical Storage Path** | `/Tables/<tableName>/` | `/Files/...` or any custom path |
| **Fabric Lakehouse View** | Appears under **Tables** | Appears under **External Tables** section |
| **Backup & Lifecycle** | Managed automatically by Fabric | Managed manually by user |

---

### 🧠 Quick Tip:
- Use **managed tables** for internal, pipeline-generated data.  
- Use **external tables** when linking **existing files** (like from a Data Lake, OneLake, or shared dataset).


In [None]:
query = """
SELECT Semester, 
       COUNT(*) AS Exams, 
       AVG(Score) AS AvgScore
FROM score_table
GROUP BY Semester
ORDER BY Semester
"""

df_result = spark.sql(query)
df_result.show()

In [None]:
%%sql
SELECT Semester, COUNT(*) AS Exams, AVG(Score) AS AvgScore
FROM score_table
GROUP BY Semester
ORDER BY Semester

In [None]:
%%sql
DESCRIBE HISTORY score_table;

In [None]:
%%sql
UPDATE score_table
SET Score = Score + 5
WHERE Semester = '2023-Fall';

In [None]:
%%sql
SELECT * FROM score_table VERSION AS OF 0;

-- SELECT * FROM score_table TIMESTAMP AS OF '2025-08-16T10:00:00';

In [None]:
%%sql

-- RESTORE TABLE score_table TO VERSION AS OF 0;

DROP TABLE IF EXISTS score_table;