#Telecom Domain ReadOps Assignment
This notebook contains assignments to practice Spark read options and Databricks volumes. <br>
Sections: Sample data creation, Catalog & Volume creation, Copying data into Volumes, Path glob/recursive reads, toDF() column renaming variants, inferSchema/header/separator experiments, and exercises.<br>

![](https://theciotimes.com/wp-content/uploads/2021/03/TELECOM1.jpg)

##First Import all required libraries & Create spark session object

In [0]:
from pyspark.sql.session import SparkSession
print(spark)
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
print(spark)


##Write SQL statements to create:

###A catalog named telecom_catalog_assign

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS telecom_catalog_assign

### Create A schema landing_zone

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS telecom_catalog_assign.landing_zone

###Create A volume landing_vol

In [0]:
%sql
CREATE VOLUME IF NOT EXISTS telecom_catalog_assign.landing_zone.landing_vol

### Using dbutils.fs.mkdirs, create folders
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/

In [0]:
base_path = "/Volumes/telecom_catalog_assign/landing_zone/landing_vol"

customer_path = f"{base_path}/customer/"
usage_path    = f"{base_path}/usage/"
tower_r1_path = f"{base_path}/tower/region1/"
tower_r2_path = f"{base_path}/tower/region2/"

dbutils.fs.mkdirs(customer_path)
dbutils.fs.mkdirs(usage_path)
dbutils.fs.mkdirs(tower_r1_path)
dbutils.fs.mkdirs(tower_r2_path)


In [0]:
dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol")

#Volume vs DBFS/FileStore<br>

DBFS/FileStore is a legacy, workspace-scoped file system mainly for demos, while Volumes are Unity Catalog‚Äìgoverned, secure, production-ready storage for files.‚Äù

##Volumes:- Volumes are a Unity Catalog‚Äìmanaged storage layer that provides secure, governed access to files<br>
Governance:-Governance is control + rules + tracking.<br>
It answers:<br>
Who can access the data?<br>
Who changed the data?<br>
When was it changed?<br>
Are rules being followed?<br>

**In Databricks**<br>
**With Unity Catalog (Volumes)**:<br>
Databricks knows who accessed the file<br>
Databricks knows who modified it<br>
Permissions are enforced<br>

**With DBFS/FileStore**:<br>
No tracking<br>
No rules<br>
Anyone with workspace access can see it<br>

**2. Access Control**<br>
Access control = permission system<br>
Examples:<br>
Read only<br>
Write<br>
Full control<br>

a. In Databricks Volumes<br>
*Ex:- Only data_team can read*<br>
GRANT READ ON VOLUME main.default.raw_data TO data_team; <br>

b. DBFS/FileStore<br>
No such control<br>
Everyone can access<br>

**3. Security (PROTECTING data)**<br>
Security = preventing unauthorized access or misuse<br>
Includes:<br>
Authentication (who you are)<br>
Authorization (what you can do)<br>
Auditing (what you did)<br>
Real-life analogy<br>
ATM:<br>
Card + PIN<br>
Logs every transaction<br>

a. In Databricks<br>
Volumes<br>
Encrypted<br>
Role-based access<br>
Audited<br>

b. DBFS<br>
Minimal protection<br>
No audit trail<br>

**4. Production Ready (SAFE for real business)**<br>
Simple meaning<br>
Production ready = safe for company-critical data<br>

a. In Databricks<br>
Volumes<br>
Designed for pipelines<br>
Safe for Jobs<br>
Used in production<br>

b. DBFS<br>
Only for testing<br>
Not supported for regulated workloads<br>

**5. SQL Support (Can SQL directly use it?)**<br>
Simple meaning<br>
Can I manage it using SQL?<br>
Example<br>
a. Volumes<br>
CREATE VOLUME main.default.sales_data;
SELECT * FROM csv.`/Volumes/main/default/sales_data/file.csv`;

b. DBFS
‚ùå SQL cannot CREATE or MANAGE DBFS locations

| Term             | DBFS/FileStore | Volumes  |
| ---------------- | -------------- | -------- |
| Governance       | ‚ùå None         | ‚úÖ Full   |
| Access Control   | ‚ùå No           | ‚úÖ Yes    |
| Security         | ‚ùå Weak         | ‚úÖ Strong |
| Production Ready | ‚ùå No           | ‚úÖ Yes    |
| SQL Support      | ‚ùå No           | ‚úÖ Yes    |

Final Memory Trick

DBFS = Development / Demo
Volumes = Production / Governed


#Why production teams prefer Volumes for regulated data

What Is ‚ÄúRegulated Data‚Äù?<br>
Regulated data is information that is governed by legal, compliance, or internal policy rules, such as:

- Healthcare data (HIPAA)
- Financial data
- Personally Identifiable Information (PII)
- Customer or employee records
- Enterprise reporting data

For such data, control and traceability are mandatory, not optional.

1. Strong Governance (Mandatory for Regulation)<br>
What regulation requires<br>

- Know who accessed data
- Know who modified data
- Enforce company policies

How Volumes help<br>
Volumes are governed by Unity Catalog, which provides:<br>
- Centralized metadata
- Access tracking
- Audit logs

Why DBFS fails<br>
- No centralized governance
- No audit trail

‚úî Regulators demand proof ‚Äî Volumes provide it

2. Fine-Grained Access Control (Least Privilege)
Requirement<br>
- Users must access only what they need
- No broad workspace access

Volumes support
GRANT READ ON VOLUME main.default.phi_data TO analytics_team;
- Read-only access
- No accidental writes
- Controlled sharing

DBFS limitation
- Either full access or none
- No role-based control

‚úî This is critical for compliance frameworks


4. Auditing & Compliance Evidence<br>
Auditors ask:<br>
Who read this file?<br>
When was it changed?<br>
Was access authorized?<br>
Volumes can answer these questions via:<br>
Unity Catalog audit logs<br>
Access history<br>

DBFS cannot.<br>
‚úî Without audit logs, compliance fails<br>

| Requirement      | Volumes        | DBFS            |
| ---------------- | -------------- | --------------- |
| Governance       | ‚úÖ Yes          | ‚ùå No            |
| Access control   | ‚úÖ Fine-grained | ‚ùå None          |
| Security         | ‚úÖ Strong       | ‚ùå Weak          |
| Auditing         | ‚úÖ Available    | ‚ùå Not available |
| Production ready | ‚úÖ Yes          | ‚ùå No            |
| Compliance ready | ‚úÖ Yes          | ‚ùå No            |


##Data files to use in this usecase:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''

##Create DataFrames from the Given Datasets


In [0]:
customer_df=spark.createDataFrame([(101,"Arun",31,"Chennai","PREPAID"),
    (102,"Meera",45,"Bangalore","POSTPAID"),
    (103,"Irfan",29,"Hyderabad","PREPAID"),
    (104,"Raj",52,"Mumbai","POSTPAID"),
    (105,None,27,"Delhi","PREPAID"),
    (106,"Sneha",None,"Pune","PREPAID")],["customer_id","name","age","city","plan_type"])

usage_df = spark.createDataFrame([
    (101,320,1500,20),
    (102,120,4000,5),
    (103,540,600,52),
    (104,45,200,2),
    (105,0,0,0)
], ["customer_id","voice_mins","data_mb","sms_count"])

tower_r1_df = spark.createDataFrame([
    (5001,101,"TWR01",-80,"2025-01-10 10:21:54"),
    (5004,104,"TWR05",-75,"2025-01-10 11:01:12")
], ["event_id","customer_id","tower_id","signal_strength","timestamp"])

#Tower Logs ‚Äì Region 2 (empty)
tower_r2_df = spark.createDataFrame([], tower_r1_df.schema)



## Copy (Write) the Data into the Volume Folders

In [0]:
customer_df.write.mode("overwrite").csv(customer_path,header=True)
usage_df.write.mode("overwrite").csv(usage_path,header=True)
tower_r1_df.write.mode("overwrite").csv(tower_r1_path,header=True)
tower_r2_df.write.mode("overwrite").csv(tower_r2_path,header=True)
display(customer_df)
display(usage_df)
display(tower_r1_df)
display(tower_r2_df)

##Validate That Files Were Successfully Copied

In [0]:
dbutils.fs.ls(base_path)

In [0]:
#Check each dataset
dbutils.fs.ls(customer_path)
dbutils.fs.ls(usage_path)
dbutils.fs.ls(tower_r1_path)
dbutils.fs.ls(tower_r2_path)


##Read all tower logs using: 
- Path glob filter (example: *.csv)
- Multiple paths input
- Recursive lookup

In [0]:
%py
##Directory Read Use Cases Using Path Glob Filter (*.csv)
df1=spark.read.option("header","True").option("sep",",").csv(f"{base_path}/tower/*/*.csv")
df1.count()
df1.display(3)
df1.show()

In [0]:
#Multiple paths input
df1=spark.read.options(header="True",sep=",",inferSchema=True).csv(path=["dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/","dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/"])
df1.count()
df1.display()

In [0]:
%py
##Recursive lookup
df1 = spark.read.csv(
    path = [f"{base_path}/tower/*"],
    inferSchema = True,
    header = True,
    sep = ",",
    pathGlobFilter = "*.csv",
    recursiveFileLookup = True
)
display(df1)

## pathGlobFilter<br>
Reads only files matching a pattern (e.g., *.csv) from subfolders.

In [0]:
df1=spark.read.option("header","True").option("sep",",").option("pathGlobalFilter",".csv").option("inferSchema","True").csv(f"{base_path}/tower/*")
df1.count()

##Using List of Paths in spark.read.csv([path1, path2])

In [0]:
list_of_path=f"{base_path}/tower/*"
df1=spark.read.option("header","True").option("sep",",").option("pathGlobalFilter",".csv").csv(list_of_path)
df1.count()

In [0]:
path=["/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/","/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/"]
df3=spark.read.options(header='True', inferSchema='True',sep=',',pathGlobalFilter='.csv',recursiveFileLookup='True').csv(path)
df3.count()

##Using recursiveFileLookup<br>
What it does<br>
Recursively reads all files in all subdirectories.<br>

In [0]:
df_recursive = spark.read.option("header", "true").option("sep", ",").option("pathGlobFilter", "*.csv").option("recursiveFileLookup", "true").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")
df_recursive.display()


##4. Schema Inference, Header, and Separator





In [0]:
df1=spark.read.format(".csv").options(header="False",inferSchema="False").csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header")
print(df1.printSchema())
df1.display(20)

**From Above results we see**<br>
- Column names are auto-generated (_c0, _c1, ‚Ä¶)<br>
- All columns are STRING<br>
- Header row is treated as data<br>

In [0]:
df1=spark.read.format(".csv").options(header="True",inferSchema="True").csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header")
print(df1.printSchema())
df1.display(20)

**From above results we see**<br>
- First row used as column names<br>
- Spark tries to detect data types<br>
- age becomes STRING (explained below)<br>

##What Changed When Using header and inferSchema?<br>
**Header**
| Value | Effect                         |
| ----- | ------------------------------ |
| false | First row treated as data      |
| true  | First row used as column names |<br>

**inferSchema Option**
| Value | Effect                               |
| ----- | ------------------------------------ |
| false | All columns read as STRING           |
| true  | Spark samples data and assigns types |<br>

What Spark Does
- Spark samples multiple rows
- Sees numeric values: 31, 45, 29
- Sees non-numeric value: "abc"
- Cannot safely cast entire column to INTEGER
- Spark always chooses the safest common type
- It will not partially fail or cast invalid rows



**How schema inference handled ‚Äúabc‚Äù in age?<br>**
- How Schema Inference Works (Step by Step)
- Spark samples the data when inferSchema=true.
- It tries to determine a single data type that can hold all values.
- Spark will not partially fail or drop rows during inference.
- Since "abc" cannot be cast to an integer, Spark cannot safely choose INT.
- Spark falls back to STRING for the entire column.

To provide schema (columname & datatype), what are the 2 basic options available that we learned so far ? inferSchema/toDF<br>
We are going to learn additionally 2 more options to handle schema (colname & datatype)?<br>
1. Using simple string format of define schema.<br>
IMPORTANT: 2. Using structure type to define schema.<br>

In [0]:
#By default it will use _c0,_c1..._cn it will apply as column headers, if we use toDF(colnames) we can define our own headers.
csv_df1=spark.read.csv("dbfs:///Volumes/catalog2/database2/volume2/created_folder/cust_1.txt").toDF("id","fname","lname","age","prof")
print(csv_df1.printSchema())
csv_df1.show()
csv_df2=spark.read.csv("dbfs:///Volumes/catalog2/database2/volume2/created_folder/cust_1.txt",inferSchema='True').toDF("id","fname","lname","age","prof")
print(csv_df2.printSchema())

#1. Using simple string format of define custom simple schema.
str_struct="id integer,fname string,lname string,age integer,prof string"
csv_df3=spark.read.schema(str_struct).csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header")
print(csv_df3.printSchema())

from pyspark.sql.types import StructType,StructField,IntegerType,StringType
#2. Using StructType and StructField to define custom schema.
schema=StructType([StructField("id",IntegerType(),True),StructField("fname",StringType(),True),StructField("lname",StringType(),True),StructField("age",IntegerType(),True),StructField("prof",StringType(),True)])
csv_df4=spark.read.schema(schema).csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header")
print(csv_df4.printSchema())


##5. Column Renaming Usecases

1. Apply column names using string using<br> toDF function for customer data<br>

In [0]:
df1=spark.read.options(header='True',inferSchema='True',sep=',').csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header").toDF("ID","FirstName","LastName","Age","Prof")
print(df1.printSchema())
df1.show(20)

2. Apply column names and datatype using the schema function for usage data

In [0]:
schema_fun="ID integer,FirstName string,LastName string,Age integer,Prof string"
df1=spark.read.schema(schema_fun).options(header='True',inferSchema='True',sep=',').csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header")
print(df1.printSchema())
df1.show(20)   

3. Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema=StructType([StructField("ID",IntegerType(),True),StructField("FirstName",StringType(),True),StructField("LastName",StringType(),True),StructField("Age",IntegerType(),True),StructField("Prof",StringType(),True)])
df1=spark.read.schema(schema).csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header")
print(df1.printSchema())
df1.show(20)

We will use format to read the data

In [0]:
df1=spark.read.format("csv").options(header="True",inferSchema="True",sep=',').load("/Volumes/catalog2/database2/volume2/created_folder/patients.csv")
print(df1.printSchema())
df1.display()

## Spark Write Operations using 
- csv, json, orc, parquet, delta, saveAsTable, insertInto, xml with different write mode, header and sep options

##6. Write Operations (Data Conversion/Schema migration) ‚Äì CSV Format Usecases
1. Write customer data into CSV format using overwrite mode
2. Write usage data into CSV format using append mode
3. Write tower data into CSV format with header enabled and custom separator (|)
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local from the catalog volume location and see the data of any of the above files opening in a notepad++.

**1. Write customer data into CSV format using overwrite mode**

In [0]:
df1_cust_csv=spark.read.format("csv").options(header="True",sep=',',inferSchema="True").load("/Volumes/workspace/default/processed/CASE3/Customer/customer.csv")

In [0]:
df1_cust_csv.write.format("csv").mode("overwrite").options(header="True",sep=",").save("/Volumes/workspace/default/processed/CASE3/Customer/cust_csv")

**Write usage data into CSV format using append mode**

In [0]:
df1_usage_csv=spark.read.csv("/Volumes/workspace/default/processed/CASE3/Usage/usage_day1.csv",header=True,inferSchema=True,sep=",")

In [0]:

df1_usage_csv.write.format("csv").mode("append").options(header="True",sep=",").save("/Volumes/workspace/default/processed/CASE3/Usage/usage_csv")


**Write tower data into CSV format with header enabled and custom separator (|)**

In [0]:
df_tower_csv=spark.read.csv("/Volumes/workspace/default/processed/CASE3/Tower/",header=True,inferSchema=True,sep=",")

In [0]:
df_tower_csv.write.format("csv").mode("overwrite").options(header="True",sep="|").save("/Volumes/workspace/default/processed/CASE3/Tower/tower_csv")

**4. Read the tower data in a dataframe and show only 5 rows.**

In [0]:
tower_data_read=spark.read.format("csv").options(header="True",inferSchema="True",sep="|").load("/Volumes/workspace/default/processed/CASE3/Tower/tower_csv")
tower_data_read.display()

**5. Download the file into local from the catalog volume location and see the data of any of the above files opening in a notepad++.**

##7. Write Operations (Data Conversion/Schema migration)‚Äì JSON Format Usecases
1. Write customer data into JSON format using overwrite mode
2. Write usage data into JSON format using append mode and snappy compression format
3. Write tower data into JSON format using ignore mode and observe the behavior of this mode
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

**1. Write customer data into JSON format using overwrite mode**

In [0]:
df1_cust_csv.write.mode("overwrite").format("json").options(header="True").save("/Volumes/workspace/default/processed/CASE3/Customer/cust_json")

**2. Write usage data into JSON format using append mode and snappy compression format**

In [0]:
df1_usage_csv.write.format("json").mode("append").options(header="True",compression="snappy").save("/Volumes/workspace/default/processed/CASE3/Usage/usage_json")

**3. Write tower data into JSON format using ignore mode and observe the behavior of this mode**

In [0]:
df_tower_csv.write.format("json").mode("ignore").options(header="True").save("/Volumes/workspace/default/processed/CASE3/Tower/tower_json")

**Read the tower data in a dataframe and show only 5 rows.**

In [0]:
tower_json=spark.read.format("json").load("/Volumes/workspace/default/raw/Data/jsontower/tower_json")
tower_json.display()

**5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.**

##8. Write Operations (Data Conversion/Schema migration) ‚Äì Parquet Format Usecases
1. Write customer data into Parquet format using overwrite mode and in a gzip format
2. Write usage data into Parquet format using error mode
3. Write tower data into Parquet format with gzip compression option
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

**Write customer data into Parquet format using overwrite mode and in a gzip format**

In [0]:
df1_cust_csv.write.format("parquet").mode("overwrite").options(header="True",compression="gzip").save("/Volumes/workspace/default/processed/CASE3/Customer/cust_parquet")


**2. Write usage data into Parquet format using error mode**

In [0]:
df1_usage_csv.write.format("parquet").mode("error").options(header="True",compression="snappy").save("/Volumes/workspace/default/processed/CASE3/Usage/usage_parquet")

**3. Write tower data into Parquet format with gzip compression option**

In [0]:
df_tower_csv.write.format("parquet").mode("append").options(header="True",compression="gzip").save("/Volumes/workspace/default/processed/CASE3/Tower/tower_parquet")

**Read the usage data in a dataframe and show only 5 rows.**

In [0]:
df=spark.read.format("parquet").load("/Volumes/workspace/default/processed/CASE3/Usage/usage_parquet")
df.display(5)

**5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.**

%md
##9. Write Operations (Data Conversion/Schema migration) ‚Äì Orc Format Usecases
1. Write customer data into ORC format using overwrite mode
2. Write usage data into ORC format using append mode
3. Write tower data into ORC format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

**Write customer data into ORC format using overwrite mode**

In [0]:
df1_cust_csv.write.mode("overwrite").format("orc").save("/Volumes/workspace/default/processed/CASE3/Customer/cust_orc")

**Write usage data into ORC format using append mode**

In [0]:
df1_usage_csv.write.format("orc").mode("append").save("/Volumes/workspace/default/processed/CASE3/Usage/usage_orc")

**Write tower data into ORC format and see the output file structure**

In [0]:
df_tower_csv.write.format("orc").save("/Volumes/workspace/default/processed/CASE3/Tower/tower_orc")

**Read the usage data in a dataframe and show only 5 rows.**

In [0]:
read_orc=spark.read.format("orc").load("/Volumes/workspace/default/processed/CASE3/Usage/usage_orc/")
read_orc.display(5)

##10. Write Operations (Data Conversion/Schema migration) ‚Äì Delta Format Usecases
1. Write customer data into Delta format using overwrite mode
2. Write usage data into Delta format using append mode
3. Write tower data into Delta format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.
6. Compare the parquet location and delta location and try to understand what is the differentiating factor, as both are parquet files only.

**Write customer data into Delta format using overwrite mode**

In [0]:
df1_cust_csv.write.mode("overwrite").format("delta").save("/Volumes/workspace/default/processed/CASE3/Customer/cust_delta")

**Write usage data into Delta format using append mode**

In [0]:
df1_usage_csv.write.mode("append").format("delta").save("/Volumes/workspace/default/processed/CASE3/Usage/usage_delta")

**Write tower data into Delta format and see the output file structure**

In [0]:
df_tower_csv.write.format("delta").save("/Volumes/workspace/default/processed/CASE3/Tower/tower_delta")

**Read the usage data in a dataframe and show only 5 rows.**

In [0]:
read_delta_df=spark.read.format("delta").load("/Volumes/workspace/default/processed/CASE3/Tower/tower_delta")
display(read_delta_df)

**5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.**

## **Compare the parquet location and delta location and try to understand what is the differentiating factor, as both are parquet files only.**

-- Parquet Location vs Delta Location
1. The core confusion (and the correct answer)<br>

Yes, Delta tables store data as Parquet files.<br>
But Parquet alone ‚â† Delta.<br>

The differentiating factor is the transaction log (_delta_log).<br>
That _delta_log folder is the entire difference.<br>

3. What Parquet-only gives you<br>
Parquet is a file format, nothing more<br>
It provides:<br>
-- Columnar storage<br>
-- Compression<br>
-- Faster reads<br>

It does NOT provide:<br>
-- ACID transactions<br>
-- Versioning<br>
-- Schema enforcement<br>
-- Concurrent write safety<br>
-- Time travel<br>

4. What Delta adds on top of Parquet<br>

Delta = Parquet files + Transaction Log<br>

The _delta_log provides<br>

| Capability              | Parquet Only | Delta |
| ----------------------- | ------------ | ----- |
| Columnar storage        | Yes          | Yes   |
| Compression             | Yes          | Yes   |
| Schema enforcement      | No           | Yes   |
| ACID transactions       | No           | Yes   |
| Time travel             | No           | Yes   |
| MERGE / UPDATE / DELETE | No           | Yes   |
| Concurrent writes       | Unsafe       | Safe  |
| Rollback                | No           | Yes   |


##11. Write Operations (Lakehouse Usecases) ‚Äì Delta table Usecases
1. Write customer data using saveAsTable() as a managed table
2. Write usage data using saveAsTable() with overwrite mode
3. Drop the managed table and verify data removal
4. Go and check the table overview and realize it is in delta format in the Catalog.
5. Use spark.read.sql to write some simple queries on the above tables created.


**Write customer data using saveAsTable() as a managed table**

In [0]:
%sql
CREATE CATALOG if not EXISTS training;


In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS training.raw;


In [0]:
df1_cust_csv.write.saveAsTable("training.raw.customers")

**2. Write usage data using saveAsTable() with overwrite mode**

In [0]:
df1_usage_csv.write.mode("overwrite").saveAsTable("training.raw.usage")

**Drop the managed table and verify data removal**

In [0]:
%sql
DROP TABLE training.raw.usage

**4. Go and check the table overview and realize it is in delta format in the Catalog.**<br>
->Yes

**Use spark.read.sql to write some simple queries on the above tables created.**

In [0]:
display(spark.sql("select * from customers"))

In [0]:
display(spark.sql("select customer_name,city,plan_type from customers where customer_id='C001'"))

In [0]:
spark.sql("select * from customers").show()

##12. Write Operations (Lakehouse Usecases) ‚Äì Delta table Usecases
1. Write customer data using insertInto() in a new table and find the behavior
2. Write usage data using insertTable() with overwrite mode

**1. Write customer data using insertInto() in a new table and find the behavior**

In [0]:
df1_cust_csv.write.insertInto("training.raw.customers_new",overwrite=True)

**Write usage data using insertTable() with overwrite mode**

In [0]:
df1_usage_csv.write.insertInto("training.raw.usage",overwrite=True)

##13. Write Operations (Lakehouse Usecases) ‚Äì Delta table Usecases
1. Write customer data into XML format using rowTag as cust
2. Write usage data into XML format using overwrite mode with the rowTag as usage
3. Download the xml data and open the file in notepad++ and see how the xml file looks like.

**Write customer data into XML format using rowTag as cust**

In [0]:
df1_cust_csv.write.mode("overwrite").xml("/Volumes/workspace/default/processed/CASE3/Customer/customer_xml",rowTag="cust")

**Write usage data into XML format using overwrite mode with the rowTag as usage**

In [0]:
df1_usage_csv.write.mode("overwrite").xml("/Volumes/workspace/default/processed/CASE3/Usage/usage_xml",rowTag="usage")

**Download the xml data and open the file in notepad++ and see how the xml file looks like.**
- Yes, Done

##14. Compare all the downloaded files (csv, json, orc, parquet, delta and xml) <br>
1. Capture the size occupied between all of these file formats and list the formats below based on the order of size from small to big.<br>

*Orginal **customer.csv** file size is 286.00 B*
- cust_csv----->280.00 B--->No compression<br>
- cust_delta--->1.69KB---->Snappy compresion<br>
- cust_json---->602.00 B--->no compression<br>
- cust_orc------>1 KB--->Snappy compression<br>
- cust_parquet--->2.04KB---->gzip compression<br>
- cust_xml------>1.19KB--->no compression<br>

Compression reduces data size only when the data volume is large enough to amortize format overhead.<br>
Which means<br>
Format overhead is extra information stored in a file that is not your actual data, such as:<br>
Schema definitions<br>
Column metadata-extra information about each column<br>
Like Minimum value,Maximum value,Number of nulls for each column<br>

3. Format-by-format explanation<br>
üîπ CSV (286 B ‚Üí 280 B)<br>
- Row-based<br>
- No schema storage<br>
- No metadata<br>
- Almost zero overhead<br>
- This is why it stays smallest.<br>

**JSON (602 B, no compression)**<br>
- Why larger than CSV<br>
- Repeats column names on every row<br>
- Uses structural characters { } : , "<br>
- UTF-8 text<br>
- JSON is self-describing, which costs space<br>

**üîπ XML (1.19 KB, no compression)**<br>
- Why even larger<br>
- Opening and closing tags<br>
- Deep verbosity<br>
- Repeated tag names<br>

**üîπ Parquet (2.04 KB, gzip)**<br>
Why bigger despite gzip<br>
- Columnar format<br>
- Schema<br>
- Column metadata<br>

**ORC (1 KB, Snappy)**<br>
Same story as Parquet<br>

**üîπ Delta (1.69 KB, Snappy)**<br>
Delta is not just a file format.<br>

It includes:<br>
- Parquet files<br>
- _delta_log JSON transaction log<br>
- Versioning metadata<br>
- ACID guarantees<br>
- Even for one row, Delta must create logs.<br>
- This is why Delta is never optimal for tiny datasets.<br>

%md
###15. Try to do permutation and combination of performing Schema Migration & Data Conversion operations like...
1. Read any one of the above orc data in a dataframe and write it to dbfs in a parquet format
2. Read any one of the above parquet data in a dataframe and write it to dbfs in a delta format
3. Read any one of the above delta data in a dataframe and write it to dbfs in a xml format
4. Read any one of the above delta table in a dataframe and write it to dbfs in a json format
5. Read any one of the above delta table in a dataframe and write it to another table

**1. Read any one of the above orc data in a dataframe and write it to dbfs in a parquet format**

In [0]:
#Read any one of the above orc data in a dataframe
orc_df=spark.read.format("orc").load("/Volumes/workspace/default/processed/CASE3/Customer/cust_orc")

In [0]:
#write it to dbfs in a parquet format
orc_df.write.parquet("/Volumes/workspace/default/processed/CASE3/Customer/cust_parquet_write")

**2. Read any one of the above parquet data in a dataframe and write it to dbfs in a delta format**

In [0]:
#Read any one of the above parquet data in a dataframe
parquet_df=spark.read.format("parquet").load("/Volumes/workspace/default/processed/CASE3/Customer/cust_parquet/")

In [0]:
#write it to dbfs in a delta format
parquet_df.write.format("delta").save("/Volumes/workspace/default/processed/CASE3/Customer/cust_delta_write")

**3. Read any one of the above delta data in a dataframe and write it to dbfs in a xml format**

In [0]:
#Read any one of the above delta data in a dataframe
delta_df=spark.read.format('delta').load("/Volumes/workspace/default/processed/CASE3/Customer/cust_delta/")

In [0]:
#write it to dbfs in a xml format
delta_df.write.xml("/Volumes/workspace/default/processed/CASE3/Customer/cust_xml_write",rowTag="cust")

**4. Read any one of the above delta table in a dataframe and write it to dbfs in a json format**

In [0]:
# Read any one of the above delta table in a dataframe
delta_df=spark.read.format('delta').load("/Volumes/workspace/default/processed/CASE3/Customer/cust_delta/")

In [0]:
#write it to dbfs in a json format
delta_df.write.json("/Volumes/workspace/default/processed/CASE3/Customer/cust_json_write")

**5. Read any one of the above delta table in a dataframe and write it to another table**


In [0]:
#Read any one of the above delta table in a dataframe
df_delta_src=spark.table("workspace.default.customers")

In [0]:
#write it to another table
df_delta_src.write.format('delta').saveAsTable("workspace.default.customers_delta")

## 16. Final Exercise: When to Use Each File Format (Simple One-Liners)

### 1. CSV (Comma-Separated Values)
**When to use / Benefits:**
Data is organized in rows and column<br>  
When to use<br>
- Small files<br>
- Learning, testing, or debugging<br>

Benefits<br>
- Very easy to read<br>
- Opens in Excel<br>
- Almost no overhead<br>

---

### 2. JSON (JavaScript Object Notation)
**When to use / Benefits:**  
Use JSON when data is semi-structured or hierarchical (nested). It is flexible, self-describing, and commonly used for APIs and data exchange.<br>
- Data coming from APIs<br>
- Web applications<br>
- Data has nested information (object inside object)<br>

---

### 3. ORC (Optimized Row Columnar)
**When to use / Benefits:**  
- High compression<br>
- Fast reads<br>
- Stores column statistics<br>


---

### 4. Parquet
**When to use / Benefits:**  
- Column-based (reads only required columns)<br>
- Good compression<br>
- Very fast for analytics<br>

---

### 5. Delta (Delta Lake format)
**When to use / Benefits:**  
Use Delta when you need reliability on a data lake. It adds ACID transactions, schema enforcement, time travel, and data versioning on top of Parquet.

---

### 6. XML
**When to use / Benefits:**  
Use XML when working with legacy systems or strict document-based integrations. It is highly structured and self-describing but very verbose and inefficient for large data processing.

---

### 7. Delta Tables
**When to use / Benefits:**  
- ACID transactions<br>
- Schema enforcement<br>
- Handles large-scale data safely<br>
