#Telecom Domain ReadOps Assignment
This notebook contains assignments to practice Spark read options and Databricks volumes. <br>
Sections: Sample data creation, Catalog & Volume creation, Copying data into Volumes, Path glob/recursive reads, toDF() column renaming variants, inferSchema/header/separator experiments, and exercises.<br>

![](https://theciotimes.com/wp-content/uploads/2021/03/TELECOM1.jpg)

##First Import all required libraries & Create spark session object

In [0]:
from pyspark.sql.session import SparkSession
print(spark)
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
print(spark)


##Write SQL statements to create:

###A catalog named telecom_catalog_assign

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS telecom_catalog_assign

### Create A schema landing_zone

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS telecom_catalog_assign.landing_zone

###Create A volume landing_vol

In [0]:
%sql
CREATE VOLUME IF NOT EXISTS telecom_catalog_assign.landing_zone.landing_vol

### Using dbutils.fs.mkdirs, create folders
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/

In [0]:
base_path = "/Volumes/telecom_catalog_assign/landing_zone/landing_vol"

customer_path = f"{base_path}/customer/"
usage_path    = f"{base_path}/usage/"
tower_r1_path = f"{base_path}/tower/region1/"
tower_r2_path = f"{base_path}/tower/region2/"

dbutils.fs.mkdirs(customer_path)
dbutils.fs.mkdirs(usage_path)
dbutils.fs.mkdirs(tower_r1_path)
dbutils.fs.mkdirs(tower_r2_path)


In [0]:
dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol")

#Volume vs DBFS/FileStore<br>

DBFS/FileStore is a legacy, workspace-scoped file system mainly for demos, while Volumes are Unity Catalog–governed, secure, production-ready storage for files.”

##Volumes:- Volumes are a Unity Catalog–managed storage layer that provides secure, governed access to files<br>
Governance:-Governance is control + rules + tracking.<br>
It answers:<br>
Who can access the data?<br>
Who changed the data?<br>
When was it changed?<br>
Are rules being followed?<br>

**In Databricks**<br>
**With Unity Catalog (Volumes)**:<br>
Databricks knows who accessed the file<br>
Databricks knows who modified it<br>
Permissions are enforced<br>

**With DBFS/FileStore**:<br>
No tracking<br>
No rules<br>
Anyone with workspace access can see it<br>

**2. Access Control**<br>
Access control = permission system<br>
Examples:<br>
Read only<br>
Write<br>
Full control<br>

a. In Databricks Volumes<br>
*Ex:- Only data_team can read*<br>
GRANT READ ON VOLUME main.default.raw_data TO data_team; <br>

b. DBFS/FileStore<br>
No such control<br>
Everyone can access<br>

**3. Security (PROTECTING data)**<br>
Security = preventing unauthorized access or misuse<br>
Includes:<br>
Authentication (who you are)<br>
Authorization (what you can do)<br>
Auditing (what you did)<br>
Real-life analogy<br>
ATM:<br>
Card + PIN<br>
Logs every transaction<br>

a. In Databricks<br>
Volumes<br>
Encrypted<br>
Role-based access<br>
Audited<br>

b. DBFS<br>
Minimal protection<br>
No audit trail<br>

**4. Production Ready (SAFE for real business)**<br>
Simple meaning<br>
Production ready = safe for company-critical data<br>

a. In Databricks<br>
Volumes<br>
Designed for pipelines<br>
Safe for Jobs<br>
Used in production<br>

b. DBFS<br>
Only for testing<br>
Not supported for regulated workloads<br>

**5. SQL Support (Can SQL directly use it?)**<br>
Simple meaning<br>
Can I manage it using SQL?<br>
Example<br>
a. Volumes<br>
CREATE VOLUME main.default.sales_data;
SELECT * FROM csv.`/Volumes/main/default/sales_data/file.csv`;

b. DBFS
❌ SQL cannot CREATE or MANAGE DBFS locations

| Term             | DBFS/FileStore | Volumes  |
| ---------------- | -------------- | -------- |
| Governance       | ❌ None         | ✅ Full   |
| Access Control   | ❌ No           | ✅ Yes    |
| Security         | ❌ Weak         | ✅ Strong |
| Production Ready | ❌ No           | ✅ Yes    |
| SQL Support      | ❌ No           | ✅ Yes    |

Final Memory Trick

DBFS = Development / Demo
Volumes = Production / Governed


#Why production teams prefer Volumes for regulated data

What Is “Regulated Data”?<br>
Regulated data is information that is governed by legal, compliance, or internal policy rules, such as:

- Healthcare data (HIPAA)
- Financial data
- Personally Identifiable Information (PII)
- Customer or employee records
- Enterprise reporting data

For such data, control and traceability are mandatory, not optional.

1. Strong Governance (Mandatory for Regulation)<br>
What regulation requires<br>

- Know who accessed data
- Know who modified data
- Enforce company policies

How Volumes help<br>
Volumes are governed by Unity Catalog, which provides:<br>
- Centralized metadata
- Access tracking
- Audit logs

Why DBFS fails<br>
- No centralized governance
- No audit trail

✔ Regulators demand proof — Volumes provide it

2. Fine-Grained Access Control (Least Privilege)
Requirement<br>
- Users must access only what they need
- No broad workspace access

Volumes support
GRANT READ ON VOLUME main.default.phi_data TO analytics_team;
- Read-only access
- No accidental writes
- Controlled sharing

DBFS limitation
- Either full access or none
- No role-based control

✔ This is critical for compliance frameworks


4. Auditing & Compliance Evidence<br>
Auditors ask:<br>
Who read this file?<br>
When was it changed?<br>
Was access authorized?<br>
Volumes can answer these questions via:<br>
Unity Catalog audit logs<br>
Access history<br>

DBFS cannot.<br>
✔ Without audit logs, compliance fails<br>

| Requirement      | Volumes        | DBFS            |
| ---------------- | -------------- | --------------- |
| Governance       | ✅ Yes          | ❌ No            |
| Access control   | ✅ Fine-grained | ❌ None          |
| Security         | ✅ Strong       | ❌ Weak          |
| Auditing         | ✅ Available    | ❌ Not available |
| Production ready | ✅ Yes          | ❌ No            |
| Compliance ready | ✅ Yes          | ❌ No            |


##Data files to use in this usecase:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''

##Create DataFrames from the Given Datasets


In [0]:
customer_df=spark.createDataFrame([(101,"Arun",31,"Chennai","PREPAID"),
    (102,"Meera",45,"Bangalore","POSTPAID"),
    (103,"Irfan",29,"Hyderabad","PREPAID"),
    (104,"Raj",52,"Mumbai","POSTPAID"),
    (105,None,27,"Delhi","PREPAID"),
    (106,"Sneha",None,"Pune","PREPAID")],["customer_id","name","age","city","plan_type"])

usage_df = spark.createDataFrame([
    (101,320,1500,20),
    (102,120,4000,5),
    (103,540,600,52),
    (104,45,200,2),
    (105,0,0,0)
], ["customer_id","voice_mins","data_mb","sms_count"])

tower_r1_df = spark.createDataFrame([
    (5001,101,"TWR01",-80,"2025-01-10 10:21:54"),
    (5004,104,"TWR05",-75,"2025-01-10 11:01:12")
], ["event_id","customer_id","tower_id","signal_strength","timestamp"])

#Tower Logs – Region 2 (empty)
tower_r2_df = spark.createDataFrame([], tower_r1_df.schema)



## Copy (Write) the Data into the Volume Folders

In [0]:
customer_df.write.mode("overwrite").csv(customer_path,header=True)
usage_df.write.mode("overwrite").csv(usage_path,header=True)
tower_r1_df.write.mode("overwrite").csv(tower_r1_path,header=True)
tower_r2_df.write.mode("overwrite").csv(tower_r2_path,header=True)
display(customer_df)
display(usage_df)
display(tower_r1_df)
display(tower_r2_df)

##Validate That Files Were Successfully Copied

In [0]:
dbutils.fs.ls(base_path)

In [0]:
#Check each dataset
dbutils.fs.ls(customer_path)
dbutils.fs.ls(usage_path)
dbutils.fs.ls(tower_r1_path)
dbutils.fs.ls(tower_r2_path)


##Read all tower logs using: 
- Path glob filter (example: *.csv)
- Multiple paths input
- Recursive lookup

In [0]:
%py
##Directory Read Use Cases Using Path Glob Filter (*.csv)
df1=spark.read.option("header","True").option("sep",",").csv(f"{base_path}/tower/*/*.csv")
df1.count()
df1.display(3)
df1.show()

In [0]:
#Multiple paths input
df1=spark.read.options(header="True",sep=",",inferSchema=True).csv(path=["dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/","dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/"])
df1.count()
df1.display()

In [0]:
%py
##Recursive lookup
df1 = spark.read.csv(
    path = [f"{base_path}/tower/*"],
    inferSchema = True,
    header = True,
    sep = ",",
    pathGlobFilter = "*.csv",
    recursiveFileLookup = True
)
display(df1)

## pathGlobFilter<br>
Reads only files matching a pattern (e.g., *.csv) from subfolders.

In [0]:
df1=spark.read.option("header","True").option("sep",",").option("pathGlobalFilter",".csv").option("inferSchema","True").csv(f"{base_path}/tower/*")
df1.count()

##Using List of Paths in spark.read.csv([path1, path2])

In [0]:
list_of_path=f"{base_path}/tower/*"
df1=spark.read.option("header","True").option("sep",",").option("pathGlobalFilter",".csv").csv(list_of_path)
df1.count()

In [0]:
path=["/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/","/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/"]
df3=spark.read.options(header='True', inferSchema='True',sep=',',pathGlobalFilter='.csv',recursiveFileLookup='True').csv(path)
df3.count()

##Using recursiveFileLookup<br>
What it does<br>
Recursively reads all files in all subdirectories.<br>

In [0]:
df_recursive = spark.read.option("header", "true").option("sep", ",").option("pathGlobFilter", "*.csv").option("recursiveFileLookup", "true").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")
df_recursive.display()


##4. Schema Inference, Header, and Separator





In [0]:
df1=spark.read.format(".csv").options(header="False",inferSchema="False").csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header")
print(df1.printSchema())
df1.display(20)

**From Above results we see**<br>
- Column names are auto-generated (_c0, _c1, …)<br>
- All columns are STRING<br>
- Header row is treated as data<br>

In [0]:
df1=spark.read.format(".csv").options(header="True",inferSchema="True").csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header")
print(df1.printSchema())
df1.display(20)

**From above results we see**<br>
- First row used as column names<br>
- Spark tries to detect data types<br>
- age becomes STRING (explained below)<br>

##What Changed When Using header and inferSchema?<br>
**Header**
| Value | Effect                         |
| ----- | ------------------------------ |
| false | First row treated as data      |
| true  | First row used as column names |<br>

**inferSchema Option**
| Value | Effect                               |
| ----- | ------------------------------------ |
| false | All columns read as STRING           |
| true  | Spark samples data and assigns types |<br>

What Spark Does
- Spark samples multiple rows
- Sees numeric values: 31, 45, 29
- Sees non-numeric value: "abc"
- Cannot safely cast entire column to INTEGER
- Spark always chooses the safest common type
- It will not partially fail or cast invalid rows



**How schema inference handled “abc” in age?<br>**
- How Schema Inference Works (Step by Step)
- Spark samples the data when inferSchema=true.
- It tries to determine a single data type that can hold all values.
- Spark will not partially fail or drop rows during inference.
- Since "abc" cannot be cast to an integer, Spark cannot safely choose INT.
- Spark falls back to STRING for the entire column.