#Telecom Domain ReadOps <br/>Assignment
This notebook contains assignments to practice Spark read options and Databricks volumes. <br>
Sections: Sample data creation, Catalog & Volume creation, Copying data into Volumes, Path glob/recursive reads, toDF() column renaming variants, inferSchema/header/separator experiments, and exercises.<br>

![](https://fplogoimages.withfloats.com/actual/68009c3a43430aff8a30419d.png)
![](https://theciotimes.com/wp-content/uploads/2021/03/TELECOM1.jpg)

##First Import all required libraries & Create spark session object

In [0]:
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()

##1. Write SQL statements to create:
1. A catalog named telecom_catalog_assign
2. A schema landing_zone
3. A volume landing_vol
4. Using dbutils.fs.mkdirs, create folders:<br>
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/
5. Explain the difference between (Just google and understand why we are going for volume concept for prod ready systems):<br>
a. Volume vs DBFS/FileStore<br>
b. Why production teams prefer Volumes for regulated data<br>

In [0]:
%sql
create catalog if not exists telecom_catalog_assign

In [0]:
%sql
create schema if not exists telecom_catalog_assign.landing_zone

In [0]:
%sql
create volume if not exists telecom_catalog_assign.landing_zone.landing_vol

In [0]:
dbutils.fs.mkdirs(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer"
    )
dbutils.fs.mkdirs(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage"
    )
dbutils.fs.mkdirs(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower"
    )

dbutils.fs.mkdirs(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1"
    )

dbutils.fs.mkdirs(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2"
    )


a. Volume vs DBFS/FileStore<br/>

Volume:<br/><br/>
Production teams choose Volumes because they provide the necessary tools for data security, compliance, and operational excellence for sensitive, regulated data (e.g., HIPAA, GDPR, CCPA). <br/>
  1.Centralized Governance:<br/>
  Fully governed by Unity Catalog.Volumes integrate seamlessly with Databricks Unity Catalog, which acts as a single, unified governance solution across all workspaces. This eliminates fragmented access controls and inconsistent security policies that were common with older methods like DBFS mounts.

  2.Data Types: Governs non-tabular data (images, CSV, JSON, libraries, ML models, etc.) in cloud storage.

  3.Path Format: Uses a three-level namespace path: /Volumes/catalog/schema/volume/path.

  4.Databricks Recommendation:  Recommended for storing all non-tabular data.

  DBFS/Filestore:<br/>

  Relies on legacy workspace-level ACLs or cloud IAM roles; complex to manage.<br/>

  Permissions are less granular access<br/>

  Deprecated for most uses; not recommended for important data due to security concerns.

  b. Why production teams prefer Volumes for regulated data?<br/>
  Production teams favor Volumes due to the robust governance framework provided by Databricks Unity Catalog, which is critical for handling sensitive and regulated data (e.g., data subject to GDPR, HIPAA, CCPA). 

##Data files to use in this usecase:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''

In [0]:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''

##2. Filesystem operations
1. Write code to copy the above datasets into your created Volume folders:
Customer → /Volumes/.../customer/
Usage → /Volumes/.../usage/
Tower (region-based) → /Volumes/.../tower/region1/ and /Volumes/.../tower/region2/

2. Write a command to validate whether files were successfully copied

In [0]:
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_csv.csv", customer_csv,overwrite=True)
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tsv.csv", usage_tsv,overwrite=True)
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv", tower_logs_region1,overwrite=True)
dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/tower_logs_region2.csv", tower_logs_region1,overwrite=True)

In [0]:
%sh ls -ltr /Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/ /Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage /Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1 /Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2


##3. Directory Read Use Cases
1. Read all tower logs using:
Path glob filter (example: *.csv)
Multiple paths input
Recursive lookup

2. Demonstrate these 3 reads separately:
Using pathGlobFilter
Using list of paths in spark.read.csv([path1, path2])
Using .option("recursiveFileLookup","true")

3. Compare the outputs and understand when each should be used.

In [0]:
tower_df1=spark.read.option("header","True").option("recursiveFileLookup","True").option("pathGlobFilter","*.csv").option("inferschema","True").option("sep","|").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower")
display(tower_df1)


In [0]:
customer_df1=spark.read.options(pathGlobFilter="*.csv",inferSchema="True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer")
display(customer_df1)

tower_df1=spark.read.option("sep","|").csv(["/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/","/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/"])
display(tower_df1)

recurs_df1=spark.read.option("recursiveFileLookup","True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/")
display(recurs_df1)

##4. Schema Inference, Header, and Separator
1. Try the Customer, Usage files with the option and options using read.csv and format function:<br>
header=false, inferSchema=false<br>
or<br>
header=true, inferSchema=true<br>
2. Write a note on What changed when we use header or inferSchema  with true/false?<br>
3. How schema inference handled “abc” in age?<br>

In [0]:
customer_df1=spark.read.option("header","False").option("inferSchema","False").format("csv").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_csv.csv")
display(customer_df1)
customer_df1.printSchema()


usage_df1=spark.read.option("header","False").option("inferSchema","False").option("sep","\t").format("csv").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tsv.csv")
display(usage_df1)
usage_df1.printSchema()


customer_df2=spark.read.options(header="True",inferSchema="True").format("csv").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_csv.csv")
display(customer_df2)
customer_df2.printSchema()

usage_df2=spark.read.options(header="True",inferSchema="True",sep="\t").format("csv").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tsv.csv")
display(usage_df2)
usage_df2.printSchema()

######Write a note on What changed when we use header<br/> or inferSchema with true/false?
Header(False):It will not consider first row value as column/header.default colmun assignment is c0,c1,c2..<br/>
inferSchema(False):It will apply the deafult variable type for the column as string.

Header(True): It will consider the first row as column/header name.<br/>
inferSchema(True):It will apply values passed in input value accordingly for variable type for the column. If any string value passed with other datatypes means, it will take the type as string

#######How schema inference handled “abc” in age?
Here age should be always integertype but here " abc" is there in age so it is infering as string.

##5. Column Renaming Usecases
1. Apply column names using string using toDF function for customer data
2. Apply column names and datatype using the schema function for usage data
3. Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data 

In [0]:
customer_df1=spark.read.options(header="False",inferSchema="False").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_csv.csv").toDF("Id","Name","Age","City","Plan")
display(customer_df1)
customer_df1.printSchema()

In [0]:
from pyspark.sql.types import StructType,StructField,IntegerType,StringType
schema_define=StructType([StructField("Customer_id",IntegerType(),True),StructField("Voice_mins",IntegerType(),True),StructField("Data_mb",IntegerType(),True),StructField("SMS_count",IntegerType(),True)])
usage_df1=spark.read.options(header="True",inferSchema="True",sep="\t").schema(schema_define).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tsv.csv")
display(usage_df1)


In [0]:
usage_df1=spark.read.options(header="True",inferSchema="True",sep="\t").schema("Customer_id int,Voice_mins int,Data_mb int,SMS_count int").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage_tsv.csv")
display(usage_df1)

In [0]:
from pyspark.sql.types import StructType,StructField,IntegerType,StringType,TimestampType
schema_define=StructType([StructField("event_id",IntegerType(),True),StructField("Customer_id",IntegerType(),True),StructField("tower_id",StringType(),True),StructField("signal_strength",IntegerType(),True),StructField("timestamp",TimestampType(),True)])
usage_df1=spark.read.options(header="True",inferSchema="True",sep="|").schema(schema_define).csv(path=["/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv","/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/tower_logs_region2.csv"])
display(usage_df1)

## 6. More to come (stay motivated)....