#Telecom Domain ReadOps Assignment
This notebook contains assignments to practice Spark read options and Databricks volumes. <br>
Sections: Sample data creation, Catalog & Volume creation, Copying data into Volumes, Path glob/recursive reads, toDF() column renaming variants, inferSchema/header/separator experiments, and exercises.<br>

![](https://fplogoimages.withfloats.com/actual/68009c3a43430aff8a30419d.png)
![](https://theciotimes.com/wp-content/uploads/2021/03/TELECOM1.jpg)

##First Import all required libraries & Create spark session object

##1. Write SQL statements to create:
1. A catalog named telecom_catalog_assign
2. A schema landing_zone
3. A volume landing_vol
4. Using dbutils.fs.mkdirs, create folders:<br>
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/
5. Explain the difference between (Just google and understand why we are going for volume concept for prod ready systems):<br>
a. Volume vs DBFS/FileStore<br>
b. Why production teams prefer Volumes for regulated data<br>

In [0]:
%sql
create catalog if not exists telecom_catalog_assign;
create schema if not exists telecom_catalog_assign.landing_zone ;
create volume if not exists telecom_catalog_assign.landing_zone.landing_vol;


In [0]:
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")

### a. Volume vs DBFS 

**Volumes** are Unity Catalog-governed objects that provide a logical layer for managing non-tabular data (files) within cloud object storage with centralized governance. They are the modern, recommended approach in Databricks and support access control policies and lineage tracking through Unity Catalog.  
**DBFS/FileStore** (Databricks File System) is an older abstraction over cloud storage that allowed users to interact with data using simple paths or mounts without robust, centralized governance. The DBFS root and its mounts are deprecated, and Databricks recommends migrating to volumes or external locations under Unity Catalog.


| Feature                | Volumes (Unity Catalog)                                                                 | DBFS/FileStore (Deprecated)                                  |
|------------------------|----------------------------------------------------------------------------------------|--------------------------------------------------------------|
| Governance             | Centralized, fine-grained access control via Unity Catalog                             | Limited, workspace-level controls                            |
| Data Lineage           | Supported                                                                              | Not supported                                                |
| Recommended for Prod   | Yes                                                                                    | No                                                           |
| Access Control         | Unity Catalog policies                                                                 | ACLs, less granular                                          |
| Usage                  | Non-tabular data, files, ML models, etc.                                               | General file storage                                         |
| Migration              | Modern, recommended approach                                                           | Deprecated, migrate to Volumes or External Locations         |



###b. Why production teams prefer Volumes for regulated data


Production teams prefer Volumes for regulated data primarily due to their integration with Unity Catalog, which provides a robust framework for data governance, security, and access management. 

**Centralized Governance and Auditing:** Unity Catalog provides a single place to manage data access, permissions, and auditing across all Databricks workspaces. This is essential for meeting compliance requirements for regulated data.

**Granular Access Control:** Volumes allow administrators to define precise access controls on specific volumes or subfolders within the cloud storage location, which is critical for restricted and sensitive data.

**Simplified Compliance:** The structured, governed approach of volumes simplifies compliance reporting and ensures data is handled consistently according to organizational policies and industry regulations.

**Lifecycle Management:** Volumes have their own policies for permissions, encryption, backup, and recovery, which aids in managing the full data lifecycle in a compliant manner. 

**DBFS** lacks these centralized governance features, making it difficult to enforce consistent, auditable access controls required for sensitive or regulated production data. 

##Data files to use in this usecase:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''

##2. Filesystem operations

###Loading the datasets into Volume

1. Write code to copy the above datasets into your created Volume folders:
Customer → /Volumes/.../customer/
Usage → /Volumes/.../usage/
Tower (region-based) → /Volumes/.../tower/region1/ and /Volumes/.../tower/region2/


In [0]:
#Loading the customer data into customer.csv file

customer_csv = '''101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,
104,Raj,52,Mumbai,
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID '''
dbutils.fs.put("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv",customer_csv,overwrite=True)

#Loading the usage data into usage.tsv file

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''
dbutils.fs.put("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",usage_tsv,overwrite=True)

#Loading the tower log data into regionwise .tsv file

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''
dbutils.fs.put("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower1/tower_logs_region1.csv",tower_logs_region1,overwrite=True)
dbutils.fs.put("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower2/tower_logs_region2.csv",tower_logs_region1,overwrite=True)


In [0]:
dbutils.fs.help()

###Verify if data is loaded
2. Write a command to validate whether files were successfully copied

In [0]:
#validating if customer file is loaded successfully
dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/")
dbutils.fs.head("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

In [0]:
#validating if usage file is loaded successfully
dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/")
dbutils.fs.head("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv")


In [0]:
#validating if tower log file is loaded successfully
dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower1/")
dbutils.fs.head("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower1/tower_logs_region1.csv")

dbutils.fs.ls("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower2/")
dbutils.fs.head("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower2/tower_logs_region2.csv")

##3. Directory Read Use Cases


###3.1. Read all tower logs using: Path glob filter (example: *.csv) Multiple paths input Recursive lookup

In [0]:
pth_glb_fl_df1=spark.read.csv(path=["dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower1","dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower2"],header=True,inferSchema=True,sep='|',pathGlobFilter="*.csv",recursiveFileLookup=True)
pth_glb_fl_df1.show()
display(pth_glb_fl_df1)





###3.2. Demonstrate these 3 reads separately:
Using pathGlobFilter
Using list of paths in spark.read.csv([path1, path2])
Using .option("recursiveFileLookup","true")

In [0]:
#Using pathGlobFilter

pth_glb_fl_df1=spark.read.csv(path=["dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower1","dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower2"],header=True,inferSchema=True,sep='|',pathGlobFilter="*.csv",recursiveFileLookup=True)
pth_glb_fl_df1.show()
display(pth_glb_fl_df1)


In [0]:
#Using list of paths in spark.read.csv([path1, path2]) 

mul_pth_df1=spark.read.csv(path:=["dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower1/","dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower2/"],header=True,inferSchema=True, sep='|',recursiveFileLookup=True)
mul_pth_df1.show()
display(mul_pth_df1)

In [0]:
#Using .option("recursiveFileLookup","true")

mul_pth_df2= spark.read.option("header",True).option("delimiter","|").option("recursiveFileLookup",True).format("csv").load(["dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/"])
mul_pth_df2.show()
display(mul_pth_df2)


###3.3 Compare the outputs and understand when each should be used.

- pathGlobFilter="*.csv" -This can be used to read files of specific(*.csv) format within the specifed path/folder
- List of paths -This can be used to read files from multiple paths/sources
- Option -This can be used for more than one options, and parameters can be passed to the option & recursiveFileLookup can be used to read files from all the sub folders in the mentioned path

##4. Schema Inference, Header, and Separator
1. Try the Customer, Usage files with the option and options using read.csv and format function:<br>
header=false, inferSchema=false<br>
or<br>
header=true, inferSchema=true<br>
2. Write a note on What changed when we use header or inferSchema  with true/false?<br>
3. How schema inference handled “abc” in age?<br>

###4.1.a Try the Customer file with the option and options using read.csv and format function:
header=false, inferSchema=false
or
header=true, inferSchema=true

In [0]:
##Read cust file with option and header=true, inferSchema=true
cust_df1=spark.read.option("header",True).option("inferSchema",True).option("delimiter",",").format("csv").load("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
display(cust_df1)
print(cust_df1.schema)

##Read cust file with options and header=true, inferSchema=true
cust_df2=spark.read.options(header=True,inferSchema=True,delimiter=",").format("csv").load("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
display(cust_df2)
print(cust_df2.schema)

#Read cust file with options.csv and header=true, inferSchema=true
cust_df3=spark.read.options(header=True,inferSchema=True,delimiter=",").csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
display(cust_df3)
print(cust_df3.schema)


In [0]:

##Read cust file with option and header=false, inferSchema=false
cust_df4=spark.read.option("header",False).option("inferSchema",False).option("delimiter",",").format("csv").load("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
display(cust_df4)
print(cust_df4.schema)

##Read cust file with options and header=false, inferSchema=false
cust_df5=spark.read.options(header=False,inferSchema=False,delimiter=",").format("csv").load("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
display(cust_df5)
print(cust_df5.schema)

#Read cust file with options.csv and header=False, inferSchema=false
cust_df6=spark.read.options(header=False,inferSchema=False,delimiter=",").csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
display(cust_df6)
print(cust_df6.schema)

###4.1.b Try the Usage files with the option and options using read.csv and format function:
header=false, inferSchema=false
or
header=true, inferSchema=true

In [0]:
##Read usage file with option and header=True, inferSchema=True
usage_df1=spark.read.option("header",True).option("inferSchema",True).option("delimiter","\t").format("csv").load("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv")
display(usage_df1)
print(usage_df1.schema)

##Read usage file with options and header=True, inferSchema=True
usage_df2=spark.read.options(header=True,inferSchema=True,delimiter="\t").format("csv").load("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv")
display(usage_df2)
print(usage_df2.schema)

##Read usage file with options.csv and header=True, inferSchema=True
usage_df3=spark.read.options(header=True,inferSchema=True,delimiter="\t").csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv")
display(usage_df3)
print(usage_df3.schema) 

##Read usage file with option and header=False, inferSchema=False
usage_df4=spark.read.option("header",False).option("inferSchema",False).option("delimiter","\t").format("csv").load("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv")
display(usage_df4)
print(usage_df4.schema)

##Read usage file with options and header=False, inferSchema=False
usage_df5=spark.read.options(header=False,inferSchema=False,delimiter="\t").format("csv").load("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv")
display(usage_df5)
print(usage_df5.schema)

##Read usage file with options.csv and header=False, inferSchema=False
usage_df6=spark.read.options(header=False,inferSchema=False,delimiter="\t").csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv")
display(usage_df6)
print(usage_df6.schema) 


Write a note on What changed when we use header or inferSchema with true/false?

- **Header = True** - When header is assigned as True its considering the first row as header
- **Header = False** - When header is assigned as False its assigning header as c0, c1, c2..etc.,
- **InferSchema = True** - When inferSchema is assigned as True it scans all rows and assigns datatype accordingly. It can be used in small datasets, but in huge volume of data full scan takes time and needs to be used carefully. Sampling ratio option can be used to avoid full scanning.
- **InferSchema = False** -When inferSchema is assigned as False all the columns are treated as strings

How schema inference handled “abc” in age?

The column age has one value as "abc", even when InferSchema is True its assigned as string datatype, to be on safer side.

##5. Column Renaming Usecases
1. Apply column names using string using toDF function for customer data
2. Apply column names and datatype using the schema function for usage data
3. Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data 

###5.1 Apply column names using string using toDF function for customer data

In [0]:
cust_df=spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv",header=True,inferSchema=True).toDF("customer_id","name","age","city","plan")
display(cust_df)
print(cust_df.schema)

###5.2 Apply column names and datatype using the schema function for usage data

In [0]:
str_struct= "customer_id integer,voice_mins integer,data_mb integer,sms_count integer"
usage_df=spark.read.schema(str_struct).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",header=True,sep="\t")
display(usage_df)
print(usage_df.schema)

###5.3 Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data

In [0]:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
custom_schema=StructType([StructField("id",IntegerType(),False),StructField("fname",StringType(),True),StructField("lname",StringType(),True),StructField("event_id",IntegerType(),True),StructField("customer_id",IntegerType(),True),StructField("tower_id",StringType(),True),StructField("signal_strength",IntegerType(),True),StructField("timestamp",StringType())])
usage_df_df=spark.read.schema(custom_schema).csv("/Volumes/catalog1_dropme/schema1_dropme/volume1_dropme/sourcedata/custs_header_1")
print(usage_df.printSchema())
usage_df.show(2)

## Spark Write Operations using 
- csv, json, orc, parquet, delta, saveAsTable, insertInto, xml with different write mode, header and sep options

##6. Write Operations (Data Conversion/Schema migration) – CSV Format Usecases
1. Write customer data into CSV format using overwrite mode
2. Write usage data into CSV format using append mode
3. Write tower data into CSV format with header enabled and custom separator (|)
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local from the catalog volume location and see the data of any of the above files opening in a notepad++.

###6.1 Write customer data into CSV format using overwrite mode

In [0]:
wr_csv_df=cust_df1.write.csv(path="/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/csvout",mode="overwrite",header=True,sep='~')  
spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/csvout",header=True,sep='~').show(2)

###6.2 Write usage data into CSV format using append mode

In [0]:
usg_wr_df=usage_df1.write.csv(path="/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_csvout",mode="append",header=True,sep='\t')
spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_csvout",header=True,sep='\t').show(2)


###6.3 Write tower data into CSV format with header enabled and custom separator (|)

In [0]:
mul_pth_df2.write.csv(path="/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/tower_out",mode="overwrite",header=True,sep='|')



###6.4 Read the tower data in a dataframe and show only 5 rows.

In [0]:
tower_csv_out_df=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/tower_out",header=True,sep='|').show(5)

###6.5 Download the file into local from the catalog volume location and see the data of any of the above files opening in a notepad++.

Below is the csv data downloaded and opened in notepad++, the data is in structured and readble format

event_id|customer_id|tower_id|signal_strength|timestamp

5001|101|TWR01|-80|2025-01-10 10:21:54

5004|104|TWR05|-75|2025-01-10 11:01:12


##7. Write Operations (Data Conversion/Schema migration)– JSON Format Usecases
1. Write customer data into JSON format using overwrite mode
2. Write usage data into JSON format using append mode and snappy compression format
3. Write tower data into JSON format using ignore mode and observe the behavior of this mode
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

###7.1 Write customer data into JSON format using overwrite mode

In [0]:
wr_json_df1=cust_df1.write.json(path="/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_json_out",mode="overwrite")
spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_json_out").show(5)

###7.2 Write usage data into JSON format using append mode and snappy compression format

In [0]:
usage_df1.write.json(path="/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usage_json_out",mode="append",compression="snappy")
spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usage_json_out").show(5)

###7.3 Write tower data into JSON format using ignore mode and observe the behavior of this mode


In [0]:
mul_pth_df2.write.json(path="/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/mul_pth_df2",mode="ignore")
#since the directory mul_pth_df2 is already present in target folder the write operation is ignored


###7.4 Read the tower data in a dataframe and show only 5 rows

In [0]:
twr_json_df=spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/mul_pth_df2")
twr_json_df.show(5)

###7.5 Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

Below is the json format downloaded and opened in notepad++, data is in semistructured dictonary format (key: value pair)
{"event_id":"5001","customer_id":"101","tower_id":"TWR01","signal_strength":"-80","timestamp":"2025-01-10 10:21:54"}
{"event_id":"5004","customer_id":"104","tower_id":"TWR05","signal_strength":"-75","timestamp":"2025-01-10 11:01:12"}


##8. Write Operations (Data Conversion/Schema migration) – Parquet Format Usecases
1. Write customer data into Parquet format using overwrite mode and in a gzip format
2. Write usage data into Parquet format using error mode
3. Write tower data into Parquet format with gzip compression option
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

###8.1 Write customer data into Parquet format using overwrite mode and in a gzip format

In [0]:
usage_df.write.parquet(path="/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_parquet_out",mode="overwrite",compression="gzip")


###8.2 Write usage data into Parquet format using error mode

In [0]:
#below is alternative way of writing into parquet file 
usage_df1.write.mode("error").option("compression","gzip").parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_parquet_out/")

#Error message is received since the directory exists already: [PATH_ALREADY_EXISTS] Path dbfs:/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_parquet_out already exists. Set mode as "overwrite" to overwrite the existing path. SQLSTATE: 42K04

###8.3 Write tower data into Parquet format with gzip compression option

In [0]:
mul_pth_df2.write.mode("append").option("compression","gzip").parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/twr_parquet_out")

###8.4 Read the usage data in a dataframe and show only 5 rows

In [0]:
usg_wr_op_df=spark.read.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_parquet_out").show(5)

###8.5 Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

Below is the parquet data file which is structured & nested format 

PAR1 ,Fš˜‹œ   ‹      cb```ff’¦†PÚ Ï+‡    (DÓád   ‹      cb```ff’††Ê ;‹    0H›œå   ‹      cb```ff’!áA†0†) Ü¬©    (F ¾ä#   ‹      cb```ff’º`ÊÜ ¿ì    hp«±»   ‹      cb```f’FF¦º†º†
†VF†V¦&èÂ†V†V†F ´ÿ™4   50015004 &  101104 &  TWR01TWR05 &  -75-80 &  2025-01-10 10:21:542025-01-10 11:01:12 &  t   |p   ìt   àr   Òž  L lHspark_schema
 %event_id% L   %customer_id% L   %tower_id% L   %signal_strength% L   %	timestamp% L   \& 5 event_idZt&<6 (50045001    <&   Ôð< & 5 customer_idTp&|<6 (104101    <&   î¬8 & 5 tower_id\t&ì<6 (TWR05TWR01    <&   ˆä@ & 5 signal_strengthTr&à<6 (-80-75    <&   ¤¤8 & 5 	timestamp–ž&Ò<6 (2025-01-10 11:01:122025-01-10 10:21:54    <L&   ÀÜx ô&è  Lorg.apache.spark.version4.0.0 )org.apache.spark.sql.parquet.row.metadatañ{"type":"struct","fields":[{"name":"event_id","type":"string","nullable":true,"metadata":{}},{"name":"customer_id","type":"string","nullable":true,"metadata":{}},{"name":"tower_id","type":"string","nullable":true,"metadata":{}},{"name":"signal_strength","type":"string","nullable":true,"metadata":{}},{"name":"timestamp","type":"string","nullable":true,"metadata":{}}]} com.databricks.spark.jobGroupIdB1765960297983_8830991696410954426_432d8485349f4e54b11746c9cf0b26ad com.databricks.spark.clusterId1217-083245-9frgsgtv-v2n Zparquet-mr version 1.15.1-databricks-0001 (build c7257b8faff5699e13bbc781679dc03f48c1102a)\           &  PAR1

##9. Write Operations (Data Conversion/Schema migration) – Orc Format Usecases
1. Write customer data into ORC format using overwrite mode
2. Write usage data into ORC format using append mode
3. Write tower data into ORC format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

###9.1 Write customer data into ORC format using overwrite mode

In [0]:
cust_df1.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_orc_out",mode="overwrite")

###9.2 Write usage data into ORC format using append mode

In [0]:
usage_df1.write.orc(path="/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usage_orc_out",mode="append")

###9.3 Write tower data into ORC format and see the output file structure

In [0]:
mul_pth_df2.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/twr_orc_out",mode="overwrite")

###9.4 Read the usage data in a dataframe and show only 5 rows.

In [0]:
us_orc_df=spark.read.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usage_orc_out").show(5)

###9.5 Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
Below is the orc format of data downloaded in notepad++, the data is in structured and stripped format
ORC  
P ?  

     "
50015004P ;  

     "
101104P C  

     "
TWR01TWR05P ;  

     "
-75-80P j  =
;
 l2",
2025-01-10 10:21:542 (1:01:12LP   50015004  FD
  101104  Bð  TWR01TWR05  FU
  -80-75  Bð@  &H2025-01-10 10:21:54. 1:01:12	  NÔ  ˜
 " $ 
 8
0 	


 
 
  
# ( >  T
š
P 
"
5001à4P X
"
101104P X
"
TWR01TWR05P X
	0-75-80
0h4",
2025-01-10 10:21:542 01:01:12LP X*ˆ êà¥Él m("Fevent_idcustomer_idtow	
ðBsignal_strength	timestamp"%:!
spark.sql.catalyst.typestring"þ' þ' j' 8*!
org.apache.	Êˆversion4.0.00:P :"
5001Ø4P X:"
101104P X:"
TWR01TWR05P X0-75-80
0h4",
2025-01-10 10:21:542 h1:01:12LP X*@NH Xb2.1.1Ç€€" (’0	‚ôORC

##10. Write Operations (Data Conversion/Schema migration) – Delta Format Usecases
1. Write customer data into Delta format using overwrite mode
2. Write usage data into Delta format using append mode
3. Write tower data into Delta format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.
6. Compare the parquet location and delta location and try to understand what is the differentiating factor, as both are parquet files only.

###10.1 Write customer data into Delta format using overwrite mode

In [0]:
cust_df4.write.format("delta").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_delta_out",header=True,mode="overwrite")


###10.2 Write usage data into Delta format using append mode

In [0]:
usage_df1.write.format("delta").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usage_delta_out",header=True,mode="append")

###10.3  Write tower data into Delta format and see the output file structure

In [0]:
mul_pth_df2.write.format("delta").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/twr_delta_out",header=True,
mode="overwrite")

###10.4 Read the usage data in a dataframe and show only 5 rows.

In [0]:
usg_del_df=spark.read.format("delta").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usage_delta_out").show(5)

###10.5 Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
below is the delta file format downloaded in notepad++, this is similarDelta format is specific to databricks only and its an enriched parquet format

PAR1 $´„ãÑ<   <   5001   5004 ô˜îý
6 (50045001   	    ?* Ç¸Ä:<   4   101   104 ô˜îý
6 (104101   	    ?*$(ü™Žñ<   D   TWR01   TWR05 ô˜îý
6 (TWR05TWR01   	    ?* Æ•Ã<   4   -80   -75 ô˜îý
6 (-80-75   	    ?*\J¸¹šƒ<   .\   2025-01-10 10:21:54: 1:01:12 ô˜îý
6 (2025-01-10 11:01:122025-01-10 10:21:54   	    ?*lHspark_schema
 %event_id% L   %customer_id% L   %tower_id% L   %signal_strength% L   %	timestamp% L   \&5event_id¨°&T&6 (50045001   &¸5customer_idž¦&þ&¸6 (104101   &Þ5tower_id°¸&®&Þ6 (TWR05TWR01   &–5signal_strength ¨&Þ&–6 (-80-75   &¾5	timestamp ’&°&¾6 (2025-01-10 11:01:122025-01-10 10:21:54   ¶&È \org.apache.spark.version4.0.0 )org.apache.spark.sql.parquet.row.metadatañ{"type":"struct","fields":[{"name":"event_id","type":"string","nullable":true,"metadata":{}},{"name":"customer_id","type":"string","nullable":true,"metadata":{}},{"name":"tower_id","type":"string","nullable":true,"metadata":{}},{"name":"signal_strength","type":"string","nullable":true,"metadata":{}},{"name":"timestamp","type":"string","nullable":true,"metadata":{}}]} com.databricks.spark.jobGroupIdB1765968763488_6218817062049439115_e2f298f1fc1a4c86a0ce5193c21b5a86 com.databricks.spark.clusterId1217-105539-s25gdn23-v2n #com.databricks.spark.writeTimestamp2025-12-17T11:16:11.971238451Z 5parquet-mr compatible Photon version 0.2 (build 17.3) ¹  PAR1

###10.6 Compare the parquet location and delta location and try to understand what is the differentiating factor, as both are parquet files only.

**Parquet file location:** 
- The data file is compressed and will be in .parquet format
- It has the .parquet data file and other log files (started, committed)  created directly in the given path

**Delta file location:**
- The data file is compressed and will be in .parquet format
- Data file will be created in the given path.
- Additionally a directory called _delta_log is created in which the logs are maintained

##11. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data using saveAsTable() as a managed table
2. Write usage data using saveAsTable() with overwrite mode
3. Drop the managed table and verify data removal
4. Go and check the table overview and realize it is in delta format in the Catalog.
5. Use spark.read.sql to write some simple queries on the above tables created.


In [0]:
#1. Write customer data using saveAsTable() as a managed table

from pyspark.sql.types import StructField,StringType,IntegerType,StructType
cust_schema=StructType([StructField("customer_id",IntegerType(),False),
StructField("name",StringType(),True),
StructField("age",IntegerType(),True),
StructField("city",StringType(),True),
StructField("Plan",StringType(),True)])

cust_df1=spark.read.schema(cust_schema).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
print(cust_df1.printSchema())
cust_df1.show(

cust_df1.write.saveAsTable("telecom_catalog_assign.landing_zone.cust_delta_tbl",mode="overwrite")
display(spark.sql("show create table telecom_catalog_assign.landing_zone.cust_delta_tbl"))
display(spark.sql("select * from telecom_catalog_assign.landing_zone.cust_delta_tbl"))

#2. Write usage data using saveAsTable() with overwrite mode
usage_df_wr=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",header=True,inferSchema=True,sep='\t')
usage_df_wr.display()
print(usage_df_wr.printSchema())

usage_df_wr.write.saveAsTable("telecom_catalog_assign.landing_zone.usg_tbl",mode="overwrite")
display(spark.sql("select * from telecom_catalog_assign.landing_zone.usg_tbl"))

#3. Drop the managed table and verify data removal
spark.sql("drop table telecom_catalog_assign.landing_zone.usg_tbl")

#since the table is deleted getting  below error message
[TABLE_OR_VIEW_NOT_FOUND] The table or view `telecom_catalog_assign`.`landing_zone`.`usg_tbl` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS. SQLSTATE: 42P01
                                                            
#4. Go and check the table overview and realize it is in delta format in the Catalog.
**Table Properties**
delta: 
enableDeletionVectors: "true"
feature.appendOnly: "supported"
feature.deletionVectors: "supported"
feature.invariants: "supported"
lastCommitTimestamp: "1766129575000"
lastUpdateVersion: "0"
minReaderVersion: "3"
minWriterVersion: "7"
spark: 
sql.statistics.auxiliaryInfo: "{\"source\":\"AUTO_STATS\"}"
sql.statistics.createdAt: "1766129573415"
sql.statistics.createdBy: "root"
sql.statistics.numRows: "6"
sql.statistics.totalSize: "1578"
sql.statistics.version: "2"
collation: 
collation: "UTF8_BINARY"

#5. Use spark.read.sql to write some simple queries on the above tables create
display(spark.sql("show create table telecom_catalog_assign.landing_zone.cust_delta_tbl"))
display(spark.sql("select * from telecom_catalog_assign.landing_zone.cust_delta_tbl"))


##12. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data using insertInto() in a new table and find the behavior
2. Write usage data using insertTable() with overwrite mode

In [0]:
#1 Write customer data using insertInto() in a new table and find the behavior

#creating new table and inserting data
spark.sql("""
    CREATE TABLE IF NOT EXISTS telecom_catalog_assign.landing_zone.cust_insert_test (
        customer_id INT,
        name STRING,
        age INT,
        city STRING,
        Plan STRING
    )
    USING DELTA
    """)

    cust_df1.write.insertInto("telecom_catalog_assign.landing_zone.cust_insert_test")
    display(spark.sql("select * from telecom_catalog_assign.landing_zone.cust_insert_test")
    #the records got inserted successfully

#2 Write usage data using insertTable() with overwrite mode

#creating a DF for inserting into the cust_insert_test table with overwrite option
from pyspark.sql.types import StructField,StringType,IntegerType,StructType
cust_schema=StructType([StructField("customer_id",IntegerType(),False),
StructField("name",StringType(),True),
StructField("age",IntegerType(),True),
StructField("city",StringType(),True),
StructField("Plan",StringType(),True)])

cust_df_wr=spark.read.schema(cust_schema).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer1.csv")
print(cust_df_wr.printSchema())
cust_df_wr.show()

cust_df_wr.write.insertInto("telecom_catalog_assign.landing_zone.cust_insert_test",overwrite=True)
display(spark.sql("select * from telecom_catalog_assign.landing_zone.cust_insert_test"))

#after running the insert into script with overwrite = True, the existing records got deleted and new records got inserted


##13. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data into XML format using rowTag as cust
2. Write usage data into XML format using overwrite mode with the rowTag as usage
3. Download the xml data and open the file in notepad++ and see how the xml file looks like.

###13.1 Write customer data into XML format using rowTag as cust

In [0]:
#1. Write customer data into XML format using rowTag as customer
from pyspark.sql.types import StructField,StringType,IntegerType,StructType
cust_schema=StructType([StructField("customer_id",IntegerType(),False),
StructField("name",StringType(),True),
StructField("age",IntegerType(),True),
StructField("city",StringType(),True),
StructField("Plan",StringType(),True)])

cust_df1=spark.read.schema(cust_schema).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
print(cust_df1.printSchema())
cust_df1.show()

cust_df1.write.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer_xml",mode="overwrite",rowTag="customer")



###13.2 Write usage data into XML format using overwrite mode with the rowTag as usage

In [0]:
usage_df_wr=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",header=True,inferSchema=True,sep='\t')
usage_df_wr.display()
print(usage_df_wr.printSchema())

usage_df_wr.write.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usage_xml",mode="overwrite",rowTag="usage")

##14. Compare all the downloaded files (csv, json, orc, parquet, delta and xml) 
1.Capture the size occupied between all of these file formats and list the formats below based on the order of size from small to big.

Using big files will give more proper size comparison
Below is the size comparisons based on smaller file size


| File type | File Size (KB) | Zip        |
| :---      | :---           | :---       |
| CSV       | 0.1484 KB      | Not zipped |
| JSON      | 0.3213 KB      | Not zipped |
| ORC       | 0.9033 KB      | Snappy     |
| XML       | 0.9980 KB      | Not zipped |
| Delta     | 1.5200 KB      | Snappy     |
| Parquet   | 1.8000 KB      | Snappy     |

##15. Try to do permutation and combination of performing Schema Migration & Data Conversion operations like...
1. Read any one of the above orc data in a dataframe and write it to dbfs in a parquet format
2. Read any one of the above parquet data in a dataframe and write it to dbfs in a delta format
3. Read any one of the above delta data in a dataframe and write it to dbfs in a xml format
4. Read any one of the above delta table in a dataframe and write it to dbfs in a json format
5. Read any one of the above delta table in a dataframe and write it to another table

In [0]:
#1. Read any one of the above orc data in a dataframe and write it to dbfs in a parquet format

cust_orc_df1 = spark.read.orc(
    "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_orc_out/part-00000-tid-122224858270013067-137d72c2-bb04-4033-9436-888dc84c4aa8-298-1-c000.snappy.orc"
)
cust_orc_df1.toDF("customer_id", "name", "age", "city", "Plan")                                                

print(cust_orc_df1.printSchema())
cust_orc_df1.display()

cust_orc_df1.write.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_orc_parquet",mode="overwrite")
spark.read.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_orc_parquet").show()

#2. Read any one of the above parquet data in a dataframe and write it to dbfs in a delta format
usg_prqt_df= spark.read.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_parquet_out/part-00000-tid-5588728212722270687-a3887482-9889-4b53-b3f2-82cccf982173-245-1-c000.gz.parquet")
usg_prqt_df.display()                                             

usg_prqt_df.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_prqt_orc_out/",mode="overwrite")
spark.read.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_prqt_orc_out").display()

#3. Read any one of the above delta data in a dataframe and write it to dbfs in a xml format

cust_delta_df=spark.read.load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_delta_out")
cust_delta_df.display()

cust_delta_df.write.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_delta_xml",rowTag="customer")
spark.read.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/cust_delta_xml",rowTag="customer").show()

#4. Read any one of the above delta table in a dataframe and write it to dbfs in a json format
usg_del_tabl_df=spark.read.table("telecom_catalog_assign.landing_zone.cust_delta_tbl")
usg_del_tabl_df.show()

usg_del_tabl_df.write.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_del1_json",mode="overwrite")
spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/target/usg_del1_json").display()

#5. Read any one of the above delta table in a dataframe and write it to another table
usg_del_tabl_df=spark.read.table("telecom_catalog_assign.landing_zone.cust_delta_tbl")
usg_del_tabl_df.show()

usg_del_tabl_df.write.saveAsTable("telecom_catalog_assign.landing_zone.cust_delta_tbl_copy",mode="overwrite")
display(spark.sql("select * from telecom_catalog_assign.landing_zone.cust_delta_tbl_copy"))
display(spark.sql("show create table telecom_catalog_assign.landing_zone.cust_delta_tbl_copy"))



##16. Do a final exercise of defining one/two liner of... 
1. When to use/benifits csv
2. When to use/benifits json
3. When to use/benifit orc
4. When to use/benifit parquet
5. When to use/benifit delta
6. When to use/benifit xml
7. When to use/benifit delta tables


In [0]:
1. When to use/benifits csv

2. When to use/benifits json
3. When to use/benifit orc
4. When to use/benifit parquet
5. When to use/benifit delta
6. When to use/benifit xml
7. When to use/benifit delta tables