#Telecom Domain Read & Write Ops Assignment - Building Datalake & Lakehouse
This notebook contains assignments to practice Spark read options and Databricks volumes. <br>
Sections: Sample data creation, Catalog & Volume creation, Copying data into Volumes, Path glob/recursive reads, toDF() column renaming variants, inferSchema/header/separator experiments, and exercises.<br>

![](https://fplogoimages.withfloats.com/actual/68009c3a43430aff8a30419d.png)
![](https://theciotimes.com/wp-content/uploads/2021/03/TELECOM1.jpg)

##First Import all required libraries & Create spark session object

##1. Write SQL statements to create:
1. A catalog named telecom_catalog_assign
2. A schema landing_zone
3. A volume landing_vol
4. Using dbutils.fs.mkdirs, create folders:<br>
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/
5. Explain the difference between (Just google and understand why we are going for volume concept for prod ready systems):<br>
a. Volume vs DBFS/FileStore<br>
b. Why production teams prefer Volumes for regulated data<br>

In [0]:
%sql
create catalog if not exists telecom_catalog_assign;
create database if not exists telecom_catalog_assign.landing_zone;
create volume if not exists telecom_catalog_assign.landing_zone.landing_vol;

In [0]:
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")

##Data files to use in this usecase:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''

##2. Filesystem operations
1. Write dbutils.fs code to copy the above datasets into your created Volume folders:
Customer → /Volumes/.../customer/
Usage → /Volumes/.../usage/
Tower (region-based) → /Volumes/.../tower/region1/ and /Volumes/.../tower/region2/

2. Write a command to validate whether files were successfully copied

In [0]:
customer_csv = """ 101,Arun,31,Chennai,PREPAID 
102,Meera,45,Bangalore,POSTPAID 
103,Irfan,29,Hyderabad,PREPAID 
104,Raj,52,Mumbai,POSTPAID 
105,,27,Delhi,PREPAID 
106,Sneha,abc,Pune,PREPAID """

usage_tsv = """customer_id\tvoice_mins\tdata_mb\tsms_count 
101\t320\t1500\t20 
102\t120\t4000\t5 
103\t540\t600\t52 
104\t45\t200\t2 
105\t0\t0\t0 """

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp 
5001|101|TWR01|-80|2025-01-10 10:21:54 
5004|104|TWR05|-75|2025-01-10 11:01:12 '''

tower_logs_region2 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5002|101|TWR01|-80|2025-01-10 10:21:54
5003|104|TWR05|-75|2025-01-10 11:01:12'''

In [0]:
df1=dbutils.fs.put("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv", customer_csv, True)
df2=dbutils.fs.put("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv", usage_tsv, True)  
df3=dbutils.fs.put("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv", tower_logs_region1, True)
df4=dbutils.fs.put("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/tower_logs_region2.csv", tower_logs_region2, True)
     

##3. Spark Directory Read Use Cases
1. Read all tower logs using:
Path glob filter (example: *.csv)
Multiple paths input
Recursive lookup

2. Demonstrate these 3 reads separately:
Using pathGlobFilter
Using list of paths in spark.read.csv([path1, path2])
Using .option("recursiveFileLookup","true")

3. Compare the outputs and understand when each should be used.

In [0]:
#recursiveFileLookup=True, reads files from the subfolders too
#pathGlobFilter="tower_logs_*", reads files with the pattern starts with tower_logs_, if file name unknown, we can use *.csv
df_multiple_path_files = (spark.read.option
                          ('header', True)
                          .option('inferSchema', True)
                          .option('delimiter', '|')
                          .option('recursiveFileLookup', True)
                          .option('pathGlobFilter', 'tower*')
                          .text('/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/*')
)
                         
display(df_multiple_path_files)


##4. Schema Inference, Header, and Separator
1. Try the Customer, Usage files with the option and options using read.csv and format function:<br>
header=false, inferSchema=false<br>
or<br>
header=true, inferSchema=true<br>
2. Write a note on What changed when we use header or inferSchema  with true/false?<br>
3. How schema inference handled “abc” in age?<br>

In [0]:
df1 = (spark.read.format("csv")
       .options(header="false", inferSchema="true")
       .load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
       )
df1.printSchema
display(df1)
     

1. When using header=True: The first row of data is treated as the column names.
1. When using header=False: Columns are automatically assigned default names like c0, c1, c2, etc.
1. Using toDF with header=False: User-specified column names will replace the default column names.
1. Using toDF with header=True: User-specified column names will replace the default ones, but the first row of data will be discarded (leading to data loss, which is not recommended).
1. inferSchema=True: The column data types will be inferred based on the data itself.
1. inferSchema=False: The default data type for all columns will be String.

1. The Age column is considered as String because Sneha’s age is mentioned as "abc".
1. If all values in the Age column were numeric, the data type would be inferred as Integer.
1. A null value is present in the custname column.

##5. Column Renaming Usecases
1. Apply column names using string using toDF function for customer data
2. Apply column names and datatype using the schema function for usage data
3. Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data 

In [0]:
#1. Apply column names using string using toDF function for customer data
df_customer = (
            spark.read.options(inferSchema="true")
            .format("csv")
            .load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
            .toDF("customer_id","name","age","city","plan")
      )
display(df_customer)

In [0]:
#2. Apply column names and datatype using the schema function for usage data
str_struct="customer_id integer, voice_mins integer, data_mb integer, sms_count string"

df_usage_use_schema = (
               spark.read.schema(str_struct)
               .options(header=True, sep="\t")
               .csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv")
             )
display(df_usage_use_schema)

In [0]:
#3. Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data
from pyspark.sql.types import StructType,StructField,IntegerType,StringType, TimestampType

cust_schema=StructType(
    [
        StructField("event_id",IntegerType(),True),
        StructField("customer_id",IntegerType(),True),
        StructField("tower_id",StringType(),True),
        StructField("signal_strength",IntegerType(),True),
        StructField("timestamp",TimestampType(),True)
    ]
)
df_logs_cust_schema = (
                    spark.read.schema(cust_schema)
                    .options(header=True,sep="|")
                    .format("csv")
                    .load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv")
                 )
#display(df_logs_cust_schema)


df_logs_cust_schema.printSchema()
df_logs_cust_schema.show(2)

## Spark Write Operations using 
- csv, json, orc, parquet, delta, saveAsTable, insertInto, xml with different write mode, header and sep options

##6. Write Operations (Data Conversion/Schema migration) – CSV Format Usecases
1. Write customer data into CSV format using overwrite mode
2. Write usage data into CSV format using append mode
3. Write tower data into CSV format with header enabled and custom separator (|)
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
#1. Write customer data into CSV format using overwrite mode

# Create a sample DataFrame (replace with your actual data)
cust_schema="id int,name string,age string,city string,plan string"
customer_data_df = spark.read.schema(cust_schema).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

output_path = "dbfs:///Volumes/telecom_catalog_assign/transform_zone/csv/customer/customer.csv"

customer_df = customer_data_df.write.mode("overwrite").format("csv").save(output_path)
customer_data_df.write.mode("overwrite").format("csv").save(output_path,compression='gzip')

#display(customer_df)


In [0]:
#2. Write usage data into CSV format using append mode

new_usage_data_df = spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",sep='\t',header=True)

output_path = "dbfs:///Volumes/telecom_catalog_assign/transform_zone/csv/usage/usage.csv"

# #write the usage dataframe to csv in append mode
new_usage_data_df = new_usage_data_df.write \
                    .format("csv") \
                    .option("header", "true") \
                    .mode("append") \
                    .save(output_path)

print(f"Data appended to {output_path} successfully in append mode.")


In [0]:
#3. Write tower data into CSV format with header enabled and custom separator (|)

cust_schema=StructType(
    [
        StructField("event_id",IntegerType(),True),
        StructField("customer_id",IntegerType(),True),
        StructField("tower_id",StringType(),True),
        StructField("signal_strength",StringType(),True),
        StructField("timestamp",StringType(),True)
    ]
)
df_tower_logs = spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv",sep='|',header=True)

# Define the output path
output_path = "dbfs:///Volumes/telecom_catalog_assign/transform_zone/csv/tower/region1/tower_logs_region1.csv" 

# Write the DataFrame to CSV format with a pipe separator
df_tower_logs = df_tower_logs.write.csv(output_path,sep='|',header=True)

print(f"Tower data successfully written to {output_path} with pipe separator.")

df_tower_logs_transform = spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/transform_zone/csv/tower/region1/tower_logs_region1.csv",sep='|',header=True)
df_tower_logs_transform.show(5)

##7. Write Operations (Data Conversion/Schema migration)– JSON Format Usecases
1. Write customer data into JSON format using overwrite mode
2. Write usage data into JSON format using append mode and snappy compression format
3. Write tower data into JSON format using ignore mode and observe the behavior of this mode
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
#1. Write customer data into JSON format using overwrite mode 

#Read customer data:
schema1="id int,name string,age string,city string,plan string"
df1=spark.read.schema(schema1).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

#write customer data
df1.write.mode("overwrite").format("json").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/json/customer/customer.json")
df1.write.mode("overwrite").format("json").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/json/customer_gzip/customer.json",compression="gzip")


In [0]:
#2. Write usage data into JSON format using append mode and snappy compression format

#Read usage data
df2=spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",sep='\t',header=True)
#write usage data
df2.write.mode("append").format("json").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/json/usage/usage.json",compression="snappy")


In [0]:
#3. Write tower data into JSON format using ignore mode and observe the behavior of this mode

#read tower data
df3=spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv",sep='|',header=True)
#write tower data
df3.write.mode("ignore").json("dbfs:///Volumes/telecom_catalog_assign/transform_zone/json/tower/region1/tower_logs_region1.json")

##8. Write Operations (Data Conversion/Schema migration) – Parquet Format Usecases
1. Write customer data into Parquet format using overwrite mode and in a gzip format
2. Write usage data into Parquet format using error mode
3. Write tower data into Parquet format with gzip compression option
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
#1. Write customer data into Parquet format using overwrite mode and in a gzip format
#Read customer data:
schema1="id int,name string,age string,city string,plan string"
df1=spark.read.schema(schema1).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

#write customer data
df1.write.mode("overwrite").format("parquet").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/parquet/customer/customer.parquet")

#2. Write usage data into Parquet format using error mode
#read usage data
df2=spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",header=True)

#write usage data
df2.write.mode("error").format("parquet").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/parquet/usage/usage.parquet")

#3. Write tower data into Parquet format with gzip compression option
#read tower data
df3=spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv",sep='|',header=True)
#write tower data
df3.write.mode("overwrite").parquet("dbfs:///Volumes/telecom_catalog_assign/transform_zone/parquet/tower/region1/tower_logs_region1.parquet",compression="gzip")


#read parquet tower data from transform zone path
df4=spark.read.parquet("dbfs:///Volumes/telecom_catalog_assign/transform_zone/parquet/usage/usage.parquet")
df4.show(5)

##9. Write Operations (Data Conversion/Schema migration) – Orc Format Usecases
1. Write customer data into ORC format using overwrite mode
2. Write usage data into ORC format using append mode
3. Write tower data into ORC format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
#1. Write customer data into ORC format using overwrite mode
schema1="id int,name string,age string,city string,plan string"
df1=spark.read.schema(schema1).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

#write customer data
df1.write.mode("overwrite").format("orc").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/orc/customer/customer.orc")
df1.write.mode("overwrite").format("orc").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/orc/customer_zstd/customer.orc",compression="zstd")

#2. Write usage data into ORC format using append mode
#read usage data
df2=spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",sep='\t',header=True)
#write usage data
df2.write.mode("append").format("orc").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/orc/usage/usage.orc")

#3. Write tower data into ORC format and see the output file structure
#read tower logs data
df3=spark.read.csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv",sep='|',header=True)
#write tower data
df3.write.mode("overwrite").orc("dbfs:///Volumes/telecom_catalog_assign/transform_zone/orc/tower/region1/tower_logs_region1.orc",compression="snappy")

#4. Read the usage data in a dataframe and show only 5 rows.
df4=spark.read.orc("dbfs:///Volumes/telecom_catalog_assign/transform_zone/orc/usage/usage.orc")
df4.show(5)

##10. Write Operations (Data Conversion/Schema migration) – Delta Format Usecases
1. Write customer data into Delta format using overwrite mode
2. Write usage data into Delta format using append mode
3. Write tower data into Delta format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.
6. Compare the parquet location and delta location and try to understand what is the differentiating factor, as both are parquet files only.

In [0]:

#1. Write customer data into Delta format using overwrite mode
#read cust data
schema1="id int,name string,age string,city string,plan string"
df1=spark.read.schema(schema1).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

#write customer data
df1.write.mode("overwrite").format("delta").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/delta/customer/customer.delta")

#2. Write usage data into Delta format using append mode
#read usage data
schema2="cutomer_id int,voice_mins int,data_mb int,sms_count int"
df2=spark.read.schema(schema2).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",sep='\t',header=True)

#write usage data
df2.write.mode("overwrite").format("delta").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/delta/usage/usage.delta")

#3. Write tower data into Delta format and see the output file structure
#read tower logs data
schema3="event_id int,customer_id int,tower_id string,signal_strength string, timestamp timestamp"
df3=spark.read.schema(schema3).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv",sep='|',header=True)

#write tower data
df3.write.mode("overwrite").format("delta").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/delta/tower/region1/tower_logs_region1.delta")

#4. Read the usage data in a dataframe and show only 5 rows.
df4=spark.read.format("delta").load("dbfs:///Volumes/telecom_catalog_assign/transform_zone/delta/usage/usage.delta")
df4.show(5)

##11. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data using saveAsTable() as a managed table
2. Write usage data using saveAsTable() with overwrite mode
3. Drop the managed table and verify data removal
4. Go and check the table overview and realize it is in delta format in the Catalog.
5. Use spark.read.sql to write some simple queries on the above tables created.


In [0]:

#1. Write customer data using saveAsTable() as a managed table
#read cust data
schema1="id int,name string,age string,city string,plan string"
df1=spark.read.schema(schema1).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

#write customer data
df1.write.mode("overwrite").saveAsTable("telecom_catalog_assign.transform_zone.customer")

#2. Write customer data using saveAsTable() as a managed table
#read usage data
schema2="cutomer_id int,voice_mins int,data_mb int,sms_count int"
df2=spark.read.schema(schema2).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",sep='\t',header=True,inferSchema=True)
#write usage data
df2.write.mode("overwrite").saveAsTable("telecom_catalog_assign.transform_zone.usage")

#3. Drop the managed table and verify data removal
#read_towe_data
schema3="event_id int,customer_id int,tower_id string,signal_strength string, timestamp timestamp"
df3=spark.read.schema(schema3).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv",sep='|',header=True,inferSchema=True)
#write tower data
df3.write.mode("overwrite").saveAsTable("telecom_catalog_assign.transform_zone.tower_logs_region1")

#4. Go and check the table overview and realize it is in delta format in the Catalog.
df4=spark.sql("select sum(voice_mins) as total_voice_mins,sum(data_mb) as total_data_mb, sum(sms_count) as total_sms_count from telecom_catalog_assign.transform_zone.usage")
df4.show()

##12. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data using insertInto() in a new table and find the behavior
2. Write usage data using insertTable() with overwrite mode

In [0]:
df1=spark.sql("select sum(voice_mins) as total_voice_mins,sum(data_mb) as total_data_mb, sum(sms_count) as total_sms_count from telecom_catalog_assign.transform_zone.usage") 
df1.write.mode("overwrite").saveAsTable("telecom_catalog_assign.transform_zone.usage_summary")
df1.write.insertInto("telecom_catalog_assign.transform_zone.usage_summary",overwrite=True)

#or

df1.createOrReplaceTempView("usage_summary_temp")
spark.sql("insert overwrite table telecom_catalog_assign.transform_zone.usage_summary select * from usage_summary_temp")
     

##13. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data into XML format using rowTag as cust
2. Write usage data into XML format using overwrite mode with the rowTag as usage
3. Download the xml data and open the file in notepad++ and see how the xml file looks like.

In [0]:

#1. Write customer data into XML format using rowTag as cust
#read cust data
schema1="id int,name string,age string,city string,plan string"
df1=spark.read.schema(schema1).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

#write customer data
df1.write.mode("overwrite").option("rowTag","customer").format("xml").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/xml/customer/customer.xml")
df1.write.mode("overwrite").option("rowTag","customer").format("xml").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/xml/customer_gzip/customer.xml",compression="gzip")

#2. Write usage data into XML format using overwrite mode with the rowTag as usage
#read usage data
schema2="cutomer_id int,voice_mins int,data_mb int,sms_count int"
df2=spark.read.schema(schema2).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",sep='\t',header=True,inferSchema=True)

#write usage data
df2.write.mode("overwrite").option("rowTag","usage_metrics").format("xml").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/xml/usage/usage1.xml")

#3. Write tower log data into XML format using overwrite mode with the rowTag as tower_region1_metrics
#read_tower_data
schema3="event_id int,customer_id int,tower_id string,signal_strength string, timestamp timestamp"
df3=spark.read.schema(schema3).csv("dbfs:///Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/tower_logs_region1.csv",sep='|',header=True,inferSchema=True)

#write tower data
df3.write.mode("overwrite").option("rowTag","tower_region1_metrics").format("xml").save("dbfs:///Volumes/telecom_catalog_assign/transform_zone/xml/tower/region1/tower_logs_region1.xml")
     

##14. Compare all the downloaded files (csv, json, orc, parquet, delta and xml) 
1. Capture the size occupied between all of these file formats and list the formats below based on the order of size from small to big.

##15. Do a final exercise of defining one/two liner of... 
1. When to use/benifits csv
2. When to use/benifits json
3. When to use/benifit orc
4. When to use/benifit parquet
5. When to use/benifit delta
6. When to use/benifit xml
7. When to use/benifit delta tables


In [0]:
When to use/benifits csv ->

  storing raw data
  Occupies more space
  doesnt work well with gzip unlike xml or json
When to use/benifits json ->

  storing api logs
  gives better compression with gzip
  stroing unstructred data
When to use/benifit orc ->

  columar file format, 
  works well with zstd(better storage) also snappy(speed) to some extent
  hive tables
  default compression is snappy
When to use/benifit parquet

    default compression is snappy
    columar file type
    faster read and write
    comparitively greater storage
When to use/benifit delta

    default parquet file with snappy compression (databricks default)
    columar file type
    faster read and write
    comparitively greater storage
    acid trasanctions
    time travel with help of delta logs
When to use/benifit xml

xml with gzip - maximum compression
xmls used in legacy systems
When to use/benifit delta tables

tables with updates
acid tables
requires timetravel