## Spark Write Operations using 
- csv, json, orc, parquet, delta, saveAsTable, insertInto, xml with different write mode, header and sep options

##1. Write Operations (Data Conversion/Schema migration) – CSV Format Usecases
1. Write customer data into CSV format using overwrite mode
2. Write usage data into CSV format using append mode
3. Write tower data into CSV format with header enabled and custom separator (|)
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
# Reading the CSV file and storing it in a datframe 

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

custom_schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("customer_name", StringType(), True),
    StructField("customer_age", IntegerType(), True),
    StructField("customer_city", StringType(), True),
    StructField("customer_plan_type", StringType(), True)])

read_customer_df = spark.read.schema(custom_schema).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")

# 1.Write customer data into CSV format using overwrite mode
write_customer_csv_df = read_customer_df.write.options(header='true').mode("overwrite").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/csvout/")
display(write_customer_csv_df)

# 2.Write usage data into CSV format using append mode
write_customer_csv_df = read_customer_df.write.options(header='true').mode("append").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/csvout/")
display(write_customer_csv_df)

# 3.Write tower data into CSV format with header enabled and custom separator (|)
read_tower_df = spark.read.options(header='true',sep='|',inferSchema='true',pathGlobeFilter='.csv',recursiveFileLookup='true').format('csv').load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region*")
display(read_tower_df)

write_tower_csv_df = read_tower_df.write.options(header='true',sep='|').mode("overwrite").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/csvout/")
display(write_tower_csv_df)

# 4.Read the tower data in a dataframe and show only 5 rows.
display(read_tower_df.limit(5))

# 5.Download the file into local from the catalog volume location and see the data of any of the above files opening in a notepad++.
'''Yes, I could download the file into local from the catalog volume location and see the data of above files opening in a notepad++.'''


##2. Write Operations (Data Conversion/Schema migration)– JSON Format Usecases
1. Write customer data into JSON format using overwrite mode
2. Write usage data into JSON format using append mode and snappy compression format
3. Write tower data into JSON format using ignore mode and observe the behavior of this mode
4. Read the tower data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
#1.Write customer data into JSON format using overwrite mode
write_customer_json_df = read_customer_df.write.mode("overwrite").json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/jsonout/")

#2.Write usage data into JSON format using append mode and snappy compression format
read_usage_csv_df = spark.read.options(header= 'true',inferSchema="True",sep ='\t').csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv")
display(read_usage_csv_df)

write_usage_json_df = read_usage_csv_df.write.mode("append").option("compression","snappy").json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/jsonout/")

#3.Write tower data into JSON format using ignore mode and observe the behavior of this mode
write_tower_json_df = read_tower_df.write.mode("ignore").json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/jsonout/")

#4.Read the tower data in a dataframe and show only 5 rows
display(read_tower_df.limit(5))

#5.Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.
'''Yes, I was able to download the files locally from the catalog volume location and view the data of all three files using Notepad++. 
But, out of 3 files, only two were in readable JSON format. The 'usage' file was compressed, so I was unable to view the data in a clear format.'''


##3. Write Operations (Data Conversion/Schema migration) – Parquet Format Usecases
1. Write customer data into Parquet format using overwrite mode and in a gzip format
2. Write usage data into Parquet format using error mode
3. Write tower data into Parquet format with gzip compression option
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
#1.Write customer data into Parquet format using overwrite mode and in a gzip format
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

custom_schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("customer_name", StringType(), True),
    StructField("customer_age", IntegerType(), True),
    StructField("customer_city", StringType(), True),
    StructField("customer_plan_type", StringType(), True)])

read_customer_df = spark.read.schema(custom_schema).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
display(read_customer_df)

write_customer_parquet_df = read_customer_df.write.mode("overwrite").option("compression","gzip").parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/parquetout/")

#2.Write usage data into Parquet format using error mode
read_usage_csv_df = spark.read.options(header= 'true',inferSchema="True",sep ='\t').csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv")
display(read_usage_csv_df)

write_usage_parquet_df = read_usage_csv_df.write.mode("error").parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/parquetout/")

#3.Write tower data into Parquet format with gzip compression option
read_tower_df = spark.read.options(header='true',sep='|',inferSchema='true',pathGlobeFilter='.csv',recursiveFileLookup='true').format('csv').load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region*")
display(read_tower_df)

write_tower_parquet_df = read_tower_df.write.mode("overwrite").option("compression","gzip").parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/parquetout/")

#4.Read the usage data in a dataframe and show only 5 rows.

display(read_usage_csv_df.limit(5))

#5.Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.
'''Yes, I was able to download the file to my local machine from the catalog volume location, but I couldn’t view the data because it is compressed and stored in parquet format.'''


##4. Write Operations (Data Conversion/Schema migration) – Orc Format Usecases
1. Write customer data into ORC format using overwrite mode
2. Write usage data into ORC format using append mode
3. Write tower data into ORC format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.

In [0]:
# 1.Write customer data into ORC format using overwrite mode
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

custom_schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("customer_name", StringType(), True),
    StructField("customer_age", IntegerType(), True),
    StructField("customer_city", StringType(), True),
    StructField("customer_plan_type", StringType(), True)])

read_customer_df = spark.read.schema(custom_schema).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
display(read_customer_df)

write_customer_orc_df = read_customer_df.write.mode("overwrite").format('orc').save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/orcout/")

#2.Write usage data into ORC format using append mode
read_usage_csv_df = spark.read.options(header= 'true',inferSchema="True",sep ='\t').csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv")
display(read_usage_csv_df)

write_usage_orc_df = read_usage_csv_df.write.mode("append").format('orc').save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/orcout/")

#3.Write tower data into ORC format and see the output file structure
read_tower_csv_df = spark.read.options(header='true',sep='|',inferSchema='true',pathGlobeFilter='.csv',recursiveFileLookup='true').format('csv').load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region*")
display(read_tower_df)

write_tower_orc_df = read_tower_csv_df.write.mode("overwrite").orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/orcout/")

#4.Read the usage data in a dataframe and show only 5 row
display(read_usage_csv_df.limit(5))

#5.Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++
'''Yes, I was able to download the file to my local machine from the catalog volume location, but I couldn’t view the data because it is compressed and stored in ORC format.'''


##5. Write Operations (Data Conversion/Schema migration) – Delta Format Usecases
1. Write customer data into Delta format using overwrite mode
2. Write usage data into Delta format using append mode
3. Write tower data into Delta format and see the output file structure
4. Read the usage data in a dataframe and show only 5 rows.
5. Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.
6. Compare the parquet location and delta location and try to understand what is the differentiating factor, as both are parquet files only.

In [0]:
#1.Write customer data into Delta format using overwrite mode
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

custom_schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("customer_name", StringType(), True),
    StructField("customer_age", IntegerType(), True),
    StructField("customer_city", StringType(), True),
    StructField("customer_plan_type", StringType(), True)])

read_customer_df = spark.read.schema(custom_schema).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv")
display(read_customer_df)

write_customer_delta_df = read_customer_df.write.mode("overwrite").format('delta').save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/deltaout/")


#2.Write usage data into Delta format using append mode
read_usage_csv_df = spark.read.options(header= 'true',inferSchema="True",sep ='\t').csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv")
display(read_usage_csv_df)

write_usage_delta_df = read_usage_csv_df.write.mode("append").format('delta').save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/deltaout/")

#3.Write tower data into Delta format and see the output file structure
read_tower_csv_df = spark.read.options(header='true',sep='|',inferSchema='true',pathGlobeFilter='.csv',recursiveFileLookup='true').format('csv').load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region*")
display(read_tower_df)

write_tower_delta_df = read_tower_csv_df.write.mode("overwrite").option("compression","gzip").format('delta').save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/deltaout/")

#4.Read the usage data in a dataframe and show only 5 rows.
display(read_usage_csv_df.limit(5))

#5.Download the file into local harddisk from the catalog volume location and see the data of any of the above files opening in a notepad++.
'''I downloaded all the files into local machine from the catolog volume location and I was unable to read the data because it is compressed and stored in delta format (Internally as Parquet format).
But I could see the transaction logs in the delta, which are not available in ORC and parquet formats.'''

#6.Compare the parquet location and delta location and try to understand what is the differentiating factor, as both are parquet files only.
'''The main difference is that the transcation logs are stored in delta format and not in parquet format.
The delta format is stored as a parquet format behind the scenes.
The delta format is a file format that is optimized for data lakes and is designed to provide efficient
We can do the ACID (DML) and Write-many-read-many WMRM activities in delta formats
we can't do the above activites in ORC or parquet file formats.We can perform only write-once-read-many WORM activities in ORC'''


##6. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data using saveAsTable() as a managed table
2. Write usage data using saveAsTable() with overwrite mode
3. Drop the managed table and verify data removal
4. Go and check the table overview and realize it is in delta format in the Catalog.
5. Use spark.read.sql to write some simple queries on the above tables created.


##7. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data using insertInto() in a new table and find the behavior
2. Write usage data using insertTable() with overwrite mode

##8. Write Operations (Lakehouse Usecases) – Delta table Usecases
1. Write customer data into XML format using rowTag as cust
2. Write usage data into XML format using overwrite mode with the rowTag as usage
3. Download the xml data and open the file in notepad++ and see how the xml file looks like.

##9. Compare all the downloaded files (csv, json, orc, parquet, delta and xml) 
1. Capture the size occupied between all of these file formats and list the formats below based on the order of size from small to big.

##10. Do a final exercise of defining one/two liner of... 
1. When to use/benifits csv
2. When to use/benifits json
3. When to use/benifit orc
4. When to use/benifit parquet
5. When to use/benifit delta
6. When to use/benifit xml
7. When to use/benifit delta tables
