#By Knowing this notebook, we can become an eligible "Data Egress Developer/Engineer"
###We are writing data in Structured(csv), Semi Structured(JSON/XML), Serialized files (orc/parquet/delta) (Datalake), Table (delta/hive) (Lakehouse) format

### Let's get some data we have already...

In [0]:
%sql
create database if not exists workspace.wd36schema;
create volume if not exists workspace.wd36schema.ingestion_volume;


In [0]:
dbutils.fs.mkdirs("/Volumes/workspace/wd36schema/ingestion_volume/source")

In [0]:
#Extract
ingest_df1=spark.read.csv("/Volumes/workspace/wd36schema/ingestion_volume/source/custs_header",header=True,sep=',',inferSchema=True,samplingRatio=0.10)

### Writing the data in Builtin - different file formats & different targets (all targets in this world we can write the data also...)

####1. Writing in csv (structured data (2D data Table/Frames with rows and columns)) format with few basic options listed below (Schema (structure) Migration)
custid,fname,lname,age,profession -> custid~fname~lname~prof~age
- header
- sep
- mode

In [0]:
#We are performing schema migration from comma to tilde delimiter
ingest_df1.write.csv(path="/Volumes/workspace/wd36schema/ingestion_volume/target/csvout",sep='~',header=True,mode='overwrite')
#4 modes of writing - append,overwrite,ignore,error
display(ingest_df1)

In [0]:
#We are performing schema migration by applying some transformations (this is our bread and butter that we learn exclusively further)
#Transform
transformed_df=ingest_df1.select("custid","fname","lname","profession","age").withColumnRenamed("profession","prof")#DSL transformation (not for now...)
#Load
transformed_df.write.csv(path="/Volumes/workspace/wd36schema/ingestion_volume/target/csvout",sep='~',header=True,mode='overwrite',compression='gzip')

In [0]:
transformed_df.show()


####2. Writing in json format with few basic options listed below
path<br>
mode
- We did a schema migration and data conversion from csv to json format (ie structued to semi structured format)
- json - we learn a lot subsequently (nested/hierarchical/complex/multiline...), 
- what is json - fundamentally it is a dictionary of dictionaries
- json - java script object notation
- Standard json format (can't be changed) - {"k1":"string value","k2":numbervalue,"k3":v2} where key has to be unique & enclosed in double quotes and value can be anything
- **when to go with json or benifits** - 
- a. If we have data in a semistructure format (with variable data format with dynamic schema)
- eg. {"custid":4000001,"profession":"Pilot","age":55,"city":"NY"}
-     {"custid":4000001,"fname":"Kristina","lname":"Chung","prof":"Pilot","age":"55"}
- b. columns/column names or the types or the order can be different
- c. json will be provided by the sources if the data is dynamic in nature (not sure about number or order of columns) or if the data is api response in nature.
- d. json is a efficient data format (serialized/encoded) for performing data exchange between applications via network & good for parsing also & good for object by object operations (row by row operation in realtime fashion eg. amazon click stream operations)
- e. json can be used to group or create hierarchy of data in a complex or in a nested format eg. https://randomuser.me/api/

In [0]:

ingest_df1.write.json(path="/Volumes/workspace/wd36schema/ingestion_volume/target/jsonout",mode='append')
#custid,fname,lname,age,profession -> {"custid":4000001,"fname":"Kristina","lname":"Chung","prof":"Pilot","age":55}


####3.Serialization (encoding in a more optimized fashion) & Deserialization File formats (Binary/Brainy File formats)
Data Mechanics: 
1. encoding/decoding(machine format) - converting the data from human readable format to machine understandable format for performant data transfer (eg. Network transfer of data will be encoded)
2. *compression/uncompression(encoding+space+time) - shrinking the data in some format using some libraries (tradeoff between time and size) (eg. Compress before store or transfer) - snappy is a good compression tech used in bigdata platform
3. encryption (encoding+security) - Addition to encoding, encryption add security hence data is (performant+secured) (using some algos - SHA/MD5/AES/DES/RSA/DSA..)
4. *Serialization (applicable more for bigdata) - Serialization is encoding + performant by saving space + processing intelligent bigdata format - Fast, Compact, Interoperable, Extensible (additional configs), Scalable (cluster compute operations), Secured (binary format)..
5. *masking - Encoding of data (in some other format not supposed to be machine format) which should not be allowed to decode (used for security purpose)

What are the (builtin) serialized file formats we are going to learn?
orc
parquet
delta(databricks properatory)

- We did a schema migration and data conversion from csv/json to serialized data format (ie structued to sturctured(internall binary unstructured) format)
- We learn/use a lot/heavily subsequently
- what is serialized - fundamentally they are intelligent/encoded/serialized/binary data formats applied with lot of optimization & space reduction strategies.. (encoded/compressed/intelligent)
- orc - optimized row column format (Columnar formats)
- parquet - tiled data format (Columnar formats)
- delta(databricks properatory) enriched parquet format - Delta (modified/changes) operations can be performed (ACID property (DML))
- format - serialized/encoded , we can't see with mere eyes, only some library is used deserialized/decoded data can be accessed as structured data
- **when to go with serialized or benifits** - 
- a. For storage benifits for eg. orc will save 65+% of space for eg. if i store 1gb data it occupy 350mb space, with compression (snappy) it can improved more...
- b. For processing optimization. Orc/parquet/delta will provide the required data alone if you query using Pushdown optimization .
- c. Interoperability feature - this data format can be understandable in multiple environments for eg. bigquery can parse this data.
- d. Secured
- **In the projects/environments when to use what fileformats - we learn in detail later...
| Format  | Schema Type              | Storage Efficiency | Analytics Performance | Updates Supported |
|--------|--------------------------|--------------------|-----------------------|------------------|
| CSV    | Structured               | Low                | Slow                  | No               |
| JSON   | Semi-structured           | Low                | Slow                  | No               |
| ORC    | Structured / Striped      | High               | Fast                  | Limited          |
| Parquet| Structured / Nested       | High               | Very Fast             | Limited          |
| Delta  | Structured / Evolving     | High               | Very Fast             | Highly           |
| XML    | Semi-structured           | Low                | Slow                  | No               |

In [0]:
ingest_df1.write.orc(path="/Volumes/workspace/wd36schema/ingestion_volume/target/orcout",mode='overwrite',compression='zlib')#by default orc/parquet uses snappy compression
spark.read.orc("/Volumes/workspace/wd36schema/ingestion_volume/target/orcout").show(2)#uncompression + deserialization


In [0]:
#Orc/Parquet follows WORM feature (Write Once Read Many)
ingest_df1.write.mode("overwrite").option("compression","gzip").option("compression","snappy").parquet(path="/Volumes/workspace/wd36schema/ingestion_volume/target/parquetout")#by default orc/parquet uses snappy compression
spark.read.parquet("/Volumes/workspace/wd36schema/ingestion_volume/target/parquetout").show(2)#uncompression + deserialization

In [0]:
#Delta follows WMRM feature (Write Many Read Many)
ingest_df1.write.format("delta").save("/Volumes/workspace/wd36schema/ingestion_volume/target/deltaout",mode='overwrite')
spark.read.format("delta").load("/Volumes/workspace/wd36schema/ingestion_volume/target/deltaout").show(2)

####4.Table Load Operations - Building LAKEHOUSE ON TOP OF DATALAKE
Can we do SQL operations directly on the tables like a database or datawarehouse? or Can we build a Lakehouse in Databricks?
- We learn/use a lot/heavily subsequently, 
- what is Lakehouse - A SQL/Datawarehouse/Query layer on top of the Datalake is called Lakehouse
- We have different lakehouses which we are going to learn further - 
1. delta tables (lakehouse) in databricks
2. hive in onprem
3. bigquery in GCP
4. synapse in azure
5. athena in aws
- **when to go with lakehouse** - 
- a. Transformation
- b. Analysis/Analytics
- c. AI/BI
- d. Literally we are going to learn SQL & Advanced SQL

####5. XML Format - Semi structured data format (most of the json features can be applied in xml also, but in DE world not so famous like json)
- Used rarely on demand (by certain target/source systems eg. mainframes)
- Can be related with json, but not so much efficient like json
- Databricks provides xml as a inbuild function

In [0]:
ingest_df1.write.xml("/Volumes/workspace/wd36schema/ingestion_volume/target/xmlout",mode="overwrite",rowTag="cust")

### Modes in Writing
1. **Append** - Adds the new data to the existing data. It does not overwrite anything.
2. **Overwrite** - Replaces the existing data entirely at the destination.
3. **ErrorIfexist**(default) - Throws an error if data already exists at the destination.
4. **Ignore** - Skips the write operation if data already exists at the destination.

What are all the overall options we used in this notebook, for learning fundamental spark dataframe write operations in different formats and targets?
1. df.write.csv/json/orc/parquet/table/xml... operations & df.write.format('delta').save()
2. Few of the important read options under csv such as header, sep, mode(append/overwrite/error/ignore), toDF.
3. Few additional options such as compression, different file formats...

In [0]:
def aparna(button:str):#irfan pinged
    if button=='hi':
        return "hi"
    else:
        return "hello"
output=aparna('hello')
print(output)