#By Knowing this notebook, we can become an eligible "Data Egress Developer/Engineer"
###We are writing data in Structured(csv), Semi Structured(JSON/XML), Serialized files (orc/parquet/delta) (Datalake), Table (delta/hive) (Lakehouse) format

### Let's get some data we have already...

# Important Takeaways: CSV vs JSON

## CSV (Structured Data)
- **Format**: Plain text, tabular (rows & columns), 2D table  
- **Schema**: Header row defines column names; order matters  
- **Limitations**:  
  - No native schema migration  
  - Position-based, no metadata or versioning  
  - Cannot handle column renames, reordering, or type changes automatically  
- **Use Case**: Structured data with fixed schema  

## JSON (Semi-Structured / Dynamic Data)
- **Format**: JavaScript Object Notation (dictionary of key-value pairs)  
- **Standard**: `{"k1":"string","k2":123,"k3":true}`  
  - Keys must be unique and in double quotes  
  - Values can be string, number, boolean, object, array, or null  
- **Advantages**:  
  - Handles semi-structured/dynamic data  
  - Flexible schema (column names, types, order can differ)  
  - Common for API responses or dynamic sources  
  - Efficient for data exchange and parsing  
  - Supports nested/complex/hierarchical data  
- **Use Case**: Dynamic data, API responses, real-time operations (e.g., clickstream)


In [0]:
#Extract
ingest_df1=spark.read.csv("/Volumes/catalog2/database2/volume2/created_folder/custs_header", header=True, inferSchema=True,samplingRatio=0.10)

%md
### Writing the data in Builtin - different file formats & different targets (all targets in this world we can write the data also...)

####1. Writing in csv (structured data (2D data Table/Frames with rows and columns)) format with few basic options listed below (Schema (structure) Migration)
custid,fname,lname,age,profession -> custid~fname~lname~prof~age
- header
- sep
- mode
# Writing in CSV (Structured Data)

**CSV is:**  
- Structured, tabular data  
- Rows and columns (2D table)  
- Column order matters  
- Column names come from the header row  

**CSV supports schema migration only through explicit transformations, not natively.**

**Why CSV does NOT support schema migration natively:**  
- Plain text format  
- Position-based  
- No embedded schema  
- No metadata or versioning  

**Because of this, CSV cannot automatically handle:**  
- Column renames  
- Column reordering  
- Data type changes  
- Backward compatibility


Correct way to do schema migration with CSV (Spark example)<br>
Original CSV (v1)<br>
custid,fname,lname,age,profession<br>

Target CSV (v2)
custid,fname,lname,prof,age<br>
below is the code shows

In [0]:
#it is just an example
df_v2 = df_v1 \
  .withColumnRenamed("profession", "prof") \
  .select("custid", "fname", "lname", "prof", "age")

df_v2.write \
  .mode("overwrite") \
  .option("header", "true") \
  .option("sep", "~") \
  .csv("/path/customers_v2")

  #This is schema migration done outside the CSV format.


In [0]:
#We are performing schema migration from comma to tilde delimiter
ingest_df1.write.mode("overwrite").csv("/Volumes/catalog2/database2/volume2/created_folder/writing_data/",header=True,sep="/")
#4 modes of writing - append,overwrite,ignore,error

In [0]:
#We are performing schema migration by applying some transformations (this is our bread and butter that we learn exclusively further
transformed_df=ingest_df1.select("custid","fname","lname","age","profession").withColumnRenamed("fname","firstname").withColumnRenamed("lname","lastname")
#Load
transformed_df.write.mode("overwrite").csv("/Volumes/catalog2/database2/volume2/created_folder/writing_data/",header=True,sep="~")

%md
####2. Writing in json format with few basic options listed below
path<br>
mode
- We did a schema migration and data conversion from csv to json format (ie structued to semi structured format)
- json - we learn a lot subsequently (nested/hierarchical/complex/multiline...), 
- what is json - fundamentally it is a dictionary of dictionaries
- json - java script object notation
- Standard json format (can't be changed) - {"k1":"string value","k2":numbervalue,"k3":v2} where key has to be unique & enclosed in double quotes and value can be anything
- **when to go with json or benifits** - 
- a. If we have data in a semistructure format (with variable data format with dynamic schema)
- eg. {"custid":4000001,"profession":"Pilot","age":55,"city":"NY"}
-     {"custid":4000001,"fname":"Kristina","lname":"Chung","prof":"Pilot","age":"55"}
- b. columns/column names or the types or the order can be different
- c. json will be provided by the sources if the data is dynamic in nature (not sure about number or order of columns) or if the data is api response in nature.
- d. json is a efficient data format (serialized/encoded) for performing data exchange between applications via network & good for parsing also & good for object by object operations (row by row operation in realtime fashion eg. amazon click stream operations)
- e. json can be used to group or create hierarchy of data in a complex or in a nested format eg. https://randomuser.me/api/

#### Writing Data in JSON Format

- **JSON**: JavaScript Object Notation, fundamentally a dictionary of key-value pairs.
- **Standard Format**: `{"k1":"string","k2":123,"k3":true}`  
  - Keys must be unique and in double quotes.  
  - Values can be string, number, boolean, object, array, or null.
- **Use Cases / Benefits**:
  1. Ideal for **semi-structured or dynamic data**.  
     Example:  
     `{"custid":4000001,"profession":"Pilot","age":55}`  
     `{"custid":4000002,"fname":"Kristina","lname":"Chung","prof":"Pilot","age":"55"}`
  2. **Flexible schema**: column names, types, or order can differ.
  3. Common for **API responses** or dynamic sources.
  4. **Efficient for data exchange**: serialized format, easy parsing, supports row-by-row operations (e.g., clickstream data).
  5. Supports **nested/complex/hierarchical data**, e.g., grouping objects.


In [0]:
#ingest_df1 i am writing into path
ingest_df1.write.json(path="/Volumes/catalog2/database2/volume2/created_folder/jsonoutput",mode='append')
df1=spark.read.json("/Volumes/catalog2/database2/volume2/created_folder/jsonoutput").show(2)

In [0]:
df1=spark.read.json("/Volumes/catalog2/database2/volume2/Write_bascics/json/Orginal/")
df1.display()

In [0]:
# def1 data I am writing it into json_copy
df1.write.mode("overwrite").json("/Volumes/catalog2/database2/volume2/Write_bascics/json/Orginal/writing/")
df1.display()

In [0]:
df3=spark.read.json("/Volumes/catalog2/database2/volume2/Write_bascics/json/Orginal/key_based.json")
df3.display()

In [0]:
df3.write.mode("append").json("/Volumes/catalog2/database2/volume2/Write_bascics/json/Orginal/writing/")

In [0]:
#in df1 the order is 
#{"custid":1,"fname":"John","lname":"Doe","age":30,"profession":"Pilot"}
#in df2 the order is 
#{"profession":"Doctor","age":40,"lname":"Smith","fname":"Jane","custid":2}
#schema id different means order of the column is different abut still it woorked
spark.read.json("/Volumes/catalog2/database2/volume2/Write_bascics/json/Orginal/writing/").show()

####3.Serialization (encoding in a more optimized fashion) & Deserialization File formats (Binary/Brainy File formats)
Data Mechanics: 
1. encoding/decoding(machine format) - converting the data from human readable format to machine understandable format for performant data transfer (eg. Network transfer of data will be encoded)
2. *compression/uncompression(encoding+space+time) - shrinking the data in some format using some libraries (tradeoff between time and size) (eg. Compress before store or transfer) - snappy is a good compression tech used in bigdata platform
3. encryption (encoding+security) - Addition to encoding, encryption add security hence data is (performant+secured) (using some algos - SHA/MD5/AES/DES/RSA/DSA..)
4. *Serialization (applicable more for bigdata) - Serialization is encoding + performant by saving space + processing intelligent bigdata format - Fast, Compact, Interoperable, Extensible (additional configs), Scalable (cluster compute operations), Secured (binary format)..
5. *masking - Encoding of data (in some other format not supposed to be machine format) which should not be allowed to decode (used for security purpose)

What are the (builtin) serialized file formats we are going to learn?
orc
parquet
delta(databricks properatory)

- We did a schema migration and data conversion from csv/json to serialized data format (ie structued to sturctured(internall binary unstructured) format)
- We learn/use a lot/heavily subsequently
- what is serialized - fundamentally they are intelligent/encoded/serialized/binary data formats applied with lot of optimization & space reduction strategies.. (encoded/compressed/intelligent)
- orc - optimized row column format (Columnar formats)
- parquet - tiled data format (Columnar formats)
- delta(databricks properatory) enriched parquet format - Delta (modified/changes) operations can be performed (ACID property (DML))
- format - serialized/encoded , we can't see with mere eyes, only some library is used deserialized/decoded data can be accessed as structured data
- **when to go with serialized or benifits** - 
- a. For storage benifits for eg. orc will save 65+% of space for eg. if i store 1gb data it occupy 350mb space, with compression (snappy) it can improved more...
- b. For processing optimization. Orc/parquet/delta will provide the required data alone if you query using Pushdown optimization .
- c. Interoperability feature - this data format can be understandable in multiple environments for eg. bigquery can parse this data.
- d. Secured
- **In the projects/environments when to use what fileformats - we learn in detail later...
| Format  | Schema Type              | Storage Efficiency | Analytics Performance | Updates Supported |
|--------|--------------------------|--------------------|-----------------------|------------------|
| CSV    | Structured               | Low                | Slow                  | No               |
| JSON   | Semi-structured           | Low                | Slow                  | No               |
| ORC    | Structured / Striped      | High               | Fast                  | Limited          |
| Parquet| Structured / Nested       | High               | Very Fast             | Limited          |
| Delta  | Structured / Evolving     | High               | Very Fast             | Highly           |
| XML    | Semi-structured           | Low                | Slow                  | No               |

#### 3. Serialization & Deserialization (Binary / Optimized File Formats)

## Core Concepts
- **Encoding / Decoding**:  
  Converting human-readable data into machine-readable format for faster data transfer (e.g., network transmission).
- **Compression / Uncompression**:  
  Reduces data size using libraries (trade-off between time and space).  
  Common in big data: **Snappy**.
- **Encryption**:  
  Adds security on top of encoding (e.g., SHA, MD5, AES, RSA).
- **Serialization (Big Data Focus)**:  
  - Encoding + compression + intelligent storage  
  - Optimized for **speed, space, and scalability**  
  - Binary, compact, interoperable, and secure
- **Masking**:  
  Data obfuscation for security; not meant to be decoded back.

## Built-in Serialized File Formats
- **ORC**
- **Parquet**
- **Delta (Databricks proprietary)**

## What is Serialized Data?
- Intelligent, binary, encoded formats with heavy optimization
- Not human-readable
- Accessed only through libraries (deserialization)
- Used after converting CSV / JSON into optimized formats

## Format Characteristics
- **ORC**: Optimized columnar format (striped storage)
- **Parquet**: Columnar, tiled data format
- **Delta**: Enhanced Parquet with ACID support (DML operations)

## When to Use Serialized Formats (Benefits)
- **Storage Efficiency**:  
  Saves significant space (e.g., ORC can reduce storage by ~65%+)
- **Processing Performance**:  
  Uses predicate pushdown to read only required data
- **Interoperability**:  
  Supported across platforms (e.g., Spark, BigQuery)
- **Security**:  
  Binary and encrypted-friendly formats

## File Format Comparison

| Format   | Schema Type              | Storage Efficiency | Analytics Performance | Updates Supported |
|---------|--------------------------|--------------------|-----------------------|------------------|
| CSV     | Structured               | Low                | Slow                  | No               |
| JSON    | Semi-structured           | Low                | Slow                  | No               |
| ORC     | Structured / Striped      | High               | Fast                  | Limited          |
| Parquet | Structured / Nested       | High               | Very Fast             | Limited          |
| Delta   | Structured / Evolving     | High               | Very Fast             | High (ACID)      |
| XML     | Semi-structured           | Low                | Slow                  | No               |


# Serialization & Deserialization

## Binary / Optimized ("Brainy") File Formats

---

## 1. What is Serialization?

**Serialization** is the process of converting data from a **human-readable or in-memory representation** into a **compact, machine-friendly (binary) format** so it can be:

* Stored on disk
* Transferred over a network
* Processed efficiently by distributed systems

### Simple Definition

> **Serialization = Object / Data → Byte Stream (Binary Format)**

### Example

**In-memory / human-readable data (JSON-like):**

```json
{
  "id": 101,
  "name": "Sunil",
  "salary": 50000
}
```

**After serialization (binary form – conceptual):**

```
01100101 00000000 01010011 01110101 01101110 01101001 01101100 ...
```

This binary representation is **not human-readable**, but machines can read and process it very fast.

---

## 2. What is Deserialization?

**Deserialization** is the reverse process of serialization.

### Simple Definition

> **Deserialization = Byte Stream (Binary Format) → Object / Data**

### Example

* A Spark job reads a Parquet file
* Spark **deserializes** the binary data
* Converts it back into:

  * Rows
  * Columns
  * DataFrame objects

This allows filtering, aggregation, and transformations.

---

## 3. Why Serialization is Needed

Serialization is essential in big data and distributed systems due to the following reasons:

### Key Benefits

1. **Performance**

   * Binary formats are faster than text formats (CSV, JSON)

2. **Smaller Storage Size**

   * Binary data consumes less disk space

3. **Efficient Network Transfer**

   * Less data is transferred between nodes

4. **Schema Awareness**

   * Some formats store column names and data types internally

---

## 4. Binary / "Brainy" File Formats

> "Brainy" is an informal term meaning **intelligent and optimized binary formats**.

These formats are:

* Binary (not human-readable)
* Schema-aware
* Compressed
* Optimized for analytics and performance

---

## 5. Common Binary File Formats

| Format               | Type             | Usage                        |
| -------------------- | ---------------- | ---------------------------- |
| **Parquet**          | Columnar binary  | Analytics, Spark, Databricks |
| **ORC**              | Columnar binary  | Hive, high compression       |
| **Avro**             | Row-based binary | Streaming, Kafka             |
| **Protocol Buffers** | Binary           | Fast network communication   |
| **Thrift**           | Binary           | Cross-language serialization |

---

## 6. Text vs Binary File Formats

| Feature            | CSV / JSON | Parquet / ORC / Avro |
| ------------------ | ---------- | -------------------- |
| Human-readable     | Yes        | No                   |
| File size          | Large      | Small                |
| Read/Write speed   | Slow       | Fast                 |
| Compression        | External   | Built-in             |
| Schema support     | Weak       | Strong               |
| Analytics friendly | No         | Yes                  |

---

## 7. Spark / Databricks Example

### Writing Data (Serialization)

```python
df.write \
  .mode("overwrite") \
  .parquet("/mnt/curated/customer")
```

What happens internally:

* Spark serializes the DataFrame
* Converts rows into Parquet binary format
* Applies compression
* Writes optimized files to storage

---

### Reading Data (Deserialization)

```python
df = spark.read.parquet("/mnt/curated/customer")
```

What happens internally:

* Spark reads Parquet binary files
* Deserializes data into objects
* Creates a DataFrame for processing

---

## 8. Real-Time Project Context

* **Bronze layer**: Raw data (CSV / JSON)
* **Silver layer**: Cleaned & structured (Parquet / ORC)
* **Gold layer**: Aggregated & analytics-ready (Parquet)

Serialization happens while moving from **Bronze → Silver → Gold**.

In [0]:
ingest_df1.write.orc("/Volumes/catalog2/database2/volume2/Write_bascics/orcoutput",mode='overwrite',compression='zlib')#by default orc/parquet uses snappy compression
spark.read.orc("/Volumes/catalog2/database2/volume2/Write_bascics/orcoutput").show(2)#uncompression + deserialization

In [0]:
spark.read.orc("/Volumes/catalog2/database2/volume2/Write_bascics/orcoutput").explain()

In [0]:
#Orc/Parquet follows WORM feature (Write Once Read Many)
ingest_df1.write.mode("overwrite").options(compression="snappy").parquet("/Volumes/catalog2/database2/volume2/Write_bascics/parquetoutput")#by default orc/parquet uses snappy compression
spark.read.parquet("/Volumes/catalog2/database2/volume2/Write_bascics/parquetoutput").show(2)#uncompression + deserialization

In [0]:
#Delta follows WMRM feature (Write Many Read Many)
ingest_df1.write.format("delta").mode("overwrite").save("/Volumes/catalog2/database2/volume2/Write_bascics/deltaoutput")
spark.read.format("delta").load("/Volumes/catalog2/database2/volume2/Write_bascics/deltaoutput").show(2)

In [0]:
ingest_df1.write.mode("overwrite").xml("/Volumes/catalog2/database2/volume2/Write_bascics/xmloutput",rowTag="cust")
spark.read.xml("/Volumes/catalog2/database2/volume2/Write_bascics/xmloutput",rowTag="cust").show(2)

####4.Table Load Operations - Building LAKEHOUSE ON TOP OF DATALAKE
Can we do SQL operations directly on the tables like a database or datawarehouse? or Can we build a Lakehouse in Databricks?
- We learn/use a lot/heavily subsequently, 
- what is Lakehouse - A SQL/Datawarehouse/Query layer on top of the Datalake is called Lakehouse
- We have different lakehouses which we are going to learn further - 
1. delta tables (lakehouse) in databricks
2. hive in onprem
3. bigquery in GCP
4. synapse in azure
5. athena in aws
- **when to go with lakehouse** - 
- a. Transformation
- b. Analysis/Analytics
- c. AI/BI
- d. Literally we are going to learn SQL & Advanced SQL

In [0]:
#We are building delta tables in databricks (we are building hive tables in onprem/we are building bq tables in gcp...)
#saveastable (named notation/named arguments)
#Table
#cid,prof,age,fname,lname
#mapping
#cid,prof,age,fname,lname
ingest_df1.write.saveAsTable("catalog2.database2.cust_info",mode='overwrite')
display(spark.sql("show create table catalog2.database2.cust_info"))
display(spark.sql("DESCRIBE EXTENDED catalog2.database2.cust_info"))#Use when:You want schema + location + provider + table properties.
display(spark.sql("DESCRIBE TABLE catalog2.database2.cust_info"))#. Get table schema only (columns & data types)
spark.table("catalog2.database2.cust_info").printSchema()#Understanding column structure
display(
    spark.sql("""
        DESCRIBE HISTORY catalog2.database2.cust_info
    """)
)




In [0]:
#1. insertinto function can be used as like saveAstable with few differences
#a. it works only if the target table exist
#b. it works by creating insert statements in the behind(not bulk load), hence it is slow, hence we have use for small dataset (safely only if table exists)
#c. it will load the data from the dataframe by using position, not by using name..
#insertInto (positional notation/positional arguments)
#Table
#cid,prof,age,fname,lname
#mapping.
#cid,fname,lname,age,prof
ingest_df1.write.insertInto("catalog2.database2.cust_info",overwrite=True)

In [0]:
ingest_df1.write.format("delta").save("location")

In [0]:
##I am using spark engine to pull the data from the lakehouse table backed by dbfs (s3) (datalake) where data in delta format(deltalake) 
display(spark.sql("select * from catalog2.database2.cust_info"))#sparkengine+lakehouse+datalake(deltalake)

# Understanding Spark Engine, Lakehouse, Data Lake, and Delta Lake

---

## 1. The Code You Are Running

```python
display(
    spark.sql("""
        SELECT *
        FROM workspace.wd36schema.lh_custtbl
    """)
)
```

This line **looks simple**, but behind it there are **multiple architectural layers** working together.

You are **not directly reading files**. You are querying a **logical table**, and Spark takes care of the rest.

---

## 2. High-Level Architecture (One Line)

> **Spark Engine** executes a SQL query on a **Lakehouse table**, whose data is stored in a **Data Lake (S3 via DBFS)** using **Delta Lake format**.

---

## 3. Spark Engine (Compute Layer)

### What Spark Engine Is

Spark is the **compute engine** responsible for:

* Parsing SQL
* Validating syntax
* Creating a logical plan
* Optimizing the query
* Executing distributed jobs
* Returning results

Spark **does not store data**.

> Spark’s job is to *process* data, not *own* it.

---

## 4. Lakehouse Table (Logical Abstraction)

```text
workspace.wd36schema.lh_custtbl
```

This is a **logical table**, not a file.

### What the Lakehouse Table Contains (Metadata)

* Table schema (columns, data types)
* Storage location
* File format (Delta)
* Partitioning information
* Table properties
* Access control rules

This metadata is stored in:

* Hive Metastore **or**
* Unity Catalog (modern Databricks setup)

> Lakehouse = Data Lake storage + Warehouse-style tables

---

## 5. Data Lake (Physical Storage Layer)

The actual data lives in a **data lake**, backed by cloud object storage:

* AWS S3
* Azure ADLS
* Google GCS

In Databricks, this is exposed via **DBFS**.

Example (conceptual):

```text
s3://company-datalake/lakehouse/lh_custtbl/
```

The data lake stores **files**, not tables.

---

## 6. Delta Lake (Storage + Transaction Layer)

The data in the data lake is stored in **Delta format**.

### What Delta Lake Adds

* ACID transactions
* Schema enforcement
* Schema evolution
* Time travel
* Data versioning

### Physical Structure

```text
lh_custtbl/
 ├── _delta_log/
 │    ├── 000000000.json
 │    ├── 000000001.json
 ├── part-00000.snappy.parquet
 ├── part-00001.snappy.parquet
```

Spark reads `_delta_log` first to determine:

* Latest table version
* Valid parquet files

---

## 7. What “Lakehouse” Means (Important)

**Lakehouse is not a separate tool.**

It is an **architecture pattern**:

| Feature                | Provided By               |
| ---------------------- | ------------------------- |
| Cheap scalable storage | Data Lake (S3)            |
| Table abstraction      | Metastore / Unity Catalog |
| ACID guarantees        | Delta Lake                |
| SQL processing         | Spark Engine              |

Together, this is called a **Lakehouse**.

---



```text
Spark Engine (Compute)
        ↓
Lakehouse Table (Metadata)
        ↓
Delta Lake (Transaction Layer)
        ↓
Data Lake (S3 via DBFS)
```

---

## 11. One-Line Interview-Ready Explanation

> "Spark uses lakehouse table metadata to locate Delta Lake files stored in the data lake, reads the Delta transaction log for consistency, and executes the query using distributed processing."

---

## 12. Key Takeaway

* You query a **table**
* Spark reads **files**
* Delta Lake ensures **correctness**
* Data Lake provides **scalable storage**

Everything works together under the **Lakehouse architecture**.

---



%md
####5. XML Format - Semi structured data format (most of the json features can be applied in xml also, but in DE world not so famous like json)
- Used rarely on demand (by certain target/source systems eg. mainframes)
- Can be related with json, but not so much efficient like json
- Databricks provides xml as a inbuild function

%md
### Modes in Writing
1. **Append** - Adds the new data to the existing data. It does not overwrite anything.
2. **Overwrite** - Replaces the existing data entirely at the destination.
3. **ErrorIfexist**(default) - Throws an error if data already exists at the destination.
4. **Ignore** - Skips the write operation if data already exists at the destination.

%md
####What are all the overall functions/options we used in this notebook, for learning fundamental spark dataframe WRITE operations in different formats and targets?
1. We learned dozen of functions (out of 18 functions) in the write module with minimum options...
2. Functions we learned are (Datalake functions - csv/json/xml/orc/parquet+delta), (Lakehouse functions - saveAsTable/insertInto), (additional options - format/save/option/options/mode).
3. We have few more performance optimization/advanced options available (jdbc (we learn this soon in the name of foreign catalog), partitionBy,ClusterBy,BucketBy,SortBy,text)
4. Few of the important read options under csv such as header, sep, mode(append/overwrite/error/ignore), toDF.
5. Few additional options such as compression, different file formats...