**next level of SQL (Spark SQL) + Python Function based programming (Framework of Spark DSL) + Datawarehouse (Datalake+Lakehouse) -> Transformation & Analytics**

### Data munging is the process of converting raw data into a usable format by cleaning, transforming, and enriching it.



**Passive Data Munging : Performing an (Data Exploration) exploratory data analysis of the raw data to identify the attributes and patterns.**
1. Visibily/Manually opening the file we found couple of data patterns (Manual Exploratory Data Analysis)
- It is a Structured data with comma seperator (CSV)
- No Header, No comments, footer is there in the data
- Total columns are (seperator + 1)
- Data Quality
- Null columns are there
- duplicate rows
- format issues are there (age is not in number format eg. 7-7)
- Uniformity issues (Artist, artist)
- Number of columns are more or less than the expected
- eg. 4000011,Francis,McNamara,47,Therapist,NewYork & 4000014,Beth,Woodard,65
- Identification of data type

**2. Programatically lets try to find couple of data patterns applying EDA - passively (without modifying, just for description)**

In [0]:
%sql
create catalog if not exists BreadandButter; 
create database if not exists BreadandButter.Data_Ingestion_DB; 
create volume if not exists BreadandButter.Data_Ingestion_DB.Data_DE_VL;

2. Programatically lets try to find couple of data patterns applying EDA - passively (without modifying, just for description).

In [0]:
rawdf1=spark.read.csv("/Volumes/breadandbutter/data_ingestion_db/data_de_vl/custsmodified",header=False,inferSchema=True).toDF("id","firstname","lastname","age","profession")
#rawdf1.show(20,False)
display(rawdf1.take(20))
display(rawdf1.sample(.1))
#inferSchema=True ->Spark scans data and assigns data types automatically
#sample(.1) #Randomly selects ~10% of rows -> Data inspection
#take(20) -> Displaying first 20 rows (Action)

In [0]:
#Important passive EDA structure functions we can use
rawdf1.printSchema() #I am realizing the id & age columns are having some non numerical values (supposed to be numeric)

#Column names in exact order
print(rawdf1.columns) #I am understanding the column numbers/order and the column names

#Column name + datatype mapping
print(rawdf1.dtypes) #Realizing the datatype of every columns (even we can do programattic column & type identification for dynamic programming)
  

In [0]:
#Identifying all string columns (dynamic logic)
for i in rawdf1.dtypes:
    if i[1]=='string':
        print(i[0])

#Full structural metadata
print(rawdf1.schema)#To identify the structure of the data in the StructType and StructField format

distinct() and dropDuplicates() both remove duplicate rows across all columns when no subset is provided. 
- The key difference is that dropDuplicates() allows deduplication based on specific columns, making it more flexible for data engineering use cases.   ex: rawdf1.dropDuplicates(['id'])

In [0]:
#Important passive EDA data functions we can use
#We identified few patterns on this data
#1. Deduplication of rows and given column(s)
#2. Null values ratio across all columns
#3. Distribution (Dense) of the data across all number columns
#4. Min, Max values
#5. StdDeviation - 
#6. Percentile - Distribution percentage from 0 to 100 in 4 quadrants of 25%

#Total row count (baseline)
print("actual count of the data",rawdf1.count()) 

#de duplicate the entire columns of the given  dataframe(SQL-style operation)
print("de-duplicated record (all columns) count sqlstyle",rawdf1.distinct().count())

#de duplicate the entire columns of the given  dataframe(DataFrame-specific API)
print("de-duplicated record (all columns) count DF api",rawdf1.dropDuplicates().count())

#de duplicate the entire columns of the given  dataframe(remove duplicates based on specific columns)
print("de-duplicated given cid column count",rawdf1.dropDuplicates(['id']).count())

#describe() provides basic statistics like count, mean, min, and max, 
display(rawdf1.describe())

# while summary() extends this by adding percentile-based distribution metrics such as median and quartiles, making it more suitable for deeper data quality analysis.
display(rawdf1.summary())

### **Active Data Munging** is the continuous process of structuring, validating, cleansing, scrubbing, deduplicating, and standardizing evolving data to make it analytics-ready.

- Combining Data + Schema Evolution/Merging (Structuring)
- Validation, Cleansing, Scrubbing - Cleansing (removal of unwanted datasets), Scrubbing (convert raw to tidy)
- De Duplication and Levels of Standardization () of Data to make it in a usable format (Dataengineers/consumers)

1)**Questions related to multiple files/paths/sub path handling**
-->I have data in different filenames in a single/multiple location, i need to read all these data in a df - path=["path1/file1","path1/file2","path2/file3"] I have data in single pattern of file names in a single/multiple locations or subfolders, i need to read all these data in a df - path=["path1/","path1/","path2/"], pathGlobFilter="custsm*", recursiveFileLookup=True

2)**Questions related handling evolving data structure with data ingested in different days/periods - Ans. Schema Evolution**
Evolution is growth over the time (Filesystem level).. Eg. Source is sending data with additional columns week over week in csv format
1. Read and write in Serialized format( ORC,Parquet)
2. Read DF with mergeSchema = True

3)**Questions related handling data from different sources with different related structure in a same day - Ans. Schema Merging/Melting (Dataframe level)**    Eg. Source1 is sending custsmodified_NY with 5 columns and Source2 is sending custsmodified TX with 4 columns
1. Read file1 in DF1, read file2 in DF2
2. Create DF3 by merging DF1 and DF2 using df1.unionByName(df2,allowMissingColumns=True)

In [0]:
#Extraction (Ingestion) methodologies
#1. Single file
struct1="id string, firstname string, lastname string, age string, profession string"
rawdf1=spark.read.schema(struct1).csv(path="/Volumes/breadandbutter/data_ingestion_db/data_de_vl/custsmodified") 
#2. Multiple files (with different names)
rawdf1=spark.read.schema(struct1).csv(path=["/Volumes/we47catalog/we47schema/we47_volume/custsmodified","/Volumes/breadandbutter/data_ingestion_db/data_de_vl/custsmodified_NY"])
#3. Multiple files in multiple paths or sub paths
rawdf1=spark.read.schema(struct1).csv(path=["/Volumes/breadandbutter/data_ingestion_db/data_de_vl/","/Volumes/breadandbutter/data_ingestion_db/data_de_vl/"],recursiveFileLookup=True,pathGlobFilter="custsm*")

- When you go for Schema Merging/Melting and Schema Evolution?
- Schema Merging/Melting (unionByName,allowMissingColumns)- If we get multiple files
- Schema Evolution (orc/parquet with mergeSchema) - If no. of columns are keeps added by the source system
- when we know structure of the file already - schema merge/ schema not known earlier  - schema evolution

Schema Evolution:
Handling schema changes over time in the same dataset.

Schema Merging:
Combining different schemas from multiple sources into one structure.

- If multiple files with different structures arrive together → Schema Merging.
- If the same source keeps adding columns over time → Schema Evolution.

### **1. Combining Data + Schema Evolution/Merging (Structuring)**

**Schema Merging**
- Schema Merging is the process of combining data from multiple sources with different schemas at the same point in time into a single unified structure.

**Scope**
- Same-day / same batch
- DataFrame level
- _Typical Scenario_
- Source A (NY): id, name, email
- Source B (TX): id, name, phone

**_How it is handled_**
- Read separately
- Merge using unionByName
- df_all = df_ny.unionByName(df_tx, allowMissingColumns=True)


What Spark does
- Matches columns by name
- Adds missing columns as NULL
- Produces unified DataFrame

Key Point
- Schema Merging happens because sources differ, not because time changes.

In [0]:
#COMBINING OR SCHEMA MERGING or SCHEMA MELTING of Data from different sources(Important interview question also as like schema evolution...)
#4. Multiple files with different structure in multiple paths or sub paths
strt1="id string, firstname string, lastname string, age string, profession string"
rawdf1=spark.read.schema(strt1).csv(path=["/Volumes/breadandbutter/data_ingestion_db/data_de_vl/"],recursiveFileLookup=True,pathGlobFilter="custsmodified_N*")

strt2="id string, firstname string, age string, profession string,city string"
rawdf2=spark.read.schema(strt2).csv(path=["/Volumes/breadandbutter/data_ingestion_db/data_de_vl/"],recursiveFileLookup=True,pathGlobFilter="custsmodified_T*")
display(rawdf1)
display(rawdf2)

rawdf_merged=rawdf1.union(rawdf2)#Use union only if the dataframes are having same columns in the same order with same datatype....Union is position-based ,Same column order, Same number of columns, Same data types.
display(rawdf_merged)

#Expected right approach to follow #allowMissingColumns=True -> Adds missing columns with NULL
rawdf_merged=rawdf1.unionByName(rawdf2,allowMissingColumns=True) #In unionByName -> Columns matched by name # Missing columns → NULL
display(rawdf_merged) #done




In [0]:
#Here, we are merging two files because both are in CSV format. If one file is CSV and the other file is in a different format, what should we do in this scenario? it will be handled automatically
#rawdf2.write.json("/Volumes/workspace/wd36schema/ingestion_volume/staging/csvjson")
rawdf3=spark.read.json("/Volumes/breadandbutter/data_ingestion_db/data_de_vl/simple_json.txt")
rawdf_merged=rawdf_merged.unionByName(rawdf3,allowMissingColumns=True)
display(rawdf_merged)#Expected dataframe to proceed further munging on a single dataframe

### 2. Validation, Cleansing, Scrubbing - Cleansing (removal of unwanted datasets), Scrubbing (convert raw to tidy)

%md
READ MODE's
### 1️⃣ mode='permissive' (DEFAULT)
What it does
- Reads all records
- Corrupt / malformed rows are not dropped
- They are placed into a special column called _corrupt_record

2️⃣ **mode='dropMalformed'**
 What it does
- Drops malformed rows
- No _corrupt_record column

3️⃣ **mode='failFast'**
 What it does
- Fails immediately when malformed record is found
- Stops job execution

In [0]:
#Validation by doing cleansing
from pyspark.sql.types import StructType,StructField,StringType,ShortType,IntegerType
#print(rawdf1.schema)
struttype1=StructType([StructField('id', IntegerType(), True), StructField('firstname', StringType(), True), StructField('lastname', StringType(), True), StructField('age', ShortType(), True), StructField('profession', StringType(), True)])

#method1 - permissive with all rows with respective nulls
cleandf1=spark.read.schema(struttype1).csv(path="/Volumes/breadandbutter/data_ingestion_db/data_de_vl/custsmodified",mode='permissive')
print("after keeping nulls on the wrong data format", cleandf1.count())#all rows count
display(cleandf1)#We are making nulls where ever data format mismatch is there (cutting down mud portition from potato)
#or
#method2 - drop malformed rows
cleandf1=spark.read.schema(struttype1).csv(path="/Volumes/breadandbutter/data_ingestion_db/data_de_vl/custsmodified",mode='dropMalformed')
print("after cleaning wrong data (type mismatch, column number mismatch)", len(cleandf1.collect()))
display(cleandf1)#We are removing the entire row, where ever data format mismatch is there (throwing away the entire potato)


### Validation

In [0]:
#method3 best methodology of applying active data munging
#Validation by doing cleansing (not at the time of creating Dataframe, rather we will clean and scrub subsequently)...
struttype1 = StructType([StructField('id', StringType(), True), StructField('firstname', StringType(), True), StructField('lastname', StringType(), True), StructField('age', StringType(), True), StructField('profession', StringType(), True)])
#method1 - permissive with all rows with respective nulls
rawdf1=spark.read.schema(struttype1).csv(path="/Volumes/breadandbutter/data_ingestion_db/data_de_vl/custsmodified",mode='permissive')
print("allow all data showing the real values",rawdf1.count())#all rows count
display(rawdf1)#We are making nulls where ever data format mismatch is there (cutting down mud portition from potato)

**Cleansing**
It is a process of cleaning/removing/deleting unwanted data
### - **na.drop(how="any")** 
- Drops the entire ROW if ANY column in that row is NULL
- Only rows with NO nulls at all survive

### - 2️⃣ **na.drop(how="any", subset=["id","age"])**
- Drops row ONLY IF id OR age is NULL
- Other columns are ignored.

- Mode	              Row dropped when
- how="any"	   ----> drop if At least one column is NULL
- how="all"	   ----> drop if All selected columns are NULL


In [0]:
#We already know how to do cleansing applying the strict Structure on method1 and method2
#Important na functions we can use to do cleansing
display(rawdf1.where("age is null")) #raw data before cleansing
cleanseddf=rawdf1.na.drop(how="any")#This function will drop any column in a given row with null otherwise this function returns rows with no null columns
display(cleanseddf.where("age is null"))#after cleansing no null row data will be seen

cleanseddf=rawdf1.na.drop(how="any",subset=["id","age"]) #Drops row ONLY IF id OR age is NULL
display(cleanseddf)
cleanseddf=rawdf1.na.drop(how="all",subset=["lastname","profession"])#4000004,Gretchen,,66,
display(cleanseddf)

**Scrubbing**
 It is a process of polishing/fine tuning/scrubbing/meaningful conversion the data in a usable format
### -  **na.fill()** - Replaces NULL values in the specified column(s) with the given value

In [0]:
cleanseddf = rawdf1.na.fill('user not provided',subset=["lastname"])#4000004,Gretchen,,66,
display(cleanseddf)

cleanseddf = rawdf1.na.fill('NA')#replaces all null values with 'NA'
display(cleanseddf)

1️⃣ Schema Evolution
- Schema Evolution is the ability of a system to handle changes in data structure over time as new data arrives with additional or modified columns, without breaking existing pipelines.
### Scope
- Time-based (day over day / week over week)
- Filesystem / table level
- _Typical Scenario_
- Week 1 file: id, name
- Week 2 file: id, name, email
- Week 3 file: id, name, email, phone
### How it is handled
- Use schema-aware formats (Parquet / ORC)
- Enable mergeSchema = true at read time
- _spark.read.option("mergeSchema", "true").parquet("/data/customers")_
### What Spark does
- Reads schema from all files
- Merges them into a superset
- Missing columns → NULL
- Key Point
- Schema Evolution happens because data changes over time.