# Key Terminology [](url)

#### Validation 
- Persmissive, 
- dropmalformated, 
- Failfast

#### When you go for Schema Merging/Melting and Schema Evolution?
 - unionByName,allowMissingColumns (Multiple file from different location)

#### Schema Evolution 
- mergeSchema=True

##### Rejection Strategy 
 - columnNameOfCorruptRecord="corruptdata"

#####  Multiple files in multiple paths or sub paths
- recursiveFileLookup=True,pathGlobFilter

In [0]:
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()

#Important passive Munging - EDA of schema/structure functions we can use

In [0]:
dbutils.fs.ls("/Volumes/lakehouse1/dbread/read_volume/sub/")

In [0]:
display(
    spark.read.text("/Volumes/lakehouse1/dbread/read_volume/custsmodified")
)


In [0]:
df1 = spark.read.csv("/Volumes/lakehouse1/dbread/read_volume/custsmodified", header=False, inferSchema=True).toDF("id", "name", "lname", "age", "prof")
display(df1)
# df1.printSchema()
# display([col for col in df1.columns])
# print(df1.schema)
# display(df1.dtypes)
# display(df1.describe())
# display(df1.summary())


In [0]:
print("actual count of the data",df1.count())
print("de-duplicated record (all columns) count",df1.distinct().count())#de duplicate the entire columns of the given  dataframe
print("de-duplicated record (all columns) count",df1.dropDuplicates().count())#de duplicate the entire columns of the given  dataframe
print("de-duplicated given cid column count",df1.dropDuplicates(['id']).count())#de duplicate the entire columns of the given  dataframe
display(df1.describe())
display(df1.summary())

In [0]:
#1. Single file
struct1="id string, firstname string, lastname string, age string, profession string"
rawdf1=spark.read.schema(struct1).csv(path="/Volumes/lakehouse1/dbread/read_volume/custs")
#display(rawdf1.count())

rawdf1=spark.read.schema(struct1).csv(path=["/Volumes/lakehouse1/dbread/read_volume/custsmodified","/Volumes/lakehouse1/dbread/read_volume/custsmodified_NY","/Volumes/lakehouse1/dbread/read_volume/custs"],pathGlobFilter="custs*",recursiveFileLookup=True)
display(rawdf1.count())

In [0]:
strt1="id string, firstname string, lastname string, age string, profession string"
rawdf1=spark.read.schema(strt1).csv(path=["/Volumes/lakehouse1/dbread/read_volume/"],recursiveFileLookup=True,pathGlobFilter="custsmodified_N*")
display(rawdf1.count())
display(rawdf1)

strt2="id string, firstname string, age string, profession string,city string"
rawdf2=spark.read.schema(strt1).csv(path=["/Volumes/lakehouse1/dbread/read_volume/"],recursiveFileLookup=True,pathGlobFilter="custsmodified_T*")
display(rawdf2.count())
display(rawdf2)

In [0]:
rawdf_merged=rawdf1.unionByName(rawdf2,allowMissingColumns=True)
display(rawdf_merged)

###Combining Data + Schema Evolution/Merging (Structuring) - Preliminary Datamunging


####**Single File**

In [0]:
struct1="id string, firstname string, lastname string, age string, profession string"
rawdf1=spark.read.schema(struct1).csv(path="/Volumes/lakehouse1/dbread/read_volume/custsmodified")
print(f"Single file total count",rawdf1.count())
#display(rawdf1)

#Multiple files (with different names)
rawdf1=spark.read.schema(struct1).csv(path=["/Volumes/lakehouse1/dbread/read_volume/custsmodified","/Volumes/lakehouse1/dbread/read_volume/custsmodified_NY","/Volumes/lakehouse1/dbread/read_volume/sub/custsmodified_TX"])
print(f"Multiple files (with different names)",rawdf1.count())
#display(rawdf1)

#Multiple files (with different names, recursive)
rawdf1 = spark.read.schema(struct1).csv(
    path="/Volumes/lakehouse1/dbread/read_volume/",
    pathGlobFilter="custs*",
    recursiveFileLookup=True
)
print(f"Multiple files (with different names, recursive)",rawdf1.count())
display(rawdf1)



In [0]:
schem1 = "id string, firstname string, lastname string, age int, profession string"
rawdf1=spark.read.csv("/Volumes/lakehouse1/dbread/read_volume/sub/custsmodified_NY",schema=schem1)
display(rawdf1)

schem2 = "id string, firstname string, age int, profession string, city string"
rawdf2 =spark.read.csv("/Volumes/lakehouse1/dbread/read_volume/sub/custsmodified_TX",schema=schem2)
display(rawdf2)

rawdf_merged= rawdf1.unionByName(rawdf2, allowMissingColumns=True)
display(rawdf_merged)
#rawdf_merged=rawdf1.unionByName(rawdf2

#### Validation – Data Exploration through Cleansing and Scrubbing

- **Scrubbing**: Applied **Permissive mode** to handle unexpected data types by converting invalid values to **NULL**.
- **Cleansing**: Applied **Drop Malformed mode** to eliminate records containing invalid or malformed data.


In [0]:
from pyspark.sql.types import *
strt1="id int, firstname string, lastname string, age int, profession string"

df_raw=spark.read.csv("/Volumes/lakehouse1/dbread/read_volume/sub/custsmodified")
df_raw.show(20)

strt11=StructType([StructField('id', IntegerType(), True), StructField('firstname', StringType(), True), StructField('lastname', StringType(), True), StructField('age', IntegerType(), True), StructField('profession', StringType(), True),StructField("corruptdata",StringType(),True)])

dfmethod1=spark.read.csv("/Volumes/lakehouse1/dbread/read_volume/sub/custsmodified", schema=strt11,mode="PERMISSIVE",header=False)

print("dfmethod1 entire count of data",dfmethod1.count())
print("dfmethod1 after scrubbing, count of data",len(dfmethod1.collect()))
display(dfmethod1)


dfmethod2=spark.read.csv("/Volumes/lakehouse1/dbread/read_volume/sub/custsmodified", schema=strt11,mode="dropMalformed",header=False)

print("dfmethod2 entire count of data",dfmethod2.count())
print("dfmethod2 after scrubbing, count of data",len(dfmethod2.collect()))
display(dfmethod2)

dfmethod3 = spark.read.csv(
    "/Volumes/lakehouse1/dbread/read_volume/sub/custsmodified",
    schema=strt11,
    mode="PERMISSIVE",
    header=False,
    columnNameOfCorruptRecord="corruptdata"
)

print("dfmethod3 entire count of data",dfmethod3.count())
print("dfmethod3 after scrubbing, count of data",len(dfmethod3.collect()))
display(dfmethod3)


In [0]:
#Before actively Cleansing or Scrubbing - We have to create a Rejection Strategy to reduce data challenges in the future
from pyspark.sql.types import (
    StructType,
    StructField,
    IntegerType,
    StringType
)

strt11 = StructType([
    StructField("id", IntegerType(), True),
    StructField("firstname", StringType(), True),
    StructField("lastname", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("profession", StringType(), True),
    StructField("corruptdata", StringType(), True)
])

dfmethod3 = spark.read.schema(strt11).csv(
    "/Volumes/lakehouse1/dbread/read_volume/sub/custsmodified",
    mode="PERMISSIVE",
    header=False,
    columnNameOfCorruptRecord="corruptdata"
)

display(dfmethod3)


print("entire count of data", dfmethod3.count())
df_reject = dfmethod3.where("corruptdata is not null")
df_reject.drop("corruptdata").write.mode("overwrite").option("header", True).option("delimiter", ",").csv("/Volumes/lakehouse1/dbread/read_volume/sub/rejects/")
print("Data to reject or update the source", df_reject.count())
display(df_reject)


In [0]:
df_read_back = spark.read \
    .option("header", True) \
    .csv("/Volumes/lakehouse1/dbread/read_volume/sub/rejects/")

display(df_read_back)


#####Cleansing
It is a process of cleaning/removing or making the data more clean Eg. Cutting/removing debris portion of the potato

In [0]:
cleansed_df1=dfmethod3.na.drop(how="any")#drop the row, if any one column in our df row contains null
#cleansed_df1=dfmethod3.na.drop(how="any",subset=["id","age"])#drop the row, if any one column id/age contains null
print("cleansed any DF count",len(cleansed_df1.collect()))
display(cleansed_df1.take(50))

In [0]:
#cleansed_df2=dfmethod3.na.drop(how="all")#drop the row, if all the columns in our df row contains null
cleansed_df2=dfmethod3.na.drop(how="all",subset=["id","profession"])#drop the row, if all the columns (id,profession) in our df row contains null
print("cleansed all DF count",len(cleansed_df2.collect()))
display(cleansed_df2.take(50))

In [0]:
#Before scrubbing, lets take the right cleansed data with id as null and entire row as null removed out
#Finally I am arriving for our current data, lets perform the best cleansing
cleansed_df = (
    dfmethod3
        .filter("corruptdata is null")
        .na.drop(subset=["id"])
        .na.drop(how="all")
        .drop("corruptdata")
)

print("Final cleansed DF",len(cleansed_df.collect()))
display(cleansed_df.take(15))

In [0]:
scrubbed_df1=cleansed_df.na.fill("na",subset=["firstname","lastname"]).na.fill("not provided",subset=["profession"])
scrubbed_df2=scrubbed_df1.na.replace("IT","Information Technologies",subset=["profession"]).na.replace("Pilot","Aircraft Pilot",subset=["profession"])
display(scrubbed_df2.take(15))

In [0]:
dict1={"IT":"Information Technologies","Pilot":"Doctor","Actor":"Business"}
scrubbed_df=scrubbed_df1.na.replace(dict1,subset=["profession"])
print("scrubbed DF",len(scrubbed_df.collect()))
display(scrubbed_df.take(15))

##### Standardization1 - Column Enrichment (Addition of columns)

In [0]:
from pyspark.sql.functions import lit,initcap
standard_df1=scrubbed_df.withColumn("Source_sys", lit("Retail"))
display(standard_df1)

In [0]:
#Standardization2 - UNIFORMITY of the data
display(standard_df1.groupBy("profession").count())#DSL
#standard_df1.createOrReplaceTempView("view1")
#display(spark.sql("select profession,count(1) from view1 group by profession"))#Declarative lang
standard_df2=standard_df1.withColumn("profession",initcap("profession"))#If we have to add a columns with some hardcoded value in dataframe, we have use lit function to add a hardcoded/literal value
#display(standard_df2.take(15))
display(standard_df2.groupBy("profession").count())

In [0]:
from pyspark.sql.functions import regexp_replace

# ------------------------------------------------------------
# Standardization 3 – Format Standardization
# ------------------------------------------------------------

# Mapping for ID standardization
# (Can later be enhanced using GenAI)
cid_standardization = {
    "one": "1",
    "two": "2",
    "ten": "10"
}

# Replace textual IDs with numeric equivalents
# Using NA replace (data munging technique)
standard_df3 = (
    standard_df2
        .na.replace(cid_standardization, subset=["id"])
)

# Standardize age format by removing '-' character
standard_df3 = (
    standard_df3
        .withColumn("age", regexp_replace("age", "-", ""))
)

# Preview standardized data
display(standard_df3.take(15))


In [0]:
from pyspark.sql.functions import col

standard_df3.printSchema()
#display(len(standard_df3.collect()))
#display(standard_df3.where("id like '%trailer%'"))
standard_df3=standard_df3.where("id not rlike '[a-zA-Z]'")#Removed the string data in the id column
#display(standard_df3.where("id='trailer_data:end of file'"))
#display(len(standard_df3.collect()))
standard_df4=standard_df3.withColumn("age",col("age").cast("int")).withColumn("id",col("id").cast("long"))
standard_df4.printSchema()
display(standard_df4.take(15))
#standard_df4.where("id=1000").show()

In [0]:
standard_df5=standard_df4.withColumnsRenamed({"id":"custid","Source_sys":"sourcesystem"})
standard_df5.printSchema()
display(standard_df5.take(15))


In [0]:
standard_df6=standard_df5.select("custid","age","firstname","lastname","profession","sourcesystem")
standard_df6.printSchema()
display(standard_df6.take(15))
standard_df6.write.mode("overwrite").saveAsTable("lakehouse1.dbread.munged_cust_data")

In [0]:
dedup_df1 = standard_df6.where("custid IN (4000001, 4000003)")
#display(dedup_df1)
dedup_df1 = standard_df6.distinct().orderBy("custid")
display(dedup_df1)
#standard_df6.write.mode("overwrite")

In [0]:
from pyspark.sql.functions import col

dedup_df2 = (
    dedup_df1
        .orderBy(["custid", "age"], ascending=[True, False])
        .coalesce(1)
        
)

display(dedup_df2)

In [0]:
dedup_df3=dedup_df2.dropDuplicates(["custid"])
display(dedup_df3)
dedup_df4 = (
    dedup_df1
        .dropDuplicates(["custid", "age", "firstname", "lastname"])
)

In [0]:
#Before we enrich, lets do some EDA
#Every stages we need to do basic EDA (Data Exploration)
dedup_df3.printSchema()
print("Records got cleaned/munged ",dfmethod1.count()-len(dedup_df3.collect()))
display(dedup_df3.summary())
display(dedup_df3)

### Data Enrichment


#### WithColumn Add

In [0]:
from pyspark.sql.functions import current_date, lit

datadt = '2026-01-01'
enrich_df1 = dedup_df3.withColumn("loaddt", current_date()).withColumn("datadt", lit(datadt))
display(enrich_df1.limit(2))

enrich_df1 = dedup_df3.withColumns({"loaddt": current_date(), "datadt": lit(datadt)})
display(enrich_df1.limit(2))

enrich_df1 = dedup_df3.select('*',current_date().alias('loaddt'),lit(datadt).alias('datadt'))
display(enrich_df1.limit(2))

enrich_df1 = dedup_df3.selectExpr('*',"current_date() as loaddt","'2026-01-01' as datadt")
display(enrich_df1.limit(2))

In [0]:
from pyspark.sql.functions import col,upper,lower

enrich_df2 = enrich_df1.withColumn(
    "firstnames",
    upper(col("firstname"))).withColumn(
    "firstname",
    lower(col("firstname"))).withColumn(
        "last_name",col("lastname"))

enrich_df2.show()

######Deriving of columns

In [0]:
from pyspark.sql.functions import substring
enrich_df2=enrich_df1.withColumn("profession_flag",substring("profession",1,1))
enrich_df2.show(10)
#or we can achieve using select
enrich_df2=enrich_df1.select("*",substring("profession",1,1).alias("profession_flag"))
enrich_df2.show(10)
#or we can achieve using selectExpr
enrich_df2=enrich_df1.selectExpr("*","substr(profession,1,1) as profession_flag")
enrich_df2.show(10)

######Renaming of columns

In [0]:
enrich_df3=enrich_df2.withColumnRenamed("profession_flag","proflag")#better to use
enrich_df3.show(10)
enrich_df3=enrich_df2.withColumnsRenamed({"profession_flag":"proflag","profession":"prof"})
enrich_df3.show(10)
#or
enrich_df3=enrich_df2.select("*",col("profession").alias("prof"))#This will derive a new column called prof
enrich_df3=enrich_df2.select("custid","age","firstname","lastname",col("profession").alias("prof"),"sourcesystem","loaddt","datadt",col("profession_flag").alias("proflag"))#This will derive a new column called prof#better to use
#enrich_df3=enrich_df3.drop("profession")#column orders are changed, not good to use
enrich_df3.show(10)


######Modify/replace (withColumn, select/selectExpr)

In [0]:
from pyspark.sql.functions import to_date, upper, col

enrich_df3.printSchema()  # datadt is not in expected date format for further usage and I want to convert sourcesystem into uppercase
enrich_df4 = (
    enrich_df3
    .withColumn("sourcesystem", upper(col("sourcesystem")))
    .withColumn("datadt", to_date(col("datadt"), 'yyyy-MM-dd'))
)
enrich_df4.printSchema()  # datadt is expected date format
display(enrich_df4.limit(10))

######Remove/Eliminate (drop,select,selectExpr) 

In [0]:
from pyspark.sql.functions import concat,upper
enrich_df5=enrich_df4.withColumn("fullname",upper(concat(col("firstname"),lit("_") , col("lastname"))))
enrich_df5=enrich_df5.drop("firstname","lastname").select("custid","age","fullname","prof","sourcesystem","loaddt","datadt","proflag")
display(enrich_df5.limit(10))

===============================
PySpark Column Enrichment
Conclusion / Best Practices
===============================

1) select()  ⭐ BEST FOR FINAL SHAPE
----------------------------------
Use when you want:
- Reorder / order columns
- Drop unwanted columns
- Derive / reformat columns
- Rename columns using alias()
- Perform ALL operations in ONE iteration

Notes:
- alias() derives a new column (not a true rename)
- Must include "*" if you want to keep all columns
- Best for Gold layer & final output

Example:
df.select("*", col("a").alias("b"))


2) selectExpr()  ⭐ SQL STYLE
----------------------------
Use when you want:
- Same capabilities as select()
- SQL-style expressions instead of DSL
- Multiple operations in ONE iteration

Notes:
- String-based (less type-safe)
- Good for SQL-heavy teams

Example:
df.selectExpr("*", "a as b", "upper(name) as uname")


3) withColumn()  ⭐ INCREMENTAL ENRICHMENT
----------------------------------------
Use when you want:
- Add new columns
- Derive columns
- Modify / replace existing columns

Notes:
- Same column name → replaces column
- New column added at the end
- Renaming via withColumn is NOT recommended
- Dropping columns is NOT possible

Example:
df.withColumn("age2", col("age") + 1)


4) withColumnRenamed()  ⭐ RENAME ONLY
------------------------------------
Use when you want:
- Rename a column cleanly and safely

Notes:
- Does not change column order
- No recomputation

Example:
df.withColumnRenamed("old", "new")


5) drop()  ⭐ REMOVE ONLY
-----------------------
Use when you want:
- Remove one or more columns quickly

Notes:
- No expressions allowed
- Avoid after select if order/schema matters

Example:
df.drop("col1", "col2")


===============================
FINAL STICKY SUMMARY
===============================

Enrich step-by-step  → withColumn
Rename cleanly       → withColumnRenamed
Remove columns       → drop
Finalize schema      → select
SQL users            → selectExpr


ONE-LINE MEMORY:
"Enrich early with withColumn, finalize late with select"


In [0]:
from pyspark.sql.functions import *
#Splitting of columns
enrich_df6=enrich_df5.withColumn("profsplit",split(col("prof"),' '))
enrich_df6=enrich_df6.withColumns({"proffirst":col("profsplit")[0],"proflast":col("profsplit")[size(col("profsplit"))-1]})
display(enrich_df6.limit(10))

In [0]:
#Merging of columns
enrich_df7=enrich_df6.withColumn("proflag",concat(substring(col("proffirst"),1,1),substring(col("proflast"),1,1))).drop("profsplit")
display(enrich_df7)