##Process customer transaction data from raw storage, apply validations, and write it to curated (processed) storage

**Folder Structure**<br>
/Volumes/workspace/default/sales_project/<br>
│<br>
├── raw/<br>
│   └── customers.csv<br>
│<br>
├── processed/<br>
│   └── customers_cleaned/<br>
│<br>
└── logs/<br>


In [0]:
%sql
CREATE VOLUME IF NOT EXISTS workspace.default.sales_project;

In [0]:
base_path = "/Volumes/workspace/default/sales_project"

dbutils.fs.mkdirs(f"{base_path}/raw")
dbutils.fs.mkdirs(f"{base_path}/processed")
dbutils.fs.mkdirs(f"{base_path}/logs")


In [0]:
customer_data = [
    (1, "Sunil", "Asha", "sunil.asha@gmail.com", "India", 32, "2024-10-01"),
    (2, "Ravi", "Kumar", "ravi.kumar@yahoo.com", "India", 45, "2024-10-03"),
    (3, "John", "Smith", "john.smith@gmail.com", "USA", 29, "2024-10-05"),
    (4, "Maria", "Garcia", "maria.garcia@gmail.com", "Spain", 35, "2024-10-07"),
    (5, "Asha", "Patel", "asha.patel@gmail.com", "UK", None, "2024-10-10"),
    (6, "Sunil", "Asha", "sunil.asha@gmail.com", "India", 32, "2024-10-01"),
    (7, "Ravi", "Kumar", "ravi.kumar@yahoo.com", "India", 45, "2024-10-03"),
    (8, "John", "Smith", "john.smith@gmail.com", "USA", 29, "2024-10-05"),
    (9, "Sunil", "Asha", "sunil.asha@gmail.com", "India", 32, "2024-10-01")
]

columns = ["customer_id","first_name","last_name","email","country","age","created_date"]

df = spark.createDataFrame(customer_data, columns)

df.write.mode("overwrite").option("header", True).csv("/Volumes/workspace/default/sales_project/raw/customers.csv")


In [0]:
df.display()

##STEP 1: Read Customer Data from RAW (Bronze)
- Reads CSV exactly as received
- No changes to source data
- Schema inferred automatically

In [0]:
df1=spark.read.options(header='True', inferSchema='True').csv("/Volumes/workspace/default/sales_project/raw/customers.csv")
df1.display()
print(df1.printSchema())

##STEP 2: Basic Data Validation Checks

###2.1 Check record count

In [0]:
df_raw=spark.read.options(header='True', inferSchema='True').csv("/Volumes/workspace/default/sales_project/raw/customers.csv")
print("Raw count:", df_raw.count())

###2.2 Check duplicate customers

In [0]:
from pyspark.sql.functions import col

df_raw.groupBy("email").count().filter(col("count") > 1).show()


###2.3 Check missing mandatory fields

In [0]:
df_raw.filter(col("email").isNull()|col("customer_id").isNull()|col("age").isNull()).show()

##STEP 3: Data Cleansing & Transformations (Silver)
- Keeps latest customer record per email

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, to_date

window_spec = Window.partitionBy("email").orderBy(col("created_date").desc())

df_dedup = (
    df_raw
    .withColumn("rn", row_number().over(window_spec))
    .filter(col("rn") == 1)
    .drop("rn").show()
)


###3.2 Handle missing values

In [0]:
df_clean = df_dedup.fillna({
    "age": 0,
    "country": "UNKNOWN"
})

###3.3 Standardize data

In [0]:
from pyspark.sql.functions import upper, trim

df_clean = (
    df_clean
    .withColumn("country", upper(trim(col("country"))))
    .withColumn("email", trim(col("email")))
)


###3.4 Add audit columns

In [0]:
from pyspark.sql.functions import current_timestamp

df_silver = df_clean.withColumn("processed_at", current_timestamp())


#STEP 4: Write to PROCESSED Layer (Silver – Delta)

In [0]:
dbutils.fs.mkdirs("/Volumes/workspace/default/sales_project/processed/customers_cleaned")

In [0]:
silver_path = "/Volumes/workspace/default/sales_project/processed/customers_cleaned"

df_silver.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(silver_path)


Why Delta?
- ACID transactions
- Schema enforcement
- Production standard

#STEP 5: Create CURATED (Gold) Business Table

In [0]:
df_silver = spark.read.format("delta").load(silver_path)


###5.2 Apply business logic<br>
Example:
- Only adults
- Only valid countries

In [0]:
df_gold = df_silver.filter(col("age") >= 18)


###5.3 Write curated table

In [0]:
gold_path = "/Volumes/workspace/default/sales_project/curated/customers"

(
    df_gold.write
           .format("delta")
           .mode("overwrite")
           .save(gold_path)
)


#STEP 6: Validation & Checks

In [0]:
spark.read.format("delta").load(gold_path).show()


**End-to-End Flow Summary**
RAW CSV<br>
   ↓<br>
Validation<br>
   ↓<br>
Deduplication<br>
   ↓<br>
Cleansing<br>
   ↓<br>
Processed (Silver Delta)<br>
   ↓<br>
Business Rules<br>
   ↓<br>
Curated (Gold Delta)<br>
