<a href="https://colab.research.google.com/github/urmilapol/urmilapolprojects/blob/master/pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.chaosgenius.io/blog/apache-spark-architecture/


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import upper

spark = SparkSession.builder.appName("DataTransformation").getOrCreate()

# Sample data: courses with fees and discounts
data = [("Java", 4000, 5), ("Python", 4600, 10), ("Scala", 4100, 15)]
columns = ["CourseName", "fee", "discount"]
df = spark.createDataFrame(data, columns)
df.show(truncate=False)


+----------+----+--------+
|CourseName|fee |discount|
+----------+----+--------+
|Java      |4000|5       |
|Python    |4600|10      |
|Scala     |4100|15      |
+----------+----+--------+



In [2]:
def to_upper(df):
    return df.withColumn("CourseName", upper(df.CourseName))

def reduce_price(df, amount):
    return df.withColumn("new_fee", df.fee - amount)

def apply_discount(df):
    return df.withColumn("discounted_fee", df.new_fee * (1 - df.discount / 100))

# Apply chain
result = df.transform(to_upper).transform(reduce_price, 1000).transform(apply_discount)
result.select("CourseName", "discounted_fee").show()


+----------+--------------+
|CourseName|discounted_fee|
+----------+--------------+
|      JAVA|        2850.0|
|    PYTHON|        3240.0|
|     SCALA|        2635.0|
+----------+--------------+



**Apache Spark excels in real-world data processing through transformations like joins, aggregations, and window functions on large datasets. Here's another hands-on PySpark example focused on sales data aggregation—a common ETL scenario for retail analytics that builds on the prior fee transformation demo**
Sales Aggregation Example
# This processes transactional sales data to compute daily revenue by product category, filtering invalid records and applying windowed ranking for top performers.
This PySpark code sets up an ETL (Extract, Transform, Load) pipeline to process sales data, aggregate it by daily revenue per category, and then rank the categories within each day.



In [9]:
from pyspark.sql import SparkSession
# It imports necessary classes and functions from pyspark.sql  which is a library for working with structured data in Spark. It includes DataFrame for various data operations, and Window for defining window functions.
from pyspark.sql.functions import col, to_date, sum as spark_sum, rank, desc
from pyspark.sql.window import Window
#creates or retrieves a SparkSession, which is the entry point to Spark functionality
spark = SparkSession.builder.appName("SalesAggregation").getOrCreate()

# Sample sales data (scale to CSV from Kaggle e-commerce datasets)
#sales_data is a Python list of tuples representing raw sales records, and columns defines the schema for this data. df = spark.createDataFrame(sales_data, columns) then converts this Python data into a Spark DataFrame.
sales_data = [
    ("2025-01-01", "Electronics", 100, 2),
    ("2025-01-01", "Clothing", 50, 5),
    ("2025-01-02", "Electronics", 100, 1),
    ("2025-01-02", "Clothing", 50, 3),
    ("2025-01-01", "Books", 20, 10),  # Low price, high volume
    ("2025-01-03", "Books", 20, 0)    # Invalid (zero qty)
]

columns = ["sale_date", "category", "price", "quantity"]
df = spark.createDataFrame(sales_data, columns)

# ETL Pipeline: Clean → Transform → Aggregate
#converts the sale_date column from a string to a proper date type.
df_clean = df.filter(col("quantity") > 0).withColumn("sale_date", to_date(col("sale_date")))
#aggregates the cleaned data: groups by sale_date and category, and calculates the total revenue for each combination. calculates the total revenue for each group by multiplying price and quantity and summing them up.
df_agg = df_clean.groupBy("sale_date", "category").agg(spark_sum(col("price") * col("quantity")).alias("revenue"))

# Window function for ranking top categories per day   This section calculates the rank of each category's revenue within each day
#defines a window specification. It partitions the data by sale_date (meaning ranks are calculated independently for each day) and orders the results within each partition by revenue in descending order.
#applies this window function to df_agg, creating a new column named rank that assigns a rank to each category based on its revenue within its respective day.
window_spec = Window.partitionBy("sale_date").orderBy(desc("revenue"))
df_ranked = df_agg.withColumn("rank", rank().over(window_spec))
#displays the final df_ranked DataFrame, ordered by sale_date and then by revenue in descending order, showing the top-performing categories for each day.

df_ranked.orderBy("sale_date", desc("revenue")).show(truncate=False)

+----------+-----------+-------+----+
|sale_date |category   |revenue|rank|
+----------+-----------+-------+----+
|2025-01-01|Clothing   |250    |1   |
|2025-01-01|Electronics|200    |2   |
|2025-01-01|Books      |200    |2   |
|2025-01-02|Clothing   |150    |1   |
|2025-01-02|Electronics|100    |2   |
+----------+-----------+-------+----+

