# 1️⃣ Create Sample CSV Files
We will create sample CSV files for products and orders so we can work with them in PySpark.  

- **products.csv** contains: product_id, name, price  
- **orders.csv** contains: order_id, product_id, quantity, customer


In [1]:
# Sample products.csv
products_csv = """
product_id,name,price
1,Pen,10
2,Notebook,50
3,Eraser,5
4,Pencil,8
"""

with open("products.csv", "w") as f:
    f.write(products_csv)

# Sample orders.csv
orders_csv = """
order_id,product_id,quantity,customer
101,1,10,Alice
102,2,5,Bob
103,1,7,Charlie
104,3,20,Alice
105,4,15,Bob
106,2,10,Charlie
"""

with open("orders.csv", "w") as f:
    f.write(orders_csv)

print("CSV files created!")


CSV files created!


# 2️⃣ PySpark Setup
Install PySpark and create a SparkSession to work with DataFrames and perform operations.


In [2]:
!pip install pyspark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg

# Create SparkSession
spark = SparkSession.builder.appName("CSV_Example").getOrCreate()




# 3️⃣ Load CSV Files
Load the CSV files into PySpark DataFrames and convert numeric columns to the correct types.  
Also, display the data to verify it is loaded correctly.


In [3]:
products_df = spark.read.option("header", True).csv("products.csv")
orders_df = spark.read.option("header", True).csv("orders.csv")

# Convert numeric columns to appropriate types
products_df = products_df.withColumn("price", col("price").cast("float"))
orders_df = orders_df.withColumn("quantity", col("quantity").cast("int"))

# Display data
products_df.show()
orders_df.show()


+----------+--------+-----+
|product_id|    name|price|
+----------+--------+-----+
|         1|     Pen| 10.0|
|         2|Notebook| 50.0|
|         3|  Eraser|  5.0|
|         4|  Pencil|  8.0|
+----------+--------+-----+

+--------+----------+--------+--------+
|order_id|product_id|quantity|customer|
+--------+----------+--------+--------+
|     101|         1|      10|   Alice|
|     102|         2|       5|     Bob|
|     103|         1|       7| Charlie|
|     104|         3|      20|   Alice|
|     105|         4|      15|     Bob|
|     106|         2|      10| Charlie|
+--------+----------+--------+--------+



# 4️⃣ Basic Operations
Perform basic DataFrame operations:  
- Select specific columns  
- Filter rows based on conditions


In [4]:
# Select specific columns
products_df.select("name", "price").show()

# Filter orders with quantity > 10
orders_df.filter(col("quantity") > 10).show()


+--------+-----+
|    name|price|
+--------+-----+
|     Pen| 10.0|
|Notebook| 50.0|
|  Eraser|  5.0|
|  Pencil|  8.0|
+--------+-----+

+--------+----------+--------+--------+
|order_id|product_id|quantity|customer|
+--------+----------+--------+--------+
|     104|         3|      20|   Alice|
|     105|         4|      15|     Bob|
+--------+----------+--------+--------+



# 5️⃣ Aggregations
Perform aggregation operations on the DataFrames:  
- Total quantity ordered  
- Average product price


In [5]:
# Total quantity ordered
orders_df.groupBy().sum("quantity").show()

# Average product price
products_df.agg(avg("price").alias("average_price")).show()


+-------------+
|sum(quantity)|
+-------------+
|           67|
+-------------+

+-------------+
|average_price|
+-------------+
|        18.25|
+-------------+



# 6️⃣ SQL Operations
Use Spark SQL to perform SQL queries on DataFrames registered as temporary views:  
- Calculate total revenue per product  
- Order results by revenue descending


In [6]:
# Register DataFrames as temp views
products_df.createOrReplaceTempView("products")
orders_df.createOrReplaceTempView("orders")

# Total revenue per product
spark.sql("""
    SELECT o.product_id, p.name, SUM(o.quantity * p.price) AS total_revenue
    FROM orders o
    JOIN products p ON o.product_id = p.product_id
    GROUP BY o.product_id, p.name
    ORDER BY total_revenue DESC
""").show()


+----------+--------+-------------+
|product_id|    name|total_revenue|
+----------+--------+-------------+
|         2|Notebook|        750.0|
|         1|     Pen|        170.0|
|         4|  Pencil|        120.0|
|         3|  Eraser|        100.0|
+----------+--------+-------------+



# 7️⃣ Join Example
Perform join operations between orders and products DataFrames:  
- Inner join on product_id  
- Combine information from both DataFrames


In [7]:
# Inner join orders and products
joined_df = orders_df.join(products_df, on="product_id", how="inner")
joined_df.show()


+----------+--------+--------+--------+--------+-----+
|product_id|order_id|quantity|customer|    name|price|
+----------+--------+--------+--------+--------+-----+
|         1|     101|      10|   Alice|     Pen| 10.0|
|         2|     102|       5|     Bob|Notebook| 50.0|
|         1|     103|       7| Charlie|     Pen| 10.0|
|         3|     104|      20|   Alice|  Eraser|  5.0|
|         4|     105|      15|     Bob|  Pencil|  8.0|
|         2|     106|      10| Charlie|Notebook| 50.0|
+----------+--------+--------+--------+--------+-----+



# 8️⃣ Sorting & Grouping
Group and sort data:  
- Group orders by customer and sum the quantities  
- Sort the results in descending order of total quantity


In [8]:
# Group by customer and sum quantities, then sort descending
customer_orders = orders_df.groupBy("customer").sum("quantity").orderBy("sum(quantity)", ascending=False)
customer_orders.show()


+--------+-------------+
|customer|sum(quantity)|
+--------+-------------+
|   Alice|           30|
|     Bob|           20|
| Charlie|           17|
+--------+-------------+

