### Install and Initialize PySpark
We will now set up **PySpark** to work with product and order data.

1. **Install PySpark**  
   - Use `!pip install pyspark` since PySpark is not pre-installed in Google Colab.  

2. **Create a SparkSession**  
   - `SparkSession` is the entry point for using PySpark.  
   - We set the application name as `"Product-Order-Example"`.  
   - Once created, the `spark` object will let us work with DataFrames and SQL queries.


In [1]:
!pip install pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Product-Order-Example").getOrCreate()



### Create Product and Order DataFrames
We will now create two sample datasets in PySpark:

1. **Product Data**  
   - Contains details like `product_id`, `name`, `category`, and `price`.  
   - Example: `(101, "Laptop", "Electronics", 55000)`.  

2. **Order Data**  
   - Contains details like `order_id`, `product_id`, `quantity`, and `customer`.  
   - Example: `(201, 101, 2, "Rahul Sharma")`.  
   - Notice one record has a `product_id` (106) that does not exist in the product catalog → simulates invalid data.  

3. Convert both lists into PySpark **DataFrames** using `spark.createDataFrame()`.  

4. Use `.show()` to display the DataFrames.


In [2]:
# Product data
product_data = [
    (101, "Laptop", "Electronics", 55000),
    (102, "Mobile Phone", "Electronics", 25000),
    (103, "Chair", "Furniture", 5000),
    (104, "Book", "Stationery", 300),
    (105, "Headphones", "Electronics", 3000)
]

product_cols = ["product_id", "name", "category", "price"]
product_df = spark.createDataFrame(product_data, product_cols)

# Order data
order_data = [
    (201, 101, 2, "Rahul Sharma"),
    (202, 102, 1, "Priya Singh"),
    (203, 103, 4, "Aman Kumar"),
    (204, 104, 10, "Sneha Reddy"),
    (205, 101, 1, "Arjun Mehta"),
    (206, 105, 3, "Rahul Sharma"),
    (207, 106, 1, "Ghost Customer")  # Order with product not in catalog
]

order_cols = ["order_id", "product_id", "quantity", "customer"]
order_df = spark.createDataFrame(order_data, order_cols)

# Show both
product_df.show()
order_df.show()


+----------+------------+-----------+-----+
|product_id|        name|   category|price|
+----------+------------+-----------+-----+
|       101|      Laptop|Electronics|55000|
|       102|Mobile Phone|Electronics|25000|
|       103|       Chair|  Furniture| 5000|
|       104|        Book| Stationery|  300|
|       105|  Headphones|Electronics| 3000|
+----------+------------+-----------+-----+

+--------+----------+--------+--------------+
|order_id|product_id|quantity|      customer|
+--------+----------+--------+--------------+
|     201|       101|       2|  Rahul Sharma|
|     202|       102|       1|   Priya Singh|
|     203|       103|       4|    Aman Kumar|
|     204|       104|      10|   Sneha Reddy|
|     205|       101|       1|   Arjun Mehta|
|     206|       105|       3|  Rahul Sharma|
|     207|       106|       1|Ghost Customer|
+--------+----------+--------+--------------+



### Basic Operations on Product DataFrame
We can perform common DataFrame operations in PySpark:

1. **Select specific columns**  
   - `product_df.select("name", "price")` → shows only product name and price.  

2. **Filter rows**  
   - `product_df.filter(product_df["price"] > 10000)` → returns products with price greater than 10,000.  

3. **Sort / Order by column**  
   - `product_df.orderBy(product_df["price"].desc())` → sorts products in descending order of price.


In [3]:
# Select specific columns
product_df.select("name", "price").show()

#Filter products with price > 10,000
product_df.filter(product_df["price"] > 10000).show()

#Order products by price descending
product_df.orderBy (product_df["price"].desc()).show()

+------------+-----+
|        name|price|
+------------+-----+
|      Laptop|55000|
|Mobile Phone|25000|
|       Chair| 5000|
|        Book|  300|
|  Headphones| 3000|
+------------+-----+

+----------+------------+-----------+-----+
|product_id|        name|   category|price|
+----------+------------+-----------+-----+
|       101|      Laptop|Electronics|55000|
|       102|Mobile Phone|Electronics|25000|
+----------+------------+-----------+-----+

+----------+------------+-----------+-----+
|product_id|        name|   category|price|
+----------+------------+-----------+-----+
|       101|      Laptop|Electronics|55000|
|       102|Mobile Phone|Electronics|25000|
|       103|       Chair|  Furniture| 5000|
|       105|  Headphones|Electronics| 3000|
|       104|        Book| Stationery|  300|
+----------+------------+-----------+-----+



### GroupBy and Aggregations in PySpark
We can use **`groupBy()`** along with aggregation functions to analyze the data:

1. **Total quantity ordered per product**  
   - `order_df.groupBy("product_id").sum("quantity")`  
   - Groups orders by product ID and calculates the total quantity ordered.  

2. **Count of orders per customer**  
   - `order_df.groupBy("customer").count()`  
   - Shows how many orders each customer has placed.  

3. **Average price per category**  
   - `product_df.groupBy("category").avg("price")`  
   - Groups products by category and computes the average price within each category.


In [4]:
#Total quantity ordered per product
order_df.groupBy("product_id").sum("quantity").show()

# Count of orders per customer
order_df.groupBy("customer").count().show()

# Average price per category
product_df.groupBy("category").avg("price").show()

+----------+-------------+
|product_id|sum(quantity)|
+----------+-------------+
|       103|            4|
|       101|            3|
|       102|            1|
|       104|           10|
|       106|            1|
|       105|            3|
+----------+-------------+

+--------------+-----+
|      customer|count|
+--------------+-----+
|    Aman Kumar|    1|
|  Rahul Sharma|    2|
|   Priya Singh|    1|
|   Arjun Mehta|    1|
|Ghost Customer|    1|
|   Sneha Reddy|    1|
+--------------+-----+

+-----------+------------------+
|   category|        avg(price)|
+-----------+------------------+
|Electronics|27666.666666666668|
| Stationery|             300.0|
|  Furniture|            5000.0|
+-----------+------------------+



### Joins in PySpark

We can combine **orders** and **products** using different types of joins:

1. **Inner Join**  
   - Returns only the matching rows (orders that have valid products).  

2. **Left Join**  
   - Returns all rows from the left table (`orders`), and matches from the right (`products`).  
   - If a product is not found, its details will be `null`.  

3. **Right Join**  
   - Returns all rows from the right table (`products`), even if there are no matching orders.  
   - Useful to see which products were never ordered.


In [6]:
#Inner Join: Orders with product details
order_df.join(product_df, order_df.product_id == product_df.product_id, "inner").show()

# Left Join: All orders, even if product not found
order_df.join(product_df, order_df.product_id == product_df.product_id, "left").show()

#Right Join: All products, even if never ordered
order_df.join(product_df, order_df.product_id == product_df.product_id, "right").show()

+--------+----------+--------+------------+----------+------------+-----------+-----+
|order_id|product_id|quantity|    customer|product_id|        name|   category|price|
+--------+----------+--------+------------+----------+------------+-----------+-----+
|     201|       101|       2|Rahul Sharma|       101|      Laptop|Electronics|55000|
|     205|       101|       1| Arjun Mehta|       101|      Laptop|Electronics|55000|
|     202|       102|       1| Priya Singh|       102|Mobile Phone|Electronics|25000|
|     203|       103|       4|  Aman Kumar|       103|       Chair|  Furniture| 5000|
|     204|       104|      10| Sneha Reddy|       104|        Book| Stationery|  300|
|     206|       105|       3|Rahul Sharma|       105|  Headphones|Electronics| 3000|
+--------+----------+--------+------------+----------+------------+-----------+-----+

+--------+----------+--------+--------------+----------+------------+-----------+-----+
|order_id|product_id|quantity|      customer|produc

### PySpark SQL Queries

- **Register Temporary Views**  
  - `product_df` and `order_df` are registered as temporary views using `createOrReplaceTempView`.  
  - This allows us to run SQL queries directly on these DataFrames.

- **Total Revenue per Product**  
  - Joins the `orders` and `products` tables on `product_id`.  
  - Multiplies `quantity` by `price` for each order.  
  - Sums the results to calculate the total revenue for each product.  
  - Groups by `product_id` and product `name` to get revenue individually for each product.

- **Top 2 Customers by Total Quantity Ordered**  
  - Groups orders by `customer`.  
  - Sums the `quantity` for each customer.  
  - Sorts the results in descending order of total quantity.  
  - Limits the output to the top 2 customers.


In [9]:
# Register as temp views
product_df.createOrReplaceTempView("products")
order_df.createOrReplaceTempView("orders")

#Query: Total revenue per product
spark.sql("""
SELECT o.product_id, p.name, SUM(o.quantity * p.price) AS total_revenue
FROM orders o
JOIN products p ON o.product_id = p.product_id
GROUP BY o.product_id, p.name
""").show()

#Query: Top 2 customers by total quantity
spark.sql("""
SELECT customer, SUM(quantity) AS total_qty
FROM orders
GROUP BY customer
ORDER BY total_qty DESC
LIMIT 2
""").show()

+----------+------------+-------------+
|product_id|        name|total_revenue|
+----------+------------+-------------+
|       101|      Laptop|       165000|
|       102|Mobile Phone|        25000|
|       103|       Chair|        20000|
|       104|        Book|         3000|
|       105|  Headphones|         9000|
+----------+------------+-------------+

+------------+---------+
|    customer|total_qty|
+------------+---------+
| Sneha Reddy|       10|
|Rahul Sharma|        5|
+------------+---------+

