# 🔹 Setup

We first create two DataFrames:
- `customers_df` → contains customer details (id, name, city, age)  
- `orders_df` → contains order details (id, customer_id, product, amount)  


In [9]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, max, min, count, sum

spark = SparkSession.builder.appName("DataFrame-Exercises").getOrCreate()

# Customers Data
customers_data = [
    (1, "Rahul Sharma", "Bangalore", 28),
    (2, "Priya Singh", "Delhi", 32),
    (3, "Aman Kumar", "Hyderabad", 25),
    (4, "Sneha Reddy", "Chennai", 35),
    (5, "Arjun Mehta", "Mumbai", 30),
    (6, "Divya Nair", "Delhi", 29)
]
customers_cols = ["customer_id", "name", "city", "age"]
customers_df = spark.createDataFrame(customers_data, customers_cols)

# Orders Data
orders_data = [
    (101, 1, "Laptop", 55000),
    (102, 2, "Mobile", 25000),
    (103, 1, "Headphones", 3000),
    (104, 3, "Chair", 5000),
    (105, 5, "Book", 700),
    (106, 2, "Tablet", 20000),
    (107, 6, "Shoes", 2500),
    (108, 7, "Camera", 30000)   # non-existent customer
]
orders_cols = ["order_id", "customer_id", "product", "amount"]
orders_df = spark.createDataFrame(orders_data, orders_cols)


## 🔹 Basic Operations
Simple filtering, selecting, and counting tasks on the DataFrames.


### 1. Select only name and city


In [10]:
customers_df.select("name", "city").show()


+------------+---------+
|        name|     city|
+------------+---------+
|Rahul Sharma|Bangalore|
| Priya Singh|    Delhi|
|  Aman Kumar|Hyderabad|
| Sneha Reddy|  Chennai|
| Arjun Mehta|   Mumbai|
|  Divya Nair|    Delhi|
+------------+---------+



### 2. Filter customers older than 30


In [11]:
customers_df.filter(col("age") > 30).show()


+-----------+-----------+-------+---+
|customer_id|       name|   city|age|
+-----------+-----------+-------+---+
|          2|Priya Singh|  Delhi| 32|
|          4|Sneha Reddy|Chennai| 35|
+-----------+-----------+-------+---+



### 3. Count customers from Delhi


In [12]:
customers_df.filter(col("city") == "Delhi").count()


2

### 4. Distinct cities


In [13]:
customers_df.select("city").distinct().show()


+---------+
|     city|
+---------+
|Bangalore|
|    Delhi|
|Hyderabad|
|  Chennai|
|   Mumbai|
+---------+



## 🔹 Aggregations
Using aggregate functions like `avg()`, `min()`, `max()`, `count()`, and `sum()` to analyze data.


### 5. Average age of customers


In [14]:
customers_df.agg(avg("age")).show()


+------------------+
|          avg(age)|
+------------------+
|29.833333333333332|
+------------------+



### 6. Maximum & minimum order amount


In [15]:
orders_df.agg(max("amount"), min("amount")).show()


+-----------+-----------+
|max(amount)|min(amount)|
+-----------+-----------+
|      55000|        700|
+-----------+-----------+



### 7. Count number of orders per customer


In [16]:
orders_df.groupBy("customer_id").count().show()


+-----------+-----+
|customer_id|count|
+-----------+-----+
|          1|    2|
|          3|    1|
|          2|    2|
|          7|    1|
|          6|    1|
|          5|    1|
+-----------+-----+



### 8. Total spending per customer


In [17]:
orders_df.groupBy("customer_id").agg(sum("amount").alias("total_spent")).show()


+-----------+-----------+
|customer_id|total_spent|
+-----------+-----------+
|          1|      58000|
|          3|       5000|
|          2|      45000|
|          7|      30000|
|          6|       2500|
|          5|        700|
+-----------+-----------+



## 🔹 Joins
Combining customers and orders with `inner`, `left` joins, and identifying missing records.


### 9. Inner join customers & orders


In [18]:
customers_df.join(orders_df, "customer_id", "inner").show()


+-----------+------------+---------+---+--------+----------+------+
|customer_id|        name|     city|age|order_id|   product|amount|
+-----------+------------+---------+---+--------+----------+------+
|          1|Rahul Sharma|Bangalore| 28|     101|    Laptop| 55000|
|          1|Rahul Sharma|Bangalore| 28|     103|Headphones|  3000|
|          2| Priya Singh|    Delhi| 32|     102|    Mobile| 25000|
|          2| Priya Singh|    Delhi| 32|     106|    Tablet| 20000|
|          3|  Aman Kumar|Hyderabad| 25|     104|     Chair|  5000|
|          5| Arjun Mehta|   Mumbai| 30|     105|      Book|   700|
|          6|  Divya Nair|    Delhi| 29|     107|     Shoes|  2500|
+-----------+------------+---------+---+--------+----------+------+



### 10. Left join (all customers, even without orders)


In [19]:
customers_df.join(orders_df, "customer_id", "left").show()


+-----------+------------+---------+---+--------+----------+------+
|customer_id|        name|     city|age|order_id|   product|amount|
+-----------+------------+---------+---+--------+----------+------+
|          1|Rahul Sharma|Bangalore| 28|     103|Headphones|  3000|
|          1|Rahul Sharma|Bangalore| 28|     101|    Laptop| 55000|
|          3|  Aman Kumar|Hyderabad| 25|     104|     Chair|  5000|
|          2| Priya Singh|    Delhi| 32|     106|    Tablet| 20000|
|          2| Priya Singh|    Delhi| 32|     102|    Mobile| 25000|
|          6|  Divya Nair|    Delhi| 29|     107|     Shoes|  2500|
|          5| Arjun Mehta|   Mumbai| 30|     105|      Book|   700|
|          4| Sneha Reddy|  Chennai| 35|    NULL|      NULL|  NULL|
+-----------+------------+---------+---+--------+----------+------+



### 11. Customers who never placed an order


In [20]:
customers_df.join(orders_df, "customer_id", "left") \
    .filter(col("order_id").isNull()).show()


+-----------+-----------+-------+---+--------+-------+------+
|customer_id|       name|   city|age|order_id|product|amount|
+-----------+-----------+-------+---+--------+-------+------+
|          4|Sneha Reddy|Chennai| 35|    NULL|   NULL|  NULL|
+-----------+-----------+-------+---+--------+-------+------+



### 12. Orders with non-existent customers


In [21]:
orders_df.join(customers_df, "customer_id", "left") \
    .filter(col("name").isNull()).show()


+-----------+--------+-------+------+----+----+----+
|customer_id|order_id|product|amount|name|city| age|
+-----------+--------+-------+------+----+----+----+
|          7|     108| Camera| 30000|NULL|NULL|NULL|
+-----------+--------+-------+------+----+----+----+



## 🔹 Sorting & Grouping
Sorting by values and grouping data with aggregate calculations.


### 13. Customers ordered by age (descending)


In [22]:
customers_df.orderBy(col("age").desc()).show()


+-----------+------------+---------+---+
|customer_id|        name|     city|age|
+-----------+------------+---------+---+
|          4| Sneha Reddy|  Chennai| 35|
|          2| Priya Singh|    Delhi| 32|
|          5| Arjun Mehta|   Mumbai| 30|
|          6|  Divya Nair|    Delhi| 29|
|          1|Rahul Sharma|Bangalore| 28|
|          3|  Aman Kumar|Hyderabad| 25|
+-----------+------------+---------+---+



### 14. Top 3 highest order amounts


In [23]:
orders_df.orderBy(col("amount").desc()).limit(3).show()


+--------+-----------+-------+------+
|order_id|customer_id|product|amount|
+--------+-----------+-------+------+
|     101|          1| Laptop| 55000|
|     108|          7| Camera| 30000|
|     102|          2| Mobile| 25000|
+--------+-----------+-------+------+



### 15. Group customers by city, find average age


In [24]:
customers_df.groupBy("city").agg(avg("age")).show()


+---------+--------+
|     city|avg(age)|
+---------+--------+
|Bangalore|    28.0|
|    Delhi|    30.5|
|Hyderabad|    25.0|
|  Chennai|    35.0|
|   Mumbai|    30.0|
+---------+--------+



### 16. Group orders by product, find total sales amount


In [25]:
orders_df.groupBy("product").agg(sum("amount")).show()


+----------+-----------+
|   product|sum(amount)|
+----------+-----------+
|     Chair|       5000|
|    Laptop|      55000|
|    Mobile|      25000|
|Headphones|       3000|
|      Book|        700|
|    Camera|      30000|
|     Shoes|       2500|
|    Tablet|      20000|
+----------+-----------+



## 🔹 SQL Operations
Register DataFrames as temporary SQL views and perform queries using SQL syntax.


### 17. Register DataFrames as Temp Views


In [26]:
customers_df.createOrReplaceTempView("customers")
orders_df.createOrReplaceTempView("orders")


### 18. Total revenue by city


In [27]:
spark.sql("""
SELECT c.city, SUM(o.amount) as total_revenue
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.city
""").show()


+---------+-------------+
|     city|total_revenue|
+---------+-------------+
|Bangalore|        58000|
|   Mumbai|          700|
|    Delhi|        47500|
|Hyderabad|         5000|
+---------+-------------+



### 19. Top 2 customers by total spend


In [28]:
spark.sql("""
SELECT c.name, SUM(o.amount) as total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY total_spent DESC
LIMIT 2
""").show()


+------------+-----------+
|        name|total_spent|
+------------+-----------+
|Rahul Sharma|      58000|
| Priya Singh|      45000|
+------------+-----------+



### 20. Customers who spent more than 20,000


In [29]:
spark.sql("""
SELECT c.name, SUM(o.amount) as total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name
HAVING total_spent > 20000
""").show()


+------------+-----------+
|        name|total_spent|
+------------+-----------+
|Rahul Sharma|      58000|
| Priya Singh|      45000|
+------------+-----------+

