### Right join in PySpark

You are working as a Data Engineer for an e-commerce company. The company has two datasets:
- Orders Dataset: Contains order details such as order_id, customer_id, and order_status.
- Customers Dataset: Contains information about customers like customer_id, customer_name, and customer_city.

Your task is to generate a report that contains all customer information, even if they have not placed any orders. Use a right join to solve this problem, so that we get all customers, including those without orders.

**Orders Dataset (orders_df):**

In [0]:

# orders_df
orders_data = [
    (1, 101, "Delivered"),
    (2, 102, "Pending"),
    (3, 103, "Shipped"),
    (4, 101, "Cancelled")
]

orders_columns = "order_id int, customer_id int, order_status string"

orders_df = spark.createDataFrame(orders_data, orders_columns)
orders_df.show()

+--------+-----------+------------+
|order_id|customer_id|order_status|
+--------+-----------+------------+
|       1|        101|   Delivered|
|       2|        102|     Pending|
|       3|        103|     Shipped|
|       4|        101|   Cancelled|
+--------+-----------+------------+



**Customers Dataset (customers_df):**

In [0]:
# customers_df)
customers_data = [
    (101, "Alice", "New York"),
    (102, "Bob", "Los Angeles"),
    (103, "Charlie", "Chicago"),
    (104, "David", "Houston")
]

customers_columns = "customer_id int, customer_name string, customer_city string"

customers_df = spark.createDataFrame(customers_data, customers_columns)
customers_df.show()

+-----------+-------------+-------------+
|customer_id|customer_name|customer_city|
+-----------+-------------+-------------+
|        101|        Alice|     New York|
|        102|          Bob|  Los Angeles|
|        103|      Charlie|      Chicago|
|        104|        David|      Houston|
+-----------+-------------+-------------+



**Output:**

| customer_id | order_id | order_status | customer_name | customer_city |
|-------------|----------|--------------|---------------|---------------|
| 101         | 1        | Delivered    | Alice         | New York      |
| 101         | 4        | Cancelled    | Alice         | New York      |
| 102         | 2        | Pending      | Bob           | Los Angeles   |
| 103         | 3        | Shipped      | Charlie       | Chicago       |
| 104         | null     | null         | David         | Houston       |

**PySpark code Solution**

In [0]:
joined_df = orders_df.join(customers_df, orders_df.customer_id==customers_df.customer_id, "right")

joined_df.orderBy(customers_df.customer_id).show()

+--------+-----------+------------+-----------+-------------+-------------+
|order_id|customer_id|order_status|customer_id|customer_name|customer_city|
+--------+-----------+------------+-----------+-------------+-------------+
|       4|        101|   Cancelled|        101|        Alice|     New York|
|       1|        101|   Delivered|        101|        Alice|     New York|
|       2|        102|     Pending|        102|          Bob|  Los Angeles|
|       3|        103|     Shipped|        103|      Charlie|      Chicago|
|    null|       null|        null|        104|        David|      Houston|
+--------+-----------+------------+-----------+-------------+-------------+

