## Joins In PySpark
In PySpark, joins are used to combine data from two DataFrames based on a common key.

The `DataFrame.join()` function provides various types of joins similar to SQL joins.

**Synatax:**

**`joined_df = df1.join(df2, join_condition, join_type)`**

- `df1:` The left DataFrame.
- `df2:` The right DataFrame.
- `join_condition:` Column(s) to perform the join (can be a single column, list of columns, or condition).
- `join_condition:` The type of join. Supported values


**`result_df = df1.join(df2, on="key", how="join_type")`**
- `on:` Column(s) to perform the join (can be a single column, list of columns, or condition).
- `how:` The type of join. Supported values

In [0]:
# Sample customer data
customer_data = [(1,'Rohish','Pune',"30-05-2022"),
(2,'vikash','kolkata',"12-03-2023"),
(3,'nikita','delhi',"25-06-2023"),
(4,'rahul','ranchi',"24-03-2023"),
(5,'mahesh','jaipur',"22-03-2023"),
(6,'prantosh','kolkata',"18-10-2022"),
(7,'raman','patna',"30-12-2022"),
(8,'prakash','ranchi',"24-02-2023"),
(9,'ragini','kolkata',"03-03-2023"),
(10,'raushan','jaipur',"05-02-2023")]

# customer schema
customer_schema=['customer_id','customer_name','address','date_of_joining']

# customer DataFrame
customer_df = spark.createDataFrame(customer_data, customer_schema)
customer_df.show()

+-----------+-------------+-------+---------------+
|customer_id|customer_name|address|date_of_joining|
+-----------+-------------+-------+---------------+
|          1|       Rohish|   Pune|     30-05-2022|
|          2|       vikash|kolkata|     12-03-2023|
|          3|       nikita|  delhi|     25-06-2023|
|          4|        rahul| ranchi|     24-03-2023|
|          5|       mahesh| jaipur|     22-03-2023|
|          6|     prantosh|kolkata|     18-10-2022|
|          7|        raman|  patna|     30-12-2022|
|          8|      prakash| ranchi|     24-02-2023|
|          9|       ragini|kolkata|     03-03-2023|
|         10|      raushan| jaipur|     05-02-2023|
+-----------+-------------+-------+---------------+



In [0]:
# Sample sales data
sales_data = [(1,22,10,"01-06-2022"),
(1,27,5,"03-02-2023"),
(2,5,3,"01-06-2023"),
(5,22,1,"22-03-2023"),
(7,22,4,"03-02-2023"),
(9,5,6,"03-03-2023"),
(2,1,12,"15-06-2023"),
(1,56,2,"25-06-2023"),
(5,12,5,"15-04-2023"),
(11,12,76,"12-03-2023")]

# sales schema
sales_schema=['customer_id','product_id','quantity','date_of_purchase']

# Sales DataFrame
sales_df = spark.createDataFrame(sales_data, sales_schema)
sales_df.show()

+-----------+----------+--------+----------------+
|customer_id|product_id|quantity|date_of_purchase|
+-----------+----------+--------+----------------+
|          1|        22|      10|      01-06-2022|
|          1|        27|       5|      03-02-2023|
|          2|         5|       3|      01-06-2023|
|          5|        22|       1|      22-03-2023|
|          7|        22|       4|      03-02-2023|
|          9|         5|       6|      03-03-2023|
|          2|         1|      12|      15-06-2023|
|          1|        56|       2|      25-06-2023|
|          5|        12|       5|      15-04-2023|
|         11|        12|      76|      12-03-2023|
+-----------+----------+--------+----------------+



In [0]:
# product sales data
product_data = [(1, 'fanta',20),
(2, 'dew',22),
(5, 'sprite',40),
(7, 'redbull',100),
(12,'mazza',45),
(22,'coke',27),
(25,'limca',21),
(27,'pepsi',14),
(56,'sting',10)]

# product sceham
product_schema=['id','name','price']

# product DataFrame
product_df = spark.createDataFrame(product_data, product_schema)
product_df.show()

+---+-------+-----+
| id|   name|price|
+---+-------+-----+
|  1|  fanta|   20|
|  2|    dew|   22|
|  5| sprite|   40|
|  7|redbull|  100|
| 12|  mazza|   45|
| 22|   coke|   27|
| 25|  limca|   21|
| 27|  pepsi|   14|
| 56|  sting|   10|
+---+-------+-----+



### Types of Joines

#### Inner Join:
- Returns only the rows that have matching keys in both DataFrames.
- Rows with non-matching keys are excluded.

In [0]:
customer_df.join(sales_df, on='customer_id', how='inner').show()

+-----------+-------------+-------+---------------+----------+--------+----------------+
|customer_id|customer_name|address|date_of_joining|product_id|quantity|date_of_purchase|
+-----------+-------------+-------+---------------+----------+--------+----------------+
|          1|       Rohish|   Pune|     30-05-2022|        22|      10|      01-06-2022|
|          1|       Rohish|   Pune|     30-05-2022|        27|       5|      03-02-2023|
|          1|       Rohish|   Pune|     30-05-2022|        56|       2|      25-06-2023|
|          2|       vikash|kolkata|     12-03-2023|         5|       3|      01-06-2023|
|          2|       vikash|kolkata|     12-03-2023|         1|      12|      15-06-2023|
|          5|       mahesh| jaipur|     22-03-2023|        22|       1|      22-03-2023|
|          5|       mahesh| jaipur|     22-03-2023|        12|       5|      15-04-2023|
|          7|        raman|  patna|     30-12-2022|        22|       4|      03-02-2023|
|          9|       r

In [0]:
customer_df.join(sales_df, customer_df.customer_id==sales_df.customer_id, how='inner').show()

+-----------+-------------+-------+---------------+-----------+----------+--------+----------------+
|customer_id|customer_name|address|date_of_joining|customer_id|product_id|quantity|date_of_purchase|
+-----------+-------------+-------+---------------+-----------+----------+--------+----------------+
|          1|       Rohish|   Pune|     30-05-2022|          1|        22|      10|      01-06-2022|
|          1|       Rohish|   Pune|     30-05-2022|          1|        27|       5|      03-02-2023|
|          1|       Rohish|   Pune|     30-05-2022|          1|        56|       2|      25-06-2023|
|          2|       vikash|kolkata|     12-03-2023|          2|         5|       3|      01-06-2023|
|          2|       vikash|kolkata|     12-03-2023|          2|         1|      12|      15-06-2023|
|          5|       mahesh| jaipur|     22-03-2023|          5|        22|       1|      22-03-2023|
|          5|       mahesh| jaipur|     22-03-2023|          5|        12|       5|      15

#### Left Join (Left Outer Join):
- Returns all rows from the left DataFrame and only matching rows from the right DataFrame.
- If there’s no match, columns from the right DataFrame will have `null` values.

In [0]:
result_left_df = customer_df.join(sales_df, on='customer_id', how='left')
result_left_df.ashow()

+-----------+-------------+-------+---------------+----------+--------+----------------+
|customer_id|customer_name|address|date_of_joining|product_id|quantity|date_of_purchase|
+-----------+-------------+-------+---------------+----------+--------+----------------+
|          1|       Rohish|   Pune|     30-05-2022|        56|       2|      25-06-2023|
|          1|       Rohish|   Pune|     30-05-2022|        27|       5|      03-02-2023|
|          1|       Rohish|   Pune|     30-05-2022|        22|      10|      01-06-2022|
|          2|       vikash|kolkata|     12-03-2023|         1|      12|      15-06-2023|
|          2|       vikash|kolkata|     12-03-2023|         5|       3|      01-06-2023|
|          3|       nikita|  delhi|     25-06-2023|      null|    null|            null|
|          5|       mahesh| jaipur|     22-03-2023|        12|       5|      15-04-2023|
|          5|       mahesh| jaipur|     22-03-2023|        22|       1|      22-03-2023|
|          4|        

#### Right Join ((Right Outer Join)):
- Returns all rows from the right DataFrame and only matching rows from the left DataFrame.
- If there’s no match, columns from the left DataFrame will have `null` values.


In [0]:
result_right_df = customer_df.join(sales_df, on='customer_id', how='right')
result_right_df.show()

+-----------+-------------+-------+---------------+----------+--------+----------------+
|customer_id|customer_name|address|date_of_joining|product_id|quantity|date_of_purchase|
+-----------+-------------+-------+---------------+----------+--------+----------------+
|          1|       Rohish|   Pune|     30-05-2022|        22|      10|      01-06-2022|
|          1|       Rohish|   Pune|     30-05-2022|        27|       5|      03-02-2023|
|          2|       vikash|kolkata|     12-03-2023|         5|       3|      01-06-2023|
|          7|        raman|  patna|     30-12-2022|        22|       4|      03-02-2023|
|          5|       mahesh| jaipur|     22-03-2023|        22|       1|      22-03-2023|
|          9|       ragini|kolkata|     03-03-2023|         5|       6|      03-03-2023|
|          2|       vikash|kolkata|     12-03-2023|         1|      12|      15-06-2023|
|          1|       Rohish|   Pune|     30-05-2022|        56|       2|      25-06-2023|
|          5|       m

####  Full Join (Full Outer Join):
- Returns all rows from both DataFrames.
- If there’s no match in one DataFrame, the columns from that DataFrame will have `null` values.

In [0]:
result_ful_df = customer_df.join(sales_df, on='customer_id', how='full')
result_ful_df.show()

+-----------+-------------+-------+---------------+----------+--------+----------------+
|customer_id|customer_name|address|date_of_joining|product_id|quantity|date_of_purchase|
+-----------+-------------+-------+---------------+----------+--------+----------------+
|          1|       Rohish|   Pune|     30-05-2022|        22|      10|      01-06-2022|
|          1|       Rohish|   Pune|     30-05-2022|        27|       5|      03-02-2023|
|          1|       Rohish|   Pune|     30-05-2022|        56|       2|      25-06-2023|
|          2|       vikash|kolkata|     12-03-2023|         5|       3|      01-06-2023|
|          2|       vikash|kolkata|     12-03-2023|         1|      12|      15-06-2023|
|          3|       nikita|  delhi|     25-06-2023|      null|    null|            null|
|          4|        rahul| ranchi|     24-03-2023|      null|    null|            null|
|          5|       mahesh| jaipur|     22-03-2023|        22|       1|      22-03-2023|
|          5|       m

#### Left Semi Join:
- Returns only rows from the left DataFrame where there is a match in the right DataFrame.
- Unlike other joins, it does not include columns from the right DataFrame.

In [0]:
result_left_semi_df = customer_df.join(sales_df, on='customer_id', how='left_semi')
result_left_semi_df.show()

# Explanation: Only rows from customer_df with matching customer_id in sales_df are returned.

+-----------+-------------+-------+---------------+
|customer_id|customer_name|address|date_of_joining|
+-----------+-------------+-------+---------------+
|          1|       Rohish|   Pune|     30-05-2022|
|          2|       vikash|kolkata|     12-03-2023|
|          5|       mahesh| jaipur|     22-03-2023|
|          7|        raman|  patna|     30-12-2022|
|          9|       ragini|kolkata|     03-03-2023|
+-----------+-------------+-------+---------------+



#### Left Anti Join:
- Returns only rows from the left DataFrame that do not have a match in the right DataFrame.
- Like left_semi, no columns from the right DataFrame are included.

In [0]:
result_left_anti_df = customer_df.join(sales_df, on='customer_id', how='left_anti')
result_left_anti_df.show()

# Explanation: Only rows from customer_df with no matching customer_id in sales_df are returned.

+-----------+-------------+-------+---------------+
|customer_id|customer_name|address|date_of_joining|
+-----------+-------------+-------+---------------+
|          3|       nikita|  delhi|     25-06-2023|
|          4|        rahul| ranchi|     24-03-2023|
|          6|     prantosh|kolkata|     18-10-2022|
|          8|      prakash| ranchi|     24-02-2023|
|         10|      raushan| jaipur|     05-02-2023|
+-----------+-------------+-------+---------------+



#### Cross Join (Cartesian Join):
- Returns the Cartesian product of the two DataFrames.
- Each row in the left DataFrame is joined with all rows in the right DataFrame.

**🚨 Caution: Cross joins can generate very large datasets.**

In [0]:
result_cross_df = customer_df.crossJoin(sales_df)
result_cross_df.display()

customer_id,customer_name,address,date_of_joining,customer_id.1,product_id,quantity,date_of_purchase
1,Rohish,Pune,30-05-2022,1,22,10,01-06-2022
1,Rohish,Pune,30-05-2022,1,27,5,03-02-2023
1,Rohish,Pune,30-05-2022,2,5,3,01-06-2023
1,Rohish,Pune,30-05-2022,5,22,1,22-03-2023
1,Rohish,Pune,30-05-2022,7,22,4,03-02-2023
1,Rohish,Pune,30-05-2022,9,5,6,03-03-2023
1,Rohish,Pune,30-05-2022,2,1,12,15-06-2023
1,Rohish,Pune,30-05-2022,1,56,2,25-06-2023
1,Rohish,Pune,30-05-2022,5,12,5,15-04-2023
1,Rohish,Pune,30-05-2022,11,12,76,12-03-2023


#### Summary of Join Types
| **Join Type**       | **Behavior**                                                                 |
|----------------------|----------------------------------------------------------------------------|
| **Inner Join**       | Only matching rows in both DataFrames.                                     |
| **Left Join**        | All rows from the left, matching rows from the right, else null.           |
| **Right Join**       | All rows from the right, matching rows from the left, else null.           |
| **Full Outer Join**  | All rows from both DataFrames, with null for non-matches.                  |
| **Left Semi Join**   | Rows from the left DataFrame with matches in the right (no right columns). |
| **Left Anti Join**   | Rows from the left DataFrame without matches in the right.                 |
| **Cross Join**       | Cartesian product of both DataFrames.                                      |
