# Joins in DataFrame

In PySpark, joins are a fundamental operation used to combine rows from two or more DataFrames based on a common column or key. This allows you to integrate data from different sources and perform complex analyses.

### Syntax
```join(self, other, on=None, how=None)```

**join()** operation takes parameters as below and returns DataFrame.

- param **other**: Right side of the join
- param **on**: a string for the join column name
- param **how**: default *inner*. Must be one of *inner, cross, outer,full, full_outer, left, left_outer, right, right_outer,left_semi, and left_anti*.

**Let understand the type of Join with example**

In [1]:
# Prapare data 
from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName('Join Example')
         .getOrCreate())

spark

In [5]:
emp = [
  (1,"Smith",-1,"2018","10","M",3000),
  (2,"Rose",1,"2010","20","M",4000),
  (3,"Williams",1,"2010","10","M",1000),
  (4,"Jones",2,"2005","10","F",2000),
  (5,"Brown",2,"2010","40","",-1),
  (6,"Brown",2,"2010","50","",-1)
]

empColumns = ["emp_id","name","manager_id","doj","dept_id","gender","salary"]

dept = [("Finance",10),("Marketing",20),("Sales",30),("IT",40)]

deptColumns = ["dept_name","dept_id"]

In [4]:
emp_df = spark.createDataFrame(emp, schema=empColumns)
emp_df.printSchema()
emp_df.show()

root
 |-- emp_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- manager_id: long (nullable = true)
 |-- doj: string (nullable = true)
 |-- dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+------+--------+----------+----+-------+------+------+
|emp_id|    name|manager_id| doj|dept_id|gender|salary|
+------+--------+----------+----+-------+------+------+
|     1|   Smith|        -1|2018|     10|     M|  3000|
|     2|    Rose|         1|2010|     20|     M|  4000|
|     3|Williams|         1|2010|     10|     M|  1000|
|     4|   Jones|         2|2005|     10|     F|  2000|
|     5|   Brown|         2|2010|     40|      |    -1|
|     6|   Brown|         2|2010|     50|      |    -1|
+------+--------+----------+----+-------+------+------+



In [7]:
dept_df = spark.createDataFrame(data=dept, schema = deptColumns)
dept_df.printSchema()
dept_df.show()

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|  Finance|     10|
|Marketing|     20|
|    Sales|     30|
|       IT|     40|
+---------+-------+



### Inner Join:
- Returns rows that have matching values in both DataFrames.
- **Syntax**: ```df1.join(df2, on='common_column', how='inner')```

**Example**

In [12]:
from pyspark.sql.functions import *

In [15]:
(emp_df.join(
    dept_df,
    on = "dept_id",
    how = "inner")
 .select('emp_id', 'name', dept_df.dept_id, 'dept_name')
 .show())

+------+--------+-------+---------+
|emp_id|    name|dept_id|dept_name|
+------+--------+-------+---------+
|     1|   Smith|     10|  Finance|
|     3|Williams|     10|  Finance|
|     4|   Jones|     10|  Finance|
|     2|    Rose|     20|Marketing|
|     5|   Brown|     40|       IT|
+------+--------+-------+---------+




### Left Outer Join:
- Returns all rows from the left DataFrame and the matched rows from the right DataFrame.
- **Syntax**: ```df1.join(df2, on='common_column', how='left')```

**Example**

In [25]:
(emp_df.alias('e').join(
    dept_df.alias('d'),
    on = 'dept_id',
    how = 'left')
 .select("e.emp_id",'e.name','d.dept_id', 'd.dept_name')
 .show()
)

+------+--------+-------+---------+
|emp_id|    name|dept_id|dept_name|
+------+--------+-------+---------+
|     1|   Smith|     10|  Finance|
|     2|    Rose|     20|Marketing|
|     3|Williams|     10|  Finance|
|     4|   Jones|     10|  Finance|
|     5|   Brown|     40|       IT|
|     6|   Brown|   NULL|     NULL|
+------+--------+-------+---------+



In [None]:
# Sample DataFrames
sales_data = [(1, "2024-12-01", 101, 500),
              (2, "2024-12-02", 102, 300),
              (3, "2024-12-03", 103, 700)]

customer_data = [(101, "Alice"),
                 (102, "Bob"),
                 (104, "Charlie")]

sales_columns = ["sales_id", "date", "customer_id", "amount"]
customer_columns = ["customer_id", "customer_name"]

# Create DataFrames
sales_df = spark.createDataFrame(sales_data, sales_columns)
customer_df = spark.createDataFrame(customer_data, customer_columns)

# Perform Left Outer Join
joined_df = (
    sales_df.alias("s")
    .join(
        customer_df.alias("c"),
        sales_df["customer_id"] == customer_df["customer_id"],
        "leftouter"
    )
    .select("s.sales_id", "s.date", "c.customer_id", "s.amount", "c.customer_name")
)

# Show the Result
joined_df.show()

+--------+----------+-----------+------+-------------+
|sales_id|      date|customer_id|amount|customer_name|
+--------+----------+-----------+------+-------------+
|       1|2024-12-01|        101|   500|        Alice|
|       2|2024-12-02|        102|   300|          Bob|
|       3|2024-12-03|        103|   700|         NULL|
+--------+----------+-----------+------+-------------+




### Right Outer Join:
- Returns all rows from the right DataFrame and the matched rows from the left DataFrame.
- **Syntax**: ```df1.join(df2, on='common_column', how='right')```

**Example**

In [28]:
(emp_df.join(
    dept_df,
    emp_df.dept_id ==  dept_df.dept_id,
    "right")
   .show())

+------+--------+----------+----+-------+------+------+---------+-------+
|emp_id|    name|manager_id| doj|dept_id|gender|salary|dept_name|dept_id|
+------+--------+----------+----+-------+------+------+---------+-------+
|     4|   Jones|         2|2005|     10|     F|  2000|  Finance|     10|
|     3|Williams|         1|2010|     10|     M|  1000|  Finance|     10|
|     1|   Smith|        -1|2018|     10|     M|  3000|  Finance|     10|
|     2|    Rose|         1|2010|     20|     M|  4000|Marketing|     20|
|  NULL|    NULL|      NULL|NULL|   NULL|  NULL|  NULL|    Sales|     30|
|     5|   Brown|         2|2010|     40|      |    -1|       IT|     40|
+------+--------+----------+----+-------+------+------+---------+-------+



### Full Outer Join:
- Returns all rows from both DataFrames, joining them based on the common column.
- **Syntax**: ```df1.join(df2, on='common_column', how='full')```

**Example**

In [29]:
emp_df.join(
    dept_df,
    emp_df.dept_id ==  dept_df.dept_id,
    how = "outer") \
    .show()

+------+--------+----------+----+-------+------+------+---------+-------+
|emp_id|    name|manager_id| doj|dept_id|gender|salary|dept_name|dept_id|
+------+--------+----------+----+-------+------+------+---------+-------+
|     1|   Smith|        -1|2018|     10|     M|  3000|  Finance|     10|
|     3|Williams|         1|2010|     10|     M|  1000|  Finance|     10|
|     4|   Jones|         2|2005|     10|     F|  2000|  Finance|     10|
|     2|    Rose|         1|2010|     20|     M|  4000|Marketing|     20|
|  NULL|    NULL|      NULL|NULL|   NULL|  NULL|  NULL|    Sales|     30|
|     5|   Brown|         2|2010|     40|      |    -1|       IT|     40|
|     6|   Brown|         2|2010|     50|      |    -1|     NULL|   NULL|
+------+--------+----------+----+-------+------+------+---------+-------+



### Left Semi Join:
- Returns all rows only from the left DataFrame that have a match in the right DataFrame.
- **Syntax**: ```df1.join(df2, on='common_column', how='leftsemi')```

**Example**

In [None]:
emp_df.join(
    dept_df,
    on = 'dept_id',
    how = 'leftsemi') \
        .show()

+-------+------+--------+----------+----+------+------+
|dept_id|emp_id|    name|manager_id| doj|gender|salary|
+-------+------+--------+----------+----+------+------+
|     10|     1|   Smith|        -1|2018|     M|  3000|
|     10|     3|Williams|         1|2010|     M|  1000|
|     10|     4|   Jones|         2|2005|     F|  2000|
|     20|     2|    Rose|         1|2010|     M|  4000|
|     40|     5|   Brown|         2|2010|      |    -1|
+-------+------+--------+----------+----+------+------+



### Left Anti Join:
- Returns all rows only from the left DataFrame that **do not** have a match in the right DataFrame.
- **Syntax**: ```df1.join(df2, on='common_column', how='leftanti')```

**Example**

In [34]:
emp_df.join(
   dept_df,
   on = 'dept_id',
   how = 'leftanti') \
   .show()

+-------+------+-----+----------+----+------+------+
|dept_id|emp_id| name|manager_id| doj|gender|salary|
+-------+------+-----+----------+----+------+------+
|     50|     6|Brown|         2|2010|      |    -1|
+-------+------+-----+----------+----+------+------+



### Self Join:
- Join a dataframe to itself

**Example**

In [35]:
emp_df.alias("emp1").join(
    emp_df.alias("emp2"),
    on = col("emp1.manager_id") == col("emp2.emp_id"),
    how = "inner") \
    .select(
        col("emp1.emp_id"),
        col("emp1.name"),
        col("emp2.emp_id").alias("manager_id"),
        col("emp2.name").alias("superior_emp_name")) \
   .show(truncate=False)

+------+--------+----------+-----------------+
|emp_id|name    |manager_id|superior_emp_name|
+------+--------+----------+-----------------+
|2     |Rose    |1         |Smith            |
|3     |Williams|1         |Smith            |
|4     |Jones   |2         |Rose             |
|5     |Brown   |2         |Rose             |
|6     |Brown   |2         |Rose             |
+------+--------+----------+-----------------+

