#### Joins

**Supported Joins**
- inner join
- outer joins
  - left
  - right
  - full
- semi join (left-semi join)
- anti join (left-anti join)
- cross join

In [0]:
employee = spark.createDataFrame([
    (1, "Raju", 25, 101),
    (2, "Ramesh", 26, 101),
    (3, "Amrita", 30, 102),
    (4, "Madhu", 32, 102),
    (5, "Aditya", 28, 102),
    (6, "Vinay", 42, 103),
    (7, "Smita", 27, 103),
    (8, "Vinod", 28, 100)])\
  .toDF("id", "name", "age", "deptid")
  
department = spark.createDataFrame([
    (101, "IT", 1),
    (102, "ITES", 2),
    (103, "Opearation", 3),
    (104, "HRD", 4)])\
  .toDF("id", "deptname", "locationid")
  
location = spark.createDataFrame([
    (1, 'Hyderabad'),
    (2, 'Chennai'),
    (3, 'Bengalure'),
    (4, 'Pune')])\
  .toDF("locationid", "location")

In [0]:
display(employee)

In [0]:
display(department)

In [0]:
display(location)

In [0]:
print( [ table.name for table in spark.catalog.listTables() ] )

**Using PySpark SQL method**

In [0]:
employee.createOrReplaceTempView("emp")
department.createOrReplaceTempView("dept")
location.createOrReplaceTempView("loc")

In [0]:
df1 = spark.sql("""select emp.*, dept.deptname, loc.*
         from emp 
            join dept on emp.deptid = dept.id
            join loc on dept.locationid = loc.locationid""")
display(df1)

In [0]:
%sql

select emp.*, dept.deptname, loc.*
from emp 
  join dept on emp.deptid = dept.id
  join loc on dept.locationid = loc.locationid

In [0]:
%sql

select emp.*, dept.*
from emp 
  left join dept on emp.deptid = dept.id

In [0]:
%sql

select emp.*, dept.*
from emp 
  right join dept on emp.deptid = dept.id

In [0]:
%sql

select emp.*, dept.*
from emp 
  full join dept on emp.deptid = dept.id

**Semi Join (Left Semi Join)**
- Is like inner join, but you get data only from left table
- Equivalent to following subquery
  - select * from emp where deptid in (select id from dept)


In [0]:
%sql

select emp.*
from emp semi join dept on emp.deptid = dept.id

**Anti Join (Left anti Join)**
- Equivalent to following subquery
  - select * from emp where deptid not in (select id from dept)

In [0]:
%sql

select emp.*
from emp anti join dept on emp.deptid = dept.id

**cross join**

In [0]:
%sql
select * from emp join dept

**Using 'join' Transformation method**


In [0]:
spark.catalog.listTables()

In [0]:
spark.catalog.dropTempView("loc")

In [0]:
c = employee.deptid == department.id
type(c)

In [0]:
joined_df = employee.join(department, employee.deptid == department.id) \
                .join(location, department.locationid == location.locationid)

display(joined_df)

In [0]:
joined_df = employee.join(department, employee.deptid == department.id, "left")
display(joined_df)

In [0]:
joined_df = employee.join(department, employee.deptid == department.id, "right")
display(joined_df)

In [0]:
joined_df = employee.join(department, employee.deptid == department.id, "full")
display(joined_df)

In [0]:
joined_df = employee.join(department, employee.deptid == department.id, "semi")
display(joined_df)

In [0]:
joined_df = employee.join(department, employee.deptid == department.id, "anti")
display(joined_df)

In [0]:
joined_df = employee.join(department)
display(joined_df)