# Join():
                    Syntax: RDD.join(<otherRDD>, numPartitions=None)
The join() transformation is an implementation of an inner join, matching two key/value pair
RDDs by their key.

The optional numPartitions argument determines how many partitions to create in the resultant
dataset. If this is not specified, the default value for the spark.default.parallelism
configuration parameter is used. The numPartitions argument has the same behavior for other
types of join operations in the Spark API as well.

The RDD returned is a structure containing the matched key and a value that is a tuple containing
all the matched records from both RDDs as a list object. (This is where it may sound a bit foreign to you if you are used to performing INNER JOIN operations in SQL, which returns a flattened list
of columns from both entities.)

Let's see it through an example:

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext,SparkConf

configure = SparkConf().setAppName("All-Joins").setMaster("local")
sc = SparkContext(conf = configure)

spark = SparkSession.builder \
        .appName("All-Joins") \
        .getOrCreate()
    
spark.sparkContext.getConf().getAll()

[('spark.master', 'local'),
 ('spark.app.id', 'local-1590555401216'),
 ('spark.driver.port', '59330'),
 ('spark.rdd.compress', 'True'),
 ('spark.app.name', 'All-Joins'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.host', 'DESKTOP-I7971JS')]

In [2]:
stores = sc.parallelize(
                        [
                        (100, 'Boca Raton'),
                        (101, 'Columbia'),
                        (102, 'Cambridge'),
                        (103, 'Naperville')
                        ]
)
# stores schema (store_id, store_location)
salespeople = sc.parallelize(
                              [
                                (1, 'Henry', 100),
                                (2, 'Karen', 100),
                                (3, 'Paul', 101),
                                (4, 'Jimmy', 102),
                                (5, 'Janice', None)
                              ]
)
# salespeople schema (salesperson_id, salesperson_name, store_id)

In [3]:
joinedDF = salespeople.keyBy(lambda x:x[2]) \
                      .join(stores) 

joinedDF.collect()

[(100, ((1, 'Henry', 100), 'Boca Raton')),
 (100, ((2, 'Karen', 100), 'Boca Raton')),
 (102, ((4, 'Jimmy', 102), 'Cambridge')),
 (101, ((3, 'Paul', 101), 'Columbia'))]

This join() operation returns all salespeople assigned to stores keyed by the store ID (the join
key) along with the entire store record and salesperson record. Notice that the resultant RDD
contains duplicate data. You could (and should in many cases) follow the join() with a map()
transformation to prune fields or project only the fields required for further processing.

# Optimizing Joins in Spark:
Joins involving RDDs that span more than one partition—and many do—require a shuffle.
Spark generally plans and implements this activity to achieve the most optimal performance
possible; however, a simple axiom to remember is “join large by small.” This means to reference
the large RDD (the one with the most elements, if this is known) first, followed by the
smaller of the two RDDs. This will seem strange for users coming from relational database
programming backgrounds, but unlike with relational database systems, joins in Spark are relatively
inefficient. And unlike with most databases, there are no indexes or statistics to optimize
the join, so the optimizations you provide are essential to maximizing performance.

# leftOuterJoin():
                Syntax:  RDD.leftOuterJoin(<otherRDD>, numPartitions=None)
                
The leftOuterJoin() transformation returns all elements or records from the first RDD referenced.
If keys from the first (or left) RDD are present in the right RDD, then the right RDD record
is returned along with the left RDD record. Otherwise, the right RDD record is None (empty).

The example shown below uses the leftOuterJoin() transformation to identify salespeople
with no stores.

In [7]:
withNoStores = salespeople.keyBy(lambda x: x[2]) \
                          .leftOuterJoin(stores) \
                          .filter(lambda x : x[1][1] is None) \
                          .map(lambda x : "salesperson " + x[1][0][1] + " has no store")
withNoStores.collect()

['salesperson Janice has no store']

# rightOuterJoin():
                    Syntax:    RDD.rightOuterJoin(<otherRDD>, numPartitions=None)
The rightOuterJoin() transformation returns all elements or records from the second RDD
referenced. If keys from the second (or right) RDD are present in the left RDD, then the left
RDD record is returned along with the right RDD record. Otherwise, the left RDD record is None
(empty).

The section below shows an example of how the rightOuterJoin() transformation can be used to
identify stores with no salespeople.

In [5]:
storeWithoutSalesPeople = salespeople.keyBy(lambda x : x[2]) \
                                     .rightOuterJoin(stores) \
                                     .filter(lambda x: x[1][0] is None) \
                                     .map(lambda x: x[1][1] + " Store has no saleperson")
storeWithoutSalesPeople.collect()

['Naperville Store has no saleperson']

# fullOuterJoin():
                            Syntax:  RDD.fullOuterJoin(<otherRDD>, numPartitions=None)
The fullOuterJoin() transforms all elements from both RDDs whether there is a key matched
or not. Keys not matched from either the left or right dataset are represented as None (empty).

The section below shows an example of how the fullOuterJoin() transformation can be used to identify
stores with no salespeople as well as salespeople with no stores.

In [6]:
outerJoin = salespeople.keyBy(lambda x:x[2]) \
                        .fullOuterJoin(stores) \
                        .filter(lambda x: x[1][0] is None or x[1][1] is None) 
                        
outerJoin.collect()

[(None, ((5, 'Janice', None), None)), (103, (None, 'Naperville'))]