# Join Transformations:
Join operations are analogous to the JOIN operations you routinely see in SQL programming. Join
functions combine records from two RDDs based on a common field, a key. Because join functions
in Spark require a key to be defined, they operate on key/value pair RDDs.

The following is a quick refresher on joins—which you may want to skip if you have a relational
database background:

<ul>
    <li>
        A <b>join</b> operates on two different datasets, where one field in each dataset is nominated as
a key (a join key). The datasets are referred to in the order in which they are specified. For
instance, the first dataset specified is considered the left entity or dataset, and the second
dataset specified is considered the right entity or dataset.
    </li>
    <li>
       An <b>inner join</b>, often simply called a join (where the “inner” is inferred), returns all elements
or records from both datasets, where the nominated key is present in both datasets. 
    </li>
    <li>
        An <b>outer join</b> does not require keys to match in both datasets. Outer joins are implemented
as either a left outer join, a right outer join, or a full outer join.
    </li>    
    <li>
       A <b>left outer join</b> returns all records from the left (or first) dataset along with matched records
only (by the specified key) from the right (or second) dataset. 
    </li>
    <li>
        A <b>right outer join</b> returns all records from the right (or second) dataset along with matched
records only (by the specified key) from the left (or first) dataset.
    </li>    
    <li>
        A <b>full outer join</b> returns all records from both datasets whether there is a key match or not.
    </li>
</ul>

Joins are some of the most commonly required transformations in the Spark API, so it is imperative
that you understand these functions and become comfortable using them.

To illustrate the use of the different join types in the Spark RDD API, let’s consider a dataset from
a fictitious retailer that includes an entity containing stores and an entity containing salespeople,
loaded into RDDs, as shown below:

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext,SparkConf

configure = SparkConf().setAppName("JOINS").setMaster("local")
sc = SparkContext(conf = configure)

stores = sc.parallelize(
                        [
                        (100, 'Boca Raton'),
                        (101, 'Columbia'),
                        (102, 'Cambridge'),
                        (103, 'Naperville')
                        ]
)
# stores schema (store_id, store_location)
salespeople = sc.parallelize(
                              [
                                (1, 'Henry', 100),
                                (2, 'Karen', 100),
                                (3, 'Paul', 101),
                                (4, 'Jimmy', 102),
                                (5, 'Janice', None)
                              ]
)
# salespeople schema (salesperson_id, salesperson_name, store_id)

In the coming notebook lectures, we will look at the available join transformations in Spark, their usage, and some
examples.