# Chapter 4: Joins (SQL and Core)

In this chapter, we will study joins in Spark, both in the Core and SQL API.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Joins-SQL-core").master("local[*]").getOrCreate()
sc = spark.sparkContext

## Data

First, we create some RDD and DataFrames for the rest of the sections.

In [2]:
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, FloatType

In [3]:
people_accounts_rdd = sc.parallelize([(1, ("John", 11)), (2, ("Isabelle", 22)), (3, ("Maria", 33)),
                                      (4, ("Peter", 44)), (5, ("Connor", 55)), (6, ("Max", 66))])

people_accounts_schema = StructType([StructField("id", IntegerType(), False),
                                     StructField("Name", StringType(), False),
                                     StructField("account_id", IntegerType(), False)])

people_accounts_df = spark.createDataFrame(people_accounts_rdd.map(lambda x: (x[0], x[1][0], x[1][1])), 
                                           people_accounts_schema)

In [4]:
people_accounts_rdd.take(3)

[(1, ('John', 11)), (2, ('Isabelle', 22)), (3, ('Maria', 33))]

In [5]:
people_accounts_df.show()

+---+--------+----------+
| id|    Name|account_id|
+---+--------+----------+
|  1|    John|        11|
|  2|Isabelle|        22|
|  3|   Maria|        33|
|  4|   Peter|        44|
|  5|  Connor|        55|
|  6|     Max|        66|
+---+--------+----------+



In [6]:
people_emails_rdd = sc.parallelize([(1, "john@gmail.com"), (3, "maria@gmail.com"),
                                    (4, "peter@gmail.com"), (5, "connor@gmail.com")])

people_emails_schema = StructType([StructField("id", IntegerType(), False),
                                     StructField("Email", StringType(), False)])

people_emails_df = spark.createDataFrame(people_emails_rdd, 
                                         people_emails_schema)

In [7]:
people_emails_rdd.take(3)

[(1, 'john@gmail.com'), (3, 'maria@gmail.com'), (4, 'peter@gmail.com')]

In [8]:
people_emails_df.show()

+---+----------------+
| id|           Email|
+---+----------------+
|  1|  john@gmail.com|
|  3| maria@gmail.com|
|  4| peter@gmail.com|
|  5|connor@gmail.com|
+---+----------------+



In [9]:
accounts_balance_type_rdd = sc.parallelize([(11, (152.0, 1)), (22, (3545.3, 2)), (33, (12.5, 1)),
                                            (44, (75.0, 1)), (55, (4853.12, 2)), (66, (47.0, 1))])

accounts_balance_type_schema = StructType([StructField("account_id", IntegerType(), False),
                                           StructField("balance", FloatType(), False),
                                           StructField("account_type_id", IntegerType(), False)])

accounts_balance_type_df = spark.createDataFrame(accounts_balance_type_rdd.map(lambda x: (x[0], x[1][0], x[1][1])),
                                           accounts_balance_type_schema)

In [10]:
accounts_balance_type_rdd.take(3)

[(11, (152.0, 1)), (22, (3545.3, 2)), (33, (12.5, 1))]

In [11]:
accounts_balance_type_df.show()

+----------+-------+---------------+
|account_id|balance|account_type_id|
+----------+-------+---------------+
|        11|  152.0|              1|
|        22| 3545.3|              2|
|        33|   12.5|              1|
|        44|   75.0|              1|
|        55|4853.12|              2|
|        66|   47.0|              1|
+----------+-------+---------------+



In [12]:
accounts_type_description_rdd = sc.parallelize([(1, "Basic Account"), (2, "Premium Account")])

accounts_type_description_schema = StructType([StructField("account_type_id", IntegerType(), False),
                                               StructField("account_description", StringType(), False)])

accounts_type_description_df = spark.createDataFrame(accounts_type_description_rdd, accounts_type_description_schema)

In [13]:
accounts_type_description_rdd.take(2)

[(1, 'Basic Account'), (2, 'Premium Account')]

In [14]:
accounts_type_description_df.show()

+---------------+-------------------+
|account_type_id|account_description|
+---------------+-------------------+
|              1|      Basic Account|
|              2|    Premium Account|
+---------------+-------------------+



## Core Spark Joins

We will start with joins of Key / Value RDDs. We can distinguish `join`, `leftOuterJoin` and `rightOuterJoin` joins.

`join()`

In [15]:
people_accounts_rdd.join(people_emails_rdd).collect()

[(1, (('John', 11), 'john@gmail.com')),
 (3, (('Maria', 33), 'maria@gmail.com')),
 (4, (('Peter', 44), 'peter@gmail.com')),
 (5, (('Connor', 55), 'connor@gmail.com'))]

In [16]:
people_accounts_rdd.join(people_emails_rdd).map(lambda x: (x[0], x[1][0][0], x[1][1])).collect()

[(1, 'John', 'john@gmail.com'),
 (3, 'Maria', 'maria@gmail.com'),
 (4, 'Peter', 'peter@gmail.com'),
 (5, 'Connor', 'connor@gmail.com')]

`leftOuterJoin()`

In [17]:
people_accounts_rdd.leftOuterJoin(people_emails_rdd).map(lambda x: (x[0], x[1][0][0], x[1][1])).collect()

[(1, 'John', 'john@gmail.com'),
 (2, 'Isabelle', None),
 (3, 'Maria', 'maria@gmail.com'),
 (4, 'Peter', 'peter@gmail.com'),
 (5, 'Connor', 'connor@gmail.com'),
 (6, 'Max', None)]

In order to speed up join processes, specially if one same RDD is going to be joined several times, it is useful to pre-partition the RDDs.

Let's check if the `people_accounts_rdd` has any partitioner.

In [18]:
print(people_accounts_rdd.partitioner)

None


We now perform a partition on that data by key, into 2 partitions.

In [19]:
people_accounts_par = people_accounts_rdd.partitionBy(2)

We can see now the partitioner of the new data.

In [20]:
print(people_accounts_par.partitioner)

<pyspark.rdd.Partitioner object at 0x7f588e588be0>


In [21]:
people_emails_par = people_emails_rdd.partitionBy(2)

We perform 10 different joins without any kind of pre-partition on the data.

In [22]:
import time
for idx in range(10):
    ini_time = time.time()
    people_accounts_rdd.join(people_emails_rdd).collect()
    print("Join time: {0}".format(time.time() - ini_time))

Join time: 0.4817535877227783
Join time: 0.4499666690826416
Join time: 0.45418882369995117
Join time: 0.3062405586242676
Join time: 0.3055107593536377
Join time: 0.3097646236419678
Join time: 0.32379674911499023
Join time: 0.318495512008667
Join time: 0.2895677089691162
Join time: 0.30077052116394043


As we can see, all the times are more or less the same. Now, we repeat the process but using pre-partition RDDs.

In [23]:
import time
for idx in range(10):
    ini_time = time.time()
    people_accounts_par.join(people_emails_par).collect()
    print("Join time: {0}".format(time.time() - ini_time))

Join time: 0.24150776863098145
Join time: 0.11041045188903809
Join time: 0.11476993560791016
Join time: 0.11603760719299316
Join time: 0.1255946159362793
Join time: 0.1141667366027832
Join time: 0.12054777145385742
Join time: 0.11200523376464844
Join time: 0.11954164505004883
Join time: 0.11143922805786133


As we can see, once the first join has been done (which is already faster than the previous ones), the join time is reduced very significantly.

Finally, when joining large RDDs with small RDDs, it is quiet convinient to "broadcast" the small RDDs to all the executors. Let's see an example.

In [24]:
big_rdd = accounts_balance_type_rdd.map(lambda x: (x[1][1], (x[0], x[1][0])))

In [25]:
big_rdd.collect()

[(1, (11, 152.0)),
 (2, (22, 3545.3)),
 (1, (33, 12.5)),
 (1, (44, 75.0)),
 (2, (55, 4853.12)),
 (1, (66, 47.0))]

In [26]:
small_rdd_local = accounts_type_description_rdd.collectAsMap()

In [27]:
small_rdd_local

{1: 'Basic Account', 2: 'Premium Account'}

In [28]:
small_rdd_local_bcast = sc.broadcast(small_rdd_local)

In [29]:
def broadcastJoinFunction(sec_iterator):
    for sec_iter in sec_iterator:
        yield (sec_iter[0], (sec_iter[1][0], sec_iter[1][1], small_rdd_local_bcast.value.get(sec_iter[0])))

In [30]:
big_rdd.mapPartitions(broadcastJoinFunction).collect()

[(1, (11, 152.0, 'Basic Account')),
 (2, (22, 3545.3, 'Premium Account')),
 (1, (33, 12.5, 'Basic Account')),
 (1, (44, 75.0, 'Basic Account')),
 (2, (55, 4853.12, 'Premium Account')),
 (1, (66, 47.0, 'Basic Account'))]

## Spark SQL 

Joining DataFrames using the SQL API is quiet simple and efficient. We can highlight the following joining modes: `inner`, `left_outer`, `right_outer`, `outer`, `left_semi` and `left_anti`. Let's see some examples of them

`inner`

In [31]:
people_accounts_df.join(people_emails_df, "id", "inner").show()

+---+------+----------+----------------+
| id|  Name|account_id|           Email|
+---+------+----------+----------------+
|  1|  John|        11|  john@gmail.com|
|  3| Maria|        33| maria@gmail.com|
|  5|Connor|        55|connor@gmail.com|
|  4| Peter|        44| peter@gmail.com|
+---+------+----------+----------------+



`left_outer`

In [32]:
people_accounts_df.join(people_emails_df, "id", "left_outer").show()

+---+--------+----------+----------------+
| id|    Name|account_id|           Email|
+---+--------+----------+----------------+
|  1|    John|        11|  john@gmail.com|
|  6|     Max|        66|            null|
|  3|   Maria|        33| maria@gmail.com|
|  5|  Connor|        55|connor@gmail.com|
|  4|   Peter|        44| peter@gmail.com|
|  2|Isabelle|        22|            null|
+---+--------+----------+----------------+



`rigth_outer`

In [33]:
people_accounts_df.join(people_emails_df, "id", "right_outer").show()

+---+------+----------+----------------+
| id|  Name|account_id|           Email|
+---+------+----------+----------------+
|  1|  John|        11|  john@gmail.com|
|  3| Maria|        33| maria@gmail.com|
|  5|Connor|        55|connor@gmail.com|
|  4| Peter|        44| peter@gmail.com|
+---+------+----------+----------------+



`full`

In [34]:
people_accounts_df.join(people_emails_df, "id", "full").show()

+---+--------+----------+----------------+
| id|    Name|account_id|           Email|
+---+--------+----------+----------------+
|  1|    John|        11|  john@gmail.com|
|  6|     Max|        66|            null|
|  3|   Maria|        33| maria@gmail.com|
|  5|  Connor|        55|connor@gmail.com|
|  4|   Peter|        44| peter@gmail.com|
|  2|Isabelle|        22|            null|
+---+--------+----------+----------------+



`left_semi`

In [35]:
people_accounts_df.join(people_emails_df, "id", "left_semi").show()

+---+------+----------+
| id|  Name|account_id|
+---+------+----------+
|  1|  John|        11|
|  3| Maria|        33|
|  5|Connor|        55|
|  4| Peter|        44|
+---+------+----------+



`left_anti`

In [36]:
people_accounts_df.join(people_emails_df, "id", "left_anti").show()

+---+--------+----------+
| id|    Name|account_id|
+---+--------+----------+
|  6|     Max|        66|
|  2|Isabelle|        22|
+---+--------+----------+



Another interesting option is the self join capabilities.

In [37]:
people_accounts_df.join(people_accounts_df, people_accounts_df["id"] == people_accounts_df["id"]).show()

+---+--------+----------+---+--------+----------+
| id|    Name|account_id| id|    Name|account_id|
+---+--------+----------+---+--------+----------+
|  1|    John|        11|  1|    John|        11|
|  6|     Max|        66|  6|     Max|        66|
|  3|   Maria|        33|  3|   Maria|        33|
|  5|  Connor|        55|  5|  Connor|        55|
|  4|   Peter|        44|  4|   Peter|        44|
|  2|Isabelle|        22|  2|Isabelle|        22|
+---+--------+----------+---+--------+----------+



Finally, we can also make use of broadcast joins.

In [38]:
import pyspark.sql.functions as F

In [39]:
accounts_balance_type_df.join(F.broadcast(accounts_type_description_df), "account_type_id").show()

+---------------+----------+-------+-------------------+
|account_type_id|account_id|balance|account_description|
+---------------+----------+-------+-------------------+
|              1|        11|  152.0|      Basic Account|
|              2|        22| 3545.3|    Premium Account|
|              1|        33|   12.5|      Basic Account|
|              1|        44|   75.0|      Basic Account|
|              2|        55|4853.12|    Premium Account|
|              1|        66|   47.0|      Basic Account|
+---------------+----------+-------+-------------------+

