# Understanding Joins

### Introduction

In this lesson, we'll talk about one of the more costly SQL operations, which is performing a join.  Let's get into it.

## The brute force approach

For example:

* customers.customer_id 
    * 1 
    * 2
    * 4
    * 6

* orders.customer_id 
    * 1
    * 4
    * 6

### How Joins Normally Work

#### 1.  Sort and Merge


1. First pointer - Move forward only if second pointer's value is higher
2. Second pointer - Move forward until more than the first pointer's value (then yield to first pointer)
* When there's a match move it to the work table

<img src="./work-table.png" width="60%">

### Hash joins

Hash joins are actually the go to technique by postgres.  With a hash join, postgres will *first hash* the values of the smaller table.  Then it will proceed through each the values of the second larger table, and with each value with look for the corresponding valus in the hash.

Let's take our list of values from above.

* customers.id, name
    * 1 sam
    * 2 bob
    * 4 tina
    * 6 clayton

* orders.customer_id, product_name
    * 1               phone
    * 4               camera
    * 6               watch
    * 1               tshirt

This time, the `orders.customer_id` column will be hashed because it's the smaller table.  Then postgres will scan through the customers table looking for a match.

In [41]:
orders_customer_id = {1: [{'customer_id': 1, 'product': 'phone'}, {'customer_id': 1, 'product': 'tshirt'}],
                     4: [{'customer_id': 4, 'product': 'camera'}],
                     6: [{'customer_id': 6, 'product': 'watch'}]}

* Sequential Scan through the data

In [32]:
customers = [{'id': 1, 'name': 'sam'},
             {'id': 2, 'name': 'bob'},
             {'id': 4, 'name': 'tina'},
             {'id': 6, 'name': 'clayton'}]

### Seeing it in action

`select * from movie_actors join actors on actors.id = movie_actors.actor_id;`


<img src="./explain_hash.png" width="100%">

### Summary 

In this lesson, we saw how joins work.  The primary technique is a sort and merge where the data is first sorted and then the algorithm looks for a match.  

To speed up joins, only load tables that are necessary, reduce the data as much as possible before joining, join on int columns (or indexed columns like foreign keys and primary keys), and try to have the smaller table on the left side.

### Resources

[Dats with Bert](https://bertwagner.com/posts/hash-match-join-internals/)

[Verica - Hash join vs Merge Join](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/HashJoinsVs.MergeJoins.htm)

[Joins on Postgres](https://www.cybertec-postgresql.com/en/join-strategies-and-performance-in-postgresql)