# Understanding Joins

### Introduction

In this lesson, we'll talk about one of the more costly SQL operations, which is performing a join.  Let's get into it.

### How Joins Work

There are two different kinds of joins that postgres will perform: 
    
1. A sort and merge join (AKA merge join) and
2. A hash join

A sort and merge join is typically faster than a hash join.

### Sort and Merge

With a sort and merge join, both of the joined tables are sorted by the joining key, and then merged.  For example, if we are merging customers and orders, and we have the following values:

* customers.customer_id 
    * 1 
    * 2
    * 4
    * 6

* orders.customer_id 
    * 1
    * 4
    * 6
    
SQL will do the following.  First, SQL will start with the first `customers.customer_id` value, 1, and then go to the first `orders.customer_id` value.  There is a match, so it lines up the two rows -- aka the merge.  Moving to the second record with customers.customer_id = 2, immediately it sees that there orders.customer_id does not have a 2 as the next highest number is 4.

Because postgres automatically indexes foreign and primary keys, the two tables are probably already properly sorted, and so the only real step is the merge operation.  

So merge joins are the go to operation by postgres and pretty fast.

### Hash joins

With a hash join, postgres will first hash the keys of the inner table (that is the table to the left of the join), and then scan the right table for any matches.

The cost of the hash join is low if the hash of the left table's join column can be fit into memory, but is significantly higher if needs to be written to disk.

### Techniques to Speed Up Joins

So now that understand a bit about how joins work, let's talk through some techniques for reducing the cost of joins.  

1. Only join tables that are necessary

> This makes sense -- the fewer tables we need to load and perform a join operation on, the faster.

2. Reduce the data as much possible before any joins.
> For example, we can optimize our joins with a where clause when joining the tables.  The optimizer will perform the where clause before joining the tables. 

> However, it will not peform the group by first -- so you can move that to a subquery before joining.  [See more](https://www.cybertec-postgresql.com/en/postgresql-speeding-up-group-by-and-joins/).

3. When joining, make sure smaller tables are on the left side of join syntax, as smaller tables are easier to store in memory.

4. Join on INT columns, preferred over any other types, it makes it faster.

### Summary 

In this lesson, we saw how joins work.  The primary technique is a sort and merge where the data is first sorted and then the algorithm looks for a match.  

To speed up joins, only load tables that are necessary, reduce the data as much as possible before joining, join on int columns (or indexed columns like foreign keys and primary keys), and try to have the smaller table on the left side.

### Resources

[Verica - Hash join vs Merge Join](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/HashJoinsVs.MergeJoins.htm)

[Joins on Postgres](https://www.cybertec-postgresql.com/en/join-strategies-and-performance-in-postgresql)