# Optimizing Joins

### Introduction

Ok, so now that we understand a bit about why joins can take a while to perform, let's move through some strategies to reduce the cost of joins.

Just as a reminder, our hash join strategy has a time complexity of n + k + n = n.  Which correspond to the steps of building the dictionary (n), looping through the smaller table (k), and then moving through the matching records in the corresponding hash, which will cost at most n.  

In practice, the larger cost is building the hashed data and potentially storing it on disk.  Remember that if we are hashing an entire table, this can become quite costly.

So now let's move through some steps to speed up our joins.

### Techniques to Speed Up Joins

So now that understand a bit about how joins work, let's talk through some techniques for reducing the cost of joins.  

1. Only join tables that are necessary

> This makes sense -- the fewer tables we need to load and perform a join operation on, the faster.

2. Reduce the data as much possible before any joins.
> For example, we can optimize our joins with a where clause when joining the tables.  The optimizer will perform the where clause before joining the tables. 

3. Perform group bys first to reduce the data

> The optimizer generally *will not* know to peform a group first.  But remember, group bys will reduce our data.  So you can move that to a subquery or CTE to reduce the data before joining.  [See more](https://www.cybertec-postgresql.com/en/postgresql-speeding-up-group-by-and-joins/).

4. Index columns that are frequently joined.  For example, specifying columns as primary keys and foreign keys will index them, speeding up the joins (as then the hash will not need to occur during each join).

5. Join on INT columns, preferred over any other types, it makes it faster.

6. Use an outer join where necessary, but know that it's less performant than inner joins

> Remember that an outer join returns rows even when there is no matching id on the join table (as opposed to an inner join, which only returns data when there is a match).  But returning this extra data is generally slower.

### Resources

* [Faster Joins](https://crate.io/blog/joins-faster-part-one)
* [Part 2](https://crate.io/blog/lab-notes-how-we-made-joins-23-thousand-times-faster-part-two)
* [Part 3](https://crate.io/blog/lab-notes-how-we-made-joins-23-thousand-times-faster-part-three)