# Self Joins

### Introduction

In this lesson, we'll move through self joins.  Self joins is when a table is joined with itself.

### Loading our data

In [15]:
import sqlite3
conn = sqlite3.connect('users.db')

In [16]:
import pandas as pd
root_url = "https://raw.githubusercontent.com/jigsawlabs-student/curriculum-images/main/has-many-through-bar/data/"
names = ['customers']
loaded_dfs = [pd.read_csv(f'{root_url}{name}.csv') for name in names]

In [28]:
students_df = loaded_dfs[0]
students_df = customer_df.assign(tutor_id = [3, 1, 2])
students_df.to_sql('students',conn, index = False,
                             if_exists = 'replace')

3

### Getting to the self-joins

Now let's take a look at our data.

In [29]:
query = """
select * from students
"""
pd.read_sql(query, conn)

Unnamed: 0,id,name,hometown,birthyear,tutor_id
0,1,bart simpson,springfield,2008,3
1,2,maggie simpson,milwaukee,2016,1
2,3,lisa simpson,philly,2006,2


As we can see our table has been updated so that each student has a tutor_id.  Now what if we wanted to see the pair of each person and their tutor.

In [33]:
query = """
select s.name, tutors.name tutor_name from students s
join students  as tutors
on s.tutor_id = tutors.id
"""
pd.read_sql(query, conn)

Unnamed: 0,name,tutor_name
0,bart simpson,lisa simpson
1,maggie simpson,bart simpson
2,lisa simpson,maggie simpson


So we can see from the above, that bart gets tutored by lisa.

Notice that to accomplish this we `select ... from students` and name it an alias `s` and then we join that same table students, but alias it to a different table name.  Then we join these two aliased tables together.

### Using self joins

There are various use cases of self joins.  

* Self-referential tables

Above we see that the table itself has a self-referential foreign key.  So that's one use case -- when the "foreign key" and primary key are on the same table.

* Sequences - Performing Lags

Another case is when we are asked to compare data with it's sequence.

In [91]:
import pandas as pd
df = pd.read_csv('./weather_central_park.csv')
df = df.assign(id = df.index).iloc[:, :2]
df = df.rename(columns = {'maximum temperature': 'max_temp'})
df = df.drop(labels = [9, 13])
# df

In [92]:
df.to_sql('temperatures', conn, index = True, index_label = 'id', if_exists = 'replace')

364

In [93]:
temperature_df = pd.read_sql('select * from temperatures limit 3', conn)
temperature_df

Unnamed: 0,id,date,max_temp
0,0,1-1-2016,42
1,1,2-1-2016,40
2,2,3-1-2016,45


Now what if we want to find the difference between each temperature and the previous day.  One way is with the lag window function.  But another way is with a self join.

In [94]:
pd.read_sql("""select t1.id, t1.date t1_date, t2.date t2_date,
t1.max_temp t1_max_temp, t2.max_temp t2_max_temp
from temperatures t1 
join temperatures t2 on t1.id = t2.id - 1 limit 3""", conn)

Unnamed: 0,id,t1_date,t2_date,t1_max_temp,t2_max_temp
0,0,1-1-2016,2-1-2016,42,40
1,1,2-1-2016,3-1-2016,40,45
2,2,3-1-2016,4-1-2016,45,36


So we can see above that we joined a table to itself, for the purpose of joining each row with the preceding row.

And we did so by aligning the rows where `t1.id = t2.id - 1`, and that only occurs where our second date is one greater than the `t1` date.

### Summary

In this lesson, we saw how we can use a self join.