# Gaps Self Joins Lab

### Introduction

In this lesson, we can take on a classic problem of finding gaps in a sequence of numbers.  Let's get started.

### Loading data

We can begin by loading our sequence of numbers.

In [15]:
import pandas as pd
import sqlite3
conn = sqlite3.connect('users.db')
root_url = "https://raw.githubusercontent.com/tech-interviews-jigsaw/sql-advanced-joins/main/6-common-strategies/"
custom = "simple_sequence.csv"
df = pd.read_csv(f"{root_url}/{custom}")

In [16]:
df.to_sql('numbers', conn, index = True,
          index_label = 'id', if_exists = 'replace')

14

In [4]:
pd.read_sql("select * from numbers", conn)

Unnamed: 0,id,number
0,0,1
1,1,2
2,2,3
3,3,5
4,4,6
5,5,7
6,6,8
7,7,9
8,8,10
9,9,12


If you look at the list of numbers above, we can see that there are gaps in the numbers of 4, 11, and 16.  Your task is to return a list of these numbers.

### Using self joins

We accomplish this by performing a left self join where left table's number equals the other table's number minus one.  That is, we line up the sequences.  And then where there is no preceding number, the left join returns null. 

In [22]:
query = """
select n1.number, n2.number from numbers n1 
left join numbers n2 on n1.number - 1 = n2.number
where n2.number is null
"""
pd.read_sql(query, conn)

# 	gap_num
# 0	4
# 1	11
# 2	16

Unnamed: 0,number,number.1
0,1,
1,5,
2,12,
3,17,


This initial query is close, but we want to remove that initial number of 1.  Yes, there's no matching preceding number there, but it's not because of a gap -- it's just the lower bound.

So we update the query to find the lower bound number and exclude it.

In [20]:
query = """
select n1.number, n2.number from numbers n1 
left join numbers n2 on n1.number = n2.number + 1
where n2.number is null and 
n1.number != (select min(numbers.number) from numbers)
"""
pd.read_sql(query, conn)

# 	gap_num
# 0	4
# 1	11
# 2	16

Unnamed: 0,number,number.1
0,5,
1,12,
2,17,


Notice that our numbers are close, but we actually are repeatedly selecting the number right after the gap -- and not the gap itself.  So we should fix this by subtracting 1.

In [23]:
query = """
select n1.number - 1 gap_num from numbers n1 
left join numbers n2 on n1.number - 1 = n2.number
where n2.number is null and 
n1.number != (select min(numbers.number) from numbers)
"""
pd.read_sql(query, conn)

# 	gap_num
# 0	4
# 1	11
# 2	16

Unnamed: 0,gap_num
0,4
1,11
2,16


* The other way

Notice that if instead of joining on the preceding number, we join the succeeding number -- our statement looks almost the same.  

The main difference is that our query will now return the last number in the sequence.  Because this time the upper bound has no succeeding number.  So let's remove that last number by excluding the max.

In [27]:
query = """
select n1.number from numbers n1 
left join numbers n2 on n1.number + 1 = n2.number
where n2.number is null and 
n1.number != (select max(numbers.number) from numbers)
"""
pd.read_sql(query, conn)


Unnamed: 0,number
0,3
1,10
2,15


And finally notice that we are returning the number below the gap, instead of the one above it.  So we should add one to the select.

In [29]:
query = """
select n1.number + 1 gap_num from numbers n1 
left join numbers n2 on n1.number + 1 = n2.number
where n2.number is null and 
n1.number != (select max(numbers.number) from numbers)
"""
pd.read_sql(query, conn)

Unnamed: 0,gap_num
0,4
1,11
2,16
