# Identifying Gaps - With Islands

### Introduction

In this lesson, we'll see yet another way to identify gaps, and that is by using the islands technique.

### Result

Let's start by loading our data.

In [1]:
import pandas as pd
import sqlite3
conn = sqlite3.connect('users.db')
url = "https://raw.githubusercontent.com/tech-interviews-jigsaw/sql-advanced-joins/main/6-common-strategies/island_sequence.csv"

df = pd.read_csv(url)

In [2]:
df.to_sql('numbers', conn, index = True,
          index_label = 'id', if_exists = 'replace')

13

In [3]:
pd.read_sql("select * from numbers", conn)

Unnamed: 0,id,number
0,0,1
1,1,2
2,2,5
3,3,6
4,4,7
5,5,8
6,6,9
7,7,10
8,8,12
9,9,13


Ok, so if you look at the numbers above, you'll see that the numbers table contains 4 islands. We need to create a SQL script to identify them.

This is what the end result should look like.

<img src="./island-answer.png" width="40%">

### Comparing to gaps

Notice that this is different from the gaps problem.  With our gaps problem, there are only 3 gaps in the data.  The gaps are
* 2 to 5,
* 10 to 12
* 15 to 20

But there are four islands -- because we include that last ending stretch of 20 to 20.  

So the first island is 1 to 2, because that is where the first set of consecutive numbers is.  Then we have 5 to 10, and so on. 

We can take a different approach for finding identifying our four islands.

### Our first steps

The first step is to add a row number column.

In [5]:
query = """SELECT number
    ,row_number() OVER (
        ORDER BY number
        ) AS row_num
FROM numbers limit 4"""

pd.read_sql(query, conn)

Unnamed: 0,number,row_num
0,1,1
1,2,2
2,5,3
3,6,4


Now look at what happens if we subtract row number from number.

In [8]:
query = """with differences as (
SELECT number,
row_number() OVER (ORDER BY number) AS row_num 
FROM numbers)
select number, row_num, number - row_num as diff from differences
limit 6"""

pd.read_sql(query, conn)

Unnamed: 0,number,row_num,diff
0,1,1,0
1,2,2,0
2,5,3,2
3,6,4,2
4,7,5,2
5,8,6,2


Ok, so you can see that our data is put into groups -- our islands.  This is because the difference between the number and row number stays constant, unless a gap occurs.  

So from our second to third row, the difference value shoots up to 2.  This is because our `number` increases from 2 to 5 instead of the expected 2 to 3 (if it were sequential).

* Next step

So how do we get down to back to identifying the beginning and ending of our islands?

Well we can just group by the island number, and find the max and the minimum value.

In [11]:
query = """with differences as (
SELECT number,
row_number() OVER (ORDER BY number) AS row_num 
FROM numbers), 
islands as(
select number, row_num, number - row_num as diff
from differences
)
select min(number) island_start, max(number) island_end from islands group by diff

"""

pd.read_sql(query, conn)

Unnamed: 0,island_start,island_end
0,1,2
1,5,10
2,12,15
3,20,20


So by grouping by the beginning and end we have got it covered.