# Cross Joins Lab

### Introduction

Ok, so now it's time to review our cross joins problem.

In [2]:
import sqlite3
conn = sqlite3.connect('users.db')

In [3]:
import pandas as pd
root_url = "https://raw.githubusercontent.com/jigsawlabs-student/curriculum-images/main/has-many-through-bar/data/"
names = ['bartenders', 'customers', 'drinks', 'orders', 'ingredients', 'ingredients_drinks']
loaded_dfs = [pd.read_csv(f'{root_url}{name}.csv') for name in names]

In [4]:
for index, name in enumerate(names):
    loaded_dfs[index].to_sql(f'{name}', conn, index = False, if_exists = 'replace')

### Performing A Cross Join

In [20]:
import pandas as pd

query = """
with all_differences as
(select c1.id id_1, c1.name name_1, c1.birthyear as birthyear_1, 
c2.id as id_2, c2.name as name_2, c2.birthyear as birthyear_2,
(c1.birthyear - c2.birthyear) as diff 
from customers c1 join customers c2 where c1.id <> c2.id and c1.id > c2.id
)
select abs(min(diff)) as minimum_diff from all_differences
"""

pd.read_sql(query, conn)

Unnamed: 0,minimum_diff
0,10


To understand the above, let's just look at the `all_differences` CTE, and let's remove the last clause 
`c1.birthyear >= c2.birthyear`.

In [22]:
query = """
select c1.id id_1, c1.name name_1, c1.birthyear as birthyear_1, 
c2.id as id_2, c2.name as name_2, c2.birthyear as birthyear_2,
(c1.birthyear - c2.birthyear) as diff 
from customers c1 join customers c2 where c1.id <> c2.id
and c1.id > c2.id
"""

pd.read_sql(query, conn)

Unnamed: 0,id_1,name_1,birthyear_1,id_2,name_2,birthyear_2,diff
0,2,maggie simpson,2016,1,bart simpson,2008,8
1,3,lisa simpson,2006,1,bart simpson,2008,-2
2,3,lisa simpson,2006,2,maggie simpson,2016,-10


So currently we joined the table with itself, and calculated the difference between ages, and we made sure each person is never paired with themselves because of the clause: `where c1.id <> c2.id`, which ensures the ids are never equal.

The issue is that we are seeing the pairings twice.  In other words we see bart and maggie, and then we also see maggie and bart. 

To remove this duplication, we can update the where clause to be:

`where c1.id <> c2.id and c1.id > c2.id`

So this way we only choose one of the duplicate pairings -- those where the higher id is on the left.  Notice that we can get the same result by removing the `<>` clause.  

In [23]:
import pandas as pd

query = """
with all_differences as
(select c1.id id_1, c1.name name_1, c1.birthyear as birthyear_1, 
c2.id as id_2, c2.name as name_2, c2.birthyear as birthyear_2,
(c1.birthyear - c2.birthyear) as diff 
from customers c1 join customers c2 where c1.id > c2.id
)
select abs(min(diff)) as minimum_diff from all_differences
"""

pd.read_sql(query, conn)

Unnamed: 0,minimum_diff
0,10


### Resources

[Stackoverflow - Cross join](https://stackoverflow.com/questions/219716/what-are-the-uses-for-cross-join)