# SQL Joins

We talked about joining tables together in Module 1. In this lecture, we are going to revisit this concept again.  
![Joins](../assets/sql-joins.png)


- LEFT JOIN returns all records from the left table and records from the right table that are found in the left table
- RIGHT JOIN returns all records from the right table and records from the left table that are found in the right table (not supported in SQlite)
- INNER JOIN only returns records that are present in both tables
- FULL JOIN returns all records from both tables (not supported in SQlite)


When joining tables together, it's important to understand different types of relationships. 

- One-to-one
- One-to-many
- Many-to-many

Here is standard SQL syntax for joining tables together

<code> SELECT a.col1, a.col2, b.col3, b.col4
       FROM table1 a
       _LEFT_/_RIGHT_/_INNER_/_FULL_ JOIN table2 b
       ON a.col1 = b.col2 </code>

More than two tables can be joined in SQL. Column names used to join tables don't need to be the same (unlike Pandas). You can also join on more than two columns

<code> SELECT a.col1, a.col2, b.col3, b.col4
       FROM table1 a
       _LEFT_/_RIGHT_/_INNER_/_FULL_ JOIN table2 b
       ON a.col1 = b.col2 and a.col3 = b.col3 </code>


In [14]:
import pandas as pd
import sqlite3

In [15]:
# create sqlite db
info = pd.read_csv("../data/customer-info.csv")
loyalty = pd.read_csv("../data/customer-loyalty.csv")
sales2019 = pd.read_csv("../data/sales2019.csv")
sales2020 = pd.read_csv("../data/sales2020.csv")
sales2021 = pd.read_csv("../data/sales2021.csv")
zipcode = pd.read_csv("../data/Zipcode-ZCTA-Population-Density-And-Area-Unsorted.csv")
conn = sqlite3.connect('../data/generalstore.sqlite') 
info.to_sql('info',con=conn,index=False,if_exists='replace')
loyalty.to_sql('loyalty',con=conn,index=False,if_exists='replace')
sales2019.to_sql('sales2019',con=conn,index=False,if_exists='replace')
sales2020.to_sql('sales2020',con=conn,index=False,if_exists='replace')
sales2021.to_sql('sales2021',con=conn,index=False,if_exists='replace')
zipcode.to_sql('zipcode',con=conn,index=False,if_exists='replace')

33144

In [16]:
#Returns the table name. Including ones that we have used previously. 

pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'",con=conn)

Unnamed: 0,name
0,SALES_ALL
1,Output
2,info
3,loyalty
4,sales2019
5,sales2020
6,sales2021
7,zipcode


In [17]:
pd.read_sql("""SELECT *
                FROM info
                LIMIT 5""", conn)

Unnamed: 0,CustomerID,Name,CustomerSince,Age,State,ZipCode,Occupation
0,1001,Harry Potter,2020,20,NJ,7960,Teacher
1,1002,Hermione Granger,2021,20,NY,10025,Lawyer
2,1003,Draco Malfoy,2019,20,CT,6807,Manager
3,1004,Albus Dumbledore,2019,70,NJ,7030,Professor
4,1005,Severus Snape,2019,50,NJ,7960,Professor


In [18]:
pd.read_sql("""SELECT *
                FROM zipcode
                LIMIT 5""", conn) # not ideal naming convention

Unnamed: 0,Zip/ZCTA,2010 Population,Land-Sq-Mi,Density Per Sq Mile
0,601,0,64.348,0.0
1,602,0,30.613,0.0
2,603,0,31.616,0.0
3,606,0,42.309,0.0
4,610,0,35.916,0.0


In [19]:
## Left join info with zip code
pd.read_sql("""
            SELECT t1.*, t2.*
            FROM info t1
            LEFT JOIN zipcode t2
            ON t1.ZipCode = t2.[Zip/ZCTA]
            """, conn)

Unnamed: 0,CustomerID,Name,CustomerSince,Age,State,ZipCode,Occupation,Zip/ZCTA,2010 Population,Land-Sq-Mi,Density Per Sq Mile
0,1001,Harry Potter,2020,20,NJ,7960,Teacher,7960.0,43747.0,35.019,1249.236129
1,1002,Hermione Granger,2021,20,NY,10025,Lawyer,10025.0,94600.0,0.752,125797.8723
2,1003,Draco Malfoy,2019,20,CT,6807,Manager,6807.0,7150.0,3.145,2273.449921
3,1004,Albus Dumbledore,2019,70,NJ,7030,Professor,7030.0,50005.0,1.264,39560.91772
4,1005,Severus Snape,2019,50,NJ,7960,Professor,7960.0,43747.0,35.019,1249.236129
5,1006,Ron Weasley,2020,20,PA,19019,Teacher,,,,
6,1007,Dobby,2021,150,NY,10014,House Elf,10014.0,31959.0,0.55,58107.27273
7,1008,Minerva McGonagall,2019,68,NY,10013,Principal,10013.0,27700.0,0.55,50363.63636
8,1009,Neville Longbottom,2020,20,NJ,7035,Botanist,7035.0,10607.0,6.716,1579.362716
9,1010,Luna Lovegood,2020,19,MA,2101,Cryptozoologist,,,,


In [20]:
## Inner join info with zip code
pd.read_sql("""
            SELECT t1.*, t2.*
            FROM info t1
            JOIN zipcode t2
            ON t1.ZipCode = t2.[Zip/ZCTA]
            """, conn)

Unnamed: 0,CustomerID,Name,CustomerSince,Age,State,ZipCode,Occupation,Zip/ZCTA,2010 Population,Land-Sq-Mi,Density Per Sq Mile
0,1001,Harry Potter,2020,20,NJ,7960,Teacher,7960,43747,35.019,1249.236129
1,1002,Hermione Granger,2021,20,NY,10025,Lawyer,10025,94600,0.752,125797.8723
2,1003,Draco Malfoy,2019,20,CT,6807,Manager,6807,7150,3.145,2273.449921
3,1004,Albus Dumbledore,2019,70,NJ,7030,Professor,7030,50005,1.264,39560.91772
4,1005,Severus Snape,2019,50,NJ,7960,Professor,7960,43747,35.019,1249.236129
5,1007,Dobby,2021,150,NY,10014,House Elf,10014,31959,0.55,58107.27273
6,1008,Minerva McGonagall,2019,68,NY,10013,Principal,10013,27700,0.55,50363.63636
7,1009,Neville Longbottom,2020,20,NJ,7035,Botanist,7035,10607,6.716,1579.362716
8,1011,Katniss Everdeen,2019,25,WA,98101,Crossfit Instructor,98101,10238,0.519,19726.39692
9,1012,Peeta Mellark,2019,25,WA,98101,Baker,98101,10238,0.519,19726.39692


In [21]:
# One to Many join
pd.read_sql("""
            SELECT * 
            FROM info t1
            LEFT JOIN Sales2019 t2
            ON t1.CustomerID = t2.CustomerID""", conn) 

Unnamed: 0,CustomerID,Name,CustomerSince,Age,State,ZipCode,Occupation,CUSTOMERID,Sales2019,item_description,Date
0,1001,Harry Potter,2020,20,NJ,7960,Teacher,,,,
1,1002,Hermione Granger,2021,20,NY,10025,Lawyer,,,,
2,1003,Draco Malfoy,2019,20,CT,6807,Manager,1003.0,5.0,Bed bug spray,11/15/2019
3,1003,Draco Malfoy,2019,20,CT,6807,Manager,1003.0,5.0,Giant roach spray,12/15/2019
4,1003,Draco Malfoy,2019,20,CT,6807,Manager,1003.0,10.0,Hammer,11/16/2019
5,1004,Albus Dumbledore,2019,70,NJ,7030,Professor,1004.0,50.0,Chocolate,12/20/2019
6,1004,Albus Dumbledore,2019,70,NJ,7030,Professor,1004.0,50.0,Chocolate,12/30/2019
7,1005,Severus Snape,2019,50,NJ,7960,Professor,1005.0,20.0,Potions,11/15/2019
8,1006,Ron Weasley,2020,20,PA,19019,Teacher,,,,
9,1007,Dobby,2021,150,NY,10014,House Elf,,,,


In [22]:
#One to One join
pd.read_sql("""
            SELECT * 
            FROM info t1
            LEFT JOIN (
                    SELECT CUSTOMERID, sum(Sales2019) as Sales2019, count(item_description) as itemcnt
                    FROM Sales2019
                    GROUP BY CUSTOMERID) t2
            ON t1.CustomerID = t2.CustomerID""", conn) 

Unnamed: 0,CustomerID,Name,CustomerSince,Age,State,ZipCode,Occupation,CUSTOMERID,Sales2019,itemcnt
0,1001,Harry Potter,2020,20,NJ,7960,Teacher,,,
1,1002,Hermione Granger,2021,20,NY,10025,Lawyer,,,
2,1003,Draco Malfoy,2019,20,CT,6807,Manager,1003.0,20.0,3.0
3,1004,Albus Dumbledore,2019,70,NJ,7030,Professor,1004.0,100.0,2.0
4,1005,Severus Snape,2019,50,NJ,7960,Professor,1005.0,20.0,1.0
5,1006,Ron Weasley,2020,20,PA,19019,Teacher,,,
6,1007,Dobby,2021,150,NY,10014,House Elf,,,
7,1008,Minerva McGonagall,2019,68,NY,10013,Principal,1008.0,15.0,1.0
8,1009,Neville Longbottom,2020,20,NJ,7035,Botanist,,,
9,1010,Luna Lovegood,2020,19,MA,2101,Cryptozoologist,,,


In [30]:
# # # show customer id from 1001 through 1010 - Will fail
# pd.read_sql("""
#             SELECT * 
#             FROM info t1
#             LEFT JOIN (
#                     SELECT CUSTOMERID, SUM(Sales2019) AS Sales2019, COUNT(item_description) AS itemcnt
#                     FROM Sales2019
#                     GROUP BY CUSTOMERID) t2
#             ON t1.CustomerID = t2.CustomerID
#             WHERE CustomerID between 1001 and 1010""", conn) # ambiguous Column Name

In [24]:
# show customer id from 1001 through 1010
pd.read_sql("""
            SELECT * 
            FROM info t1
            LEFT JOIN (
                    SELECT CUSTOMERID, SUM(Sales2019) AS Sales2019, count(item_description) AS itemcnt
                    FROM Sales2019
                    GROUP BY CUSTOMERID) t2
            ON t1.CustomerID = t2.CustomerID
            WHERE t1.CustomerID between 1001 and 1010""", conn) 

Unnamed: 0,CustomerID,Name,CustomerSince,Age,State,ZipCode,Occupation,CUSTOMERID,Sales2019,itemcnt
0,1001,Harry Potter,2020,20,NJ,7960,Teacher,,,
1,1002,Hermione Granger,2021,20,NY,10025,Lawyer,,,
2,1003,Draco Malfoy,2019,20,CT,6807,Manager,1003.0,20.0,3.0
3,1004,Albus Dumbledore,2019,70,NJ,7030,Professor,1004.0,100.0,2.0
4,1005,Severus Snape,2019,50,NJ,7960,Professor,1005.0,20.0,1.0
5,1006,Ron Weasley,2020,20,PA,19019,Teacher,,,
6,1007,Dobby,2021,150,NY,10014,House Elf,,,
7,1008,Minerva McGonagall,2019,68,NY,10013,Principal,1008.0,15.0,1.0
8,1009,Neville Longbottom,2020,20,NJ,7035,Botanist,,,
9,1010,Luna Lovegood,2020,19,MA,2101,Cryptozoologist,,,


In [25]:
## ADDING WHERE, GROUP BY, and ORDER BY

# show 2021 sales by Occupation outside of NY, sort by sales in desc order
pd.read_sql("""
            SELECT Occupation, sum(Sales2021) as Sales
            FROM info t1
            LEFT JOIN (
                    SELECT CUSTOMER_ID, SUM(Sales2021) AS Sales2021, COUNT(item_description) AS itemcnt
                    FROM Sales2021
                    GROUP BY CUSTOMER_ID) t2
            ON t1.CustomerID = t2.Customer_ID
            WHERE State <> 'NY'
            GROUP BY Occupation
            ORDER BY Sales desc""", conn) 

Unnamed: 0,Occupation,Sales
0,Crossfit Instructor,36037
1,Professor,30593
2,Designer,22865
3,Botanist,21587
4,Teacher,19135
5,Bartender,12665
6,Cryptozoologist,10255
7,Baker,9790
8,Doctor,9445
9,Politician,9339


# UNION

UNION can also combine two or more tables but unlike JOIN, it appends tables on top of each other. In order for UNION to work correctly, you need to specify the same number of columns appearing in the same order from each table. The columns should also be of similar data type.

When can UNION be helpful? If you have similar data stored in different tables. Perhaps, data from one year is stored in one table whereas data from another year is stored in a different table, and you want to combine it in one table, that's when you would use the UNION operator.

## UNION vs UNION ALL

UNION: keeps unique records
UNION ALL: keeps all records

In [31]:
# #One to One join - will fail
# pd.read_sql("""
#             SELECT * FROM Sales2019
#             UNION
#             SELECT * from Sales2020""", conn) 

In [27]:
pd.read_sql("""

            SELECT CUSTOMERID, Sales2019, item_description, Date FROM Sales2019
            UNION ALL
            SELECT customer_id, Sales2020, item_description, YearMonth FROM Sales2020""", conn)

## Union ALL and Union -> same number of records but sorted differently

Unnamed: 0,CUSTOMERID,Sales2019,item_description,Date
0,1003,5,Bed bug spray,11/15/2019
1,1003,10,Hammer,11/16/2019
2,1003,5,Giant roach spray,12/15/2019
3,1004,50,Chocolate,12/20/2019
4,1004,50,Chocolate,12/30/2019
...,...,...,...,...
218,1017,301,Tools,202011
219,1018,541,Tools,202010
220,1019,340,Books,202001
221,1020,440,Grocery,202002


In [28]:
pd.read_sql("""

            SELECT CUSTOMERID, Sales2019, item_description, Date FROM Sales2019
            UNION ALL
            SELECT customer_id, Sales2020, item_description, YearMonth FROM Sales2020
            UNION ALL
            SELECT customer_id, Sales2021, item_description, YearMonth FROM Sales2021""", conn)

Unnamed: 0,CUSTOMERID,Sales2019,item_description,Date
0,1003,5,Bed bug spray,11/15/2019
1,1003,10,Hammer,11/16/2019
2,1003,5,Giant roach spray,12/15/2019
3,1004,50,Chocolate,12/20/2019
4,1004,50,Chocolate,12/30/2019
...,...,...,...,...
648,1002,700,Grocery,202102
649,1007,365,Cleaning Supplies,202106
650,1013,528,Tools,202111
651,1014,484,Tools,202109


In [29]:
## Clean output

pd.read_sql("""

            SELECT CUSTOMERID AS customer_id, 
                    Sales2019 AS Sales, 
                    item_description, 
                    substr(Date,7,4)||substr(Date,1,2) AS YearMonth
            FROM Sales2019
                UNION ALL
            SELECT customer_id, 
                    Sales2020 AS Sales, 
                    item_description, 
                    YearMonth 
            FROM Sales2020
                UNION ALL
            SELECT customer_id, 
                    Sales2021 AS Sales, 
                    item_description, 
                    YearMonth 
            FROM Sales2021
            ORDER BY customer_id, YearMonth""", conn)

Unnamed: 0,customer_id,Sales,item_description,YearMonth
0,1001,417,Tools,202001
1,1001,234,Sewing Supplies,202002
2,1001,118,Pest Control,202005
3,1001,470,Produce,202005
4,1001,825,Tote bag,202005
...,...,...,...,...
648,1021,476,Grocery,202111
649,1021,193,Tools,202111
650,1021,560,Cleaning Supplies,202111
651,1021,358,Grocery,202112


# Fuzzy Matching

Sometimes you may have two tables that need to be joined on a string instead of an ID. This can be very painful! 

Certain variations of SQL have built-in functions that allow you to check how similar two strings are, for example, COMPGED calculates the Levenshtein distance (more on this in NLP) or the SOUNDEX function. Fuzzy matching isn't supported in SQLite

[Fuzzywuzzy](https://pypi.org/project/fuzzywuzzy/) is a great Python alternative for fuzzy matching.

There are also NoSQL databases that are great for entity resolution ([Elastic Search](https://www.elastic.co/elastic-stack/)).
