<a href="https://colab.research.google.com/github/zackives/upenn-cis-2450/blob/main/7_Module_2_Part_III_Better_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Better SQL - How to Understand, Write, and Debug

It's easy to write SQL queries that become very challenging to debug.

In this Notebook, we'll try to summarize some of the subtleties of different SQL constructs, how they relate, and how we might debug.

In [1]:
!wget -nc https://storage.googleapis.com/penn-cis5450/linkedin_anon.jsonl

--2024-09-21 15:17:44--  https://storage.googleapis.com/penn-cis5450/linkedin_anon.jsonl
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.199.207, 74.125.142.207, 74.125.195.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.199.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 179851696 (172M) [application/octet-stream]
Saving to: ‘linkedin_anon.jsonl’


2024-09-21 15:17:45 (112 MB/s) - ‘linkedin_anon.jsonl’ saved [179851696/179851696]



In [2]:
!pip3 install lxml
!pip3 install duckdb



## Setting up a sample database

All of this is to load up our LinkedIn data...

In [3]:
import pandas as pd
import numpy as np

# JSON parsing
import json

# HTML parsing
from lxml import etree
import urllib

# DuckDB RDBMS
import duckdb

# Polars big data package
import polars

# Time conversions
import time

In [4]:
'''
Simple code to pull out data from JSON and load into DuckDB.
'''
import ast

linked_in = open('linkedin_anon.jsonl')
people_df = pd.read_json('linkedin_anon.jsonl', lines=True)

In [5]:
def get_nested_dict(rel, name):
  # This evaluates the string that describes the dictionary, as a dictionary
  # definition
  ret = rel.copy()
  # ret[name] = rel[name].map(lambda x: ast.literal_eval(x) if len(x) else np.NaN)
  ret = ret.dropna()
  # This joins rows on the index
  return ret.drop(columns=name).join(pd.DataFrame(ret[name].tolist()))

def get_nested_list(rel, name):
  ret = rel.copy()
  ret = ret.dropna().explode(name).dropna()
  ret = ret.join(pd.DataFrame(ret[name].tolist())).drop(columns=name).drop_duplicates()
  return ret.rename(columns={0: name})

def get_nested_list_dict(rel, name):
  ret = rel.copy()

  ret = ret.dropna().explode(name)

  exploded_pairs = pd.DataFrame(ret.apply(lambda x: {'_id': x['_id']} | x[name] if isinstance(x[name], dict) else {'_id': x['_id']}, axis=1).tolist())

  return ret.merge(exploded_pairs, on='_id').drop(columns=name)
  #pd.DataFrame(ret[name].tolist())).drop(columns=name).drop_duplicates()

# Take the lists, drop any blank strings
specialties_df = people_df[['_id','specilities']].explode('specilities').rename(columns={'_id': 'person'})
specialties_df.dropna(inplace=True)
interests_df = people_df[['_id','interests']].explode('interests').rename(columns={'_id': 'person'})
interests_df.dropna(inplace=True)

names_df = get_nested_dict(people_df[['_id','name']], 'name')

education_df = get_nested_list_dict(people_df[['_id','education']], 'education')
experience_df = get_nested_list_dict(people_df[['_id','experience']], 'experience')
skills_df = get_nested_list(people_df[['_id','skills']], 'skills')
honors_df = get_nested_list(people_df[['_id','honors']], 'honors')
events_df = get_nested_list_dict(people_df[['_id','events']], 'events')

groups_df = get_nested_dict(people_df[['_id','group']], 'group')

people_only_df = people_df.drop(columns=['name','education','group','skills','experience','honors','events','specilities','interests']).\
  merge(names_df, on='_id')

In [6]:
## This is just to reset things so we don't have an index
conn = duckdb.connect('linkedin.db')
conn.execute('BEGIN TRANSACTION')
conn.execute('DROP TABLE IF EXISTS people')
conn.execute('DROP TABLE IF EXISTS education')
conn.execute('DROP TABLE IF EXISTS experience')
conn.execute('DROP TABLE IF EXISTS skills')
conn.execute('DROP TABLE IF EXISTS honors')
conn.execute('DROP TABLE IF EXISTS events')
conn.execute('DROP TABLE IF EXISTS groups')
conn.execute('DROP TABLE IF EXISTS specialties')
conn.execute('DROP TABLE IF EXISTS interests')
conn.execute('DROP INDEX IF EXISTS people_industry')
conn.execute('CREATE TABLE people AS SELECT * FROM people_only_df')
conn.execute('CREATE TABLE education AS SELECT * FROM education_df')
conn.execute('CREATE TABLE experience AS SELECT * FROM experience_df')
conn.execute('CREATE TABLE skills AS SELECT * FROM skills_df')
conn.execute('CREATE TABLE honors AS SELECT * FROM honors_df')
conn.execute('CREATE TABLE events AS SELECT * FROM events_df')
conn.execute('CREATE TABLE groups AS SELECT * FROM groups_df')
conn.execute('CREATE TABLE specialties AS SELECT * FROM specialties_df')
conn.execute('CREATE TABLE interests AS SELECT * FROM interests_df')
conn.execute('COMMIT')

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

<duckdb.duckdb.DuckDBPyConnection at 0x7f2ff1643530>

## Penngrader setup

In [7]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

Writing notebook-config.yaml


In [8]:
!pip3 install penngrader-client

Collecting penngrader-client
  Downloading penngrader_client-0.5.2-py3-none-any.whl.metadata (15 kB)
Collecting dill (from penngrader-client)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Downloading penngrader_client-0.5.2-py3-none-any.whl (10 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill, penngrader-client
Successfully installed dill-0.3.8 penngrader-client-0.5.2


In [9]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

In [10]:
%set_env HW_ID=cis2450_fall24_HW9

env: HW_ID=cis2450_fall24_HW9


In [11]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)

PennGrader initialized with Student ID: 99999999

Make sure this correct or we will not be able to store your grade


## Validating the database setup

In [12]:
conn.sql("""
SELECT DISTINCT industry
FROM people
WHERE industry IS NOT NULL
""").df()

Unnamed: 0,industry
0,Medical Devices
1,Pharmaceuticals
2,Research
3,Information Technology and Services
4,Telecommunications
...,...
1504,Çevre Koruma Hizmetleri
1505,Învățământ superior
1506,Biyoteknoloji
1507,Медицинская практика


In [13]:
conn.sql("""
SELECT DISTINCT skills
FROM skills
""").df()

Unnamed: 0,skills
0,Strategic Planning
1,Market Planning
2,Negotiation
3,Forecasting
4,Sales Management
...,...
8451,Free to Play
8452,Time Series Analysis
8453,Company Valuation
8454,SAS Base


### Queries with Joins

How many People have skills related to Biology and work in Tech?

In [14]:
conn.sql('select * from people').df()

Unnamed: 0,_id,locality,industry,summary,url,interval,family_name,given_name
0,moist-vodka,United States,Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,https://www.linkedin.com/in/moist-vodka,,Post,Belvedere
1,adagio-catalyst,"Antwerp Area, Belgium",Pharmaceuticals,Ph.D. scientist with background in cancer rese...,https://www.linkedin.com/in/adagio-catalyst,20.0,Watt,Brunton
2,tart-acorn,"San Francisco, California",Research,I am interested in inventing new methods to co...,https://www.linkedin.com/in/tart-acorn,0.0,Hannay,Passepartout
3,objective-riesling,San Francisco Bay Area,Information Technology and Services,OBJECTIVE<Primary> Work on an interesting and ...,https://www.linkedin.com/in/objective-riesling,5.0,Carnegie,Passepartout
4,generative-amberjack,"Chennai Area, India",Aviation & Aerospace,"Experience in Avionics Systems, Embedded Syste...",https://www.linkedin.com/in/generative-amberjack,,Duncan,Merriman
...,...,...,...,...,...,...,...,...
51817,glowing-flush,Greater Chicago Area,Marketing and Advertising,Sales and marketing professional specializing ...,https://www.linkedin.com/in/glowing-flush,5.0,Kincaid,Passepartout
51818,grouchy-flight,Greater Atlanta Area,Financial Services,Accomplished business development manager expe...,https://www.linkedin.com/in/grouchy-flight,14.0,Ogilvy,Bullimore
51819,dense-bell,"Calgary, Canada Area",Design,Brad Gibson is a recognized expert in power qu...,https://www.linkedin.com/in/dense-bell,42.0,Macdougall,Barrymore
51820,brave-hoops,San Francisco Bay Area,Public Policy,Brad Kane's multi-faceted career in the govern...,https://www.linkedin.com/in/brave-hoops,26.0,Forsyth,Cadbury


# Understanding UNION, INTERSECTION, Cartesian Product, and JOIN

Recall that relations are *sets of tuples* and that the Relational Algebra is a set of operations over sets of tuples.  SELECT filters tuples, PROJECT changes (projects out of) their schema, etc.

Recall that sets have three common operators:

1. UNION
2. INTERSECTION
3. CARTESIAN PRODUCT

(Recall that by default database tables allow duplicates.  We can always use `SELECT DISTINCT` to get true sets.)

## UNION and WHERE c1() OR c2()

In a way, UNION is the simplest operation, conceptually. We take two sets of tuples (with the same schema!) and put them together.

We can do this, for instance, to collect sets of items that satisfy either of two different conditions.

In [15]:
display(conn.sql("""
  SELECT DISTINCT _id, given_name, family_name
  FROM people
  WHERE lower(industry) LIKE '%bio%'
  UNION
  SELECT DISTINCT _id, given_name, family_name
  FROM people
  WHERE lower(industry) LIKE '%tech%'
""").df())

Unnamed: 0,_id,given_name,family_name
0,rosy-nucleus,Leporello,Forrester
1,quick-subfloor,Belvedere,Howard
2,intense-phalanx,Figaro,Hannay
3,boxy-cove,Figaro,Riddell
4,abstract-nickel,Passepartout,Campbell
...,...,...,...
6556,novel-century,Bertuccio,Fergusson
6557,scared-luminosity,Cadbury,Maclean
6558,wise-bight,Figaro,Greg
6559,tender-peak,Brunton,Akins


Because UNION combines sets, we can interchangeably go between different union "branches" and *disjunction* (`OR`) in our queries:

```
 SELECT *
 FROM S
 WHERE c1(S)
UNION
 SELECT *
 FROM S
 WHERE c2(S)
 ```

 vs

 ```
 SELECT *
 FROM S
 WHERE c1(S) OR c2(S)
```

### Exercise

Given the above: try writing the above query about bio and tech people as a single SELECT with an OR:

In [None]:
# TODO
result_df = # TODO

display(result_df)

In [35]:
if not isinstance(result_df, pd.DataFrame):
  raise TypeError("Value in results_df must be a pandas DataFrame")
grader.grade('sql_or', result_df)

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.


## Multisets and UNION

What if I want to *count* how many rows there are with people who have these skills? If so, I may want *multisets* or *bags*.  Here I don't use `DISTINCT`.

In [18]:
display(conn.sql("""
  SELECT count(_id)
  FROM people
  WHERE lower(industry) LIKE '%bio%' OR lower(industry) LIKE '%tech%'
""").df())

Unnamed: 0,count(_id)
0,6594


I can also assemble sub-results via UNION, but if I want "bag union" I need to say UNION ALL.

In [19]:
display(conn.sql("""
  SELECT COUNT(*)
  FROM (
    SELECT _id, given_name, family_name
    FROM people
    WHERE lower(industry) LIKE '%bio%'
    UNION ALL
    SELECT _id, given_name, family_name
    FROM people
    WHERE lower(industry) LIKE '%tech%'
  )
""").df())

Unnamed: 0,count_star()
0,6857


### Question for discussion

*Why is this not the same?  What could we do to fix it?  You are welcome to discuss with your peers or on Ed Discussion.*

## JOIN is a Cartesian Product

Again connecting to set operations: JOIN is a special case of CARTESIAN PRODUCT, namely a CARTESIAN PRODUCT followed by SELECT (which is the join condition).

In [20]:
display(conn.sql("""
  SELECT DISTINCT _id, given_name, family_name
  FROM people NATURAL JOIN skills
""").df())

display(conn.sql("""
  SELECT DISTINCT people._id, given_name, family_name
  FROM people CROSS JOIN skills
  WHERE people._id = skills._id
""").df())

Unnamed: 0,_id,given_name,family_name
0,tart-acorn,Passepartout,Hannay
1,generative-amberjack,Merriman,Duncan
2,arctic-investor,Merriman,Rollo
3,cold-reveal,Barrymore,Kincaid
4,afraid-collision,Merriman,Hunter
...,...,...,...
38142,joint-atoll,Cadbury,Macalister
38143,resultant-coach,Bertuccio,Kincaid
38144,glowing-flush,Passepartout,Kincaid
38145,dense-bell,Barrymore,Macdougall


Unnamed: 0,_id,given_name,family_name
0,moist-vodka,Belvedere,Post
1,adagio-catalyst,Brunton,Watt
2,salty-section,Figaro,Watt
3,sensitive-estuary,Merriman,Macneil
4,rich-laser,Figaro,Spalding
...,...,...,...
38142,happy-circuit,Simonides,Brown
38143,calm-block,Figaro,Macnab
38144,piercing-tetrad,Brunton,Hunter
38145,espressivo-media,Brunton,Dundas


### Join and Intersection

Sometimes you are asked for sets of items that jointly satisfy conditions.  If you just want to return the basic items, this can be accomplished with an INTERSECTion:

In [21]:
conn.sql("""
  SELECT DISTINCT _id, given_name, family_name
  FROM people NATURAL JOIN skills
  WHERE lower(skills.skills) LIKE '%bio%'
  INTERSECT
  SELECT DISTINCT _id, given_name, family_name
  FROM people NATURAL JOIN experience
  WHERE lower(industry) LIKE '%technology%'
""").df()

Unnamed: 0,_id,given_name,family_name
0,foggy-frontlist,Simonides,Scott
1,meek-pear,Poole,Hunter
2,vertical-alignment,Poole,Lockhart
3,meek-pear,Cadbury,Haldane
4,extremal-generator,Barrymore,Haldane


But also with a join:

In [22]:
conn.sql("""
  SELECT DISTINCT _id, given_name, family_name
  FROM people NATURAL JOIN skills NATURAL JOIN experience
  WHERE lower(skills.skills) LIKE '%bio%' AND
   lower(industry) LIKE '%technology%'
""").df()

Unnamed: 0,_id,given_name,family_name
0,foggy-frontlist,Simonides,Scott
1,meek-pear,Poole,Hunter
2,meek-pear,Cadbury,Haldane
3,vertical-alignment,Poole,Lockhart
4,extremal-generator,Barrymore,Haldane


In [23]:
conn.sql("""
  SELECT DISTINCT _id, given_name, family_name
  FROM people NATURAL JOIN skills NATURAL JOIN experience
  WHERE lower(skills.skills) LIKE '%bio%' AND
   lower(industry) LIKE '%technology%'
""").df()

Unnamed: 0,_id,given_name,family_name
0,foggy-frontlist,Simonides,Scott
1,meek-pear,Poole,Hunter
2,vertical-alignment,Poole,Lockhart
3,meek-pear,Cadbury,Haldane
4,extremal-generator,Barrymore,Haldane


So why would I use one vs the other?

1. Remember that INTERSECTION only returns items from the set that satisfy the condition. You can't, for instance, include combinations of fields that match.
2. If your checks-for-relationships span multiple tables, then that is *inherently* a join.
3. Remember that JOIN will be default produce a multiset (you can SELECT DISTINCT to remove).

In [24]:
conn.sql("""
  SELECT DISTINCT _id, given_name, family_name, skills, industry
  FROM people NATURAL JOIN skills NATURAL JOIN experience
  WHERE lower(skills.skills) LIKE '%bio%' AND
   lower(industry) LIKE '%technology%'
""").df()

Unnamed: 0,_id,given_name,family_name,skills,industry
0,meek-pear,Poole,Hunter,Biotechnology,Information Technology and Services
1,meek-pear,Cadbury,Haldane,Biotechnology,Information Technology and Services
2,extremal-generator,Barrymore,Haldane,Biomedical Engineering,Information Technology and Services
3,vertical-alignment,Poole,Lockhart,Computational Biology,Information Technology and Services
4,foggy-frontlist,Simonides,Scott,Biotechnology,Information Technology and Services


## Python Comprehensions vs SQL ... and Conditionals

SQL is a bit like Python [list comprehensions](https://www.programiz.com/python-programming/list-comprehension).  In Python, we can create a new list from the members of another collection, using list-builder (square-bracket) notation, and the *for* keyword.

Intuitively, the list comprehension is heavily inspired by set-builder notation in discrete mathematics. Perhaps you've seen mathematical expressions like this:

$$\{x | x \in S \wedge x < 5\}$$

Suppose $S$ were a list and not a set.  You could imagine extending to a list-builder notation like this:

$$[x | x \in S \wedge x < 5]$$

Indeed, that's roughly what Python does as a list comprehension:

```
[x for x in S]
```

Now let's connect this to DataFrames, which are really lists of tuples. We can, if we want to, iterate over the set of rows in a dataframe, and pull out the name from the content in a list:

In [25]:
[x[1]['name'] for x in people_df.iterrows()]

[{'family_name': 'Post', 'given_name': 'Belvedere'},
 {'family_name': 'Watt', 'given_name': 'Brunton'},
 {'family_name': 'Hannay', 'given_name': 'Passepartout'},
 {'family_name': 'Carnegie', 'given_name': 'Passepartout'},
 {'family_name': 'Duncan', 'given_name': 'Merriman'},
 {'family_name': 'Watt', 'given_name': 'Figaro'},
 {'family_name': 'Graham', 'given_name': 'Leporello'},
 {'family_name': 'Macneil', 'given_name': 'Merriman'},
 {'family_name': 'Rollo', 'given_name': 'Merriman'},
 {'family_name': 'Spalding', 'given_name': 'Figaro'},
 {'family_name': 'Kincaid', 'given_name': 'Barrymore'},
 {'family_name': 'Hunter', 'given_name': 'Merriman'},
 {'family_name': 'Burnett', 'given_name': 'Merriman'},
 {'family_name': 'Haig', 'given_name': 'Belvedere'},
 {'family_name': 'Howard', 'given_name': 'Poole'},
 {'family_name': 'Primrose', 'given_name': 'Belvedere'},
 {'family_name': 'Hamilton', 'given_name': 'Brunton'},
 {'family_name': 'Barclay', 'given_name': 'Cadbury'},
 {'family_name': 'Buch

This is basically the same as, in DuckDB:

In [26]:
duckdb.sql('select name from people_df')

┌───────────────────────────────────────────────────────┐
│                         name                          │
│    struct(family_name varchar, given_name varchar)    │
├───────────────────────────────────────────────────────┤
│ {'family_name': Post, 'given_name': Belvedere}        │
│ {'family_name': Watt, 'given_name': Brunton}          │
│ {'family_name': Hannay, 'given_name': Passepartout}   │
│ {'family_name': Carnegie, 'given_name': Passepartout} │
│ {'family_name': Duncan, 'given_name': Merriman}       │
│ {'family_name': Watt, 'given_name': Figaro}           │
│ {'family_name': Graham, 'given_name': Leporello}      │
│ {'family_name': Macneil, 'given_name': Merriman}      │
│ {'family_name': Rollo, 'given_name': Merriman}        │
│ {'family_name': Spalding, 'given_name': Figaro}       │
│                        ·                              │
│                        ·                              │
│                        ·                              │
│ {'family_nam

We can also add conditionals on this.  Python has a bit of a weird syntax: each value of `x` in the collection needs something to be output, and we can output a different value depending on whether a condition is satisfied.

In [27]:
[x if not pd.isna(x) and x.find('-') < 6 else '(special)' for x in list(people_df['_id'])]

['moist-vodka',
 '(special)',
 'tart-acorn',
 '(special)',
 '(special)',
 'salty-section',
 '(special)',
 '(special)',
 '(special)',
 'rich-laser',
 'cold-reveal',
 '(special)',
 'oily-sport',
 '(special)',
 '(special)',
 '(special)',
 'plain-torpedo',
 '(special)',
 'coped-wash',
 'hard-node',
 'exact-novel',
 '(special)',
 'tart-velocity',
 '(special)',
 'fancy-notch',
 '(special)',
 '(special)',
 '(special)',
 'worn-swallow',
 '(special)',
 'pale-pot',
 'all-architecture',
 'dry-participant',
 '(special)',
 '(special)',
 '(special)',
 'sweet-clef',
 '(special)',
 'dull-bezier',
 'nutty-profile',
 '(special)',
 'rigid-coconut',
 '(special)',
 '(special)',
 '(special)',
 '(special)',
 'each-publicist',
 '(special)',
 'thin-level',
 '(special)',
 '(special)',
 '(special)',
 '(special)',
 '(special)',
 '(special)',
 '(special)',
 'basic-wax',
 '(special)',
 '(special)',
 'rude-turbine',
 'rosy-specialist',
 '(special)',
 '(special)',
 '(special)',
 '(special)',
 '(special)',
 '(special)

SQL also allows for conditions using a `CASE WHEN` syntax.  Note the `position` function in DuckDB SQL returns 1-based, as opposed to 0-based, positions.

In [28]:
duckdb.sql("""
  SELECT CASE WHEN _id IS NOT NULL AND position('-' IN _id) < 7 then _id else '(special)' end
  FROM people_df
""")


┌──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CASE  WHEN (((_id IS NOT NULL) AND (main."position"(_id, '-') < 7))) THEN (_id) ELSE '(special)' END │
│                                               varchar                                                │
├──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ moist-vodka                                                                                          │
│ (special)                                                                                            │
│ tart-acorn                                                                                           │
│ (special)                                                                                            │
│ (special)                                                                                            │
│ salty-section                                        

### Exercise

Write a SQL query that replaces all industries without "tech" as a substring with NULL.  Be sure you are case-agnostic in your query, but don't change the case in the result.  Make sure the column is called "industry." Return the results as a dataframe.

In [49]:
results_df = # TODO


In [None]:
results_df.dropna()

In [60]:
if not isinstance(results_df, pd.DataFrame):
  raise TypeError("Value in results_df must be a pandas DataFrame")
elif len(results_df.dropna()) == len(results_df):
  raise RuntimeError('We would expect some nulls!')
grader.grade('sql_case', results_df)

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.


## *Uncorrelated* Subqueries / Table Expressions

Generally, in SQL we can write a *table expression* anywhere we could use a table.  Maybe the simplest way is to consider an expression within the FROM clause.

In [29]:
conn.sql(
    """
  SELECT *
  FROM (
    SELECT _id, given_name, family_name
    FROM people JOIN (SELECT _id FROM skills WHERE lower(skills) LIKE '%bio%') USING (_id)
  )
"""
)

┌─────────────────────┬──────────────┬─────────────┐
│         _id         │  given_name  │ family_name │
│       varchar       │   varchar    │   varchar   │
├─────────────────────┼──────────────┼─────────────┤
│ proper-strain       │ Jenkins      │ Strange     │
│ honest-loop         │ Passepartout │ Nesbitt     │
│ drab-birch          │ Bullimore    │ Stuart      │
│ proud-brig          │ Simonides    │ Murdoch     │
│ cold-pivot          │ Merriman     │ Kennedy     │
│ cheerful-davit      │ Poole        │ Crichton    │
│ shabby-shepherd     │ Bertuccio    │ Armstrong   │
│ regular-airwatt     │ Merriman     │ Watt        │
│ tossed-wave         │ Cadbury      │ Oliphant    │
│ fermented-coastie   │ Bullimore    │ Lennox      │
│        ·            │    ·         │   ·         │
│        ·            │    ·         │   ·         │
│        ·            │    ·         │   ·         │
│ rosy-covariance     │ Barkley      │ Drummond    │
│ global-kayak        │ Alfred       │ Armstro

And SQL also allows for IN {table expression} or EXISTS({table expression}) within the WHERE clause.

In [30]:
conn.sql(
    """
  SELECT *
  FROM (
    SELECT _id, given_name, family_name
    FROM people
    WHERE _id IN (SELECT _id FROM skills WHERE lower(skills) LIKE '%bio%')
  )
"""
)

┌─────────────────────┬──────────────┬─────────────┐
│         _id         │  given_name  │ family_name │
│       varchar       │   varchar    │   varchar   │
├─────────────────────┼──────────────┼─────────────┤
│ proper-strain       │ Jenkins      │ Strange     │
│ honest-loop         │ Passepartout │ Nesbitt     │
│ drab-birch          │ Bullimore    │ Stuart      │
│ proud-brig          │ Simonides    │ Murdoch     │
│ cold-pivot          │ Merriman     │ Kennedy     │
│ cheerful-davit      │ Poole        │ Crichton    │
│ shabby-shepherd     │ Bertuccio    │ Armstrong   │
│ regular-airwatt     │ Merriman     │ Watt        │
│ tossed-wave         │ Cadbury      │ Oliphant    │
│ fermented-coastie   │ Bullimore    │ Lennox      │
│        ·            │    ·         │   ·         │
│        ·            │    ·         │   ·         │
│        ·            │    ·         │   ·         │
│ rosy-covariance     │ Barkley      │ Drummond    │
│ global-kayak        │ Alfred       │ Armstro

Notice that both of the above queries compute the same thing!  In fact, generally we can accomplish the same thing as *either* of the above two uncorrelated query forms, as a single query with a JOIN.

*Lesson here: think about whether you can simplify your nested queries into a single query that is easier to write / reason about!*

In [31]:
conn.sql(
    """
  SELECT _id, given_name, family_name
    FROM people JOIN skills USING (_id)
    WHERE lower(skills) LIKE '%bio%'
"""
)

┌─────────────────────┬──────────────┬─────────────┐
│         _id         │  given_name  │ family_name │
│       varchar       │   varchar    │   varchar   │
├─────────────────────┼──────────────┼─────────────┤
│ proper-strain       │ Jenkins      │ Strange     │
│ honest-loop         │ Passepartout │ Nesbitt     │
│ drab-birch          │ Bullimore    │ Stuart      │
│ proud-brig          │ Simonides    │ Murdoch     │
│ cold-pivot          │ Merriman     │ Kennedy     │
│ cheerful-davit      │ Poole        │ Crichton    │
│ shabby-shepherd     │ Bertuccio    │ Armstrong   │
│ regular-airwatt     │ Merriman     │ Watt        │
│ tossed-wave         │ Cadbury      │ Oliphant    │
│ fermented-coastie   │ Bullimore    │ Lennox      │
│        ·            │    ·         │   ·         │
│        ·            │    ·         │   ·         │
│        ·            │    ·         │   ·         │
│ rosy-covariance     │ Barkley      │ Drummond    │
│ global-kayak        │ Alfred       │ Armstro

## Correlated Subqueries

Recall that one way of thinking about SQL queries is that the iterate over all of the tuples in each of the tables in the FROM clause. Each table iterator is given a variable name (if you don't specify one it will be the table's name):

```
SELECT *
FROM people A, skills B
```

would iterate over all A and B tuples and consider their combinations. In fact you would get a Cartesian product as a result of this.

We may want to write *subqueries* that test against the values in the iterators.


### EXISTS

Perhaps the simplest subquery uses the `EXISTS` predicate in the `WHERE` clause. Within the predicate, we can compute any set-style expression, including a SQL query that returns results.  Naturally, `EXISTS` tests whether we have an empty set or not.

Example: For each person in an industry, we can see if there exists at least one other person with the same first name, in the same industry.

In [67]:
%%time
conn.sql("""
  SELECT DISTINCT A.family_name, A.given_name, A.industry
  FROM people A
  WHERE EXISTS (
    SELECT *
    FROM people B
    WHERE A._id != B._id AND
      A.given_name = B.given_name AND
      A.industry = B.industry
  )
""").df()

CPU times: user 96.8 ms, sys: 20 ms, total: 117 ms
Wall time: 89.5 ms


Unnamed: 0,family_name,given_name,industry
0,Watt,Brunton,Pharmaceuticals
1,Hannay,Passepartout,Research
2,Graham,Leporello,Nonprofit Organization Management
3,Rollo,Merriman,Telekomünikasyon
4,Hunter,Merriman,Human Resources
...,...,...,...
37496,Wheatley,Barrymore,Public Relations and Communications
37497,Kincaid,Bertuccio,Real Estate
37498,Macdougall,Barrymore,Design
37499,Forsyth,Cadbury,Public Policy


### IN

One can also test whether a result exists within a (correlated) subquery.  This should mirror your intuitions for the logical $x \in S$ test one would apply in logic.

Here's a version of the previous query, looking for *all people who aren't the current person in the parent query but have a matching first name*.  Then we test if the industry matches.
Observe that the results are the same but the query looks very different. What can you say about the *execution time*?

In [69]:
%%time
conn.sql("""
  SELECT DISTINCT A.family_name, A.given_name, A.industry
  FROM people A
  WHERE A.industry IN (
    SELECT industry
    FROM people B
    WHERE A._id != B._id
    AND A.given_name = B.given_name
  )
""").df()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

CPU times: user 23 s, sys: 48 ms, total: 23.1 s
Wall time: 23.2 s


Unnamed: 0,family_name,given_name,industry
0,Strange,Jenkins,Internet
1,Burnett,Belvedere,Government Relations
2,Crichton,Belvedere,Bankwezen
3,Kennedy,Merriman,Banking
4,Dunbar,Cadbury,Financial Services
...,...,...,...
37496,Crichton,Merriman,Tehnologia informației și servicii informatice
37497,Maclean,Alfred,Computer Software
37498,Ogilvy,Barrymore,Marketing and Advertising
37499,Harris,Leporello,E-Learning


### Question for discussion

Can you think of a way to use a *join* to capture the same result above? Hint: you may need to also use `DISTINCT`.  Among the three options which is fastest?

### Test against ALL

The `>= ALL()` predicate in the `WHERE` clause allows us to compare against the results of a set -- computed in a subquery.  (In most cases that subquery returns a collection of unary tuples, since we will be comparing a single scalar such as a string or int.)

Example: Let's find, for each industry, the **person/people with the lexicographically greatest last name**. We can do this by seeing if the iterator's last name matches or exceeds *all* last names of people in the same industry.

In [63]:
conn.sql("""
   SELECT industry, _id, given_name, family_name
   FROM people A
   WHERE family_name >= ALL(
    SELECT family_name
    FROM people B
    WHERE A.industry = B.industry)
  ORDER BY A.industry
""").df()

Unnamed: 0,industry,_id,given_name,family_name
0,Accessori e moda,solid-change,Jeeves,Wheatley
1,Accessori e moda,camel-diff,Cadbury,Wheatley
2,Accounting,short-plan,Poole,Wood
3,Accounting,shy-fluid,Poole,Wood
4,Accounting,rainy-content,Poole,Wood
...,...,...,...,...
1837,,juicy-repeat,Alfred,Mackenzie
1838,,bounded-lattice,Figaro,Beckham
1839,,courtly-front,Bertuccio,Guthrie
1840,,generous-caption,Simonides,Cumming


Of course, if there is `>=ALL()`, you can imagine that there are *other* conditionals that can be tested against `ALL`.

## Grouping, HAVING vs WHERE

Sometimes we need to do SQL *grouping* as well as *filtering*.  What's important is to understand whether we are filtering *before* the aggregation (e.g., we eliminate folks we don't want to count) or *after* the aggregation (e.g., we eliminate aggregate groups).

For the former we use `WHERE` as per the usual.  But if we want to filter the *results* of a `GROUP BY` we need to either (1) feed the results into another query as a source in the `FROM` clause and filter, which is often painful; or (2) use the optional SQL HAVING clause.

So, if we want to show in descending order all industries by popularity, as long as the industry is *not* Biotechnology and there are at least 10 people in the industry, we can do the following.

In [73]:
conn.sql("""
   SELECT industry, COUNT(_id) AS popularity
   FROM people
   WHERE industry <> 'Biotechnology'
   GROUP BY industry
   HAVING COUNT(_id) > 10
   ORDER BY COUNT(_id) DESC
""").df()

Unnamed: 0,industry,popularity
0,Information Technology and Services,5627
1,Computer Software,3512
2,Marketing and Advertising,2659
3,Internet,2159
4,Financial Services,1541
...,...,...
413,Telekomunikace,11
414,Alimenti e bevande,11
415,Arquitetura e planejamento,11
416,Fotografie,11


# SQL Debugging

## Debugging by Refactoring

Let's take a complex query about transitive relationships -- say, people who have experience (3+ "experience" rows) and have a common company, but one is in tech and one is in marketing.  This is essentially a "two-hop neighbor" query on a kind of graph (person p1 --> company <-- p2 where both p1 and p2 have certain constraints).

Suppose we tried to write all of this out, maybe like this.  It has a lot of the right form, e.g., we know we are looking for people `p1` and `p2`, it looks for people with at least 3 experiences, etc.  But it has no results!

In [97]:
conn.sql("""
  SELECT DISTINCT p1._id, p1.family_name, p1.given_name, p2._id, p2.family_name, p2.given_name
  FROM people p1 JOIN skills s ON p1._id = s._id JOIN experience e ON p1._id = e._id
     JOIN people p2 ON e._id = p2._id JOIN skills s2 ON p2._id = s2._id
  WHERE lower(s.skills) LIKE '%tech%' AND EXISTS (
    SELECT _id FROM experience e
    WHERE p1._id = e._id
    GROUP BY e._id
    HAVING COUNT(*) >= 3)
  AND lower(s2.skills) = 'Marketing' AND EXISTS (
    SELECT _id FROM experience e
    WHERE p2._id = e._id
    GROUP BY e._id
    HAVING COUNT(*) >= 3
  )
""").df()

Unnamed: 0,_id,family_name,given_name,_id_1,family_name_1,given_name_1


What if we break into the two pieces first?

In [98]:
conn.sql("""
  SELECT DISTINCT _id, family_name, given_name, locality, skills, org
  FROM people p JOIN skills s USING (_id) JOIN experience USING (_id)
  WHERE lower(skills) LIKE '%tech%' AND EXISTS (
    SELECT _id FROM experience e
    WHERE p._id = e._id
    GROUP BY _id
    HAVING COUNT(*) >= 3)
""").df()

Unnamed: 0,_id,family_name,given_name,locality,skills,org
0,olive-shore,Ogilvy,Bertuccio,Greater Chicago Area,Technically Strong,Heineken
1,olive-shore,Ogilvy,Bertuccio,Greater Chicago Area,Technically Strong,Cervejarias Kaiser
2,olive-shore,Ogilvy,Bertuccio,Greater Chicago Area,Technically Strong,Viação Santa Catarina
3,olive-shore,Ogilvy,Bertuccio,Greater Chicago Area,Technically Strong,Federação de Cooperativas Agropecuárias
4,olive-shore,Ogilvy,Bertuccio,Greater Chicago Area,Technically Strong,Sonata Ind Apar Eletrônicos
...,...,...,...,...,...,...
2039,relaxed-ragged,Keith,Poole,"Chennai Area, India",Technological Innovation,Pace Automation Ltd
2040,zesty-collie,Livingstone,Leporello,"Copenhagen Area, Denmark",Technology Policy,Accenture
2041,zesty-collie,Livingstone,Leporello,"Copenhagen Area, Denmark",Technology Policy,Devoteam
2042,zesty-collie,Livingstone,Leporello,"Copenhagen Area, Denmark",Technology Policy,Rambøll Management


In [99]:
conn.sql("""
  SELECT DISTINCT _id, family_name, given_name, skills, org
  FROM people p JOIN skills s USING (_id) JOIN experience USING (_id)
  WHERE skills = 'Marketing' AND EXISTS (
    SELECT _id FROM experience e
    WHERE p._id = e._id
    GROUP BY _id
    HAVING COUNT(*) >= 3)
""").df()

Unnamed: 0,_id,family_name,given_name,skills,org
0,sour-database,Donald,Belvedere,Marketing,Steris
1,humid-paprika,Lindsay,Simonides,Marketing,Amanda Stevens Consulting
2,humid-paprika,Lindsay,Simonides,Marketing,Magellan Vacations
3,humid-paprika,Lindsay,Simonides,Marketing,Botanical PaperWorks
4,humid-paprika,Lindsay,Simonides,Marketing,RBC Royal Bank
...,...,...,...,...,...
463,compact-firmware,Finlay,Jenkins,Marketing,SIDEA Consultores
464,compact-firmware,Finlay,Jenkins,Marketing,Gobierno de Córdoba
465,compact-firmware,Finlay,Jenkins,Marketing,CEDI Consulting & Training
466,compact-firmware,Finlay,Jenkins,Marketing,Apex a SYKES Company


### Question

Those look good.  How would I join them together to actually solve the problem?

In [100]:
# TODO

## Debugging over Samples

For really big datasets, you may find that it takes forever to run the query.  One approach is to *sample* results.

*Caveat*: sampling independently from different tables that join is very risky -- each time we do this, the proportion of tuples in your query that join ("selectivity") goes down exponentially, because the real values are *correlated* but you are instead sampling *independently*.

DuckDB addresses this by allowing you to *sample over a query result* instead of over the input tables.

In [105]:
conn.sql("""
  SELECT DISTINCT _id, family_name, given_name, skills, org
  FROM people p
  JOIN skills s USING (_id) JOIN experience USING (_id)
  WHERE skills = 'Marketing' AND EXISTS (
    SELECT _id FROM experience e
    WHERE p._id = e._id
    GROUP BY _id
    HAVING COUNT(*) >= 3)
   USING SAMPLE 10 PERCENT
""").df()

Unnamed: 0,_id,family_name,given_name,skills,org
0,pureed-caramel,Henry,Merriman,Marketing,Easy Internet Marketing ( Tribal Holding )
1,pureed-caramel,Henry,Merriman,Marketing,Tribal People
2,pureed-caramel,Henry,Merriman,Marketing,Easy Internet Marketing
3,pureed-caramel,Henry,Merriman,Marketing,Grace Industries
4,pureed-caramel,Henry,Merriman,Marketing,The Designer-III Company
5,pureed-caramel,Henry,Merriman,Marketing,Joy Mining Machinery
6,pureed-caramel,Henry,Merriman,Marketing,Tribal Internet Solutions (Shanghai) LTD
7,pureed-caramel,Henry,Merriman,Marketing,Matric Limited
8,pureed-caramel,Henry,Merriman,Marketing,Circuit Cellar Magazine
