# Discussion 10: SQL

## Setup

In [1]:
import pandas as pd
import numpy as np
import duckdb

In [2]:
# Run this cell to set up SQL.
%load_ext sql

In [3]:
# Run this cell to connect to duckdb
conn = duckdb.connect()
conn.query("INSTALL sqlite")
%sql conn --alias duckdb

In [4]:
# Run this cell to create the survey table for Q1
data = {'j_name': ['Llama Technician','Software Engineer','Open Source Maintainer','Big Data Engineer', 'Data Analyst', 'Analyst Intern'],
        'c_name': ["Google","Salesforce", "Github","Microsoft","Startup","Google"],
        'c_location' : ['Mountain View', 'SF', 'SF', 'Redmond', 'Berkeley', 'SF'],
        'm_name': ["Applied Math","ORMS","Computer Science", "Data Science", "Data Science","Philosophy"]
        }

survey = pd.DataFrame(data, columns = list(data.keys()))

In [5]:
# Run this cell to create the tables for Q3
homes_data = {'home_id': [1,2,3,4,5,6],
        'city': ["Berkeley","San Jose","Berkeley","Berkeley","Berkeley", "Sunnyvale"],
        'bedrooms': [2,1,5,3,4,1],
        'bathrooms': [2,2,1,1,3,2],
        'area': [str(i) for i in [500,750,1000,1500,500,1000]] 
        }

homes = pd.DataFrame(homes_data, columns = list(homes_data.keys()))

transactions_data = {'home_id': [1,2,3,5],
        'buyer_id': [5,6,7,8],
        'seller_id': [8,7,6,5],
        'transaction_data': ['1/12/2001','4/14/2001','8/11/2001','12/21/2001'],
        'sale_price': [1000,500,750,1200]
        }

transactions = pd.DataFrame(transactions_data, columns = list(transactions_data.keys()))


buyers_data = {'buyer_id': [5,6,7,8],
        'name': ["Xiaorui","Conan","Rose","Brandon"],
        }

buyers = pd.DataFrame(buyers_data, columns = list(buyers_data.keys()))

seller_data = {'seller_id': [8,7,6,5],
        'name': ["Shreya","Emrie","Jake","Sam"],
        }

seller = pd.DataFrame(seller_data, columns = list(seller_data.keys()))

In [6]:
# Run this cell to create the tables for Q4
cat_owners_data = {'id': [10,11,12],
        'name': ["Alice","Bob","Candice"],
        }

cat_owners = pd.DataFrame(cat_owners_data, columns = list(cat_owners_data.keys()))

cats_data = {'id': [51,52,53,54,55],
        'owner_id': [10, 10, 11, 11, 12],
        'name': ["Mittens","Whisker","Pishi","Lucky","Fluffy"],
        'breed' : ["Tabby","Black","Orange","Tabby","Black"],
        'age': [2,3,1,2,16]
        }

cats = pd.DataFrame(cats_data, columns = list(cats_data.keys()))

## SQL Syntax
### Q1 
For this question, we will be working with the UC Berkeley Undergraduate Career Survey
dataset, named `survey`. Each year, the UC Berkeley Career Center surveys graduating seniors for their plans after graduating. Below is a sample of the full dataset that contains many
thousands of rows.

![](survey_table.png)

Each record of the `survey` table is an entry corresponding to a student. We have the job title,
company information, and the student’s major.

#### 1a
Write an SQL query that selects all data science major graduates that got jobs in Berkeley.
The result generated by your query should include all 4 columns.

In [7]:
%%sql 
-- write your query here --
SELECT * FROM survey
WHERE m_name = 'Data Science' 
AND c_location = 'Berkeley';

j_name,c_name,c_location,m_name
Data Analyst,Startup,Berkeley,Data Science


#### 1b
Write an SQL query to find the top 2 most popular companies that data science graduates will work at, from most popular to 2nd most popular.

In [8]:
%%sql 
-- write your query here --
SELECT c_name, COUNT(*) AS count
FROM survey
WHERE m_name = 'Data Science'
GROUP BY c_name
ORDER BY count DESC
LIMIT 2;

c_name,count
Microsoft,1
Startup,1


## Joins 

![](joins.png)

Note: You do not need the JOIN keyword to join SQL tables. The following are equivalent:

    SELECT column1, column2
    FROM table1, table2
    WHERE table1.id = table2.id;
    
    SELECT column1, column2
    FROM table1 JOIN table2 
    ON table1.id = table2.id;

### Q2 

In the figure above, assume `table1` has $m$ records, while `table2` has $n$ records. Describe which records are returned from each type of join. What is the **maximum** possible number of records returned in each join? Consider the cases where on the joined field, (1) both tables have unique values, and (2) both tables have duplicated values. Finally, what is the **minimum** possible number of records returned in each join?


**Solution**

(INNER) JOIN: Returns records that have matching values in both tables. The maximum number of rows is $\min(m, n)$ if unique rows in each table. The minimum number of rows is 0.

LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table. The maximum number of rows is $m$ if unique rows. The minimum number of rows is $m$.

RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table. The maximum number of rows is $n$ if unique rows. The minimum number of rows is $n$.

FULL (OUTER) JOIN: Return all records when there is a match in either the left or right table. The maximum number of rows in $m + n$ if unique rows. The minimum number of rows is $\max(m, n)$.

CROSS JOIN: Return all pairs of records between the left and right tables. The maximum number of rows is $m × n$. The minimum number of rows is $m × n$.

    
All joins have a maximum number of m × n rows if duplicates are allowed. 

### Q3



Consider the following real estate schema (underlined column names have unique values and no duplicates):

* <code> homes(<u>home_id int</u>, city text, bedrooms int, bathrooms int,
area text) </code>
* <code> transactions(<u>home_id int, buyer_id int, seller_id int, transaction_date date</u>, sale_price int) </code>
* <code> buyers(<u>buyer_id int</u>, name text) </code>
* <code> sellers(<u>seller_id int</u>, name text) </code>

Fill in the blanks in the SQL query to find the `home_id`, `selling price`, and `area` for each home in Berkeley with an area greater than 600. If the home has not been sold yet and has an area greater than 600, it should still be included in the table with **the price as None**.


In [9]:
%%sql 
-- fill in the blanks --
SELECT H.home_id, T.sale_price, H.area
FROM homes AS H
LEFT JOIN transactions AS T
ON H.home_id = T.home_id
WHERE H.city = 'Berkeley'
AND CAST(H.area AS INT) > 600;

home_id,sale_price,area
3,750.0,1000
4,,1500


In [10]:
%%sql 
-- alternate solution using RIGHT JOIN and casting only when making a comparison--
SELECT H.home_id, T.sale_price, 
    CAST(H.area as INT) as area_int
FROM transactions AS T
RIGHT JOIN homes AS H
ON H.home_id = T.home_id
WHERE H.city = 'Berkeley'
AND area_int > 600;

home_id,sale_price,area_int
3,750.0,1000
4,,1500


## More SQL Queries
### Q4

Examine this schema for these two tables:

    CREATE TABLE cat_owners (
        id integer, 
        name text, 
        age integer,
        PRIMARY KEY (id)
    ); 

    CREATE TABLE cats (
        id integer
        owner_id integer, 
        name text, 
        breed text, 
        age integer, 
        PRIMARY KEY (id),
        FOREIGN KEY (owner_id) REFERENCES cat_owners
    );


#### 4a
Write an SQL query to create an almost identical table as cats, except with an additional
column `Nickname` that has the value "Kitten" for cats less than or equal to the age of 1,
"Catto" for cats between 1 and 15, and "Wise One" for cats older than or equal to 15

In [11]:
%%sql 
-- write your query here --
SELECT id, owner_id, name, breed, age,
    CASE
        WHEN age <= 1 THEN 'Kitten'
        WHEN age >= 15 THEN 'Wise One'
        ELSE 'Catto'
    END AS Nickname
FROM cats AS C;
-- the first line can also be written as SELECT * --

id,owner_id,name,breed,age,Nickname
51,10,Mittens,Tabby,2,Catto
52,10,Whisker,Black,3,Catto
53,11,Pishi,Orange,1,Kitten
54,11,Lucky,Tabby,2,Catto
55,12,Fluffy,Black,16,Wise One


#### 4b
Considering only cats with ages strictly greater than 1, write an SQL query that returns the `owner_ids` of owners that own more than one cat.

In [12]:
%%sql 
-- write your query here --
SELECT owner_id
FROM cats
WHERE age > 1
GROUP BY owner_id
HAVING COUNT(*) > 1;

owner_id
10


#### 4c
Write an SQL query that returns the total number of cats each `owner_id` owns sorted by the number of cats in descending order. There should be two columns (`owner_id` and `num_cats`).

In [13]:
%%sql 
-- write your query here --
SELECT owner_id, COUNT(*) AS num_cats
FROM cats 
GROUP BY owner_id
ORDER BY num_cats DESC;

owner_id,num_cats
11,2
10,2
12,1


#### 4d
Write an SQL query to figure out the names of all of the cat owners who have a cat
named Pishi. 

In [14]:
%%sql 
-- write your query here --
SELECT O.name 
FROM cats AS C
JOIN cat_owners AS O
ON C.owner_id = O.id 
WHERE C.name = 'Pishi';

name
Bob


#### 4e
It is possible to have a cat with an owner not in the `cat_owners` table? Explain your answer.

**Solution**: Since the table cats has a FOREIGN KEY requirement on `owner_id`, a corresponding entry for an `owner_id` must exist.

#### 4f
Write an SQL query to select all rows from the `cats` table that have cats of
the top 2 most popular cat breeds.

In [15]:
%%sql 
-- write your query here --
SELECT *
FROM cats WHERE breed IN
    (SELECT breed
    FROM cats
    GROUP BY breed
    ORDER BY COUNT(*) DESC
    LIMIT 2);

id,owner_id,name,breed,age
51,10,Mittens,Tabby,2
52,10,Whisker,Black,3
54,11,Lucky,Tabby,2
55,12,Fluffy,Black,16
