In [1]:
%load_ext sql
%sql sqlite:///PS2.db

Problem Set #2
=======

### Deliverables:

Submit your answers using the `submission_template.txt` file that is posted on the class website. Follow the instructions on the file! Upload the file at Canvas (under PS2).


### Instructions / Notes:

**_Read these carefully_**

* You **may** create new IPython notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!- **just make sure that your final answer for each question is _in its own cell_ and _clearly indicated_**
* When you see `In [*]:` to the left of the cell you are executing, this means that the code / query is _running_.
    * **If the cell is hanging- i.e. running for too long: To restart the SQL connection, you must restart the entire python kernel**
    * To restart kernel using the menu bar: "Kernel >> Restart >> Clear all outputs & restart"), then re-execute the sql connection cell at top
    * You will also need to restart the connection if you want to load a different version of the database file
* Remember:
    * `%sql [SQL]` is for _single line_ SQL queries
    * `%%sql [SQL]` is for _multi line_ SQL queries
* _Have fun!_

Problem 1: Verifying Functional Dependencies [24 points]
---------

For this part, you will need to provide a _single_ SQL query which will check whether a certain condition holds on the **hospital** table in the provided database:

In [2]:
%sql select * from hospital limit 2;

 * sqlite:///PS2.db
Done.


provider,hospital,address,city,state,zip,county,phone_number,hospital_type,hospital_owner,emergency_service,condition,measure_code
10018,CALLAHAN EYE FOUNDATION HOSPITAL,1720 UNIVERSITY BLVD,BIRMINGHAM,AL,35233,JEFFERSON,2053258100,Acute Care Hospitals,Voluntary non-profit - Private,Yes,Surgical Infection Prevention,SCIP-CARD-2
10018,CALLAHAN EYE FOUNDATION HOSPITAL,1720 UNIVERSITY BLVD,BIRMINGHAM,AL,35233,JEFFERSON,2053258100,Acute Care Hospitals,Voluntary non-profit - Private,Yes,Surgical Infection Prevention,SCIP-INF-1


You need to evaluate any requested conditions in the following way: **your query should return an empty result if and only if the condition holds on the instance.**  If the condition doesn't hold, your query should return something non-empty, but it doesn't matter what this is.

Note our language here: the conditions that we specify cannot be proved to hold **in general** without knowing the externally-defined functional dependencies; so what we mean is, _check whether they **are not violated** for the provided instance_.

You may assume that there are no `NULL` values in the tables.

### Part (a)  [14 points]

Is $\{provider\}$ a **superkey** for relation $Hospital$?

In [7]:
%%sql
with counts as (
    select count(h.provider) as 'count', h.provider
    from hospital h
    group by h.provider),
result as (
    select max(c.count) as 'max'
    from counts c)
select *
from counts c
where count > 1
;

 * sqlite:///PS2.db
Done.
Done.


count,provider
25,10001
25,10005
25,10006
25,10007
25,10008
25,10009
25,10010
25,10011
25,10012
25,10015


### Part (b) [10 points]

Does $\{Zip\} \rightarrow \{City, State\}$ hold for relation $Hospital$?

In [8]:
%%sql
with zcs as (
    select h.zip, h.city, h.state
    from hospital h),
zz as (
    select tmp.zip
    from zcs tmp),
cs as (
    select tmp.city, tmp.state
    from zcs tmp),
zzcount as (
    select count(*) as 'count'
    from zz),
cscount as (
    select count(*) as 'count'
    from cs)
select *
from zzcount, cscount
where zzcount.count != cscount.count
;

 * sqlite:///PS2.db
Done.
Done.


count,count_1


Problem 2: Superkeys & Decompositions [40 points]
---------

Consider a relation $S(A,B,C,D,E,F)$ with the following functional dependencies:

* $\{A\} \rightarrow \{D\}$
* $\{A\} \rightarrow \{E\}$
* $\{D\} \rightarrow \{C\}$
* $\{D\} \rightarrow \{F\}$

In each part of this problem, we will examine different properties the provided schema.

To answer **yes**, provide python code that assigns the variable ```answer``` to ```True``` and assigns ```explanation``` to be a python string which contains a (short!) explanation of why.  For example:

```python
answer = True
explanation = "All keys are superkeys."
```

To answer **no**, provide python code that assigns the variable ```answer``` to ```False``` and assigns ```explanation``` to be a python string which contains a (short!) explanation of why.  For example:

```python
answer = False
explanation = "D is not a superkey because its closure is {D,C,F}."
```

### Part (a) [8 points]

Is it correct that ${A,B}$ is a superkey?

In [28]:
answer = True
explanation = "Closure of A, B yields all attributes of relation S."

### Part (b) [8 points]

Is it correct that the decomposition $ABC$, $CDE$, $EFA$ is lossless-join?

In [29]:
answer = False
explanation = "Chase algorithm ends with no rows having all subscripts"

### Part (c) [8 points]

Is it correct that the decomposition $ABC$, $CDE$, $EFA$ is dependency preserving?

In [30]:
answer = False
explanation = "A -> D and D -> F are not preserved."

### Part (d) [8 points]

Is the functional dependency $\{A\} \rightarrow \{E,F\}$ logically implied by FDs present in the relation?

In [31]:
answer = True
explanation = "Closure of A is C, D, E, and F, so {A} -> {E, F} is implied."

### Part (e) [8 points]

Is it correct that relation $S$ is in BCNF? 

In [32]:
answer = False
explanation = "Neither A nor D are superkeys for the relation S."

Problem 3: Relational Algebra [36 points]
---------

Consider the following relational schema for conference publications:
*  `Article(aid, title, year, confid, numpages)`
*  `Conference(confid, name, impact)`
*  `Author(aid, pid)`
*  `Person(pid, name, affiliation)`

Express the following queries in the extended Relational Algebra (you can also use the aggregation operator if necessary). To write the RA expression, use the LaTex mode that ipython notebook provides. For example:

$$\pi_{name}(\sigma_{affiliation="UW-Madison"}(Person))$$ 

### Part (a) [9 points]

Output the name of every person affiliated with `UW-Madison` who has submitted at least one article in 2019.

$$\pi_{name}(\sigma_{affiliation="UW-Madison"}(\pi_{aid}(\sigma_{year=2019}(Article))\bowtie Author)\bowtie\pi_{pid, name, affiliation}(Person)))$$

### Part (b) [9 points]

Output the names of the people who coauthored an article with `John Doe`. Be careful: a person cannot be coauthor with herself!

$$S_1 = \pi_{pid}(\sigma_{name="John Doe"}(Person))\bowtie Author \\
S_2 = \rho_{pid=npid, aid=aid}(\pi_{pid}(\sigma_{name\neq"John Doe"}(Person))\bowtie Author) \\
\pi_{name}(\rho_{pid}(\pi_{npid}(S_1 \bowtie S_2)) \bowtie Person)$$

### Part (c) [9 points]

Count how many articles were published during 2010-2020 by `John Doe`.

$$\gamma_{COUNT(aid)}(\pi_{aid}(\pi_{pid}(\sigma_{name="John Doe"}(Person))\bowtie Author)\bowtie \sigma_{year \geq 2010 AND year \leq 2020}(Article))$$

### Part (d) [9 points]

Output the names of everyone who published an article in the conference `SIGMOD` in 2018, but not in 2019. 

$$S_1 = \pi_{pid}(\pi_{aid}(\sigma_{year=2018}(\pi_{confid}(\sigma_{name="SIGMOD"}(Conference)) \bowtie Article)) \bowtie Author) \\
S_2 = \pi_{pid}(\pi_{aid}(\sigma_{year=2019}(\pi_{confid}(\sigma_{name="SIGMOD"}(Conference)) \bowtie Article)) \bowtie Author) \\
S_1 - S_2$$