# SQL subqueries

In this Notebook, we will explore how the three types of subquery (*scalar*, *row* and *table*) can be used to 
compare data from two or more tables.

A `SELECT` statement can be embedded within another `SELECT` statement, that is, one query – 
a subquery (or inner query), within another query (outer query). 
The results of the subquery are used in the outer query to help determine the content of the resultant table.

There are three types of subquery:

- A *scalar subquery* returns a single column and a single row, that is, a single value. 
A *scalar subquery* can be used whenever a single value is needed. For example, in the `SELECT` clause and in a `WHERE` or `HAVING` condition.
- A *row subquery* returns two or more columns, as a single row. 
A *row subquery* can be used whenever a single row is needed. For example, in a `WHERE` or `HAVING` condition.
- A *table subquery* returns one or more columns and one or more rows, that is, a table. 
A *table subquery* can be used whenever a table is needed. For example, in the `FROM` clause, 
and in a `WHERE` or `HAVING` condition.

This notebook contains several exercises or activities, which are presented with a space for you to try your own solution. In each case, you can see our solution by clicking on the small triangle next to the text "**our solution**", but in all cases, you should attempt the questions yourself before looking at our proposed solutions.



## Set up the Database

First, enable access to the PostgreSQL database engine via [SQL Cell Magic](https://pypi.python.org/pypi/ipython-sql).

In [2]:
%load_ext sql

%sql postgresql://test:test@localhost:5432/tm351test

'Connected: test@tm351test'

As the `doctor` and `patient` tables may have been updated by another Notebook, we will first restore them to their original state:

In [3]:
%%sql
DROP TABLE IF EXISTS patient CASCADE;
DROP TABLE IF EXISTS doctor CASCADE;

CREATE TABLE doctor (
 doctor_id CHAR(3) NOT NULL
  CHECK (doctor_id SIMILAR TO 'd[0-9][0-9]'),
 doctor_name VARCHAR(20) NOT NULL,
 date_of_birth DATE NOT NULL,
 PRIMARY KEY (doctor_id)
 );

CREATE TABLE patient (
  patient_id CHAR(4) NOT NULL
    CHECK (patient_id SIMILAR TO 'p[0-9][0-9][0-9]'),
  patient_name VARCHAR(20) NOT NULL,
  date_of_birth DATE NOT NULL,
  gender CHAR(1) NOT NULL
    CHECK (gender = 'F' OR gender = 'M'),
  height DECIMAL(4,1)
    CHECK (height > 0),
  weight DECIMAL(4,1)
    CHECK (weight > 0),
  doctor_id CHAR(3),
 PRIMARY KEY (patient_id),
 FOREIGN KEY (doctor_id) REFERENCES doctor(doctor_id)
 );

Done.
Done.
Done.
Done.


[]

In [4]:
%%sql
DROP TABLE IF EXISTS patient CASCADE;
DROP TABLE IF EXISTS doctor CASCADE;

CREATE TABLE doctor (
 doctor_id CHAR(3) NOT NULL
  CHECK (doctor_id SIMILAR TO 'd[0-9][0-9]'),
 doctor_name VARCHAR(20) NOT NULL,
 date_of_birth DATE NOT NULL,
 PRIMARY KEY (doctor_id)
 );

CREATE TABLE patient (
  patient_id CHAR(4) NOT NULL
    CHECK (patient_id SIMILAR TO 'p[0-9][0-9][0-9]'),
  patient_name VARCHAR(20) NOT NULL,
  date_of_birth DATE NOT NULL,
  gender CHAR(1) NOT NULL
    CHECK (gender = 'F' OR gender = 'M'),
  height DECIMAL(4,1)
    CHECK (height > 0),
  weight DECIMAL(4,1)
    CHECK (weight > 0),
  doctor_id CHAR(3),
 PRIMARY KEY (patient_id),
 FOREIGN KEY (doctor_id) REFERENCES doctor(doctor_id)
 );

Done.
Done.
Done.
Done.


[]

Populate the tables from files using [Psycopg](http://initd.org/psycopg/docs/index.html), 
a PostgreSQL database adapter for Python.

In [5]:
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

In [6]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()

# open doctor.dat
io = open('data/doctor.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'doctor')
# close doctor.dat
io.close()
# commit transaction
conn.commit()

# open patient+doctor_id.dat
io = open('data/patient+doctor_id.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'patient')
# close patient+doctor_id.dat
io.close()
# commit transaction
conn.commit()

# close cursor
c.close()
# close database connection
conn.close()

And check that everything's worked (should have 5 rows returned for each of the next two queries):

In [7]:
%sql SELECT * FROM doctor ORDER BY doctor_id;

5 rows affected.


doctor_id,doctor_name,date_of_birth
d06,Gibson,1954-02-24
d07,Paxton,1960-05-23
d09,Tamblin,1972-12-22
d10,Rampton,1980-09-25
d11,Nolan,1988-04-01


In [8]:
%sql SELECT * FROM patient ORDER BY patient_id LIMIT 5;

5 rows affected.


patient_id,patient_name,date_of_birth,gender,height,weight,doctor_id
p001,Thornton,1980-01-22,F,162.3,71.6,d06
p007,Tennent,1980-04-01,M,176.8,70.9,d07
p008,James,1980-07-08,M,167.9,70.5,d07
p009,Kay,1980-09-25,F,164.7,53.2,d06
p015,Harris,1980-12-04,M,180.6,64.3,d06


## Scalar subqueries

A *scalar subquery* returns a single column and a single row, that is, a single value. 
A *scalar subquery* can be used whenever a single value is needed. 
For example, in the `SELECT` clause, or in the condition of a `WHERE` or `HAVING` clause with a comparison 
operator (`=, <, >, <>`). 
In the latter case, the resultant value can be compared with a value from the outer query.

The following query returns a single column and a single row – the average weight of patients registered at the 
doctors’ surgery who have been weighed:

In [9]:
%sql SELECT AVG(weight) FROM patient;

1 rows affected.


avg
67.47333333333333


We can exert some control over the presentation of the numerical result using the `CAST( ... ) AS DECIMAL(precision, scale)` formuation, where the *precision* is the number of signifcant digits in the whole number *before* the decimal point and the *scale* refers to the count of decimal digits in the fractional part *after* the decimal point.

In [10]:
%sql SELECT CAST(AVG(weight) AS DECIMAL(4,2)) FROM patient;

1 rows affected.


avg
67.47


In [11]:
%sql SELECT CAST(AVG(weight) AS INTEGER) FROM patient;

1 rows affected.


avg
67


This query can be used, for example, as a *scalar subquery* in the following three queries, where it is used 
respectively in the `SELECT` clause, in the condition of a `WHERE` clause, and in the condition of a `HAVING` clause.

Scalar subqueries as part of a SELECT clause

Suppose that we want to find the difference between a particular value associated with each record and the *average (mean) value taken over all the records.

For example, we might ask: *"what is the difference between the individual weights of a set patients compared to the average weight of those patients?"*

To answer this question, we need to subtract the average weight from each individual weight:

In [12]:
%%sql

SELECT patient_id, patient_name, weight - (SELECT AVG(weight) 
                                           FROM patient) AS weight_difference
FROM patient
ORDER BY patient_id LIMIT 3;

3 rows affected.


patient_id,patient_name,weight_difference
p001,Thornton,4.126666666666666
p007,Tennent,3.4266666666666667
p008,James,3.026666666666667


In [13]:
%%sql

SELECT patient_id, patient_name, CAST(weight - (SELECT AVG(weight) FROM patient) AS DECIMAL(4,2)) AS weight_difference

FROM patient
ORDER BY patient_id LIMIT 3;

3 rows affected.


patient_id,patient_name,weight_difference
p001,Thornton,4.13
p007,Tennent,3.43
p008,James,3.03


### Exercise 1

How might you improve the presentation of that result, for example, by using a `CAST(... AS ...)` construction?

In [14]:
%%sql
SELECT patient_id ,patient_name, CAST(weight - (SELECT AVG(weight) FROM patient) AS DECIMAL(4,2)) AS weight_difference

FROM patient
ORDER BY patient_id LIMIT 3;

3 rows affected.


patient_id,patient_name,weight_difference
p001,Thornton,4.13
p007,Tennent,3.43
p008,James,3.03


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

The presentation of the result could be improved by wrapping the diffference calcuation in a `CAST` function:

In [13]:
%%sql

SELECT patient_id, patient_name, CAST(weight - (SELECT AVG(weight)
                                                FROM patient) AS DECIMAL(4,1)) AS weight_difference
FROM patient 
ORDER BY patient_id 
LIMIT 5;


5 rows affected.


patient_id,patient_name,weight_difference
p001,Thornton,4.1
p007,Tennent,3.4
p008,James,3.0
p009,Kay,-14.3
p015,Harris,-3.2


### Scalar subqueries as part of a WHERE clause

If we want to display rows where a property of each individual row stands in comparison to some property calculated over all the rows, we can use a *scalar subquery* as part of a `WHERE` clause to return a numerical value against which we can make a comparison.

For example, suppose that we want to know which patients that weigh less than the average patient weight. That is, we want to select rows where the individual weight is less than the overall average weight.

In [15]:
%%sql

SELECT patient_id, patient_name
FROM patient
WHERE weight < (SELECT AVG(weight) FROM patient)
ORDER BY patient_id;

7 rows affected.


patient_id,patient_name
p009,Kay
p015,Harris
p068,Monroe
p079,Dixon
p080,Bell
p087,Reed
p089,Jarvis


### Scalar subqueries in the conditional part of a HAVING clause

If we want to make decisions based on group properties, rather than across all records, we need to make use of a HAVING clause.

Recall that the HAVING clause lets you select grouped responses based on a test against a property of the group as a whole.

### Exercise 2

Write a query that can identify the doctors having more than three patients on their books. 

In [16]:
%%sql

SELECT doctor_id, COUNT(*) AS roll_count FROM patient
GROUP BY doctor_id
HAVING COUNT(*) > 3

2 rows affected.


doctor_id,roll_count
d07,4
d06,5


In [17]:
%%sql
SELECT doctor_id, COUNT(*) AS roll_count FROM patient
GROUP BY doctor_id
HAVING COUNT(*) > 3


2 rows affected.


doctor_id,roll_count
d07,4
d06,5


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

<div class='answer'>To identify doctors with more than three patients on their books, we need to group the patient records by doctor, and then just keep the groups that contain (with `HAVING`) more than three patient records associated with them.

In [18]:
%%sql

SELECT doctor_id, COUNT(*) AS roll_count FROM patient
GROUP BY doctor_id
HAVING COUNT(*) > 3

2 rows affected.


doctor_id,roll_count
d07,4
d06,5


When used in conjunction with a `HAVING` clause, a scalar subquery lets us compare a group property with a property of another set of records or the a property of the records as a whole from which the grouped rows are taken.

For example, suppose that you wanted to know the years in which patients were born where the average weight of those patients is less than the average weight of all the patients, along with those year average weights.

First, we need to find a way of extracting the birth year from the data of birth. The [`EXTRACT`](http://www.postgresql.org/docs/9.3/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) function can be used to retrieve subfields such as year from *date* values.

In [19]:
%%sql
SELECT patient_id, date_of_birth, EXTRACT(YEAR FROM date_of_birth) AS year_of_birth 
FROM patient WHERE doctor_id='d10' LIMIT 5

3 rows affected.


patient_id,date_of_birth,year_of_birth
p068,1981-10-21,1981.0
p071,1981-12-12,1981.0
p078,1982-02-25,1982.0


*Note that as the [`EXTRACT`](http://www.postgresql.org/docs/9.3/static/functions-datetime.html) function returns values as *double precision*, [`CAST`](http://www.postgresql.org/docs/9.3/static/sql-expressions.html#SQL-SYNTAX-TYPE-CASTS) should used to convert the value to an *integer*.

Now we can find those year groups "having" an average weight less that the overall average weight, along with those year average weights.

In [20]:
%%sql
SELECT CAST(EXTRACT(YEAR FROM date_of_birth) AS INTEGER) AS year_of_birth, 
       AVG(weight) AS average_weight
FROM patient
GROUP BY year_of_birth
HAVING AVG(weight) < (SELECT AVG(weight) FROM patient);

2 rows affected.


year_of_birth,average_weight
1980,66.1
1982,63.98333333333333


Once again, we might care to tidy up the result using a `CAST()`:

In [21]:
%%sql
SELECT CAST(EXTRACT(YEAR FROM date_of_birth) AS INTEGER) AS year_of_birth, 
       CAST(AVG(weight) AS DECIMAL(4,1)) AS average_weight
FROM patient
GROUP BY year_of_birth
HAVING AVG(weight) < (SELECT AVG(weight) FROM patient);

2 rows affected.


year_of_birth,average_weight
1980,66.1
1982,64.0


### Exercise 3

Scalar subqueries can also be combined in a single query. How might you combine scalar subqueries to report not only the birth year and year-average weights for those patients  born in years where the average weight of those patients is greater than the average weight of all the patients, but also the overall average weight?

In [23]:
%%sql
SELECT CAST(EXTRACT(YEAR FROM date_of_birth) AS INTEGER) AS year_of_birth,
        CAST(AVG(weight) AS DECIMAL(4,1)) AS average_weight,
        CAST((SELECT AVG(weight) FROM patient) AS DECIMAL(4,1)) AS overall_average_weight
FROM patient
GROUP BY year_of_birth
HAVING AVG(weight) > (SELECT AVG(weight) FROM patient);

1 rows affected.


year_of_birth,average_weight,overall_average_weight
1981,74.4,67.5


In [22]:
%%sql
SELECT CAST(EXTRACT(YEAR FROM date_of_birth) AS INTEGER) AS year_of_birth,
        CAST(AVG(weight) AS DECIMAL(4,1)) AS average_weight,
        CAST((SELECT AVG(weight) FROM patient) AS DECIMAL(4,1)) AS overall_average_weight
FROM patient
GROUP BY year_of_birth
HAVING AVG(weight) > (SELECT AVG(weight) FROM patient);

1 rows affected.


year_of_birth,average_weight,overall_average_weight
1981,74.4,67.5


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

To answer this question, we need to find the average weight by group and compare this to the overall average (using as scalar subquery in the <tt>HAVING</tt> clause, but also report on the overall average as part of the <tt>SELECT</tt> clause:

In [24]:
%%sql

SELECT CAST(EXTRACT(YEAR FROM date_of_birth) AS INTEGER) AS year_of_birth,
       CAST(AVG(weight) AS DECIMAL(4,1)) AS average_weight,
       CAST((SELECT AVG(weight) FROM patient) AS DECIMAL(4,1)) AS overall_average_weight
FROM patient
GROUP BY year_of_birth
HAVING AVG(weight) > (SELECT AVG(weight) FROM patient);


1 rows affected.


year_of_birth,average_weight,overall_average_weight
1981,74.4,67.5


## Row subqueries

Suppose that there has been a mix up with some patient records relating to a particular doctor's patients born in a particular year. To resolve the issue, we need to find all the patients who have the same year of birth and doctor responsible for their care as the patient identified by a particular patient_id (*p015*).

A query that returns two columns as a single row, the year of birth and the doctor responsible for the 
patient identified as <tt>p015</tt> can be given as:

In [36]:
%%sql

SELECT EXTRACT(YEAR FROM date_of_birth), doctor_id
FROM patient
WHERE patient_id ='p015';

1 rows affected.


date_part,doctor_id
1980.0,d06


Identifying the birth year and doctor for the patient is one thing, but how can we use now this information to find other patients with the same doctor and year of birth?

A *row subquery* returns two or more columns as a single row. The result of a row subquery can be used in the condition of the outer query `WHERE` or `HAVING` clause with a comparison operator (`=, <, >, <>`). That is, it can be compared with values (a row) from the outer query.

### Row subqueries as part of a WHERE clause
Let's remind ourselves of the question we're trying to answer: *which patients have the same year of birth and doctor  responsible for their care as the patient identified by p015?*

We've already found the birth year and doctor identifier for that patient, but how might we use that information as a *row subquery* in the context of a `WHERE` clause?

The answer is to test a condition defined as a two-tuple against the row subquery; the two-tuple columns should match the two columns returned from the *row subquery*:

In [25]:
%%sql
SELECT patient_id, patient_name, gender
FROM patient
WHERE (EXTRACT(YEAR FROM date_of_birth), doctor_id) =
      (SELECT EXTRACT(YEAR FROM date_of_birth), doctor_id
              FROM patient WHERE patient_id = 'p015');

3 rows affected.


patient_id,patient_name,gender
p001,Thornton,F
p009,Kay,F
p015,Harris,M


### Row subqueries as part of a HAVING clause
Row subqueries can also to filter a set of grouped results based on a group property by using them as part of a `HAVING` clause.

For example, suppose we now want to break out patients by gender to see how many male and female patients there are with the same year of birth and doctor responsible for their care as the patient identified by `p015`.

We could group around the rows after filtering them via a `WHERE` clause:

In [26]:
%%sql
SELECT gender, COUNT(*) AS number FROM patient
WHERE (EXTRACT(YEAR FROM date_of_birth), doctor_id) =
      (SELECT EXTRACT(YEAR FROM date_of_birth), doctor_id
              FROM patient WHERE patient_id = 'p015')
GROUP BY gender, EXTRACT(YEAR FROM date_of_birth), doctor_id

2 rows affected.


gender,number
F,2
M,1


Alternatively, we could create multiple groups and then filter on just the groups meeting the desired condition.

In [27]:
%%sql
SELECT gender, COUNT(*) AS number
FROM patient
GROUP BY gender, EXTRACT(YEAR FROM date_of_birth), doctor_id
HAVING (EXTRACT(YEAR FROM date_of_birth), doctor_id) = 
       (SELECT EXTRACT(YEAR FROM date_of_birth), doctor_id
               FROM patient WHERE patient_id = 'p015');

2 rows affected.


gender,number
F,2
M,1


## Table subqueries
A *table subquery* returns one or more columns and one or more rows, that is, a table.  A table subquery can be used whenever a table is needed, such as in the `FROM` clause or in the condition of a `WHERE` or `HAVING` clause with a `[NOT] IN`, `[NOT] EXISTS`, `ALL` or `ANY` predicate. In the case of a `WHERE` or `HAVING` clause, the resultant table can be compared with the values from the outer query.

An SQL set operation that returns the identifiers of those doctors who are currently treating both male and female patients could use the <tt>INTERSECT</tt> operator:

In [28]:
%%sql

SELECT doctor_id 
FROM patient 
WHERE gender = 'F'

INTERSECT

SELECT doctor_id 
FROM patient WHERE 
gender = 'M'

4 rows affected.


doctor_id
d06
d10
d07
d11


### A *table subquery* in the `FROM` clause

How can we count the number of doctors who are currently treating both male and female patients?

One way is to use the query that finds doctors treating both male and female patients as a *table subquery* as part of the FROM clause in the following query:

In [29]:
%%sql
SELECT COUNT(*) AS number_of_doctors
FROM (SELECT doctor_id FROM patient WHERE gender = 'F'
      INTERSECT
      SELECT doctor_id FROM patient WHERE gender = 'M') AS doctor_ids;

1 rows affected.


number_of_doctors
4


Note that PostgreSQL requires that a *table subquery* in a `FROM` clause is given a table alias via an `AS` clause - `doctor_ids` in this example.

The previous query simply reports on the count of rows in the table returned from the table subquery. But how might we extract information from that table? For example, how might we report on the *doctor_id* of the doctors who are currently treating both male and female patients?

In [30]:
%%sql
SELECT doctor_id
FROM (SELECT doctor_id FROM patient WHERE gender = 'F'
      INTERSECT
      SELECT doctor_id FROM patient WHERE gender = 'M') AS doctor_ids;

4 rows affected.


doctor_id
d06
d10
d07
d11


We can now go one step further, using the `doctor_id` to pull in information from the `doctor` table:

In [31]:
%%sql
SELECT doctor_name
FROM (SELECT doctor_id FROM patient WHERE gender = 'F'
      INTERSECT
      SELECT doctor_id FROM patient WHERE gender = 'M') AS doctor_ids
NATURAL JOIN doctor;

4 rows affected.


doctor_name
Gibson
Rampton
Paxton
Nolan


### A *table subquery* in a `WHERE` clause using (NOT) IN,  (NOT) EXISTS, ALL, ANY

The `IN` predicate enables us to check whether a column (or row) value exists in a table, and can be used with a 
*table subquery*.

### Exercise 4 

Write a query that finds the identity of doctors who are responsible for one or more patients:

In [51]:
%%sql

SELECT DISTINCT doctor_id
FROM patient
WHERE doctor_id IS NOT NULL;

4 rows affected.


doctor_id
d11
d07
d06
d10


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

One way of doing this is to find all non-null references to the `doctor_id` column in the `patient` table:

In [52]:
%%sql

SELECT DISTINCT doctor_id 
FROM patient 
WHERE doctor_id IS NOT NULL;


4 rows affected.


doctor_id
d11
d07
d06
d10


### IN and NOT IN
How might we use that table query to help us find the identifiers and names of doctors who are responsible for one or more patients?

One way is to report on `doctor_id` values where the `doctor_id` is *IN* the table subquery.

In [32]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE doctor_id IN (SELECT DISTINCT doctor_id
                    FROM patient
                    WHERE doctor_id IS NOT NULL);

4 rows affected.


doctor_id,doctor_name
d06,Gibson
d07,Paxton
d10,Rampton
d11,Nolan


In [34]:
%%sql
SELECT doctor_name
FROM doctor
WHERE doctor_id IN (SELECT DISTINCT doctor_id
                   FROM patient
                   WHERE doctor_id IS NOT NULL);

4 rows affected.


doctor_name
Gibson
Paxton
Rampton
Nolan


Note that it is the column order we are testing against, *not* the column name, as demonstrated by renaming the column in the table subquery. (Typically, we would *not* do this.)

In [54]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE doctor_id IN (SELECT DISTINCT doctor_id AS arbitrary_name
                    FROM patient
                    WHERE doctor_id IS NOT NULL);

4 rows affected.


doctor_id,doctor_name
d06,Gibson
d07,Paxton
d10,Rampton
d11,Nolan


As the *inner query* (the table subquery) does not include any columns from the *outer query* it is evaluated first, and returns a table with a single column, `doctor_id` which is the *foreign key* representing the relationship between the `doctor` and `patient` tables.

The *outer query* is evaluated next, checking whether for each row of the `doctor` table, the `doctor_id` exists in the table created by the *inner query*. If so, that row of the `doctor` table is included in the resultant table.

We can find the identifiers and names of doctors who are *not* responsible for any patients by using the `NOT IN` rather than the `IN` predicate:

In [55]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE doctor_id NOT IN (SELECT doctor_id
                        FROM patient
                        WHERE doctor_id IS NOT NULL);

1 rows affected.


doctor_id,doctor_name
d09,Tamblin


### EXISTS and NOT EXISTS
The `EXISTS` predicate enables us to check whether a table contains at least one row, and can be used with a 
*table subquery* to answer similar requests as the `IN` predicate.

For example, we can display the identifiers and names of doctors who are responsible for one or more patients by checking to see whether a row exists in a table subquery that matches `doctor_id` values in the `patient` table within the *inner* table subquery with  `doctor_id` values in the `doctor` table in the *outer* query.

In [56]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE EXISTS (SELECT *
              FROM patient
              WHERE doctor.doctor_id = patient.doctor_id);

4 rows affected.


doctor_id,doctor_name
d06,Gibson
d07,Paxton
d10,Rampton
d11,Nolan


As the *inner query* includes a column from the outer query, `doctor.doctor_id`, the *outer query* is evaluated first. 

The *outer query* evaluates the *inner query* for each row of the `doctor` table, matching the primary key (`doctor.doctor_id`) and foreign key (`patient.doctor_id`) values. If the resultant table from the *inner query* contains at least one row (i.e. a row exists), that row of the `doctor` table is used to populate the resultant table.

In the same where that the `IN` operator can be complemented or negated using `NOT IN`, so too can we use a `NOT EXISTS` predicate.

### Exercise 5

Write a query to display the identifiers and names of doctors who are *not* responsible for any patients:

In [35]:
%%sql

SELECT doctor_id, doctor_name
FROM doctor
WHERE NOT EXISTS (SELECT *
              FROM patient
              WHERE doctor.doctor_id = patient.doctor_id);

1 rows affected.


doctor_id,doctor_name
d09,Tamblin


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

<div class='answer'>Reuse the previous query but replace <tt>EXISTS</tt>: with <tt>NOT EXISTS</tt><br/><br/>
<tt>SELECT doctor_id, doctor_name<br/>
FROM doctor<br/>
WHERE NOT EXISTS (SELECT *<br/>
                  FROM patient<br/>
                  WHERE doctor.doctor_id = patient.doctor_id);</tt>
</div>

### Exercise 6

The `EXISTS` predicate also allows us to test table subqueires in which we can make comparisions between several pairs of columns using the comparison operators (`=`, `<`, `>`, `<>`).

For example, the following query will display the identifiers and names of doctors who care for at least one patient who is older than the doctor:

In [59]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE EXISTS (SELECT *
              FROM patient
              WHERE doctor.doctor_id = patient.doctor_id
                AND doctor.date_of_birth > patient.date_of_birth);

1 rows affected.


doctor_id,doctor_name
d11,Nolan


Write a query to display the identifiers and names of doctors who care for at least one patient who is *younger* than the doctor.

In [66]:
%%sql
SELECT doctor_id, doctor_name, date_of_birth
FROM doctor
WHERE EXISTS (SELECT *
              FROM patient
              WHERE doctor.doctor_id = patient.doctor_id
                 AND doctor.date_of_birth < patient.date_of_birth)


3 rows affected.


doctor_id,doctor_name,date_of_birth
d06,Gibson,1954-02-24
d07,Paxton,1960-05-23
d10,Rampton,1980-09-25


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

<div class='answer'>You might think that one way of doing this is to complement the previous query by using <tt>NOT EXISTS</tt> rather than <tt>EXISTS</tt>:<br/><br/>
<tt>SELECT doctor_id, doctor_name FROM doctor<br/>
WHERE NOT EXISTS (SELECT * FROM patient<br/>
&nbsp;&nbsp;&nbsp;&nbsp;WHERE doctor.doctor_id = patient.doctor_id<br/>
&nbsp;&nbsp;&nbsp;&nbsp;AND doctor.date_of_birth > patient.date_of_birth);</tt><br/><br/>
However, keen readers may notice that this would also admit of doctors the same as as the patient. For a strict inequality, we could retain the <tt>EXISTS</tt> clause and change the &gt; comparison operator to &lt;.</div>

### ALL
The `ALL` predicate enables us to inspect column (or row) values in a table to determine whether a condition is `true` for every row in a table, and can be used with a *table subquery*.

For example, suppose we have a query that finds information about patients who have a non null *weight* value:

In [36]:
%%sql
SELECT patient_id, patient_name, weight
FROM patient
WHERE weight IS NOT NULL
ORDER BY weight LIMIT 3;

3 rows affected.


patient_id,patient_name,weight
p080,Bell,49.2
p009,Kay,53.2
p089,Jarvis,53.4


One way of finding the heaviest patient is to order all the patients with a recorded weight by descreasing weight and pick the first one:

In [37]:
%%sql
SELECT patient_id, patient_name, gender, weight FROM patient
WHERE weight is NOT NULL
ORDER BY weight DESC LIMIT 1

1 rows affected.


patient_id,patient_name,gender,weight
p088,Boswell,M,91.4


Another way of approaching this query is to generate a table that contains *all* the patients *except* for a tested patient, and then see if the weight of the tested patient is greater than the weights of *all* the other patients:

In [38]:
%%sql
SELECT patient_id, patient_name, gender, weight
FROM patient a 
WHERE weight > ALL (SELECT weight
                    FROM patient b
                    WHERE weight IS NOT NULL
                      AND a.patient_id <> b.patient_id);

1 rows affected.


patient_id,patient_name,gender,weight
p088,Boswell,M,91.4


The query above compares the weight of each patient (*outer query*) in turn with the weights of all the other patients 
(*inner query*). The heaviest patient will be the patient that is heavier than ALL the other patients, excluding 
themselves.

We can find the details of the lightest patient(s) simply by reversing the inequality test:

In [39]:
%%sql
SELECT patient_id, patient_name, gender, weight
FROM patient a 
WHERE weight < ALL (SELECT weight
                    FROM patient b
                    WHERE weight IS NOT NULL
                      AND a.patient_id <> b.patient_id);

1 rows affected.


patient_id,patient_name,gender,weight
p080,Bell,F,49.2


###  ANY
The `ANY` predicate enables us to inspect column (or row) values in a table to determine whether a condition is true for *at least* one row in a table.

For example, having already found the details of the heaviest patient(s), we can modify the `ALL` predicated query to find all patients *excluding* the heaviest patient, which is to say we can find `ANY` patient who is lighter than (which is to say, with a weight *less than*) the heaviest patient.

In [40]:
%%sql
SELECT patient_id, patient_name, gender, weight
FROM patient a 
WHERE weight < ANY (SELECT weight
                    FROM patient b
                    WHERE weight IS NOT NULL
                      AND a.patient_id <> b.patient_id)
ORDER BY weight;

14 rows affected.


patient_id,patient_name,gender,weight
p080,Bell,F,49.2
p009,Kay,F,53.2
p089,Jarvis,F,53.4
p079,Dixon,F,56.5
p087,Reed,F,59.1
p068,Monroe,F,62.6
p015,Harris,M,64.3
p008,James,M,70.5
p007,Tennent,M,70.9
p001,Thornton,F,71.6


The query above compares the weight of each patient (*outer query*) in turn with the weights of all the other patients  (*inner query*). The heaviest patient(s) will be excluded as they are not lighter than `ANY` other patient, excluding themselves.

Again, we can modify the previous query to exclude the lightest patient(s) simply by reversing the inequality test.

In [41]:
%%sql
SELECT patient_id, patient_name, gender, weight
FROM patient a 
WHERE weight > ANY (SELECT weight
                          FROM patient b
                          WHERE weight IS NOT NULL
                                 AND a.patient_id <> b.patient_id)
ORDER BY weight;

14 rows affected.


patient_id,patient_name,gender,weight
p009,Kay,F,53.2
p089,Jarvis,F,53.4
p079,Dixon,F,56.5
p087,Reed,F,59.1
p068,Monroe,F,62.6
p015,Harris,M,64.3
p008,James,M,70.5
p007,Tennent,M,70.9
p001,Thornton,F,71.6
p039,Maher,F,73.0


### Optional Exercise - Comparing the Performance of SQL subquery operations with Other Queries

Using the EXPLAIN technique used in notebook *10.4 Normalised v. unnormalised data* for profiling queries, write some equivalent queries using SQL subqueires and some of the other SQL constructs reviewed in this notebook (for example, *IN, NOT IN, EXISTS, NOT EXISTS, ANY, ALL*) and compare the performance of the different query styles.

---

### Optional Exercise

If you have time, you might ike to revisit the *movies* dataset to see what sorts of questions you can now turn into queries using the additional SQL constructs reviewed in this notebook.

## Summary
In this Notebook you have seen how the three types of subquery (scalar, row and table) can be used to compare data from two or more tables.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `11.3 Recommender systems`.