# SQL subqueries

In this Notebook, we will explore how the three types of subquery (*scalar*, *row* and *table*) can be used to 
compare data from two or more tables.

A `SELECT` statement can be embedded within another `SELECT` statement, that is, one query – 
a subquery (or inner query), within another query (outer query). 
The results of the subquery are used in the outer query to help determine the content of the resultant table.

There are three types of subquery:

- A *scalar subquery* returns a single column and a single row, that is, a single value. 
A *scalar subquery* can be used whenever a single value is needed. For example, in the `SELECT` clause and in a `WHERE` or `HAVING` condition.
- A *row subquery* returns two or more columns, as a single row. 
A *row subquery* can be used whenever a single row is needed. For example, in a `WHERE` or `HAVING` condition.
- A *table subquery* returns one or more columns and one or more rows, that is, a table. 
A *table subquery* can be used whenever a table is needed. For example, in the `FROM` clause, 
and in a `WHERE` or `HAVING` condition.

Enable access to the PostgreSQL database engine via [SQL Cell Magic](https://pypi.python.org/pypi/ipython-sql).

In [None]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/tm351test

As the `doctor` and `patient` tables may have been updated by another Notebook, recreate them.

In [None]:
%%sql
DROP TABLE IF EXISTS patient CASCADE;
DROP TABLE IF EXISTS doctor CASCADE;

CREATE TABLE doctor (
 doctor_id CHAR(3) NOT NULL
  CHECK (doctor_id SIMILAR TO 'd[0-9][0-9]'),
 doctor_name VARCHAR(20) NOT NULL,
 date_of_birth DATE NOT NULL,
 PRIMARY KEY (doctor_id)
 );

CREATE TABLE patient (
  patient_id CHAR(4) NOT NULL
    CHECK (patient_id SIMILAR TO 'p[0-9][0-9][0-9]'),
  patient_name VARCHAR(20) NOT NULL,
  date_of_birth DATE NOT NULL,
  gender CHAR(1) NOT NULL
    CHECK (gender = 'F' OR gender = 'M'),
  height DECIMAL(4,1)
    CHECK (height > 0),
  weight DECIMAL(4,1)
    CHECK (weight > 0),
  doctor_id CHAR(3),
 PRIMARY KEY (patient_id),
 FOREIGN KEY (doctor_id) REFERENCES doctor(doctor_id)
 );

Populate the tables from files using [Psycopg](http://initd.org/psycopg/docs/index.html), 
a PostgreSQL database adapter for Python.

In [None]:
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

In [3]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()

# open doctor.dat
io = open('data/doctor.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'doctor')
# close doctor.dat
io.close()
# commit transaction
conn.commit()

# open patient+doctor_id.dat
io = open('data/patient+doctor_id.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'patient')
# close patient+doctor_id.dat
io.close()
# commit transaction
conn.commit()

# close cursor
c.close()
# close database connection
conn.close()

In [None]:
%%sql
SELECT * 
FROM doctor
ORDER BY doctor_id;

In [None]:
%%sql
SELECT * 
FROM patient
ORDER BY patient_id;

## Scalar subquery

A *scalar subquery* returns a single column and a single row, that is, a single value. 
A *scalar subquery* can be used whenever a single value is needed. 
For example, in the `SELECT` clause, or in the condition of a `WHERE` or `HAVING` clause with a comparison 
operator (`=, <, >, <>`). 
In the latter case, the resultant value can be compared with a value from the outer query.

The following query returns a single column and a single row – the average weight of patients registered at the 
doctors’ surgery who have been weighed:

In [None]:
%%sql
SELECT AVG(weight)
FROM patient;

This query can be used, for example, as a *scalar subquery* in the following three queries, where it is used 
respectively in the `SELECT` clause, in the condition of a `WHERE` clause, and in the condition of a `HAVING` clause:

'Display the identifiers and names of patients, and the difference between their weight and the average weight 
of patients.'

In [None]:
%%sql
SELECT patient_id, patient_name,
 CAST(weight - (SELECT AVG(weight)
                FROM patient)
  AS DECIMAL(4,1)) AS weight_difference
FROM patient
ORDER BY patient_id;

'Display the identifiers and names of patients who weigh less than the average weight of patients.'

In [None]:
%%sql
SELECT patient_id, patient_name
FROM patient
WHERE weight < (SELECT AVG(weight) 
                FROM patient)
ORDER BY patient_id;

'Display the years where the average weight of patients born in that year is less than the average weight of all patients.'

In [None]:
%%sql
SELECT CAST(EXTRACT(YEAR FROM date_of_birth) AS INTEGER) AS year_of_birth, 
       CAST(AVG(weight) AS DECIMAL(4,1)) AS average_weight
FROM patient
GROUP BY year_of_birth
HAVING AVG(weight) < (SELECT AVG(weight) 
                      FROM patient);

Notes:
    
The [`EXTRACT`](http://www.postgresql.org/docs/9.3/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) 
function retrieves subfields such as year from *date* values.
            
As the [`EXTRACT`](http://www.postgresql.org/docs/9.3/static/functions-datetime.html) 
function returns values as *double precision*, 
[`CAST`](http://www.postgresql.org/docs/9.3/static/sql-expressions.html#SQL-SYNTAX-TYPE-CASTS) 
is used to convert the value to *integer*.

## Row subquery

A *row subquery* returns two or more columns as a single row. 
The result of a row subquery can be used in the condition of the outer query `WHERE` or `HAVING` clause with a comparison operator (`=, <, >, <>`). That is, it can be compared with values (a row) from the outer query.

The following query returns two columns as a single row – the year of birth and the doctor responsible for the 
patient identified by `p015`.

In [None]:
%%sql
SELECT EXTRACT(YEAR FROM date_of_birth), doctor_id
FROM patient
WHERE patient_id ='p015';

This query can be used, for example, as a *row subquery* in the following two queries, where it is used respectively 
in the conditions of a `WHERE` clause, and a `HAVING` clause:

'Display the identifiers of patients, their name and gender, who have the same year of birth and doctor 
responsible for their care as the patient identified by p015.'

In [None]:
%%sql
SELECT patient_id, patient_name, gender
FROM patient
WHERE (EXTRACT(YEAR FROM date_of_birth), doctor_id) =  (SELECT EXTRACT(YEAR FROM date_of_birth), doctor_id
                                                        FROM patient
                                                        WHERE patient_id = 'p015');

'Display the numbers of male and female patients who have the same year of birth and doctor responsible for their 
care as the patient identified by `p015`.'

In [None]:
%%sql
SELECT gender, COUNT(*) AS number
FROM patient
GROUP BY gender, EXTRACT(YEAR FROM date_of_birth), doctor_id
HAVING (EXTRACT(YEAR FROM date_of_birth), doctor_id) =  (SELECT EXTRACT(YEAR FROM date_of_birth), doctor_id
                                                         FROM patient
                                                         WHERE patient_id = 'p015');

Notes:

SQL requires that the `GROUP BY` clause includes all the columns and expressions that appear in the `SELECT` and 
`HAVING` clauses unless they are arguments of an *aggregate* function.

## Table subquery
A *table subquery* returns one or more columns and one or more rows, that is, a table. 
A table subquery can be used whenever a table is needed. For example, in the `FROM` clause, and in the condition of 
a `WHERE` or `HAVING` clause with a `[NOT] IN`, `[NOT] EXISTS`, `ALL` or `ANY` predicates. 
In the case of a `WHERE` or `HAVING` clause, the resultant table can be compared with the values from the outer query.

### A *table subquery* in the `FROM` clause

The following query returns the identifiers of those doctors who are currently treating both male and female patients. 

In [None]:
%%sql
SELECT doctor_id
FROM patient
WHERE gender = 'F'
INTERSECT
SELECT doctor_id
FROM patient
WHERE gender = 'M';

This query is used as a *table subquery* in the FROM clause of the following two examples:

'Give the number of doctors who are currently treating both male and female patients.'

In [None]:
%%sql
SELECT COUNT(*) AS number_of_doctors
FROM (SELECT doctor_id
      FROM patient
      WHERE gender = 'F'
      INTERSECT
      SELECT doctor_id
      FROM patient
      WHERE gender = 'M') AS doctor_ids;

Notes:

PostgreSQL requires that a *table subquery* in a `FROM` clause is given a table alias via an `AS` clause – `doctor_ids` in this example.

'Give the names of the doctors who are currently treating both male and female patients.'

In [None]:
%%sql
SELECT doctor_name
FROM (SELECT doctor_id
      FROM patient
      WHERE gender = 'F'
      INTERSECT
      SELECT doctor_id
      FROM patient
      WHERE gender = 'M') AS doctor_ids
      NATURAL JOIN doctor;

### A *table subquery* in a `WHERE` clause

The `IN` predicate enables us to check whether a column (or row) value exists in a table, and can be used with a 
*table subquery* as illustrated in the following two queries.

'Display the identifiers and names of doctors who are responsible for one or more patients.'

In [None]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE doctor_id IN (SELECT doctor_id
                    FROM patient
                    WHERE doctor_id IS NOT NULL);

Notes:

As the *inner query* (subquery) does not include any columns from the *outer query* it is evaluated first, and returns a table with a single column, `doctor_id`, which is the *foreign key* representing the relationship between the `doctor` and `patient` tables – 'A doctor is responsible for one or more patients'.

The *outer query* is evaluated next, checking whether for each row of the `doctor` table, the `doctor_id` exists in the table created by the *inner query*. If so, that row of the `doctor` table is included in the resultant table.

'Display the identifiers and names of doctors who are not responsible for any patients.'

In [None]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE doctor_id NOT IN (SELECT doctor_id
                        FROM patient
                        WHERE doctor_id IS NOT NULL);

The `EXISTS` predicate enables us to check whether a table contains at least one row, and can be used with a 
*table subquery* to answer similar requests as the `IN` predicate.

'Display the identifiers and names of doctors who are responsible for one or more patients.'

In [None]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE EXISTS (SELECT *
              FROM patient
              WHERE doctor.doctor_id = patient.doctor_id);

Notes:
    
As the *inner query* includes a column from the outer query, `doctor.doctor_id`, the *outer query* is evaluated first. 

The *outer query* evaluates the *inner query* for each row of the `doctor` table, matching the primary key 
(`doctor.doctor_id`) and foreign key (`patient.doctor_id`) values. If the resultant table from the *inner query* 
contains at least one row, that row of the `doctor` table is included in the resultant table.

'Display the identifiers and names of doctors who are not responsible for any patients.'

In [None]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE NOT EXISTS (SELECT *
                  FROM patient
                  WHERE doctor.doctor_id = patient.doctor_id);

The `EXISTS` predicate also allows us to make comparisions between several pairs of columns using the comparison 
operators (`=`, `<`, `>`, `<>`).

'Display the identifiers and names of doctors who care for at least one patient who is older than the doctor.'

In [None]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE EXISTS (SELECT *
              FROM patient
              WHERE doctor.doctor_id = patient.doctor_id
                AND doctor.date_of_birth > patient.date_of_birth);

'Display the identifiers and names of doctors who care for at least one patient who is younger than the doctor.'

In [None]:
%%sql
SELECT doctor_id, doctor_name
FROM doctor
WHERE NOT EXISTS (SELECT *
                  FROM patient
                  WHERE doctor.doctor_id = patient.doctor_id
                    AND doctor.date_of_birth > patient.date_of_birth);

The `ALL` predicate enables us to inspect column (or row) values in a table to determine whether a condition is 
`true` for every row in a table, and can be used with a *table subquery* as illustrated in the following queries about 
patients' weights.

In [None]:
%%sql
SELECT patient_id, patient_name, weight
FROM patient
WHERE weight IS NOT NULL
ORDER BY weight;

'Display details of the heaviest patient(s).'

In [None]:
%%sql
SELECT patient_id, patient_name, gender, weight
FROM patient a 
WHERE weight > ALL (SELECT weight
                    FROM patient b
                    WHERE weight IS NOT NULL
                      AND a.patient_id <> b.patient_id);

The query above compares the weight of each patient (*outer query*) in turn with the weights of all the other patients 
(*inner query*). The heaviest patient will be the patient that is heavier than ALL the other patients, excluding 
themselves.

'Display details of the lightest patient(s).'

In [None]:
%%sql
SELECT patient_id, patient_name, gender, weight
FROM patient a 
WHERE weight < ALL (SELECT weight
                    FROM patient b
                    WHERE weight IS NOT NULL
                      AND a.patient_id <> b.patient_id);

The `ANY` predicate enables us to inspect column (or row) values in a table to determine whether a condition is true for at least one row in a table, and can be used with a table subquery as illustrated in the following queries about patients' weights.

'Display details of all patients excluding the heaviest patient(s).'

In [None]:
%%sql
SELECT patient_id, patient_name, gender, weight
FROM patient a 
WHERE weight < ANY (SELECT weight
                    FROM patient b
                    WHERE weight IS NOT NULL
                      AND a.patient_id <> b.patient_id)
ORDER BY weight;

The query above compares the weight of each patient (*outer query*) in turn with the weights of all the other patients 
(*inner query*). The heaviest patient(s) will be excluded as they are not lighter than `ANY` other patient, excluding 
themselves.

'Display details of all patients excluding the lightest patient(s).'

In [None]:
%%sql
SELECT patient_id, patient_name, gender, weight
FROM patient a 
WHERE weight > ANY (SELECT weight
                          FROM patient b
                          WHERE weight IS NOT NULL
                                 AND a.patient_id <> b.patient_id)
ORDER BY weight;

## Summary
In this Notebook you have seen how the three types of subquery (scalar, row and table) can be used to compare data from two or more tables.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `11.3 Recommender systems`.