# SQL DML
In this Notebook you will use the SQL SELECT statement to answer queries about the data recorded in

  (a) the `patient` table
    
  (b) the Movies dataset.

This notebook contains several exercises or activities, which are presented with a space for you to try your own solution. In each case, you can see our solution by clicking on the small triangle next to the text "**our solution**", but in all cases, you should attempt the questions yourself before looking at our proposed solutions.



Enable access to the PostgreSQL database engine via SQL cell magic.

In [1]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/tm351test

'Connected: test@tm351test'

## (a) the `patient` table

As the `patient` table is updated by other Notebooks, recreate it.

In [2]:
%%sql
DROP TABLE IF EXISTS patient CASCADE;

CREATE TABLE patient (
  patient_id CHAR(4) NOT NULL
    CHECK (patient_id SIMILAR TO 'p[0-9][0-9][0-9]'),
  patient_name VARCHAR(20) NOT NULL,
  date_of_birth DATE NOT NULL,
  gender CHAR(1) NOT NULL
    CHECK (gender = 'F' OR gender = 'M'),
  height DECIMAL(4,1)
    CHECK (height > 0),
  weight DECIMAL(4,1)
    CHECK (weight > 0),
 PRIMARY KEY (patient_id)
 );

Done.
Done.


[]

Populate the `patient` table from a CSV file named `patients.csv` using [Psycopg](http://initd.org/psycopg/docs/index.html), 
a PostgreSQL database adapter for Python.

In [3]:
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

In [4]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()
# open patient.csv
io = open('data/patient.csv', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'patient', sep=',', null='')
# close patient.csv
io.close()
# commit transaction
conn.commit()
# close cursor
c.close()
# close database connection
conn.close()

In [5]:
%%sql
SELECT * 
FROM patient
ORDER BY patient_id;

17 rows affected.


patient_id,patient_name,date_of_birth,gender,height,weight
p001,Thornton,1980-01-22,F,162.3,71.6
p007,Tennent,1980-04-01,M,176.8,70.9
p008,James,1980-07-08,M,167.9,70.5
p009,Kay,1980-09-25,F,164.7,53.2
p015,Harris,1980-12-04,M,180.6,64.3
p031,Rubinstein,1980-12-23,F,,
p037,Boswell,1981-06-11,F,,
p038,Ming,1981-09-23,M,186.3,85.4
p039,Maher,1981-10-09,F,161.9,73.0
p068,Monroe,1981-10-21,F,165.0,62.6


### Exercise 1 - `patient` table

Execute SQL `SELECT` statements to answer the following queries about patients:
1. Give the details of female patients who were born before 1981.
2. For each birth year, give the number of patients who were born that year, the number whose weight has been 
recorded, and the minimum, maximum and average weights.
3. Give the number of female patients and male patients who are 'overweight' according to their 
[BMI (Body Mass Index)](https://en.wikipedia.org/wiki/Body_mass_index).

### My answers:
1) Give details of female patients who were born before 1981.

In [8]:
%%sql
SELECT *
FROM patient
WHERE gender = 'F' AND EXTRACT(YEAR FROM date_of_birth) < 1981

3 rows affected.


patient_id,patient_name,date_of_birth,gender,height,weight
p001,Thornton,1980-01-22,F,162.3,71.6
p009,Kay,1980-09-25,F,164.7,53.2
p031,Rubinstein,1980-12-23,F,,


2) For each birth year, give the number of patients who were born that year, the number whose weight has been recorded, and the minimum, maximum and average weights.

In [12]:
%%sql
SELECT CAST(EXTRACT(YEAR FROM date_of_birth) AS INTEGER) AS birth_year,
        COUNT(*) AS number_of_patients,
        COUNT(weight) AS number_weighed,
        MIN(weight) AS min_weight,
        MAX(weight) AS max_weight,
        CAST(AVG(weight) AS DECIMAL(4,1)) AS average_weight
FROM patient
GROUP BY birth_year
ORDER BY birth_year;

3 rows affected.


birth_year,number_of_patients,number_weighed,min_weight,max_weight,average_weight
1980,6,5,53.2,71.6,66.1
1981,5,4,62.6,85.4,74.4
1982,6,6,49.2,91.4,64.0


 3) Give the number of female patients and male patients who are overweight according to their BMI

In [18]:
%%sql
SELECT gender, COUNT(*)
FROM patient
WHERE weight/(height*height/10000) > 24
GROUP BY gender

2 rows affected.


gender,count
F,2
M,3


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


1\. Give the details of female patients who were born before 1981.

In [None]:
%%sql
SELECT *
FROM patient
WHERE EXTRACT(YEAR FROM date_of_birth) < 1981
ORDER BY patient_id;

Notes:
    
The [`EXTRACT`](http://www.postgresql.org/docs/9.3/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) 
[DATE/TIME function](http://www.postgresql.org/docs/9.3/static/functions-datetime.html) 
retrieves subfields such as year or hour from date/time values.

2\. For each birth year, give the number of patients who were born that year, the number whose weight has been 
recorded, and the minimum, maximum and average weights.

In [None]:
%%sql
SELECT CAST(EXTRACT(YEAR FROM date_of_birth) AS INTEGER) AS birth_year,
       COUNT(*) AS number_of_patients,
       COUNT(weight) AS number_weighed,
       MIN(weight) AS minimum_weight,
       MAX(weight) AS maximum_weight,
       CAST(AVG(weight) AS DECIMAL(4,1)) AS average_weight
FROM patient
GROUP BY birth_year
ORDER BY birth_year;

Notes:
    
The derived column `birth_year`, defined in the `SELECT` clause, is used in the `GROUP BY` and `ORDER BY` clauses.

The `GROUP BY` and `ORDER BY` clauses could have written as

* `GROUP BY EXTRACT(YEAR FROM date_of_birth)`
* `ORDER BY EXTRACT(YEAR FROM date_of_birth)`

The first form is not accepted by all SQL implementations.

 

**Resultant table**

The resultant table from the execution of an SQL `SELECT` statement can be put into a DataFrame 
(see TM351 VM Installation Test Notebook, Database tests, PostgreSQL).

In [None]:
from sqlalchemy import create_engine
engine = create_engine("postgresql://test:test@localhost:5432/tm351test")
from pandas import read_sql_query as psql

In [None]:
resultant_table = psql("SELECT CAST(EXTRACT(YEAR FROM date_of_birth) AS INTEGER) AS birth_year, \
                               COUNT(*) AS number_of_patients, \
                               COUNT(weight) AS number_weighed, \
                               MIN(weight) AS minimum_weight, \
                               MAX(weight) AS maximum_weight, \
                               CAST(AVG(weight) AS DECIMAL(4,1)) AS average_weight \
                        FROM patient \
                        GROUP BY birth_year \
                        ORDER BY birth_year;", engine)
resultant_table 

The resultant DataFrame can subsequently be manipulated using 
[tools](http://pandas.pydata.org/pandas-docs/version/0.17.1/api.html#dataframe) that you have used previously. 
For example, [plotting](http://pandas.pydata.org/pandas-docs/version/0.17.1/api.html#api-dataframe-plotting) the results.


In [None]:
resultant_table.plot.bar('birth_year')

3\. Give the number of female patients and male patients who are 'overweight' according to their 
[BMI (Body Mass Index)](https://en.wikipedia.org/wiki/Body_mass_index).

In [None]:
%%sql
SELECT gender, COUNT(*)
FROM patient
WHERE weight/(height*height/10000) > 24
GROUP BY gender;

## (b) the Movies dataset

This Notebook will be just using the `movie` table from the Movies dataset.

`movie (movie_id, title, year, rt_all_critics_rating, rt_top_critics_rating, rt_audience_rating, ml_user_rating)`

Each row records the following data about a particular movie identified by the `movie_id` primary key (PK) column.

column | description
------ | -----------
movie_id  (PK) | movie identifier
title | movie title
year | year of release
rt_all_critics_rating | RottenTomatoes - all critics: average rating
rt_top_critics_rating | RottenTomatoes - top critics: average rating
rt_audience_rating | RottenTomatoes - audience: average rating
ml_user_rating | MovieLens - users: average rating



In [19]:
%%sql
DROP TABLE IF EXISTS movie;

CREATE TABLE movie(
 movie_id INTEGER NOT NULL,
 title VARCHAR(250) NOT NULL,
 year INTEGER NOT NULL,
 rt_all_critics_rating REAL,
 rt_top_critics_rating REAL,
 rt_audience_rating REAL,
 ml_user_rating REAL,
 PRIMARY KEY (movie_id)
);

Done.
Done.


[]

Populate the `movies` table from the file named `movie.dat` using Psycopg.

In [20]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()
# open movie.dat
io = open('data/movie.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'movie')
# close movie.dat
io.close()
# commit transaction
conn.commit()
# close cursor
c.close()
# close database connection
conn.close()

In [21]:
%%sql
SELECT * 
FROM movie
ORDER BY movie_id
LIMIT 10;

10 rows affected.


movie_id,title,year,rt_all_critics_rating,rt_top_critics_rating,rt_audience_rating,ml_user_rating
1,Toy Story,1995,9.0,8.5,3.7,3.9
2,Jumanji,1995,5.6,5.8,3.2,3.2
3,Grumpier Old Men,1995,5.9,7.0,3.2,3.2
4,Waiting to Exhale,1995,5.6,5.5,3.3,2.9
5,Father of the Bride Part II,1995,5.3,5.4,3.0,3.1
6,Heat,1995,7.7,7.2,3.9,3.8
7,Sabrina,1995,7.4,7.2,3.8,3.4
8,Tom and Huck,1995,4.2,0.0,2.7,3.1
9,Sudden Death,1995,5.2,5.6,2.6,3.0
10,GoldenEye,1995,6.8,6.2,3.4,3.4


### Exercise 2 - Movies dataset I

Characterise the data in the `movie` table by executing SQL `SELECT` statements to answer the following questions: 

    1 How many movies are there?
    2 How many unique movie titles are there?
    3 What are the earliest and latest years of release?
    4 What are the ranges of values for critics, audience and user ratings?
    5 Missing data - How many movies are recorded without:
        5.1 a title?
        5.2 a year of release?
        5.3 critics, audience or user ratings?

Compare your answers with those from the same questions asked in the `08.1 Movies dataset` Notebook.  

1) How many movies are there?

In [23]:
%%sql
SELECT COUNT(*)
FROM movie;

1 rows affected.


count
10681


2) How many unique movie titles are there?

In [24]:
%%sql
SELECT COUNT(DISTINCT title)
FROM movie;

1 rows affected.


count
10410


3) What are the earliest and latest years of release?

In [25]:
%%sql
SELECT MIN(year) AS earliest, MAX(year) AS latest
FROM movie;


1 rows affected.


earliest,latest
1915,2008


4) What are the ranges of values for critics, audience and user ratings?

In [27]:
%%sql
SELECT MIN(rt_all_critics_rating) AS min_all_critics_rating, MAX(rt_all_critics_rating) AS max_rt_all_critics_rating,
        MIN(rt_top_critics_rating) AS min_top_critics_rating, MAX(rt_top_critics_rating) AS max_top_critics_rating,
        MIN(rt_audience_rating) AS min_audience_rating, MAX(rt_audience_rating) AS max_audience_rating,
        MIN(ml_user_rating) AS min_ml_user_rating, MAX(ml_user_rating) AS max_ml_user_rating
FROM movie;

1 rows affected.


min_all_critics_rating,max_rt_all_critics_rating,min_top_critics_rating,max_top_critics_rating,min_audience_rating,max_audience_rating,min_ml_user_rating,max_ml_user_rating
0.0,9.6,0.0,10.0,0.0,5.0,0.5,5.0


5) Missing data - How many movies are recorded without:

5.1) a title?

In [28]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE title IS NULL;

1 rows affected.


count
0


5.2) a year of release?

In [29]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE year IS NULL;

1 rows affected.


count
0


5.3) critics, audience or user ratings?

In [30]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE rt_all_critics_rating IS NULL;

1 rows affected.


count
714


In [31]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE rt_top_critics_rating IS NULL;

1 rows affected.


count
714


In [32]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE rt_audience_rating IS NULL;

1 rows affected.


count
714


In [33]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE ml_user_rating IS NULL

1 rows affected.


count
4


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


1\. How many movies are there?

In [35]:
%%sql
SELECT COUNT(*)
FROM movie;

1 rows affected.


count
10681


2\. How many unique movie titles are there?

In [36]:
%%sql
SELECT COUNT(DISTINCT title)
FROM movie;

1 rows affected.


count
10410


3\. What are the earliest and latest years of release?

In [37]:
%%sql
SELECT MIN(year), MAX(year)
FROM movie;

1 rows affected.


min,max
1915,2008


4\. What are the ranges of values for critics, audience and user ratings?

In [38]:
%%sql
SELECT MIN(rt_all_critics_rating) AS min_rt_all_critics_rating, MAX(rt_all_critics_rating) AS max_rt_all_critics_rating, 
       MIN(rt_top_critics_rating) AS min_rt_top_critics_rating, MAX(rt_top_critics_rating) AS max_rt_top_critics_rating, 
       MIN(rt_audience_rating) AS min_rt_audience_rating, MAX(rt_audience_rating) AS max_rt_audience_rating, 
       MIN(ml_user_rating) AS min_ml_user_rating, MAX(ml_user_rating) AS max_ml_user_rating
FROM movie;

1 rows affected.


min_rt_all_critics_rating,max_rt_all_critics_rating,min_rt_top_critics_rating,max_rt_top_critics_rating,min_rt_audience_rating,max_rt_audience_rating,min_ml_user_rating,max_ml_user_rating
0.0,9.6,0.0,10.0,0.0,5.0,0.5,5.0


5.1 How many movies are recorded without a title?

In [39]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE title IS NULL;

1 rows affected.


count
0


5.2 How many movies are recorded without a year of release?

In [40]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE year IS NULL;

1 rows affected.


count
0


5.3 How many movies are recorded without critics, audience or user ratings?

In [41]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE rt_all_critics_rating IS NULL;

1 rows affected.


count
714


In [42]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE rt_top_critics_rating IS NULL;

1 rows affected.


count
714


In [43]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE rt_audience_rating IS NULL;

1 rows affected.


count
714


In [44]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE ml_user_rating IS NULL;

1 rows affected.


count
4


## Activity 3 - Movies dataset II
Execute SQL `SELECT` statements to answer the following queries about movies: 

    1 How many movies have the word 'Dog' in their title?
    2 Movies are often remade and released with the same name. Which movies have been made more than 3 times?
    3 How many movies have been released each decade? Plot the results as a histogram.

In [45]:
# Try your code here

In [51]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE title LIKE '%Dog%';

1 rows affected.


count
55


In [59]:
%%sql
SELECT COUNT(*) AS number_of_versions, title
FROM movie
GROUP BY title
HAVING COUNT(*) > 3;

1 rows affected.


number_of_versions,title
5,Hamlet


In [63]:
%%sql
SELECT (year/10)*10 AS decade, COUNT(*) AS number_released
FROM movie
GROUP BY decade
ORDER BY decade;

10 rows affected.


decade,number_released
1910,11
1920,83
1930,230
1940,379
1950,521
1960,690
1970,784
1980,1712
1990,3022
2000,3249


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


1\. How many movies have the word 'Dog' in their title?

In [None]:
%%sql
SELECT COUNT(*)
FROM movie
WHERE title LIKE '%Dog%';

2\. Movies are often remade and released with the same name. Which movies have been made more than 3 times?

In [None]:
%%sql
SELECT title, COUNT(*) AS number_of_versions
FROM movie
GROUP BY title
HAVING COUNT(*) > 3;

3\. How many movies have been released each decade? Plot the results as a histogram.

In [None]:
%%sql
SELECT (year/10)*10 AS decade, COUNT(*) AS no_of_films
FROM movie
GROUP BY decade
ORDER BY decade;

Notes:

The [PostgreSQL mathematical operator](http://www.postgresql.org/docs/9.3/static/functions-math.html) 
`\` performs integer division, truncating the result.

In [None]:
resultant_table = psql("SELECT (year/10)*10 AS decade, COUNT(*) AS no_of_films \
                        FROM movie \
                        GROUP BY decade \
                        ORDER BY decade;", engine)
resultant_table

In [None]:
resultant_table.plot.bar('decade')

## Summary
In this Notebook you have used the SQL SELECT statement to answer queries about the data recorded in

  (a) the `patient` table
    
  (b) the Movies dataset.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `09.3 SQL views`.