# SQL DML
In this Notebook you will use the SQL SELECT statement to answer queries about the data recorded in

  (a) the `patient` table
    
  (b) the Movies dataset.

Enable access to the PostgreSQL database engine via SQL cell magic.

In [None]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/tm351test

## (a) the `patient` table

As the `patient` table is updated by other Notebooks, recreate it.

In [None]:
%%sql
DROP TABLE IF EXISTS patient CASCADE;

CREATE TABLE patient (
  patient_id CHAR(4) NOT NULL
    CHECK (patient_id SIMILAR TO 'p[0-9][0-9][0-9]'),
  patient_name VARCHAR(20) NOT NULL,
  date_of_birth DATE NOT NULL,
  gender CHAR(1) NOT NULL
    CHECK (gender = 'F' OR gender = 'M'),
  height DECIMAL(4,1)
    CHECK (height > 0),
  weight DECIMAL(4,1)
    CHECK (weight > 0),
 PRIMARY KEY (patient_id)
 );

Populate the `patient` table from a CSV file named `patients.csv` using [Psycopg](http://initd.org/psycopg/docs/index.html), 
a PostgreSQL database adapter for Python.

In [None]:
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

In [None]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()
# open patient.csv
io = open('data/patient.csv', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'patient', sep=',', null='')
# close patient.csv
io.close()
# commit transaction
conn.commit()
# close cursor
c.close()
# close database connection
conn.close()

In [None]:
%%sql
SELECT * 
FROM patient
ORDER BY patient_id;

## Activity 1 - `patient` table
Execute SQL `SELECT` statements to answer the following queries about patients:
1. Give the details of female patients who were born before 1981.
2. For each birth year, give the number of patients who were born that year, the number whose weight has been 
recorded, and the minimum, maximum and average weights.
3. Give the number of female patients and male patients who are 'overweight' according to their 
[BMI (Body Mass Index)](https://en.wikipedia.org/wiki/Body_mass_index).

In [None]:
# Try your code here

Solutions can be found in the `09.2.soln SQL DML` Notebook, 
but please DO attempt the activity yourself before looking at these solutions.

## (b) the Movies dataset

This Notebook will be just using the `movie` table from the Movies dataset.

`movie (movie_id, title, year, rt_all_critics_rating, rt_top_critics_rating, rt_audience_rating, ml_user_rating)`

Each row records the following data about a particular movie identified by the `movie_id` primary key (PK) column.

column | description
------ | -----------
movie_id  (PK) | movie identifier
title | movie title
year | year of release
rt_all_critics_rating | RottenTomatoes - all critics: average rating
rt_top_critics_rating | RottenTomatoes - top critics: average rating
rt_audience_rating | RottenTomatoes - audience: average rating
ml_user_rating | MovieLens - users: average rating



In [None]:
%%sql
DROP TABLE IF EXISTS movie;

CREATE TABLE movie(
 movie_id INTEGER NOT NULL,
 title VARCHAR(250) NOT NULL,
 year INTEGER NOT NULL,
 rt_all_critics_rating REAL,
 rt_top_critics_rating REAL,
 rt_audience_rating REAL,
 ml_user_rating REAL,
 PRIMARY KEY (movie_id)
);

Populate the `movies` table from the file named `movie.dat` using Psycopg.

In [None]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()
# open movie.dat
io = open('data/movie.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'movie')
# close movie.dat
io.close()
# commit transaction
conn.commit()
# close cursor
c.close()
# close database connection
conn.close()

In [None]:
%%sql
SELECT * 
FROM movie
ORDER BY movie_id
LIMIT 10;

## Activity 2 - Movies dataset I
Characterise the data in the `movie` table by executing SQL `SELECT` statements to answer the following questions: 

    1 How many movies are there?
    2 How many unique movie titles are there?
    3 What are the earliest and latest years of release?
    4 What are the ranges of values for critics, audience and user ratings?
    5 Missing data - How many movies are recorded without:
        5.1 a title?
        5.2 a year of release?
        5.3 critics, audience or user ratings?

Compare your answers with those from the same questions asked in the `08.1 Movies dataset` Notebook.  

In [None]:
# Try your code here

Solutions can be found in the `09.2.soln SQL DML` Notebook, 
but please DO attempt the activity yourself before looking at these solutions.

## Activity 3 - Movies dataset II
Execute SQL `SELECT` statements to answer the following queries about movies: 

    1 How many movies have the word 'Dog' in their title?
    2 Movies are often remade and released with the same name. Which movies have been made more than 3 times?
    3 How many movies have been released each decade? Plot the results as a histogram.

In [None]:
# Try your code here

Solutions can be found in the `09.2.soln SQL DML` Notebook, 
but please DO attempt the activity yourself before looking at these solutions.

## Summary
In this Notebook you have used the SQL SELECT statement to answer queries about the data recorded in

  (a) the `patient` table
    
  (b) the Movies dataset.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `09.3 SQL views`.