# Recommender systems

In this Notebook, we will demonstrate how we can develop SQL queries for a content-based recommender system using the *Movies dataset* that would help you (hopefully) to find movies you will like.

**NOTE: This Notebook have been provided solely as a demonstration of how SQL queries can be used to interrogate the Movies dataset as a relational database in order to determine similarities between movies; in particular, how missing data has to be handled using SQL.**

**You are not expected to be able write SQL code of such complexity after studying TM351, but you should try to understand how the data is being manipulated by the SQL statements in order to answer the questions asked about the data.**

Enable access to the PostgreSQL database engine via [SQL Cell Magic](https://pypi.python.org/pypi/ipython-sql).

In [None]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/tm351test

Create the `movie`, `movie_actor`, `movie_country`, `movie_director` and `movie_genre` tables.

In [None]:
%%sql
DROP TABLE IF EXISTS movie_actor CASCADE;
DROP TABLE IF EXISTS movie_country CASCADE;
DROP TABLE IF EXISTS movie_director CASCADE;
DROP TABLE IF EXISTS movie_genre CASCADE;
DROP TABLE IF EXISTS movie CASCADE;

CREATE TABLE movie (
 movie_id INTEGER NOT NULL,
 title VARCHAR(250) NOT NULL,
 year INTEGER NOT NULL,
 rt_all_critics_rating REAL,
 rt_top_critics_rating REAL,
 rt_audience_rating REAL,
 ml_user_rating REAL,
 PRIMARY KEY (movie_id)
);

CREATE TABLE movie_actor (
 movie_id INTEGER NOT NULL,
 actor_name VARCHAR(50) NOT NULL,
 ranking INTEGER NOT NULL,
 PRIMARY KEY (movie_id, actor_name),
 FOREIGN KEY (movie_id) REFERENCES movie(movie_id)
);

CREATE TABLE movie_country (
 movie_id INTEGER NOT NULL,
 country VARCHAR(30) NOT NULL,
 PRIMARY KEY (movie_id),
 FOREIGN KEY (movie_id) REFERENCES movie(movie_id)
);

CREATE TABLE movie_director (
 movie_id INTEGER NOT NULL,
 director_name VARCHAR(50) NOT NULL,
 PRIMARY KEY (movie_id),
 FOREIGN KEY (movie_id) REFERENCES movie(movie_id)
);

CREATE TABLE movie_genre (
 movie_id INTEGER NOT NULL,
 genre VARCHAR(20) NOT NULL,
 PRIMARY KEY (movie_id, genre),
 FOREIGN KEY (movie_id) REFERENCES movie(movie_id)
);

Populate the tables from the Movies dataset using [Psycopg](http://initd.org/psycopg/docs/index.html), 
a PostgreSQL database adapter for Python.

In [None]:
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

In [None]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()

# populate 'movie' table
io = open('data/movie.dat', 'r')
c.copy_from(io, 'movie')
io.close()
conn.commit()

# populate 'movie_actor' table
io = open('data/movie_actor.dat', 'r')
c.copy_from(io, 'movie_actor')
io.close()
conn.commit()

# populate 'movie_country' table
io = open('data/movie_country.dat', 'r')
c.copy_from(io, 'movie_country')
io.close()
conn.commit()

# populate 'movie_director' table
io = open('data/movie_director.dat', 'r')
c.copy_from(io, 'movie_director')
io.close()
conn.commit()

# populate 'movie_genre' table
io = open('data/movie_genre.dat', 'r')
c.copy_from(io, 'movie_genre')
io.close()
conn.commit()

# close cursor
c.close()
# close database connection
conn.close()

Display all of the data associated with one particular movie.

In [None]:
%%sql
SELECT *
FROM movie
WHERE movie_id = 807;

In [None]:
%%sql
SELECT *
FROM movie_actor
WHERE movie_id = 807
ORDER BY ranking;

In [None]:
%%sql
SELECT *
FROM movie_country
WHERE movie_id = 807;

In [None]:
%%sql
SELECT *
FROM movie_director
WHERE movie_id = 807;

In [None]:
%%sql
SELECT *
FROM movie_genre
WHERE movie_id = 807
order by genre;

## Activity
Let us assume that you have just watched *Toy Story* (`movie_id` = 1) and as you have enjoyed it so much you want to identify similar movies.

We will focus on looking for similarities in directors, actors, genres and movie ratings between *Toy Story* and other movies.

### Similarity measure

The simplest approach to content-based recommendation is to compute the similarity of the features of a favourite item 
(or user profile) with each other item. There are a number of similarity measures that are employed by recommender 
systems, the simplest being as follows:
    
Given two items, `x` (favourite item or user profile) and `y` (another item), with `n` features, 
the similarity of `x` to `y`, `similarity(x, y)` expressed as a value in the range `[0..1]`, can be computed as:
    
`similarity(x, y) = (F1 + F2 + … + Fn) / n`

If `Fi` is a multi-valued property then `Fi` will be a fraction of values of `y` in `x`, 
which gives a value in the range `[0..1]`.

If `Fi` is a single-valued numerical property then `Fi` will be the normalised difference in values between `x` and `y`, 
which is expressed as a value in the range `[0..1]`.

If `Fi` is a single-valued non-numerical property then `Fi` will be 1 if the values of x and y are identical, 
otherwise 0.

### Directors
The Movies dataset records information about directors in the `movie_director` table as follows:

`movie_director (movie_id, director_name)`

Each movie has one director. Each row records the director of a particular movie identified by the `movie_id` 
primary key (PK) column.

column | description
------ | -----------
movie_id  (PK) | movie identifier
director_name | director's name

As the director of a movie is a single-valued non-numerical property, then `F(director)` for a movie will be 1 
if a movie has the same director as *Toy Story*, otherwise 0.

We can use the following `SELECT` statement to determine `F(director)` for each movie:

In [None]:
%%sql
-- F(director)
SELECT movie_id,
 CASE WHEN (director_name = (SELECT director_name FROM movie_director WHERE movie_id = 1)) THEN 1
      ELSE 0
 END AS F_director
FROM movie_director
ORDER BY F_director DESC, movie_id
LIMIT 10;

Notes:

The [`CASE`](http://www.postgresql.org/docs/9.3/static/functions-conditional.html) 
statement provides conditional execution based on truth of *Boolean* expressions within a `SELECT` clause.

The *logical processing* (see Part 9, Section 3.2, SQL `SELECT` statement processing) of the SQL statement is as follows:

- First, the *scalar subquery* `(SELECT director_name FROM movie_director WHERE movie_id = 1)` is processed resulting in a table with a single row and single column recording the name of the director of the *Toy Story* movie.

- Second, the `FROM` clause is processed. It outputs the first intermediate table which is a copy of the whole `movie_director` table.

- Third, the `SELECT` clause is processed. It takes the first intermediate table as input and outputs a second intermediate table containing only the columns specified on the `SELECT` clause. That is, `movie_id` and a `1` if the movie has the same director as *Toy Story*, otherwise `0`.

- Finally, the `ORDER BY` clause is processed. It takes the second intermediate table as input and outputs the resultant table, ordering the rows by the values specified by the `ORDER BY` clause.

### Actors
The Movies dataset records information about actors in the `movie_actor` table as follows:
    
`movie_actor (movie_id, actor_name, ranking)`

Each movie features one or more actors. Each row records a particular actor featuring in a particular movie 
identified by the `movie_id` and `actor_name` primary key columns.

column | description
------ | -----------
movie_id  (PK) | movie identifier
actor_name  (PK) | actor's name
ranking | position of actor on the movie's cast list

As the actors appearing in a movie is a multi-valued property, then `F(actors)` for a movie will be a fraction of 
actors appearing in the movie who also appear in *Toy Story*.

We can use the following `SELECT` statement to determine `F(actors)` for each movie:

In [None]:
%%sql
-- F(actors)
SELECT movie_id, 
       CAST(COUNT(*) AS REAL)/(SELECT COUNT(*) FROM movie_actor WHERE movie_id = 1) AS F_actors
FROM movie_actor
WHERE actor_name IN (SELECT actor_name FROM movie_actor WHERE movie_id = 1)
GROUP BY movie_id
ORDER BY F_actors DESC, movie_id
LIMIT 10;

Notes:
    
The *logical processing* (see Part 9, Section 3.2, SQL `SELECT` statement processing) of the SQL statement is as follows:
    
- First, *scalar subquery* `(SELECT COUNT(*) FROM movie_actor WHERE movie_id = 1)` is processed, resulting in a table with a single row and single column recording the number of actors featuring in *Toy Story*.
    
- Second, the *table subquery* `(SELECT actor_name FROM movie_actor WHERE movie_id = 1)` is processed, resulting in a table with a single column recording the actors featuring in *Toy Story*.

- Third, the `FROM` clause is processed. It outputs the first intermediate table which is a copy of the whole `movie_actor` table.

- Fourth, the `WHERE` clause is processed. It takes the first intermediate table as input and outputs a second intermediate table containing only rows where the actor has also appeared in *Toy Story*.

- Five, the `GROUP BY` clause is processed. It partitions the second intermediate table into groups with the same value of `movie_id`, creating the third intermediate table.

- Sixth, the `SELECT` clause is processed. It takes the third intermediate table as input and, as this comprises groups of rows, for each group a single row of a fourth intermediate table is output, comprising `movie_id` and the number of actors in that movie who have also featured in *Toy Story* as a fraction of the total number. 

- Finally, the `ORDER BY` clause is processed. It takes the fourth intermediate table as input and outputs the resultant table, ordering the rows by the values specified by the `ORDER BY` clause.

The [`CAST`](http://www.postgresql.org/docs/9.3/static/sql-expressions.html#SQL-SYNTAX-TYPE-CASTS)`( ... AS REAL)` 
is required to convert the result into a real number so that the following `/` operator 
does not result in an *integer division* truncating the result.

The resultant table only includes a movie if there is at least one actor who also appears in *Toy Story*.

### Movie ratings
The Movies dataset records four movie ratings in the `movie` table as follows:
    
`movie (movie_id, title, year, rt_all_critics_rating, rt_top_critics_rating, rt_audience_rating, ml_user_rating)`

Each row records the following data about a particular movie identified by the `movie_id` primary key (PK) column.

column | description
------ | -----------
movie_id  (PK) | movie identifier
title | movie title
year | year of release
rt_all_critics_rating | RottenTomatoes - all critics: average rating
rt_top_critics_rating | RottenTomatoes - top critics: average rating
rt_audience_rating | RottenTomatoes - audience: average rating
ml_user_rating | MovieLens - users: average rating

As each movie rating is a single-valued numerical property, then `F(rating)` for a movie will be the normalised 
difference in ratings between that movie and *Toy Story*, expressed as a value in the range `[0..1]`.

`F(rating) = ((rating(range) - |(`*Toy Story*`(rating) - movie(rating))|) / rating(range)` 

First, we need to calculate the range of each movie rating by determining the minimum and maximum values:

In [None]:
%%sql
SELECT MIN(rt_all_critics_rating) AS min_rt_all_critics, MAX(rt_all_critics_rating) AS max_rt_all_critics, 
       MIN(rt_top_critics_rating) AS min_rt_top_critics, MAX(rt_top_critics_rating) AS max_rt_top_critics, 
       MIN(rt_audience_rating) AS min_rt_audience_rating, MAX(rt_audience_rating) AS max_rt_audience_rating, 
       MIN(ml_user_rating) AS min_ml_user, MAX(ml_user_rating) AS max_ml_user
FROM movie;

rating | range
------ | -----
rt_all_critics_rating | 9.6 - 0.0 = 9.6
rt_top_critics_rating | 10.0 - 0.0 = 10.0
rt_audience_rating | 5.0 - 0.0 = 5.0
ml_user_rating | 5.0 - 0.5 = 4.5

Now, we can use the following SELECT statements to determine `F(rt_all_critics_rating)`, `F(rt_top_critics_rating)`, 
`F(rt_audience_rating)` and `F(ml_user_rating)` respectively for each movie. 

In [None]:
%%sql
SELECT movie_id, 
 COALESCE ((9.6 - ABS(rt_all_critics_rating - (SELECT rt_all_critics_rating FROM movie WHERE movie_id = 1)))/9.6,0)
  AS F_rt_all_critics_rating
FROM movie
ORDER BY F_rt_all_critics_rating DESC
LIMIT 10;

Notes:

For those movies where `rt_all_critics_rating` is `null`, the expression calculating `F_rt_all_critics_rating` will 
also evaluate to `null` rather than `0`. To assign these movies a `F_rt_all_critics_rating` of `0` we can use the 
[`COALESCE`](http://www.postgresql.org/docs/9.3/static/functions-conditional.html#FUNCTIONS-COALESCE-NVL-IFNULL) 
function in the form `COALESCE(<expression>,0)` which if the `<expression>` evaluates to `null` will return `0`.

In [None]:
%%sql
SELECT movie_id,
 COALESCE ((10.0 - ABS(rt_top_critics_rating - (SELECT rt_top_critics_rating FROM movie WHERE movie_id = 1)))/10.0,0)
  AS F_rt_top_critics_rating
FROM movie
ORDER BY F_rt_top_critics_rating DESC
LIMIT 10;

In [None]:
%%sql
SELECT movie_id, 
 COALESCE ((5.0 - ABS(rt_audience_rating - (SELECT rt_audience_rating FROM movie WHERE movie_id = 1)))/5.0,0)
  AS F_rt_audience_rating
FROM movie
ORDER BY F_rt_audience_rating DESC
LIMIT 10;

In [None]:
%%sql
SELECT movie_id, 
 COALESCE ((4.5 - ABS(ml_user_rating - (SELECT ml_user_rating FROM movie WHERE movie_id = 1)))/4.5,0)
  AS F_ml_user_rating
FROM movie
ORDER BY F_ml_user_rating DESC
LIMIT 10;

The above four `SELECT` statements can be combined into a single statement as follows:

In [None]:
%%sql
-- F(rt_all_critics_rating), F(rt_top_critics_rating), F(rt_audience_rating), F(ml_user_rating)
SELECT movie_id, 
 COALESCE ((9.6 - ABS(rt_all_critics_rating - (SELECT rt_all_critics_rating FROM movie WHERE movie_id = 1)))/9.6,0)
  AS F_rt_all_critics_rating,
 COALESCE ((10.0 - ABS(rt_top_critics_rating - (SELECT rt_top_critics_rating FROM movie WHERE movie_id = 1)))/10.0,0)
  AS F_rt_top_critics_rating,
 COALESCE ((5.0 - ABS(rt_audience_rating - (SELECT rt_audience_rating FROM movie WHERE movie_id = 1)))/5.0,0)
  AS F_rt_audience_rating,
 COALESCE ((4.5 - ABS(ml_user_rating - (SELECT ml_user_rating FROM movie WHERE movie_id = 1)))/4.5,0)
  AS F_ml_user_rating
FROM movie
ORDER BY movie_id
LIMIT 10;

### Genres
The Movies dataset records information about genres in the `movie_genre` table as follows:

`movie_genre (movie_id, genre)`

Each movie is categorised as belonging to one or more movie genres. Each row records a particular genre that 
categorises a particular movie identified by the `movie_id` and `genre` primary key columns.

column | description
------ | -----------
movie_id  (PK) | movie identifier
genre  (PK) | movie genre

As genres that categorise a movie are a multi-valued property, then F(genres) for a movie will be a fraction of genres 
that categorise the movie that also categorise *Toy Story*.

We can use the following SELECT statement to determine `F(genres)` for each movie:

In [None]:
%%sql
-- F(genres)
SELECT movie_id,
       CAST(COUNT(*) AS REAL)/(SELECT COUNT(*) FROM movie_genre WHERE movie_id = 1) AS F_genres
FROM movie_genre
WHERE genre IN (SELECT genre FROM movie_genre WHERE movie_id = 1)
GROUP BY movie_id
ORDER BY F_genres DESC, movie_id
LIMIT 10;

Notes:

[`COUNT`](http://www.postgresql.org/docs/9.3/static/functions-aggregate.html)`(*)` returns the number of genres categorising a movie that also categorise *Toy Story*. 

The [`CAST`](http://www.postgresql.org/docs/9.3/static/sql-expressions.html#SQL-SYNTAX-TYPE-CASTS)
`( ... AS REAL)` is required to convert the result into a real number so that the following `/` operator does not result in *integer division* truncating the result.

The *scalar subquery* `(SELECT COUNT(*) FROM movie_genre WHERE movie_id = 1)` returns a table with a single row and a single column recording the number of genres categorising *Toy Story*.

The *table subquery* `(SELECT genre FROM movie_genre WHERE movie_id = 1)` returns a table of the genres categorising *Toy Story*.

The resultant table only includes a movie if there is at least one genre in common with *Toy Story*.

### Combining the similarity terms F1 + F2 + … + Fn

So far, we have written `SELECT` statements to determine similarity terms for directors, actors, genres and movie ratings between *Toy Story* and other movies. 

To compute the *similarity* of the *Toy Story* movie with each other movie we now have to combine these terms. To achieve this we must *join* together the resultant tables from these `SELECT` statements. 

Although we have dealt with those movies where critics', audience and user ratings are missing (`null`), when joining the resultant tables we must also deal with those movies where no actors, directors or genres have been recorded, or have no actors or genres in common.

As joining the resultant tables is a complex process, we will complete process in a stepwise fashion, starting with the movie ratings terms: `F(rt_all_critics_rating)`, `F(rt_top_critics_rating)`, `F(rt_audience_rating)` and `F(ml_user_rating)`.

In [None]:
%%sql
SELECT movie.movie_id,
       title, 
       CAST((F_rt_all_critics_rating+F_rt_top_critics_rating+F_rt_audience_rating+F_ml_user_rating)/4 AS DECIMAL(4,2)) 
        AS similarity
FROM movie JOIN

-- F(rt_all_critics_rating), F(rt_top_critics_rating), F(rt_audience_rating), F(ml_user_rating)
(SELECT movie_id, 
  COALESCE ((9.6 - ABS(rt_all_critics_rating - (SELECT rt_all_critics_rating FROM movie WHERE movie_id = 1)))/9.6,0)
   AS F_rt_all_critics_rating,
  COALESCE ((10.0 - ABS(rt_top_critics_rating - (SELECT rt_top_critics_rating FROM movie WHERE movie_id = 1)))/10.0,0)
   AS F_rt_top_critics_rating,
  COALESCE ((5.0 - ABS(rt_audience_rating - (SELECT rt_audience_rating FROM movie WHERE movie_id = 1)))/5.0,0)
   AS F_rt_audience_rating,
  COALESCE ((4.5 - ABS(ml_user_rating - (SELECT ml_user_rating FROM movie WHERE movie_id = 1)))/4.5,0)
   AS F_ml_user_rating
 FROM movie) AS movie_F_ratings
ON movie.movie_id = movie_F_ratings.movie_id

ORDER BY similarity DESC, movie_id
LIMIT 10;

Notes:
    
The above query joins the `movie` table with the resultant table (named `movie_F_ratings`) from the execution of the 
`SELECT` statement used to determine `F(rt_all_critics_rating)`, `F(rt_top_critics_rating)`, `F(rt_audience_rating)` 
and `F(ml_user_rating)` for each movie.

Next, we will include the `F(director)` term.

In [None]:
%%sql
SELECT movie.movie_id,
       title, 
       CAST((F_rt_all_critics_rating+F_rt_top_critics_rating+F_rt_audience_rating+F_ml_user_rating+
             COALESCE(F_director,0))/5 AS DECIMAL(4,2)) AS similarity
FROM (movie JOIN

-- F(rt_all_critics_rating), F(rt_top_critics_rating), F(rt_audience_rating), F(ml_user_rating)
(SELECT movie_id, 
  COALESCE ((9.6 - ABS(rt_all_critics_rating - (SELECT rt_all_critics_rating FROM movie WHERE movie_id = 1)))/9.6,0)
   AS F_rt_all_critics_rating,
  COALESCE ((10.0 - ABS(rt_top_critics_rating - (SELECT rt_top_critics_rating FROM movie WHERE movie_id = 1)))/10.0,0)
   AS F_rt_top_critics_rating,
  COALESCE ((5.0 - ABS(rt_audience_rating - (SELECT rt_audience_rating FROM movie WHERE movie_id = 1)))/5.0,0)
   AS F_rt_audience_rating,
  COALESCE ((4.5 - ABS(ml_user_rating - (SELECT ml_user_rating FROM movie WHERE movie_id = 1)))/4.5,0)
   AS F_ml_user_rating
 FROM movie) AS movie_F_ratings
ON movie.movie_id = movie_F_ratings.movie_id) LEFT OUTER JOIN

-- F(director)
(SELECT movie_id,
  CASE WHEN (director_name = (SELECT director_name FROM movie_director WHERE movie_id = 1)) THEN 1
       ELSE 0
  END AS F_director
 FROM movie_director) AS movie_F_director
ON movie.movie_id = movie_F_director.movie_id

ORDER BY similarity DESC, movie_id
LIMIT 10;

Notes:

The above query joins the resultant table from the previous query with the resultant table (named `movie_F_director`) from the execution of the `SELECT` statement used to determine `F_director`.

As there are movies where no director has been recorded, we have to use a `LEFT OUTER JOIN` to ensure that all movies are included in the resultant table. 
Using [`COALESCE`](http://www.postgresql.org/docs/9.3/static/functions-conditional.html#FUNCTIONS-COALESCE-NVL-IFNULL)`(F_director,0)` when calculating `similarity` ensures that the 
`F(director)` term is `0` rather than `null` for those movies where no director has been recorded, otherwise `similarity` will compute to *unknown* (`null`).

Next, we will include the `F(actors)` term.

In [None]:
%%sql
SELECT movie.movie_id,
       title, 
       CAST((F_rt_all_critics_rating+F_rt_top_critics_rating+F_rt_audience_rating+F_ml_user_rating+
             COALESCE(F_director,0)+
             COALESCE(F_actors,0))/6 AS DECIMAL(4,2)) AS similarity
FROM ((movie JOIN

-- F(rt_all_critics_rating), F(rt_top_critics_rating), F(rt_audience_rating), F(ml_user_rating)
(SELECT movie_id, 
  COALESCE ((9.6 - ABS(rt_all_critics_rating - (SELECT rt_all_critics_rating FROM movie WHERE movie_id = 1)))/9.6,0)
   AS F_rt_all_critics_rating,
  COALESCE ((10.0 - ABS(rt_top_critics_rating - (SELECT rt_top_critics_rating FROM movie WHERE movie_id = 1)))/10.0,0)
   AS F_rt_top_critics_rating,
  COALESCE ((5.0 - ABS(rt_audience_rating - (SELECT rt_audience_rating FROM movie WHERE movie_id = 1)))/5.0,0)
   AS F_rt_audience_rating,
  COALESCE ((4.5 - ABS(ml_user_rating - (SELECT ml_user_rating FROM movie WHERE movie_id = 1)))/4.5,0)
   AS F_ml_user_rating
 FROM movie) AS movie_F_ratings
ON movie.movie_id = movie_F_ratings.movie_id) LEFT OUTER JOIN

-- F(director)
(SELECT movie_id,
  CASE WHEN (director_name = (SELECT director_name FROM movie_director WHERE movie_id = 1)) THEN 1
       ELSE 0
  END AS F_director
 FROM movie_director) AS movie_F_director
ON movie.movie_id = movie_F_director.movie_id) LEFT OUTER JOIN
      
-- F(actors)
(SELECT movie_id, 
        CAST(COUNT(*) AS REAL)/(SELECT COUNT(*) FROM movie_actor WHERE movie_id = 1) AS F_actors
 FROM movie_actor
 WHERE actor_name IN (SELECT actor_name FROM movie_actor WHERE movie_id = 1)
 GROUP BY movie_id) AS movie_F_actors
ON movie.movie_id = movie_F_actors.movie_id
      
ORDER BY similarity DESC, movie_id
LIMIT 10;

Notes:

The above query joins the resultant table from the previous query with the resultant table (named `movie_F_actors`) from the execution of the `SELECT` statement used to determine `F_actors`.

As there are movies where no actors have been recorded and the query that determines `F(actors)` for each movie only 
includes a movie if there is at least one actor who also appears in *Toy Story*, we have to use a 
`LEFT OUTER JOIN` to ensure that all movies are included in the resultant table. 
Using [`COALESCE`](http://www.postgresql.org/docs/9.3/static/functions-conditional.html#FUNCTIONS-COALESCE-NVL-IFNULL)
`(F_actors,0)` when calculating `similarity` ensures that the `F(actors)` term is `0` rather than `null` for these movies, otherwise `similarity` will compute to *unknown* (`null`).

Finally, we will include the `F(genres)` term.

In [None]:
%%sql
SELECT movie.movie_id,
       title, 
       CAST((F_rt_all_critics_rating+F_rt_top_critics_rating+F_rt_audience_rating+F_ml_user_rating+
             COALESCE(F_director,0)+
             COALESCE(F_actors,0)+
             COALESCE(F_genres,0))/7 AS DECIMAL(4,2)) AS similarity
FROM (((movie JOIN

-- F(rt_all_critics_rating), F(rt_top_critics_rating), F(rt_audience_rating), F(ml_user_rating)
(SELECT movie_id, 
  COALESCE ((9.6 - ABS(rt_all_critics_rating - (SELECT rt_all_critics_rating FROM movie WHERE movie_id = 1)))/9.6,0)
   AS F_rt_all_critics_rating,
  COALESCE ((10.0 - ABS(rt_top_critics_rating - (SELECT rt_top_critics_rating FROM movie WHERE movie_id = 1)))/10.0,0)
   AS F_rt_top_critics_rating,
  COALESCE ((5.0 - ABS(rt_audience_rating - (SELECT rt_audience_rating FROM movie WHERE movie_id = 1)))/5.0,0)
   AS F_rt_audience_rating,
  COALESCE ((4.5 - ABS(ml_user_rating - (SELECT ml_user_rating FROM movie WHERE movie_id = 1)))/4.5,0)
   AS F_ml_user_rating
 FROM movie) AS movie_F_ratings
ON movie.movie_id = movie_F_ratings.movie_id) LEFT OUTER JOIN

-- F(director)
(SELECT movie_id,
  CASE WHEN (director_name = (SELECT director_name FROM movie_director WHERE movie_id = 1)) THEN 1
       ELSE 0
  END AS F_director
 FROM movie_director) AS movie_F_director
ON movie.movie_id = movie_F_director.movie_id) LEFT OUTER JOIN
      
-- F(actors)
(SELECT movie_id, 
        CAST(COUNT(*) AS REAL)/(SELECT COUNT(*) FROM movie_actor WHERE movie_id = 1) AS F_actors
 FROM movie_actor
 WHERE actor_name IN (SELECT actor_name FROM movie_actor WHERE movie_id = 1)
 GROUP BY movie_id) AS movie_F_actors
ON movie.movie_id = movie_F_actors.movie_id) LEFT OUTER JOIN

-- F(genres)
(SELECT movie_id,
        CAST(COUNT(*) AS REAL)/(SELECT COUNT(*) FROM movie_genre WHERE movie_id = 1) AS F_genres
 FROM movie_genre
 WHERE genre IN (SELECT genre FROM movie_genre WHERE movie_id = 1)
 GROUP BY movie_id) AS movie_F_genres
ON movie.movie_id = movie_F_genres.movie_id
      
ORDER BY similarity DESC, movie_id
LIMIT 10;

Notes:

The above query joins the resultant table from the previous query with the resultant table (named `movie_F_genres`) from the execution of the `SELECT` statement used to determine `F_genres`.

As the query that determines `F(genres)` for each movie only includes a movie if there is at least one genre in common with *Toy Story*, we have to use a `LEFT OUTER JOIN` to ensure that all movies are included in the resultant table. Using [`COALESCE`](http://www.postgresql.org/docs/9.3/static/functions-conditional.html#FUNCTIONS-COALESCE-NVL-IFNULL)
`(F_genres,0)` when calculating `similarity` ensures that the `F(genres)` term is `0` 
rather than `null` for these movies, otherwise `similarity` will compute to *unknown* (`null`).

## Discussion

The SQL code we have developed so far only answers one specific question: "Which movies are *similar* to *Toy Story*?". 

To answer the general question, for example, "Which movies are *similar* to a specified movie?", 
we need to write an [`SQL function`](http://www.postgresql.org/docs/9.3/static/xfunc-sql.html) which takes a single argument (a movie) and returns a table of movies with similar features to that movie (as in the above queries).

In [None]:
%%sql
CREATE OR REPLACE FUNCTION you_might_also_like (p_movie_id INTEGER)
RETURNS TABLE (movie_ids INTEGER, titles VARCHAR(250), similarity DECIMAL(4,2)) AS $$
BEGIN
 RETURN QUERY
    
SELECT movie.movie_id,
       title, 
       CAST((F_rt_all_critics_rating+F_rt_top_critics_rating+F_rt_audience_rating+F_ml_user_rating+
             COALESCE(F_director,0)+
             COALESCE(F_actors,0)+
             COALESCE(F_genres,0))/7 AS DECIMAL(4,2)) AS similarity
FROM (((movie JOIN

-- F(rt_all_critics_rating), F(rt_top_critics_rating), F(rt_audience_rating), F(ml_user_rating)
(SELECT movie_id, 
  COALESCE ((9.6 - ABS(rt_all_critics_rating - (SELECT rt_all_critics_rating FROM movie WHERE movie_id = p_movie_id)))/9.6,0)
   AS F_rt_all_critics_rating,
  COALESCE ((10.0 - ABS(rt_top_critics_rating - (SELECT rt_top_critics_rating FROM movie WHERE movie_id = p_movie_id)))/10.0,0)
   AS F_rt_top_critics_rating,
  COALESCE ((5.0 - ABS(rt_audience_rating - (SELECT rt_audience_rating FROM movie WHERE movie_id = p_movie_id)))/5.0,0)
   AS F_rt_audience_rating,
  COALESCE ((4.5 - ABS(ml_user_rating - (SELECT ml_user_rating FROM movie WHERE movie_id = p_movie_id)))/4.5,0)
   AS F_ml_user_rating
 FROM movie) AS movie_F_ratings
ON movie.movie_id = movie_F_ratings.movie_id) LEFT OUTER JOIN

-- F(director)
(SELECT movie_id,
  CASE WHEN (director_name = (SELECT director_name FROM movie_director WHERE movie_id = p_movie_id)) THEN 1
       ELSE 0
  END AS F_director
 FROM movie_director) AS movie_F_director
ON movie.movie_id = movie_F_director.movie_id) LEFT OUTER JOIN
      
-- F(actors)
(SELECT movie_id, 
        CAST(COUNT(*) AS REAL)/(SELECT COUNT(*) FROM movie_actor WHERE movie_id = p_movie_id) AS F_actors
 FROM movie_actor
 WHERE actor_name IN (SELECT actor_name FROM movie_actor WHERE movie_id = p_movie_id)
 GROUP BY movie_id) AS movie_F_actors
ON movie.movie_id = movie_F_actors.movie_id) LEFT OUTER JOIN

-- F(genres)
(SELECT movie_id,
        CAST(COUNT(*) AS REAL)/(SELECT COUNT(*) FROM movie_genre WHERE movie_id = p_movie_id) AS F_genres
 FROM movie_genre
 WHERE genre IN (SELECT genre FROM movie_genre WHERE movie_id = p_movie_id)
 GROUP BY movie_id) AS movie_F_genres
ON movie.movie_id = movie_F_genres.movie_id
      
ORDER BY similarity DESC, movie_id
LIMIT 10;
    
END;
$$ LANGUAGE plpgsql;

Notes:
    
This [`SQL function`](http://www.postgresql.org/docs/9.3/static/xfunc-sql.html), 
which returns a table, has been constructed simply by incorporating the `SELECT` query we developed 
earlier into the following framework after replacing throughout `movie_id = 1` with `p_movie_id`, the function 
parameter:
    
```
CREATE FUNCTION <name> (<parameters>)
RETURNS TABLE (<columns>) AS $$
BEGIN
 RETURN QUERY

<query>
                      
END;
$$ LANGUAGE plpgsql;
```

This function could be adapted, for example, to accept the movie title (`title`) rather than `movie_id`.

In [None]:
%%sql
-- Toy Story
SELECT *
FROM you_might_also_like(1);

In [None]:
%%sql
-- Shrek
SELECT *
FROM you_might_also_like(4306);

## CAUTION!
As we noted in the module text, this is the simplest approach to compute the similarity between two items. 
We would probably need at least to consider applying different weightings to each feature. 
For example, if we felt that the director of a movie had more influence on us liking a movie than the cast, then we would apply a higher weighting to the director than to the actors featuring in the movie.

While developing SQL queries for a content-based recommender system using the Movies dataset it has become apparent that the [hetrec2011-movielens-2k dataset](http://grouplens.org/datasets/hetrec-2011/), from which the Movies dataset is derived, is inconsistent with one of its primary data sources, the [Internet Movie Database (IMDb)](http://www.imdb.com/), with respect to the actors recorded as featuring in some movies and the directors.

The hetrec2011-movielens-2k dataset was made available at the *2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems* ([HetRec 2011](http://ir.ii.uam.es/hetrec2011)) as a benchmark dataset to facilitate the evaluation of recommendation approaches. At the of time writing (April 2016), these inconsistencies had not been reported.

## Summary
In this Notebook, you seen how we can develop SQL queries for a content-based recommender system using the Movies dataset.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, you've completed the Part 11 Notebooks. It's time to move on to Part 12.