# intro

This notebook will walk through some of the basics of retreiving data from a SQL relational database. We will be using the SQLite database we designed in our previous discussion on creating SQL databases, and likewise will be using the sqlite3 and python modules to facilitate our queries. As with the previous notebook, the focus will be on the SQL commands; detailed explanations of the Python script fall outside the scope of this discussion, but efforts have been made to comment the code should you wish to understand the logic behind it.

As we already discussed the basic theory in the last notebook, we'll save the preamble and dive straight into querying our database. We'll start small, but hopefully by the end we'll have built up to some interesting and complex queries:

In [2]:
## setting up our notebook to be able to query the database

import pandas as pd
import sqlite3
from IPython.core.display import display, HTML

# objects used to configure and communicate with database
conn = sqlite3.connect("movies.db")
c = conn.cursor()
c.execute("PRAGMA foreign_keys = ON")

# set float format in pandas display:
pd.options.display.float_format = '{:,.2f}'.format

# function to display results of database queries
def display_results(command, display_all=False):
    # returns results of sql command as list of tuples
    results = c.execute(command).fetchall()
    # collect names of columns requested in query
    column_names = [feature_tuple[0] for feature_tuple in c.description]
    # build a pandas dataframe of results
    results_df = pd.DataFrame(results, columns=column_names)
    if display_all:
        # display dataframe as cell output with max_colwidth set to 1000 chars
        with pd.option_context("display.max_colwidth", 1000):
            display(results_df)
    # otherwise display dataframe as cell output with default display settings
    else:
        display(results_df)

## SELECT

Every query asking for data from a SQL database has at it's core a `SELECT` statement. A basic `SELECT` statement takes the form `SELECT [column_names] from [table name]`. Let's take a look at an example:

In [7]:
display_results(command = """
                          SELECT title, runtime, release_date 
                          FROM movies 
                          LIMIT 5
                          """)

Unnamed: 0,title,runtime,release_date
0,Ariel,69,1988-10-21
1,Shadows in Paradise,76,1986-10-16
2,Four Rooms,98,1995-12-09
3,Judgment Night,110,1993-10-15
4,Star Wars,121,1977-05-25


As with pretty much all basic SQL commands, the above `SELECT` statement should be fairly easy to parse. We're `SELECT`ing the `title`, `runtime`, and `release_date` columns `FROM` the `movies` database, and `LIMIT`ing the number of results to `5` (without this limit statement, the query would return these columns for every entry in the database). Pretty straight forward, and there aren't really any surprises to be found with this query type - you can follow this logic to take information from any of the tables in our database.

If you wish, you can use a `*` wildcard character to signify "all columns" like so:  

In [8]:
display_results(command = """
                          SELECT * 
                          FROM credits 
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,character,actor_id,appearance_order
0,862,Woody (voice),31,0
1,862,Buzz Lightyear (voice),12898,1
2,862,Mr. Potato Head (voice),7167,2
3,862,Slinky Dog (voice),12899,3
4,862,Rex (voice),12900,4


Often, we can leave our queries at that - take all the information from a table and then work with it from there. Easy! But, as you might guess, that isn't a terribly efficient solution. If we're querying a table that contains a huge amount of data, just asking for all the data therein is very computationally expensive and memory intensive (if the data fits in memory at all!). As an example, take the `ratings` data - asking for all 26M entries will take a considerable time to load and will likely consume a large portion (if not all) of our local memory. Also, consider that many modern cloud database solutions charge users based on the size of queries, and so we'll rack up some pretty huge bills if this is the only tool we have to get at our data!

Thankfully, there are a lot of tools SQLite provides us to refine our search. The `LIMIT` clause is a crude example of this - we're reducing our query from all the data in a table to a specified number. Lets take a look at some other clauses that might help us make some more interesting queries:

### ORDER BY

If we combine our `LIMIT` clause with an `ORDER BY` clause, we have a tool to query to the top "n" of a particular feature of our data. For example, we can ask the database for the top 5 longest movies like so:

In [13]:
display_results(command = """
                          SELECT title, runtime, release_date 
                          FROM movies 
                          ORDER BY runtime DESC
                          LIMIT 5
                          """)

Unnamed: 0,title,runtime,release_date
0,Centennial,1256,1978-10-01
1,Baseball,1140,1994-09-18
2,Jazz,1140,2001-01-09
3,Berlin Alexanderplatz,931,1980-08-28
4,Heimat: A Chronicle of Germany,925,1984-09-16


Setting aside for the moment that anyone would be willing to watch 19 continuous hours of "Baseball", we have asked the database to return the same info as before, but to `ORDER` the results `BY runtime` in `DESC`ending order before selecting the first 5 results. The `ORDER BY` function orders in ascending order by default, but you can stipulate `ASC` if you wish to be more specific.

As another example, we can ask for the oldest 5 movies like so:

In [24]:
display_results(command = """
                          SELECT title, runtime, release_date 
                          FROM movies 
                          ORDER BY release_date ASC
                          LIMIT 5
                          """)

Unnamed: 0,title,runtime,release_date
0,Passage of Venus,1,1874-12-09
1,Sallie Gardner at a Gallop,1,1878-06-14
2,Buffalo Running,1,1883-11-19
3,Man Walking Around a Corner,1,1887-08-18
4,Accordion Player,1,1888-01-01


Those are some old movies!

### WHERE

The `WHERE` clause allows us to ask the database only for data that meets a certain condition (or set of conditions). For example, we can find the entry relating to the first Toy Story movie like so:

In [21]:
display_results(command = """
                          SELECT * 
                          FROM movies 
                          WHERE title = "Toy Story"
                          """,
               display_all=True)

Unnamed: 0,id,title,original_title,runtime,budget,original_language,overview,release_date,revenue,tagline
0,862,Toy Story,Toy Story,81,30000000,en,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",1995-10-30,373554033,


We're asking the database for any entries from the `movies` table `WHERE` the `title` column equals "Toy Story". 

The `WHERE` clause is very powerful, and offers the use of the classic comparison operators `=`, `!=`, `<`, `>`, `<=`, `>=` and a host of logical operators such as `AND`, `OR`, `BETWEEN`, `IN`, `LIKE`, `NOT` (these vary depending on the particular SQL engine). We could spend a very long time discussing the features of different where clauses (and anyone interested should read [this excellent tutorial](https://www.sqlitetutorial.net/sqlite-where/)) to the extent that it's much better to explore these features as we go. We'll take a look at one silightly more complex example here and then move on. Worry not though, there will be plenty of conditional queries to follow. 

Imagine we really wanted to watch a Harry Potter movie but had less than 2 hours 30 minutes to do so. We can query our database for movies that meet this criteria as follows:

In [14]:
display_results(command = """
                          SELECT title, runtime, release_date 
                          FROM movies 
                          WHERE title LIKE "%harry potter%" 
                              AND runtime < 150
                          """)

Unnamed: 0,title,runtime,release_date
0,Harry Potter and the Prisoner of Azkaban,141,2004-05-31
1,Harry Potter and the Order of the Phoenix,138,2007-06-28
2,Harry Potter and the Deathly Hallows: Part 1,146,2010-10-17
3,Harry Potter and the Deathly Hallows: Part 2,130,2011-07-07


The above query contains two `WHERE` conditions joined with an `AND` logical operator - each condition must be met for a result to be returned: 

* The first condition is that the `title` is `LIKE "%harry potter%"` - the `LIKE` condition returns any entries with a `title` that matches the `%harry potter%` pattern. The `%` characters in the pattern are wildcards that match any string. So, we are effectively asking for any titles that contain the string "harry potter" within them. Note that the `LIKE` condition isn't case sensitive, whereas when testing for equality (i.e. our "Toy Story" query above) strings are case sensitive.
* The second condition is that the `runtime` is less than (`<`) 150 minutes. 

From the output, we have 4 of the 8 movies from the Harry Potter franchise to choose from.

You'll note that the `SELECT` statements that we've devised so far are returning entries that are identical to those we entered. SQL also offers us the tools to manipulate and perform calculations on database entries at the point we make a query. This allows us to ask for even more specific information from our database, without being limited to the format the data was entered:

## GROUP BY

The `GROUP BY` clause allows us to ask for entries grouped by one of the columns or features of our data. If we take the `ratings` table as an example, we can use the `movie_id` feature to group together all the ratings for the same movie. You might then ask yourself, what happens with the other features? We need some way of combining the information stored in the other columns. `GROUP BY` queries are always combined with **aggregate functions**, that tell the database what to do with the other features you're querying. Let's ask the database for the average rating of each movie:

In [34]:
# NOTE: queries of the ratings table will take much longer
# than the other tables as it has 26M entries
display_results(command = """
                          SELECT movie_id, AVG(rating) 
                          FROM ratings 
                          GROUP BY movie_id
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,AVG(rating)
0,2,3.673664
1,3,3.770115
2,5,3.409031
3,6,2.924862
4,11,4.132299


Here we have asked the database for the `movie_id` column and an `AVG` (average) of the `rating` column from the `ratings` table `GROUP`ed `BY` the `movie_id` column. This gathers together every rating for the same movie, and averages the individual ratings for each movie. Note that, unlike the other queries we've run, we didn't provide the database with this information when it was created - the database engine calculated this information for us.

If we use different aggregate functions, unsurprisingly, we get different information:

In [36]:
display_results(command = """
                          SELECT movie_id, 
                                 AVG(rating), 
                                 SUM(rating), 
                                 COUNT(rating) 
                          FROM ratings 
                          GROUP BY movie_id
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,AVG(rating),SUM(rating),COUNT(rating)
0,2,3.673664,962.5,262
1,3,3.770115,328.0,87
2,5,3.409031,20761.0,6090
3,6,2.924862,3717.5,1271
4,11,4.132299,318373.0,77045


Here we have not only the average, but also a `SUM` of each individual rating and a `COUNT` of all the ratings in each group (this count would work for any of the features in the ratings table, not just `rating`). If you want to you can use this information to double-check SQLite is doing its work correctly - you can divide the `SUM(rating)` column by the `COUNT(rating)` column to calculate your own average.

Note that, if we use the aggregate functions **without** a `GROUP BY` statement, the database will aggregate over the whole table (rather than the groups):

In [15]:
display_results(command = """
                          SELECT AVG(rating), SUM(rating), COUNT(rating) 
                          FROM ratings 
                          """)

Unnamed: 0,AVG(rating),SUM(rating),COUNT(rating)
0,3.527854,91642196.0,25976754


### HAVING

There may be instances where we wish to narrow our grouped search results, like we did above with the `WHERE` clause. SQL database engines reserve the `WHERE` clause for the columns of a database - it can't be used with aggregated data. Instead, conditions can be placed on aggregated data using the `HAVING` clause - other than this distinction, the two function in the same way. Using the `HAVING` clause we can, for example, return the top 5 rated movies with 100 or more ratings:

In [18]:
display_results(command = """
                          SELECT movie_id, AVG(rating), COUNT(rating) 
                          FROM ratings 
                          GROUP BY movie_id
                          HAVING COUNT(rating) > 100
                          ORDER BY AVG(rating) DESC
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,AVG(rating),COUNT(rating)
0,192040,4.47878,754
1,278,4.429015,91082
2,331214,4.394366,284
3,238,4.339811,57070
4,629,4.300189,59271


Note that our commands are beggining to become more complex, but the logic behind each component is as simple as it was above. Here we're asking the database to group the `ratings` table by the `movie_id` feature, calculate an `AVG` and `COUNT` of the rating feature, and return the highest 5 average ratings of movies with a `COUNT` of `rating`s that is greater than 100. 

### AS

When looking at the output from these grouped queries, you may note the somewhat "unnatractive" column names given to our aggregated features. Most SQL engines (including SQLite of course) support "aliasing" - basically, renaming a feature to something a little less verbose. This is done using the `AS` clause with the format `[feature] AS [alias]`. We can then refer to our new alias in the query itself if required. If we take the above example, we can use aliasing to make the SQL command and the output a little more readable:

In [19]:
display_results(command = """
                          SELECT movie_id, 
                                 AVG(rating) AS average_rating, 
                                 COUNT(rating) AS n_ratings 
                          FROM ratings 
                          GROUP BY movie_id
                          HAVING n_ratings > 100
                          ORDER BY average_rating DESC
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,average_rating,n_ratings
0,192040,4.47878,754
1,278,4.429015,91082
2,331214,4.394366,284
3,238,4.339811,57070
4,629,4.300189,59271


Here we are taking the `AVG(rating)` and renaming/aliasing it `AS average_rating` and renaming the `COUNT(rating)` feature `AS n_ratings`. We then later refer to these aliases when conditioning the results as those `HAVING n_ratings > 100` and `ORDER`ing `BY average_ratings`. Great!

So far, we've learned a number of ways to ask for specific information from one table at a time. But what if the information we need is contained in multiple tables? Take for example the many-to-many relationship between movies and genres - each genre relates to multiple movies and many movies fit in multiple genres. We have movies, movie_genres and genres tables but currently we can only query these tables separately, which isn't much use. This leads us nicely to...

## JOIN

If we want to aggregate information from our database contained in multiple tables, we use the `JOIN` command. The join command takes a join condition, a feature common to the tables that can be used to match entries to one another - for example we can join the `movies` data with the `ratings` data using the `id` /`movie_id` features in the respective tables. We must also select a specific **type** of join - in SQLite these break down as follows:

* `INNER JOIN`: this returns all the rows that are common to both tables based on the join clause, and the columns requested from each table. In our movies / ratings example, this would be any movies that have at least one rating.
* `LEFT JOIN`: this returns all the rows from the "LEFT" table (the primary table you SELECT from at the start of your query) whether there are matching entries in the "RIGHT" table or not. Each row is matched with corresponding rows from the "RIGHT" table if available, otherwise the columns from the "RIGHT" table are recorded as NULL. In our movies / ratings example, this would be return all the movies, with ratings info if available and NULL values for the ratings otherwise.
* `CROSS JOIN`: this join doesn't require a join condition, and combines the "n" entries of the LEFT table with the "m" entries of the RIGHT table to form an "n x m" table. There isn't a good use-case for this join type with the movies database. As a general exapmle, you might use a `CROSS JOIN` if you had a table of "products" and another table of "stores" and wanted a table with each unique combination of "products" and "stores" to record individual sales. 

Starting with a simple example, we can take the ratings table and pair each rating with the title of the movie it is rating:

In [6]:
display_results(command = """
                          SELECT ratings.rating, movies.title 
                          FROM ratings 
                          LEFT JOIN movies 
                              ON ratings.movie_id = movies.id
                          LIMIT 5
                          """)

Unnamed: 0,rating,title
0,1.0,Braveheart
1,4.5,The Basketball Diaries
2,5.0,The Godfather
3,5.0,The Godfather: Part II
4,5.0,Dead Poets Society


The above asks for the `rating` column from the `ratings` table and the `title` column from the `movies` table, and takes this information from the `ratings` table `LEFT JOIN`ed with the `movies` table `ON` the `id` feature in the `movies` table and the `movie_id` feature in the ratings table. A bit of a mouthful, but it should hopefully be relatively easy to parse this from the SQL command. Now that we are using columns from more than one table, note that we have to specify `[table_name].[column_name]` each time we reference a column.

Now that we have the tools to combine data from different tables, we can take our "top 5 movies with 100 or more ratings" query above and ask for the movie title rather than the movie_id - this is bound to make it more readable:

In [22]:
display_results(command = """
                          SELECT movies.title, 
                              AVG(ratings.rating) AS average_rating,
                              COUNT(ratings.rating) AS n_ratings
                          FROM ratings 
                          LEFT JOIN movies 
                              ON ratings.movie_id = movies.id 
                          GROUP BY movie_id
                          HAVING n_ratings > 100
                          ORDER BY average_rating DESC
                          LIMIT 5
                          """)

Unnamed: 0,title,average_rating,n_ratings
0,Planet Earth,4.47878,754
1,The Shawshank Redemption,4.429015,91082
2,Band of Brothers,4.394366,284
3,The Godfather,4.339811,57070
4,The Usual Suspects,4.300189,59271


Now we can simply read which movies the ratings relate to.

So far, we've only merged pairs of tables. With the many-to-many relationships between our `movies` table and the `actors` and `genres`, we have three different tables to keep track of each many-to-many relationship (e.g. `movies` -> `movie_genres` -> `genres`). Thankfully SQL allows us to "stack" join commands, so we can request a list of genre names and the movie titles they correspond with like so:

In [6]:
display_results(command = """
                          SELECT movies.title, genres.name 
                          FROM movies 
                          LEFT JOIN movie_genres 
                              ON movie_genres.movie_id = movies.id 
                          LEFT JOIN genres 
                              ON movie_genres.genre_id = genres.id
                          LIMIT 5
                          """)

Unnamed: 0,title,name
0,Ariel,Drama
1,Ariel,Crime
2,Shadows in Paradise,Comedy
3,Shadows in Paradise,Drama
4,Four Rooms,Comedy


and we can create a list of the genres that correspond to each movie, by grouping each movie and using the `GROUP_CONCAT` aggreagte function:

In [7]:
display_results(command = """
                          SELECT movies.title, 
                                 GROUP_CONCAT(genres.name, ", ") AS genres 
                          FROM movies 
                          LEFT JOIN movie_genres 
                              ON movie_genres.movie_id = movies.id 
                          LEFT JOIN genres 
                              ON movie_genres.genre_id = genres.id
                          GROUP BY movies.id
                          LIMIT 5
                          """)

Unnamed: 0,title,genres
0,Ariel,"Drama, Crime"
1,Shadows in Paradise,"Comedy, Drama"
2,Four Rooms,"Comedy, Crime"
3,Judgment Night,"Action, Crime, Thriller"
4,Star Wars,"Adventure, Action, Science Fiction"


Now that we're starting to stack `JOIN`s on top of one another, our queries will start becomming more and more complex. The more tables we are joining together in one go, the more difficult any conditions (`WHERE` and `HAVING`) will become, until eventually they become impossible. We'll also start to face issues with the complexity of our queries - stitching table after table together to form one gigantic table, and then having SQL calculate table constraints on this giant table is not very efficient. It would be great if we had a way of splitting up our queries so we can place conditions on a group of results **before** we then join them with others. Fortunately, SQL has us covered: 

## WITH

The `WITH` command allows us to create a **common table expression** (or CTE). This is effectively a "temporary table" that we can create and use during a SQL command to help us take information from multiple tables, or just generally clean up our queries to make them more readable. 

As an example, imagine we wanted to add average reviews to the table we created above. The information required is contained in the `movies`, `movie_genres`, `genres` and `ratings` tables; this will be a really difficult query to write if we want to stack all of these tables together in one go. More importantly, in joining all these tables we would create a **huge** table - each of the 26M reviews would be replicated **per genre** of each movie - before we grouped the results together, which would be horribly inneficient. Instead, we can split our query into two: we can first join all the movies with their ratings, and then use a `WITH` command to join these results with a list of the genres (like we created above). Let's start by writing a command to return all the movie titles, along with their average rating and number of ratings:

In [16]:
# this will take a while to run
display_results(command = """
                          SELECT movies.id,
                                 movies.title, 
                                 AVG(ratings.rating) AS average_rating,
                                 COUNT(ratings.rating) AS n_ratings
                          FROM movies
                          JOIN ratings 
                              ON movies.id = ratings.movie_id
                          GROUP BY movies.id
                          LIMIT 5
                          """)

Unnamed: 0,id,title,average_rating,n_ratings
0,2,Ariel,3.67,262
1,3,Shadows in Paradise,3.77,87
2,5,Four Rooms,3.41,6090
3,6,Judgment Night,2.92,1271
4,11,Star Wars,4.13,77045


Now, we'll use this table as a temporary set of results, and join this with our command from before that creates a list of genres. To do this, we write the above command in parentheses `()` and prefix it with `WITH [temporary table name] AS`, and then follow it with our command to join this table with the genres list, like so:

In [17]:
# this will take even longer to run
display_results(command = """
                          WITH movie_reviews AS
                          (
                              SELECT movies.id,
                                     movies.title, 
                                     AVG(ratings.rating) AS average_rating,
                                     COUNT(ratings.rating) AS n_ratings
                              FROM movies
                              JOIN ratings 
                                  ON ratings.movie_id = movies.id
                              GROUP BY movies.id
                          )
                          SELECT movie_reviews.title,
                                 movie_reviews.average_rating,
                                 movie_reviews.n_ratings,
                                 GROUP_CONCAT(genres.name, ", ") AS genres 
                          FROM movie_reviews 
                          LEFT JOIN movie_genres 
                              ON movie_reviews.id = movie_genres.movie_id
                          LEFT JOIN genres 
                              ON movie_genres.genre_id = genres.id
                          GROUP BY movie_reviews.id
                          LIMIT 5
                          """)

Unnamed: 0,title,average_rating,n_ratings,genres
0,Ariel,3.67,262,"Drama, Crime"
1,Shadows in Paradise,3.77,87,"Comedy, Drama"
2,Four Rooms,3.41,6090,"Comedy, Crime"
3,Judgment Night,2.92,1271,"Action, Crime, Thriller"
4,Star Wars,4.13,77045,"Adventure, Action, Science Fiction"


CPU times: user 28.9 s, sys: 9.12 s, total: 38 s
Wall time: 38 s


This command may look a little daunting at first but, at it's heart, it's simply the last two commands we wrote stitched together. The first command in parentheses is identical to our command collecting the reviewed movies above (minus the `LIMIT` constraint). Using the results of this query, called `movie_reviews`, we then `JOIN` them with the command that creates a list of genres.

And that pretty much does it for the concepts we're going to cover. To demonstrate the power of these tools in combination, we'll finish by taking a look at a couple more complex examples:

## Complex Examples

First, lets imagine we want to compare the profitability of the actors Scarlet Johansen, Brad Pitt, and Tom Hanks. We have the `revenue` of each movie in the `movies` table, and the `credits` table tells us which of the actors (in the `actors` table) starred in these movies, and their order of appearance. To exclude any minor roles played by the actors, we can limit the movies we consider to those where the actors appearance order in the credits is in the top 3. As such, we can calculate an average revenue for the movies these actors played a starring role in like so:

In [27]:
display_results(command = """
                          WITH credits_list AS
                          (
                              SELECT actors.name, credits.movie_id 
                              FROM actors
                              JOIN credits ON actors.id = credits.actor_id
                              WHERE actors.name IN ("Scarlett Johansson", 
                                                    "Tom Hanks", 
                                                    "Brad Pitt")
                                    AND credits.appearance_order <= 2 
                          )
                          SELECT credits_list.name, 
                                 AVG(movies.revenue) AS mean_revenue,
                                 COUNT(movies.revenue) AS n_credits
                          FROM credits_list
                          JOIN movies ON credits_list.movie_id = movies.id
                          GROUP BY credits_list.name
                          ORDER BY mean_revenue DESC
                          """)

Unnamed: 0,name,mean_revenue,n_credits
0,Tom Hanks,168708254.79,53
1,Brad Pitt,156051320.41,41
2,Scarlett Johansson,102423013.4,30


It seems that Tom is the actor you should choose if you're looking to make some money! (obviously this is a very crude way of evaluating profitability)

Here we start by creating a temporary `credits_list` table, which features all the credits with the corresponding `actor.name` appearing `IN` the list `("Scarlett Johansson", "Tom Hanks", "Brad Pitt")`. We then join this list with the movies table to add the `revenue` feature, and group the results by each actor to calculate the average revenue per movie they starred in, along with a count of these movies.

We can use a similar logic to find the top 5 most profitable actors like so (we'll limit the results to those actors that have played a starring role in 5 or more movies):

In [26]:
display_results(command = """
                          WITH credits_list AS
                          (
                              SELECT actors.name, credits.movie_id 
                              FROM actors
                              JOIN credits ON actors.id = credits.actor_id
                              WHERE credits.appearance_order <= 2
                          )
                          SELECT credits_list.name, 
                                 AVG(movies.revenue) AS mean_revenue,
                                 COUNT(movies.revenue) AS n_credits
                          FROM credits_list
                          JOIN movies ON credits_list.movie_id = movies.id
                          GROUP BY credits_list.name
                          HAVING n_credits > 5
                          ORDER BY mean_revenue DESC
                          LIMIT 5
                          """)

Unnamed: 0,name,mean_revenue,n_credits
0,Emma Watson,557535535.47,17
1,Rupert Grint,481928848.25,16
2,Chris Pratt,465556990.78,9
3,Daniel Radcliffe,437859132.61,18
4,Taylor Lautner,387699683.62,8


Here we simply remove our conditional `IN` statement from the temporary table, and instead condition the results has `HAVING` greater than 5 `n_credts`.

As a final example, let's imagine we're looking for the best comedy film to watch. We have already written commands to return the average ratings of each movie, and the genre's for each movie - so all we need to do is combine these into one combined command:

In [26]:
# this will take a while to run
display_results(command = """
                          WITH comedies AS
                          (
                              SELECT movie_genres.movie_id 
                              FROM movie_genres
                              JOIN genres ON movie_genres.genre_id = genres.id
                              WHERE genres.name = "Comedy"
                          )
                          SELECT movies.title, 
                                 AVG(ratings.rating) AS average_rating,
                                 COUNT(ratings.rating) AS n_ratings
                          FROM comedies
                          JOIN movies ON comedies.movie_id = movies.id
                          JOIN ratings ON movies.id = ratings.movie_id
                          GROUP BY movies.id
                          HAVING n_ratings > 50
                          ORDER BY average_rating DESC
                          LIMIT 5
                          """)

Unnamed: 0,title,average_rating,n_ratings
0,Dr. Strangelove or: How I Learned to Stop Worr...,4.21,28280
1,A Dog's Will,4.19,154
2,Life Is Beautiful,4.17,25245
3,Monty Python and the Holy Grail,4.16,39058
4,The Thin Man,4.15,3628


It should hopefully be possible to parse the above by now. We're creating a list of all the comedies and then joining all of the `movies` and `ratings` data to this, and then grouping by each movie to get an average rating. We're limiting the results to those with more than 50 ratings to ensure we have a good range of opinion before we spend our valuable time and money on a movie.

We might not be in the mood for a classic, which several of the above seem to be, so we could further limit our results to movies released after 2000 like so:

In [28]:
# this will take a while to run
display_results(command = """
                          WITH comedies AS
                          (
                              SELECT movie_genres.movie_id 
                              FROM movie_genres
                              JOIN genres ON movie_genres.genre_id = genres.id
                              WHERE genres.name = "Comedy"
                          )
                          SELECT movies.title, 
                                 AVG(ratings.rating) AS average_rating,
                                 COUNT(ratings.rating) AS n_ratings
                          FROM comedies
                          JOIN movies ON comedies.movie_id = movies.id
                          JOIN ratings ON movies.id = ratings.movie_id
                          WHERE release_date > "2000-01-01"
                          GROUP BY movies.id
                          HAVING n_ratings > 50
                          ORDER BY average_rating DESC
                          LIMIT 5
                          """)

Unnamed: 0,title,average_rating,n_ratings
0,A Dog's Will,4.19,154
1,Amélie,4.13,34430
2,The Intouchables,4.12,11253
3,Wild Tales,4.11,1522
4,Bill Burr: Why Do I Do This?,4.09,85


And with that we'll call it a day! We now have the tools at our disposal to write SQL commands to request pretty much anything we could want from our database. 