# intro

This notebook will walk through some of the basics of retreiving data from a SQL relational database. We will be using the SQLite database we designed in our previous discussion on creating SQL databases, and likewise will be using the sqlite3 and python modules to facilitate our queries. As with the previous notebook, the focus will be on the SQL commands; detailed explanations of the Python script fall outside the scope of this discussion, but efforts have been made to comment the code should you wish to understand the logic behind it.

As we already discussed the basic theory in the last notebook, we'll save the preamble and dive straight into querying our database, picking up the theory as we go:


In [2]:
## setting up our notebook to be able to query the database

import pandas as pd
import sqlite3
from IPython.core.display import display, HTML

# objects used to configure and communicate with database
conn = sqlite3.connect("movies.db")
c = conn.cursor()
c.execute("PRAGMA foreign_keys = ON")

# function to display results of database queries
def display_results(command, display_all=False):
    # returns results of sql command as list of tuples
    results = c.execute(command).fetchall()
    # collect names of columns requested in query
    column_names = [feature_tuple[0] for feature_tuple in c.description]
    # build a pandas dataframe of results
    results_df = pd.DataFrame(results, columns=column_names)
    if display_all:
        # display dataframe as cell output with max_colwidth set to 1000 chars
        with pd.option_context("display.max_colwidth", 1000):
            display(results_df)
    # otherwise display dataframe as cell output with default display settings
    else:
        display(results_df)

## SELECT

Every query asking for data from a SQL database has at it's core a `SELECT` statement. A basic `SELECT` statement takes the form `SELECT [column_names] from [table name]`. Let's take a look at an example:

In [7]:
display_results(command = """
                          SELECT title, runtime, release_date FROM movies 
                          LIMIT 5
                          """)

Unnamed: 0,title,runtime,release_date
0,Ariel,69,1988-10-21
1,Shadows in Paradise,76,1986-10-16
2,Four Rooms,98,1995-12-09
3,Judgment Night,110,1993-10-15
4,Star Wars,121,1977-05-25


As with pretty much all basic SQL commands, the above `SELECT` statement should be fairly easy to parse. We're `SELECT`ing the `title`, `runtime`, and `release_date` columns `FROM` the `movies` database, and `LIMIT`ing the number of results to `5` (without this limit statement, the query would return these columns for every entry in the database). Pretty straight forward, and there aren't really any surprises to be found with this query type - you can follow this logic to take information from any of the tables in our database.

If you wish, you can use a `*` wildcard character to signify "all columns" like so:  

In [8]:
display_results(command = """
                          SELECT * FROM credits 
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,character,actor_id,appearance_order
0,862,Woody (voice),31,0
1,862,Buzz Lightyear (voice),12898,1
2,862,Mr. Potato Head (voice),7167,2
3,862,Slinky Dog (voice),12899,3
4,862,Rex (voice),12900,4


Often, we can leave our queries at that - take all the information from a table and then work with it from there. Easy! But, as you might guess, that isn't a terribly efficient solution. If we're querying a table that contains a huge amount of data, just asking for all the data therein is very computationally expensive and memory intensive (if the data fits in memory at all!). Indeed, many modern cloud database solutions charge users based on the size of queries, and so we'll rack up some pretty huge bills if this is the only tool we have to get at our data!

Thankfully, there are a lot of tools SQLite provides us to refine our search. The `LIMIT` clause is a crude example of this - we're reducing our query from all the data in a table to a specified number. The following are some examples of other clauses we can use to refine our basic search:

### ORDER BY

If we combine our `LIMIT` clause with an `ORDER BY` clause, we have a tool to query to the top "n" of a particular feature of our data. For example, we can ask the database for the top 5 longest movies like so:

In [13]:
display_results(command = """
                          SELECT title, runtime, release_date FROM movies 
                          ORDER BY runtime DESC
                          LIMIT 5
                          """)

Unnamed: 0,title,runtime,release_date
0,Centennial,1256,1978-10-01
1,Baseball,1140,1994-09-18
2,Jazz,1140,2001-01-09
3,Berlin Alexanderplatz,931,1980-08-28
4,Heimat: A Chronicle of Germany,925,1984-09-16


Setting aside for a moment that anyone would be willing to watch 19 continuous hours of "Baseball", we have asked the database to return the same info as before, but to `ORDER` the results `BY runtime` in `DESC`ending order before selecting the first 5 results. The `ORDER BY` function orders in ascending order by default, but you can stipulate `ASC` if you wish to be more specific.

As another example, we can ask for the oldest 5 movies like so:

In [18]:
display_results(command = """
                          SELECT title, runtime, release_date FROM movies 
                          ORDER BY release_date ASC
                          LIMIT 5
                          """)

Unnamed: 0,title,runtime,release_date
0,Passage of Venus,1,1874-12-09
1,Sallie Gardner at a Gallop,1,1878-06-14
2,Buffalo Running,1,1883-11-19
3,Man Walking Around a Corner,1,1887-08-18
4,Accordion Player,1,1888-01-01


Those are some old movies!

### WHERE

The `WHERE` clause allows us to ask the database only for data that meets a certain condition (or set of conditions). Take the following example:

In [21]:
display_results(command = """
                          SELECT * FROM movies 
                          WHERE title = "Toy Story"
                          """,
               display_all=True)

Unnamed: 0,id,title,original_title,runtime,budget,original_language,overview,release_date,revenue,tagline
0,862,Toy Story,Toy Story,81,30000000,en,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",1995-10-30,373554033,


We're asking the database for any entries from the `movies` table `WHERE` the `title` column equals "toy story". 

The `WHERE` clause is very powerful, and offers the use of the classic comparison operators `=`, `!=`, `<`, `>`, `<=`, `>=` and a host of logical operators such as `AND`, `OR`, `BETWEEN`, `IN`, `LIKE`, `NOT` (these vary depending on the particular SQL engine). We could spend a very long time discussing the features of different where clauses (and anyone interested should read [this excellent tutorial](https://www.sqlitetutorial.net/sqlite-where/)) to the extent that it's much better to explore these features by making your own queries. We'll take a look at one more complex example and then move on:

In [26]:
display_results(command = """
                          SELECT title, runtime, release_date FROM movies 
                          WHERE title LIKE "%harry potter%" 
                              AND release_date > "2005-01-01"
                          """)

Unnamed: 0,title,runtime,release_date
0,Harry Potter and the Goblet of Fire,157,2005-11-05
1,Harry Potter and the Order of the Phoenix,138,2007-06-28
2,Harry Potter and the Half-Blood Prince,153,2009-07-07
3,Harry Potter and the Deathly Hallows: Part 1,146,2010-10-17
4,Harry Potter and the Deathly Hallows: Part 2,130,2011-07-07


The above query contains two `WHERE` conditions joined with an `AND` logical operator - each condition must be met for a result to be returned: 

* The first condition is that the `title` is `LIKE "%harry potter%"` - the `LIKE` condition returns any entries with a `title` that matches the `%harry potter%` pattern. The `%` characters in the pattern are wildcards that match any string. So, we are effectively asking for any titles that contain the string "harry potter" within them.
* The second condition is that the `release_date` is greater than (`>`) `2005-01-01`. SQLite recognises date strings and is capable of comparing the absolute values of different dates. As such, this will return entries that have a release date after 1 January 2005.

The results are all the Harry Potter films released from 2005 onwards. Exciting stuff!

You'll note that the `SELECT` statements that we've devised so far are returning entries that are identical to those we entered. SQL also offers us the tools to manipulate and perform calculations on database entries at the point we make a query. This allows us to ask for even more specific information from our database, without being limited to the format the data was entered:

## GROUP BY

The `GROUP BY` clause allows us to ask for entries groputed by one of the columns or features of our data. If we take the `ratings` table as an example, we can use the `movie_id` feature to group together all the ratings for the same movie. You might then ask yourself, what happens with the other features? We need some way of combining the information stored in the other columns. `GROUP BY` queries are always combined with **aggregate functions**, that tell the database what to do with the other features you're querying. Let's take a look at an example of the query described above: 

In [34]:
# this query will take longer than the others 
# as the ratings database has 26M entries
display_results(command = """
                          SELECT movie_id, AVG(rating) FROM ratings 
                          GROUP BY movie_id
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,AVG(rating)
0,2,3.673664
1,3,3.770115
2,5,3.409031
3,6,2.924862
4,11,4.132299


Here we have asked the database for the `movie_id` column and an `AVG` (average) of the `rating` column from the `ratings` table `GROUP`ed `BY` the `movie_id` column. This gathers together every rating for the same movie, and averages the individual ratings for each movie. Note that, unlike the other queries we've run, we didn't provide the database with this information when it was created - the database engine calculated the information for us based on the entries in the database.

If we use different aggregate functions, unsurprisingly we get different information:

In [36]:
# this query will take longer than the others 
# as the ratings database has 26M entries
display_results(command = """
                          SELECT movie_id, AVG(rating), SUM(rating), COUNT(rating) FROM ratings 
                          GROUP BY movie_id
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,AVG(rating),SUM(rating),COUNT(rating)
0,2,3.673664,962.5,262
1,3,3.770115,328.0,87
2,5,3.409031,20761.0,6090
3,6,2.924862,3717.5,1271
4,11,4.132299,318373.0,77045


Here we have not only the average, but also a `SUM` of each individual rating and a `COUNT` of all the ratings in each group (this count would work for any of the features in the ratings table, not just `rating`). If you want to double-check SQLite is doing its work correctly, you can divide the `SUM(rating)` column by the `COUNT(rating)` column to calculate your own average.

Note that, if we use the aggregate functions without a `GROUP BY` statement the database will aggregate over the whole table (rather than the groups):

In [37]:
# this query will take longer than the others 
# as the ratings database has 26M entries
display_results(command = """
                          SELECT AVG(rating), SUM(rating), COUNT(rating) FROM ratings 
                          """)

Unnamed: 0,movie_id,AVG(rating),SUM(rating),COUNT(rating)
0,197,3.527854,91642196.0,25976754


### HAVING

There may be instances where we wish to narrow our grouped search results, like we did above with the `WHERE` clause. SQL database engines reserve the `WHERE` clause for the columns of a database - it can't be used with aggregated data. Instead, conditions can be placed on aggregated data using the `HAVING` clause - other than this distinction, the two function in the same way. Take the following for example:

In [40]:
# this query will take longer than the others 
# as the ratings database has 26M entries
display_results(command = """
                          SELECT movie_id, AVG(rating) 
                          FROM ratings 
                          GROUP BY movie_id
                          HAVING AVG(rating) > 4
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,AVG(rating)
0,11,4.132299
1,13,4.052926
2,14,4.130704
3,15,4.094047
4,28,4.106593


This shows the first 5 movies that have an average rating greater than 4 out of 5.

### AS

When looking at the output from these grouped queries, you may note the somewhat "unnatractive" column names given to our aggregated features. Most SQL engines (including SQLite of course) support "aliasing" - basically, renaming a feature to something a little less verbose. This is done using the `AS` clause with the format `[feature] AS [alias]`. We can then refer to our new alias in the query itself if required. If we take the above example, we can use aliasing to make the SQL command and the output a little more readable:

In [42]:
# this query will take longer than the others 
# as the ratings database has 26M entries
display_results(command = """
                          SELECT movie_id, AVG(rating) AS average_rating 
                          FROM ratings 
                          GROUP BY movie_id
                          HAVING average_rating > 4
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,average_rating
0,11,4.132299
1,13,4.052926
2,14,4.130704
3,15,4.094047
4,28,4.106593


Here we are taking the `AVG(rating)` and renaming/aliasing it `AS average_rating`. We then later refer to the alias when conditioning the results as those `HAVING average_rating > 4`. Great!

So far, we've learned a number of ways to ask for specific information from one table at a time. But what if the information we need is contained in multiple tables? Take for example the many-to-many relationship between movies and genres - each genre relates to multiple movies and many movies fit in multiple genres. We have movies, movie_genres and genres tables but currently we can only query these tables separately, which isn't much use. This leads us nicely to...

## JOIN

If we want to aggregate information from our database contained in multiple tables, we use the `JOIN` command. The join command takes a join condition, a feature common to the tables that can be used to match entries to one another - for example we can join all the `movies` data with the `ratings` data using the `id` feature in the `movies` table and the `movie_id` feature in the `ratings` table. We must also select a specific **type** of join - in SQLite these break down as follows:

* `INNER JOIN`: this returns all the rows that are common to both tables based on the join clause, and the columns requested from each table. In our movies / ratings example, this would be any movies that have at least one rating.
* `LEFT JOIN`: this returns all the rows from the "LEFT" table (the primary table you SELECT from at the start of your query) whether there are matching entries in the "RIGHT" table or not. Each row is matched with corresponding rows from the "RIGHT" table if available, otherwise the columns from the "RIGHT" table are recorded as NULL if no matching entries are available. In our movies / ratings example, this would be return all the movies, with ratings info if available and NULL values for the ratings otherwise.
* `CROSS JOIN`: this join doesn't require a join condition, and combines the "n" entries of the LEFT table with the "m" entries of the RIGHT table to form an "n x m" table. There isn't a good use-case for this join type with the movies database. More generally you might, for example, use a `CROSS JOIN` if you had a table of "products" and another table of "stores" and wanted a table with each unique combination of "products" and "stores" to record individual sales. 

Plan:

1. Intro
2. SELECT
* FROM
* WHERE
3. GROUP BY
* introduce AS and ORDER BY
4. Aggregate Functions COUNT, MIN, MAX, 
5. JOIN
6. 