# intro

This notebook will walk through some of the basics of retreiving data from a SQL relational database. We will be using the SQLite database we designed in our previous discussion on creating SQL databases, and likewise will be using the sqlite3 and python modules to facilitate our queries. As with the previous notebook, the focus will be on the SQL commands; detailed explanations of the Python script fall outside the scope of this discussion, but efforts have been made to comment the code should you wish to understand the logic behind it.

As we already discussed the basic theory in the last notebook, we'll save the preamble and dive straight into querying our database, picking up the theory as we go:


In [2]:
## setting up our notebook to be able to query the database

import pandas as pd
import sqlite3
from IPython.core.display import display, HTML

# objects used to configure and communicate with database
conn = sqlite3.connect("movies.db")
c = conn.cursor()
c.execute("PRAGMA foreign_keys = ON")

# function to display results of database queries
def display_results(command, display_all=False):
    # returns results of sql command as list of tuples
    results = c.execute(command).fetchall()
    # collect names of columns requested in query
    column_names = [feature_tuple[0] for feature_tuple in c.description]
    # build a pandas dataframe of results
    results_df = pd.DataFrame(results, columns=column_names)
    if display_all:
        # display dataframe as cell output with max_colwidth set to 1000 chars
        with pd.option_context("display.max_colwidth", 1000):
            display(results_df)
    # otherwise display dataframe as cell output with default display settings
    else:
        display(results_df)

## SELECT

Every query asking for data from a SQL database has at it's core a `SELECT` statement. A basic `SELECT` statement takes the form `SELECT [column_names] from [table name]`. Let's take a look at an example:

In [7]:
display_results(command = """
                          SELECT title, runtime, release_date FROM movies 
                          LIMIT 5
                          """)

Unnamed: 0,title,runtime,release_date
0,Ariel,69,1988-10-21
1,Shadows in Paradise,76,1986-10-16
2,Four Rooms,98,1995-12-09
3,Judgment Night,110,1993-10-15
4,Star Wars,121,1977-05-25


As with pretty much all basic SQL commands, the above `SELECT` statement should be fairly easy to parse. We're `SELECT`ing the `title`, `runtime`, and `release_date` columns `FROM` the `movies` database, and `LIMIT`ing the number of results to `5` (without this limit statement, the query would return these columns for every entry in the database). Pretty straight forward, and there aren't really any surprises to be found with this query type - you can follow this logic to take information from any of the tables in our database.

If you wish, you can use a `*` wildcard character to signify "all columns" like so:  

In [8]:
display_results(command = """
                          SELECT * FROM credits 
                          LIMIT 5
                          """)

Unnamed: 0,movie_id,character,actor_id,appearance_order
0,862,Woody (voice),31,0
1,862,Buzz Lightyear (voice),12898,1
2,862,Mr. Potato Head (voice),7167,2
3,862,Slinky Dog (voice),12899,3
4,862,Rex (voice),12900,4


Often, we can leave our queries at that - take all the information from a table and then work with it from there. Easy! But, as you might guess, that isn't a terribly efficient solution. If we're querying a table that contains a huge amount of data, just asking for all the data therein is very computationally expensive and memory intensive (if the data fits in memory at all!). Indeed, many modern cloud database solutions charge users based on the size of queries, and so we'll rack up some pretty huge bills if this is the only tool we have to get at our data!

Thankfully, there are a lot of tools SQLite provides us to refine our search. The `LITMIT` clause is a crude example of this - we're reducing our query from all the data in a table to a specified number. The following are some examples of other clauses we can use to refine our basic search:

## ORDER BY

If we combine our `LIMIT` clause with an `ORDER BY` clause, we have a tool to query to the top "n" of a particular feature of our data. For example, we can ask the database for the top 5 longest movies like so:

In [13]:
display_results(command = """
                          SELECT title, runtime, release_date FROM movies 
                          ORDER BY runtime DESC
                          LIMIT 5
                          """)

Unnamed: 0,title,runtime,release_date
0,Centennial,1256,1978-10-01
1,Baseball,1140,1994-09-18
2,Jazz,1140,2001-01-09
3,Berlin Alexanderplatz,931,1980-08-28
4,Heimat: A Chronicle of Germany,925,1984-09-16


Setting aside that anyone would be willing to watch 19 continuous hours of "Baseball", we have asked the database to return the same info as before, but to `ORDER` the results `BY runtime` in `DESC`ending order before selecting the first 5 results. The `ORDER BY` function orders in ascending order by default, but you can stipulate `ASC` if you wish to be more specific.

As another example, we can ask for the oldest 5 movies like so:

In [18]:
display_results(command = """
                          SELECT title, runtime, release_date FROM movies 
                          ORDER BY release_date ASC
                          LIMIT 5
                          """)

Unnamed: 0,title,runtime,release_date
0,Passage of Venus,1,1874-12-09
1,Sallie Gardner at a Gallop,1,1878-06-14
2,Buffalo Running,1,1883-11-19
3,Man Walking Around a Corner,1,1887-08-18
4,Accordion Player,1,1888-01-01


Those are some old movies!

## WHERE

The `WHERE` clause allows us to ask the database only for data that meets a certain condition (or set of conditions). Take the following example:

In [21]:
display_results(command = """
                          SELECT * FROM movies 
                          WHERE title = "Toy Story"
                          """,
               display_all=True)

Unnamed: 0,id,title,original_title,runtime,budget,original_language,overview,release_date,revenue,tagline
0,862,Toy Story,Toy Story,81,30000000,en,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",1995-10-30,373554033,


We're asking the database for any entries from the `movies` table `WHERE` the `title` column equals "toy story". 

The `WHERE` clause is very powerful, and offers the use of the classic comparison operators `=`, `!=`, `<`, `>`, `<=`, `>=` and a host of logical operators such as `AND`, `OR`, `BETWEEN`, `IN`, `LIKE`, `NOT` (these vary depending on the particular SQL engine). We could spend a very long time discussing the features of different where clauses (and anyone interested should read [this excellent tutorial](https://www.sqlitetutorial.net/sqlite-where/)) to the extent that it's much better to explore these features by making your own queries. We'll take a look at one more complex example and then move on:

In [26]:
display_results(command = """
                          SELECT title, runtime, release_date FROM movies 
                          WHERE title LIKE "%harry potter%" 
                              AND release_date > "2005-01-01"
                          """)

Unnamed: 0,title,runtime,release_date
0,Harry Potter and the Goblet of Fire,157,2005-11-05
1,Harry Potter and the Order of the Phoenix,138,2007-06-28
2,Harry Potter and the Half-Blood Prince,153,2009-07-07
3,Harry Potter and the Deathly Hallows: Part 1,146,2010-10-17
4,Harry Potter and the Deathly Hallows: Part 2,130,2011-07-07


Plan:

1. Intro
2. SELECT
* FROM
* WHERE
3. GROUP BY
* introduce AS and ORDER BY
4. Aggregate Functions COUNT, MIN, MAX, 
5. JOIN
6. 

In [64]:
display_results(command = """
                          SELECT * FROM credits
                          WHERE movie_id = 16
                          """,
                n_results=10,
                display_all=True)

Unnamed: 0,movie_id,character,actor_id,appearance_order
0,16,Selma Jezkova,47,0
1,16,Kathy,50,1
2,16,Bill Houston,52,2
3,16,Jeff,53,3
4,16,Doctor,1640,19
5,16,Norman,1642,7
6,16,Dr. Porkorny,1646,11
7,16,Linda Houston,2617,5
8,16,Woman on Night Shift,4458,20
9,16,Morty,6121,12


In [23]:
poop, *_ = c.description[0]

In [37]:
c.description

(('id', None, None, None, None, None, None),
 ('title', None, None, None, None, None, None),
 ('original_title', None, None, None, None, None, None),
 ('runtime', None, None, None, None, None, None),
 ('budget', None, None, None, None, None, None),
 ('original_language', None, None, None, None, None, None),
 ('overview', None, None, None, None, None, None),
 ('release_date', None, None, None, None, None, None),
 ('revenue', None, None, None, None, None, None),
 ('tagline', None, None, None, None, None, None))