# Querying the Movies Database

In this notebook, you will have the opportunity to practise converting questions into queries. We will give a series of questions which can be answered using the database, and give some hints as to how you might consider answering them.

For each of the tables we looked at in notebook *08.1 Movies dataset*: `movie`, `movie_actor`, `movie_director` and `movie_genre`, we will give a question which could be answered with the table, some hints about what is needed to answer the question, and our own code to answer the question. In each case, of course, you should attempt to answer the question yourself before looking at our solutions.

If you are not sure how to construct a query to answer one of the questions, do try to understand our description of how the query is constructed, rather than focussing excessively on SQL or pandas syntax. Understanding the *structure* of database-style queries is much more important than the syntax of the different implementations, and if you have a clear idea of how you think the query should be constructed, then you will find that your ability to write the particular SQL or pandas will improve over the next few weeks.

When formulating your queries, you may use either pandas dataframe methods or use pandasql (as introduced in notebook *03.2*) to run SQL queries over the dataframes.

Remember that the dataframes have a suffix `_df`, so that the `movie_df` and `movieActor_df` dataframes (for example) contains the contents of the `movie` and `movie_actor` database tables respectively.


This notebook contains several exercises or activities, which are presented with a space for you to try your own solution. In each case, you can see our solution by clicking on the small triangle next to the text "**our solution**", but in all cases, you should attempt the questions yourself before looking at our proposed solutions.



In [1]:
# This cell imports the pandas and pandasql modules, and imports
# the database tables as dataframes.

import pandas as pd
from pandasql import sqldf

# To make it a bit easier to apply the sqldf function, we will create a 
# simple wrapper function to allow us to supply the query 'q' without the 
# surrounding syntax of the function call.
pysqldf = lambda q: sqldf(q, globals())

# Create the DataFrame 'movie' from the CSV data file 'movie.csv'.
movie_df = pd.read_csv('data/movie.csv')

# Create the DataFrame 'movie_actor' from the CSV data file 'movie_actor.csv'.
movieActor_df = pd.read_csv('data/movie_actor.csv')

# Create the DataFrame 'movie_country' from the CSV data file 'movie_country.csv'.
movieCountry_df = pd.read_csv('data/movie_country.csv')

# Create the DataFrame 'movie_director' from the CSV data file 'movie_director.csv'.
movieDirector_df = pd.read_csv('data/movie_director.csv')

# Create the DataFrame 'movie_genre' from the CSV data file 'movie_genre.csv'.
movieGenre_df = pd.read_csv('data/movie_genre.csv')

## Getting Started - Simple Questions into Queries

Here are some simple questions to get you started, along with some hints on how to turn them into queries; the answers provide examples of queries that can be used to answer the questions.

### Exercise 1: The `movie` table - *How many movies are there?*

This is a simple counting question that asks you to find the length of the table or the number of records in it.

In [2]:
# Remind ourselves what columns are available
list(movie_df.columns)

['movie_id',
 'title',
 'year',
 'rt_all_critics_rating',
 'rt_top_critics_rating',
 'rt_audience_rating',
 'ml_user_rating']

In [4]:
# Enter your query here
len(movie_df)

10681

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


The Python built-in function <tt>len()</tt> returns the number of rows in the DataFrame specified (see the *01.3 Basic python data structures* Notebook):<br/>


In [None]:
# Find the number of rows in the movie_df dataframe

len(movie_df)

Alternatively, the SQL `COUNT()` function will count the number of row items returned within a query:

In [None]:
pysqldf('''
        SELECT COUNT(*) AS number_of_titles
        FROM movie_df
        ''')

__Related questions:__ *How many actors appear in the database? How many directors? How many countries?*

### Exercise 2: The `movie` table - *how many *unique* movie titles are there?*
This is a more refined/exact counting question that asks you to identify the *unique* or *distinct* elements in a column and then count them.

In [8]:
# Enter your query here
# len(movie_df['title'].unique())
movie_df['title'].nunique()

10410

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


`movie_df['title']` returns a Series containing the values in the 'title' column. The `.unique()` method returns an array of unique values from the Series (see the *02.1 Pandas Dataframes* Notebook):

In [9]:
len(movie_df["title"].unique())

10410

In SQL, if we select the `DISTINCT` titles, we can then count them:

In [10]:
pysqldf('''
        SELECT COUNT (DISTINCT title) AS number_of_distinct_titles
        FROM movie_df
        ''')

Unnamed: 0,number_of_distinct_titles
0,10410


__Related questions:__ *How many uniquely named actors are there? How many uniquely named directors? Countries? Genres?*

In [17]:
# number of unique actors
print("Unique actors: {0}".format(movieActor_df["actor_name"].nunique()))

# number of unique directors
print("Unique directors: {0}".format(movieDirector_df["director_name"].nunique()))

# number of unique countries
print("Unique countries: {0}".format(movieCountry_df["country"].nunique()))

# number of unique genres
print("Unique genres: {0}".format(movieGenre_df["genre"].nunique()))

Unique actors: 95187
Unique directors: 4052
Unique countries: 71
Unique genres: 20


### Exercise 3: The `movie` table - *What release data period does the dataset cover?*


This question can be asked by thinking about the range (maximum and minimum) of values in a particular column included in the dataset.

In [26]:
# Enter your query here
print("Earliest movie: {0}".format(movie_df['year'].min()))

print("Newest movie: {}". format(movie_df['year'].max()))

print("({0} - {1})".format(movie_df['year'].min(), movie_df['year'].max()))
# movie_df['year'].max() - movie_df['year'].min()

Earliest movie: 1915
Newest movie: 2008
(1915 - 2008)


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


The DataFrame column methods `.min()` and `.max()` return the minimum and maximum of the values in the column specified (see the *04.2 Descriptive statistics in pandas* Notebook):

In [27]:
movie_df['year'].min(), movie_df['year'].max()

(1915, 2008)

The SQL `MIN()` and `MAX()` functions will find extreme values in a numerically ranged set of values:

In [28]:
pysqldf('''
        SELECT MIN(year) AS minimum_year, MAX(year) AS maximum_year
        FROM movie_df
        ''')

Unnamed: 0,minimum_year,maximum_year
0,1915,2008


__Related questions:__ *what are the ranges of values for critics, audience and user ratings?*

### Exercise 4: The `movie_genre` table - *How many movies are classified under each genre, sorted according to decreasing count?*


This query requires a couple of steps: first, group items into particular sets, and second, count the number of items in each set.

When developing your queries, pay particular attention to the range of genres listed. Are there any notable or distinguished values listed there?

In [29]:
# Remind ourselves what columns are available

list(movieGenre_df.columns)

['movie_id', 'genre']

In [39]:
# Enter your query here
movieGenre_df.groupby('genre').size().sort_values(ascending=False)

genre
Drama                 5339
Comedy                3703
Thriller              1706
Romance               1685
Action                1473
Crime                 1118
Adventure             1025
Horror                1013
Sci-Fi                 754
Fantasy                543
Children               528
War                    511
Mystery                509
Documentary            482
Musical                436
Animation              286
Western                275
Film-Noir              148
IMAX                    29
(no genres listed)       1
dtype: int64

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


<div class='answer'>There are many different ways of forumulating queries to address this question. For example, we can use <tt>.groupby()</tt> on the `genre` column, and then use <tt>.size()</tt> to find the size of each group. The results can then be sorted in order of decreasing count value:

In [40]:
movieGenre_df.groupby('genre').size().sort_values(ascending=False)

genre
Drama                 5339
Comedy                3703
Thriller              1706
Romance               1685
Action                1473
Crime                 1118
Adventure             1025
Horror                1013
Sci-Fi                 754
Fantasy                543
Children               528
War                    511
Mystery                509
Documentary            482
Musical                436
Animation              286
Western                275
Film-Noir              148
IMAX                    29
(no genres listed)       1
dtype: int64

Alternatively, we could generate a pivot table indexed on *genre* and aggregated using a `size` function (as described in the *04.1 Crosstabs and pivot tables* Notebook):

In [41]:
movieGenre_df.pivot_table(index=['genre'], aggfunc='size').sort_values(ascending=False)

genre
Drama                 5339
Comedy                3703
Thriller              1706
Romance               1685
Action                1473
Crime                 1118
Adventure             1025
Horror                1013
Sci-Fi                 754
Fantasy                543
Children               528
War                    511
Mystery                509
Documentary            482
Musical                436
Animation              286
Western                275
Film-Noir              148
IMAX                    29
(no genres listed)       1
dtype: int64

In SQL terms, the grouping approach, using `GROUP BY`, is probably the simplest way, ordering the final result by count using `DESC`:

In [42]:
pysqldf('''
        SELECT genre, COUNT(*) AS number_in_genre 
        FROM movieGenre_df 
        GROUP BY genre 
        ORDER BY number_in_genre DESC
        ''')

Unnamed: 0,genre,number_in_genre
0,Drama,5339
1,Comedy,3703
2,Thriller,1706
3,Romance,1685
4,Action,1473
5,Crime,1118
6,Adventure,1025
7,Horror,1013
8,Sci-Fi,754
9,Fantasy,543



Of the *genre* values, one notable value is the *(no genres listed)* value which explicitly identifies a movie with no associated genres, rather than representing that information with a NULL value or by omitting the particular movie from the table altogether.

## More complex questions: missing data, and using multiple tables

We will now try to answer some questions which seek to identify the number of records with a missing value in a particular column, or that appear in one table but not another. Different strategies may be required to calculate these numbers.

### Exercise 5: How many movies don't have an audience rating?
This query requires us to identify which records are missing a value in a particular column.

In [43]:
# Start by checking which column of the movie table contains audience ratings:
movie_df.head()

Unnamed: 0,movie_id,title,year,rt_all_critics_rating,rt_top_critics_rating,rt_audience_rating,ml_user_rating
0,1,Toy Story,1995,9.0,8.5,3.7,3.9
1,2,Jumanji,1995,5.6,5.8,3.2,3.2
2,3,Grumpier Old Men,1995,5.9,7.0,3.2,3.2
3,4,Waiting to Exhale,1995,5.6,5.5,3.3,2.9
4,5,Father of the Bride Part II,1995,5.3,5.4,3.0,3.1


In [46]:
# Enter your query here
len(movie_df[movie_df['rt_audience_rating'].isnull()])

714

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


The Series and DataFrame data structures have an <tt>.isnull()</tt> method whch returns a boolean for each element in the Series, which indicates whether the values are null (*03.4 Handling missing data* Notebook). These boolean values can be used to perform a selection on those rows where <tt>.isnull()</tt> returns `True` (see the *03.2 Selecting and projecting, sorting and limiting* Notebook):


In [47]:
movie_df[movie_df['rt_audience_rating'].isnull()]

Unnamed: 0,movie_id,title,year,rt_all_critics_rating,rt_top_critics_rating,rt_audience_rating,ml_user_rating
50,51,Guardian Angel,1994,,,,2.7
106,108,Catwalk,1996,,,,3.1
107,109,Headless Body in Topless Bar,1995,,,,1.5
125,127,"Silence of the Palace, The (Saimt el Qusur)",1994,,,,3.3
131,133,Nueba Yol,1995,,,,2.8
136,138,"Neon Bible, The",1995,,,,3.2
140,142,Shadows (Cienie),1988,,,,2.9
141,143,Gospa,1995,,,,2.3
190,192,"Show, The",1995,,,,3.0
208,210,Wild Bill,1995,,,,2.9


We can then use the `len()` function to count the number of such movies:

In [48]:
len(movie_df[movie_df['rt_audience_rating'].isnull()])

714

In SQL, we can select rows where a specified column value contains NULL:

In [49]:
pysqldf('''
        SELECT COUNT(*) AS number_of_movies
        FROM movie_df
        WHERE rt_audience_rating IS NULL
        ''')

Unnamed: 0,number_of_movies
0,714


### Exercise 6: How many movies don't have a country listed?

This question requires us to compare the number of records that appear in one table (which we might exepct to contain a complete set records) compared to another.

In [50]:
# Remind ourselves what columns are available in the two tables:
list(movie_df.columns)

['movie_id',
 'title',
 'year',
 'rt_all_critics_rating',
 'rt_top_critics_rating',
 'rt_audience_rating',
 'ml_user_rating']

In [51]:
list(movieCountry_df.columns)

['movie_id', 'country']

In [56]:
# Enter your query here

# assuming that both tables are complete
# find the difference compare lengths of both movie_id
len(movie_df['movie_id']) - len(movieCountry_df['movie_id'])

484

In [57]:
# better approach
len( set(movie_df['movie_id']).difference(set(movieCountry_df['movie_id'])) )

484

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


If we assume that the `movie` table contains the complete set of movies, we could find the difference between the length of that table and the `movie_country` table:


In [53]:
len(movie_df)-len(movieCountry_df)

484

A better approach (that is, one that is more generalisable, and which is more closely related to the particular question) might be to find the length of the set of movies which appear in the `movie` table but not the `movie_country` table:

In [58]:
len( set(movie_df['movie_id']).difference(set(movieCountry_df['movie_id'])) )

484

In SQL, we can use a subquery to find the sent of movies which are associated with a country, and then use a `WHERE ... NOT IN ...` to select and count the movies which are *not* in that collection:

In [59]:
pysqldf('''
        SELECT COUNT(*) AS number_of_movies 
        FROM movie_df
        WHERE movie_id NOT IN (SELECT movie_id 
                               FROM movieCountry_df)
        ''')

Unnamed: 0,number_of_movies
0,484


__Related questions:__ *the same question could be asked of the directors, which also have at most one entry for each particular `movie_id`.*

### Exercise 7: Combining the `movie` and `movie_actor` tables: which films released in 1995 did Tom Hanks appear in?

To frame this query, we need to join two tables, and in each case, filter on one of the elements.

In [60]:
# Enter your query here


In [63]:
pd.merge( movie_df[movie_df['year']==1995], movieActor_df[movieActor_df['actor_name']=="Tom Hanks"],on='movie_id')['title']

0                Toy Story
1                Apollo 13
2    Celluloid Closet, The
Name: title, dtype: object

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


The pandas <tt>merge()</tt> function can be used to merge dataframes which share one or more common column values. In this case, we want to merge only the rows in the `movie_df` dataframe which have a value of *1995* in the `year` column, and only those rows in the `movieActor_df` dataframe which have an `actor_name` value of 
*Tom Hanks*:

In [64]:
pd.merge(movie_df[movie_df['year']==1995],movieActor_df[movieActor_df['actor_name']=='Tom Hanks'],on='movie_id')['title']

0                Toy Story
1                Apollo 13
2    Celluloid Closet, The
Name: title, dtype: object

For the SQL query, we can do a simple `JOIN` on the `movie` and `movie_actor` tables on the `movie_id` column, and then filter the rows as required in each separate table:

In [65]:
pysqldf('''
        SELECT title
        FROM movie_df JOIN movieActor_df
            ON movie_df.movie_id=movieActor_df.movie_id
        WHERE year=1995 AND actor_name="Tom Hanks"
        ''')

Unnamed: 0,title
0,Toy Story
1,Apollo 13
2,"Celluloid Closet, The"


### Exercise 8: Combining the `movie` and `movie_actor` tables: which of Tom Hanks' lead billing movies was least highly rated by all critics and in what year was it released?


To create a query corresponding to this question, we need to:
1. identify the tables containing the information we need,
2. filter the data in the tables so that it contains only what is needed,
3. join the tables,
4. rank the result on non-null values and finally
5. choose one of the extreme ranked values.

In [67]:
# Enter your query here
list(movie_df.columns)

['movie_id',
 'title',
 'year',
 'rt_all_critics_rating',
 'rt_top_critics_rating',
 'rt_audience_rating',
 'ml_user_rating']

In [70]:
list(movieActor_df.columns)

['movie_id', 'actor_name', 'ranking']

In [74]:
tomHanksLead_df = movieActor_df[(movieActor_df['actor_name']=='Tom Hanks') & (movieActor_df['ranking']==1)]
tomHanksLead_df.head()

Unnamed: 0,movie_id,actor_name,ranking
22,1,Tom Hanks,1
3384,150,Tom Hanks,1
8127,356,Tom Hanks,1
11976,508,Tom Hanks,1
12947,539,Tom Hanks,1


In [77]:
allRatings_df = movie_df[~movie_df['rt_all_critics_rating'].isnull()]
allRatings_df.head()

Unnamed: 0,movie_id,title,year,rt_all_critics_rating,rt_top_critics_rating,rt_audience_rating,ml_user_rating
0,1,Toy Story,1995,9.0,8.5,3.7,3.9
1,2,Jumanji,1995,5.6,5.8,3.2,3.2
2,3,Grumpier Old Men,1995,5.9,7.0,3.2,3.2
3,4,Waiting to Exhale,1995,5.6,5.5,3.3,2.9
4,5,Father of the Bride Part II,1995,5.3,5.4,3.0,3.1


In [79]:
tomHanks_rated = pd.merge(allRatings_df,tomHanksLead_df,on='movie_id')
tomHanks_rated

Unnamed: 0,movie_id,title,year,rt_all_critics_rating,rt_top_critics_rating,rt_audience_rating,ml_user_rating,actor_name,ranking
0,1,Toy Story,1995,9.0,8.5,3.7,3.9,Tom Hanks,1
1,150,Apollo 13,1995,8.0,7.9,3.6,3.9,Tom Hanks,1
2,356,Forrest Gump,1994,6.9,7.0,4.1,4.0,Tom Hanks,1
3,508,Philadelphia,1993,6.5,6.7,3.7,3.8,Tom Hanks,1
4,539,Sleepless in Seattle,1993,6.5,5.6,3.3,3.5,Tom Hanks,1
5,1827,"Big One, The",1997,7.9,6.8,3.4,3.9,Tom Hanks,1
6,2028,Saving Private Ryan,1998,8.2,8.1,4.0,4.1,Tom Hanks,1
7,2072,"'burbs, The",1989,5.2,0.0,3.3,3.0,Tom Hanks,1
8,2100,Splash,1984,6.9,5.6,3.0,3.3,Tom Hanks,1
9,2418,Nothing in Common,1986,5.8,0.0,2.8,3.0,Tom Hanks,1


In [None]:
tomHanks_rated.sort_values('rt_all_critics_rating').head(1)

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


When generating complex queries, it can often be useful to split the problem into several separate pieces. For example, we can start by generating a dataframe that finds Tom Hanks movies where the actor had top billing:

In [None]:
tomHanksLead_df=movieActor_df[(movieActor_df['actor_name']=='Tom Hanks') & (movieActor_df['ranking']==1)]

tomHanksLead_df.head()

We can also find movies where the "all critics" rating is not null:

In [None]:
ratedMovies_df=movie_df[~movie_df['rt_all_critics_rating'].isnull()]

ratedMovies_df.head()

We can then merge these two dataframes by finding the values in the `movie_id` column:

In [None]:
tomHanksRated_df=pd.merge(ratedMovies_df,tomHanksLead_df,on='movie_id')

tomHanksRated_df

We can then sort the merged dataframe by the critics ratings, find the lowest critic rating value, and then project the `title` and `year` columns to produce our final result:

In [None]:
tomHanksRated_df.sort_values('rt_all_critics_rating').head(1)[['title','year','rt_all_critics_rating']]

For an SQL query, the projection is provided by the `SELECT` clause, the `ON` clause is used to specify how the `JOIN` clause works, the rows are filtered with the `WHERE` clause, and the `ORDER BY` clause sorts the result and a `LIMIT 1` clause returns just the first extreme value:

In [None]:
pysqldf('''
        SELECT title, year, rt_all_critics_rating 
        FROM movie_df JOIN movieActor_df 
            ON movie_df.movie_id=movieActor_df.movie_id
        WHERE movieActor_df.actor_name="Tom Hanks" 
            AND movieActor_df.ranking=1 
            AND movie_df.rt_all_critics_rating IS NOT NULL
        ORDER BY rt_all_critics_rating
        LIMIT 1
        ''')

### Activity: Your own questions here...

If you would like to try to turn some of your own questions, or questions posted by other students on the forums, into queries, add them here. Feel free to share your queries on the course forums.


In [None]:
# YOUR OWN QUESTIONS INTO QUERIES...


## Summary
In this notebook you have had an opportunity to practice the conversion of *questions* into *queries*, as well as revising how to manipulate and query datasets using native *pandas* functions as well as simple SQL queries.

## What next?

This completes the practical notebook activities for this week - return to the course materials on the VLE.