# Analysing Hollywood - The Movies Dataset

In this Notebook you will explore the contents of the Movies dataset using the tools you have used previously to 
manipulate `DataFrames`. 
You will be using this dataset, which contains data about movies, their actors and directors, and audience and 
critics' ratings, as a relational database in some of the practical activities in Parts 9-12.

This dataset is derived from the [MovieLens + IMDb/Rotten Tomatoes](http://grouplens.org/datasets/hetrec-2011/) dataset made available at the *2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems* 
([HetRec 2011](http://ir.ii.uam.es/hetrec2011)) at the *5th ACM Conference on Recommender Systems* 
([RecSys 2011](http://recsys.acm.org/2011)). 
It is an extension of the [MovieLens 10M](http://grouplens.org/datasets/movielens/) 
dataset containing additional data from the 
[Internet Movie Database (IMDb)](http://www.imdb.com/) and the [RottenTomatoes (RT)](http://www.rottentomatoes.com/) 
movie review system.

This dataset comprises five individual datasets: `movie`, `movie_actor`, `movie_country`,`movie_dirextor` and `movie_genre`.

The data for each table is contained in a correspondingly named CSV file in the `./data` subdirectory.

This notebook contains several exercises or activities, which are presented with a space for you to try your own solution. In each case, you can see our solution by clicking on the small triangle next to the text "**our solution**" or "**discussion**", but in all cases, you should attempt the questions yourself before looking at our proposed solutions.


In [1]:
!ls data

movie_actor.csv    movie.csv	       movie_genre.csv
movie_country.csv  movie_director.csv


Each table contains a `movie_id` corresponding to a particular movie. The `movie_id` can be used to pull information about the same movie from different tables. To simplfy the retrieval of rows corresponding to to a movie by its `movie_id`, we can define a simple helper function.

## From Questions to Queries

When asking a question of a dataset, we are actually trying to think of questions whose answers are contained within a single row, maybe even a single cell, of a data table, or which are the result of sorting or processing an aggregation of the data.

To ask a *question* of a dataset, we must turn it into a *query* that we can apply to the actual data. In other words, we don't just run queries over a dataset out of the blue. The queries are motivated by a question we want to know the answer to. This should then help us tell (or find) a particular story, and so provide us with information that might help us make a particular decision.

The query itself may be made up of several distinct processing steps, such as aggregation and sorting, and may require data from several sources to be combined. If you are to work effectively with data, you need to develop a range of skills. In the first case, you need to be able to generate interesting questions that can be reasonably asked of the data - that is, you need to think like an *investigator*; secondly, you need to be able to transform those questions into queries, or processing steps, that can be applied to the data at hand - a *programming* step; thirdly, you need to be able to implement the query, for example by writing some *python*/*pandas* code or an SQL query that manipulates the data contained in a *pandas* `DataFrame` or a relational database - a *coding* step. This final step may also require you to do some prior work, such cleaning (or *clean__s__ing*) it to get it into a state were you can reliably write the queries which will enable you to ask the right questions of it.

In this notebook, you will have an opportunity to develop your question formulating skills. In the next, you will have a chance to turn some of these questions into queries.

### Questions and Stories,  Decisions and Discovery

As you familiarise yourself with the data contained in each table, start to ask yourself what sorts of __question__ you might be able to ask of the data contained in just that table, as well as the questions you might be able to ask from a combination of tables. Be as ambitious as you can in formulating the questions. Don't concern yourself for now with how you might actually write a query over the data to answer those questions.

Also start to think about what sorts of __story__ you might be able to tell using the data. Often, coming up with the idea of a story in the form of a high level question, such as *"How did an actor's career develop?"*, will suggest some deeper questions (for example, more specific or refined questions). These in turn may lead to yet more questions...

Alternatively, you can start to think about what sorts of __decision__ might be informed by data contained in the dataset, or information derived from it, or what sorts of thing you might be able to __discover__ within the dataset.

While looking at the data, also try to identify any possible problems or potential data quality issues with it.

## Exploring the Movies Dataset
Let's use *pandas* DataFrames to explore the data contained in the CSV data files.

In [None]:
import pandas as pd

###  The `movie` table
`movie (movie_id, title, year, rt_all_critics_rating, rt_top_critics_rating, rt_audience_rating, ml_user_rating)`

Each row records the following data about a particular movie identified by the `movie_id` primary key (PK) column.

column | description
------ | -----------
movie_id  (PK) | movie identifier
title | movie title
year | year of release
rt_all_critics_rating | RottenTomatoes - all critics: average rating
rt_top_critics_rating | RottenTomatoes - top critics: average rating
rt_audience_rating | RottenTomatoes - audience: average rating
ml_user_rating | MovieLens - users: average rating

In [None]:
# Create the DataFrame 'movie' from the CSV data file 'movie.csv'.
movie_df = pd.read_csv('data/movie.csv')

movie_df.head()

In [None]:
# Display data about the movie with the movie identifier of '1'.

movie_df[movie_df['movie_id']==1]

To simplify some of the next cells, we will choose a specific movie to focus on, and store its identifier in a variable:

In [None]:
# For convenience, let's use the same movie_id throughout.
#
# To run the rest of the notebook for a different movie, just
# change the value of this variable to a different identifier.

MOVIE_ID = 1

In [None]:
movie_df[movie_df['movie_id']== MOVIE_ID ]

### Activity: What might we learn from this table?

By considering the table as a whole, and the column names in particular, what might you be able to learn from the data contained in this table?

In at ways might the table suffer from *data quality* issues? For example, how might it contain *incomplete*, *incorrect* or otherwise *dirty* data?

Spend a few moments thinking about these questions before revealing our discussion below.

#### Discussion

To reveal some of our own answers to these questions, click on the triangle symbol on the left-hand end of this cell.

<div class='answer'>Trivially, we can look up the name of a movie or the year it was released from its identifier, which appears in the `movie_id` column of the `movie` table. Or we can find the identifier for a movie with a particular name, or with a particular name that was released in a particular year.<br/>
<br/>
We can also count the number of movies that are contained in the database, or narrow this down to find the number of movies released by year; plotting the number of movies by year might allow us to see trends in the volume of movies released over time.<br/>
<br/>
Another thing we could do is find the period over which the database applies by checking the range of years associated with movies.<br/>
<br/>
We could use the data to find the top or bottom ranked movies of all time (at least, all time according to the dataset!). Rankings are provided from critics, audiences and MovieLens users, so we could sort the table by the appropriate ratings column and then look at the top N or bottom M rows to find out how each group ranked the movies. We could also find the most highly (or lowly) rated movies for a particular year or for each year.<br/>
<br/>
Grouping by year, we could find the mean ratings to see if one year or another appeared to be a particularly "good" or "bad" year for movie releases. Looking at the range or standard deviation of ratings within each year might indicate whether the movies released that year were of a particularly consistent or inconsistent quality.<br/>
<br/>
Noting that there are several columns relating to reviews by different sets of people, we could generate a *derived* data column (the mean or median ranking across the different rankings, for example) or look to see whether the different reviewer class scores are correlated with each other.<br/>
<br/>
As far as *data quality* goes, one issue relates to the comprehensiveness of the data: for any year refererenced in the  database, to what extent does it represent a complete set records according to whatever collection policy is applied to it? Also terms of *incomplete data*, this time at the record level, one possible problem is missing values; for example, a movie record may be missing the title of the movie, or one or more of the review scores. In terms of *incorrect data*, one possible problem is the mis-spelling of a title or the incorrect year associated with a movie. If searches on movie titles are case sensitive, incorrect capitalisation of the movie title may cause problems, which we might class as a *dirty data* problem.
</div>

###  The `movie_actor` table
`movie_actor (movie_id, actor_name, ranking)`

Each movie features one or more actors. Each row records a particular actor featuring in a particular movie 
identified by the `movie_id` and `actor_name` primary key columns.


column | description
------ | -----------
movie_id  (PK) | movie identifier
actor_name  (PK) | actor's name
ranking | position of actor on the movie's cast list

In [None]:
# Create the DataFrame 'movie_actor' from the CSV data file 'movie_actor.csv'.
movieActor_df = pd.read_csv('data/movie_actor.csv')

movieActor_df.head()

In [None]:
# Display the top 5 actors featuring in our chosen movie, ordered according to the cast list ranking

movieActor_df[movieActor_df['movie_id']==MOVIE_ID].sort_values('ranking')

### Activity: What might we learn from this table?

Looking at the range of column names, what might you be able to learn from this table?

See if you can rephrase each of those discoveries in terms of a __question__ you can ask of the data.

What additional questions might you be able to ask if you combine data from this table with data from the `movie` table? What sort of __story__ might you be able to start to tell based on such combinations, or what sort of decision might information derived from these tables help you make?</div>

#### Discussion

To reveal some of our own answers to these questions, click on the triangle symbol on the left-hand end of this cell.

<div class="answer">For a given movie, we could find the size of the cast, or the name of the top billed actors. In terms of questions, we might ask: *"What was the size of the cast for a particular movie?"* or *"Who was the top billed actor in a particular movie?"*<br/> 
<br/>
We could also use the <tt>actor_name</tt> column to find the number of movies an actor has appeared in, as well as their average ranking. In terms of specific questions, we could ask *"How many movies has Tom Hanks appeared in?"* or *"How many movies has Tom Hanks had the top billing/ranking in?"* Spinning that question around allows us to ask things like *"Which actor has had the greatest number of top billings?"*<br/>
<br/>
As each movie has multiple actors, we can look for the number of times in which two actors have appeared together. More ambitiously, we should also be able to work out the [Bacon number](https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon#Bacon_numbers) of any given actor...<br/>
<br/>
If we cross-reference tables, we can use the year of movies' releases (obtained from the `movie` table) to find the number of movies an actor appeared in by year of release (*"How many movies did Tom Hanks appear in that were released in 2005?"*) as well as their average bill ranking by year.<br/>
<br>The `movie` table also contains movie rankings, so we could track the popularity of movies a particular actor has appeared in (*"How well received by the top critics were Tom Hanks' top billing movies between 1990 and 2005?"* ).<br/>
<br/>Putting all this together, we could start to address a question of the form *"How did Tom Hanks' career evolve?"* in order to tell a story about that actor's career path.<br/>
<br/></div>

### The `movie_director` table
`movie_director (movie_id, director_name)`

Each movie has one director. Each row in the `movie_director` table records the director of a particular movie, 
identified by the `movie_id` primary key column.


column | description
------ | -----------
movie_id  (PK) | movie identifier
director_name | director's name

In [None]:
# Create the DataFrame 'movie_director' from the CSV data file 'movie_director.csv'.
movieDirector_df = pd.read_csv('data/movie_director.csv')

movieDirector_df[movieDirector_df['movie_id']==MOVIE_ID]

### Activity: What might we learn from this table?

Building on the way you considered the previous two tables, what might you learn from this table on its own, and in conjunction with the previously mentioned tables?

What problems or issues might arise from the use of this table?


#### Discussion

To reveal some of our own answers to these questions, click on the triangle symbol on the left-hand end of this cell.

On its own, the `movie_director` table allows us to ask questions of the form *"How many movies were directed by a particular director?"*.<br/>
<br/>
When combined with other tables, we can start to ask questions of the form *"How many movies did a particular person both feature in and direct?"*, or *"How many times did a particular actor work with a particular director?"*. We can also ask more convoluted questions of the form *"Of all the movies directed by a particular director, and with a particular actor in a top N billing, which were the most and least highly ranked by audiences?"*<br/>
<br/>
A possible issue with data quality in this table, as with the `movie_actor` table, is that there may be cases where different actors share the same name, or where an individual actor or director is credited with different names in the database. No unique identifiers are provided for actors or directors - instead we rely solely on the name. For example, the `director_name` column in the `movie_director` table contains the entries `Charlie S. Chaplin` and `Charles Chaplin`, both of which refer to the same individual, [Charlie Chaplin](https://en.wikipedia.org/wiki/Charlie_Chaplin). If the database had unique identifiers for people, in the way that it does for movies, we could use this to unambiguously refer to a particular person, even if they appear in the same database with different names.



### The `movie_country` Table
`movie_country (movie_id, country)`

Each movie has one country of origin. Each row records the country of origin of a particular movie 
identified by the `movie_id` primary key column.

column | description
------ | -----------
movie_id  (PK) | movie identifier
country | country of origin

In [None]:
# Create the DataFrame 'movie_country' from the CSV data file 'movie_country.csv'.
movieCountry_df = pd.read_csv('data/movie_country.csv')

movieCountry_df[movieCountry_df['movie_id']==MOVIE_ID]

### Activity: What might we learn from this table?

As before, consider some questions you could address using the data in this table, but also try to think of some *decisions* that this data may be able to provide supporting evidence for.

#### Discussion

To reveal some of our own answers to these questions, click on the triangle symbol on the left-hand end of this cell.

<div class="answer">On its own, we can count the number of movies made in each country, and find the most popular country for making movies by volume.<br/>
<br/>In combination with other tables, we can ask about the number of movies released by year in different countries, or the countries that different directors or actors appear to like working in the most. We could also start to think about the relationship between the country that a movie was filmed in, and the review rankings of those movies.<br/>
<br/>
As far as decisions go, we can often frame these in terms of a question, or come up with a question that can provide evidence to help us make the decision. For example, if you are making a movie in Japan, and are wondering whether or not Tom Hanks will star in it, it may be worth asking the question *"Has Tom Hanks already made any movies in Japan?"*.</div>

### The `movie_genre` table
`movie_genre (movie_id, genre)`

Each movie is categorised as belonging to one or more movie genres. Each row records a particular genre that 
categorises a particular movie identified by the `movie_id` and `genre` primary key columns.


column | description
------ | -----------
movie_id  (PK) | movie identifier
genre  (PK) | movie genre

In [None]:
# Create the DataFrame 'movie_genre' from the CSV data file 'movie_genre.csv'.
movieGenre_df = pd.read_csv('data/movie_genre.csv')

movieGenre_df[movieGenre_df['movie_id']==MOVIE_ID]


### Activity: What might we learn from this table?

As with the `movie_actor` table, this table has multiple entries for each individual movie. So what sorts of question might we ask around this data and what sorts of decision might it help inform?

#### Discussion

To reveal some of our own answers to these questions, click on the triangle symbol on the left-hand end of this cell.

<div class="answer">On its own, we can try to find the most popular genres, or which combinations of genre appear together most frequently. We could also look to see which the most unlikely genre combinations are.<br/>
<br/>
In combination with other tables, we could ask which genres are favoured by a particular director or actor, or which are the most popular movies in a particular genre or combination of genres, either overall, or by year of release.<br/>
<br/>
As far as decisions go, we might be interested in knowing which jobbing actors (lower ranked in terms of movie billing) may be willing to take on a role in a comedy adventure movie in Iceland, based on the genres and locations of movies they have participated in to date.</div>

---

## Activity
Try to come up with some of your own questions relating to the Movies dataset, either applied to one table or shared across many of them, and share them in the course forums.

In later notebooks, where you will have a chance to practise turning questions into queries, you may want to pick up on the challenge of answering particular questions shared in the forums by turning them into queries that you can apply to the Movies dataset.

## Summary
In this notebook you have familiarised yourself with the data contained in the Movies dataset and started to formulate your own questions around the dataset.

You should also have started to think about any data quality issues and the extent to which these may affect the results of any query you run on the database.

Although this preliminary work has been done in the context of *pandas* dataframes, you will be using this dataset in your exploration of the PostgreSQL relational database management system.

## What next?

Move on to the notebook [08.2 Querying the Movies Database](http://127.0.0.1:35180/notebooks/Part%2008%20Notebooks/08.2%20Querying%20the%20Movies%20Database.ipynb) to practise converting some of your questions to queries.