# ML: Movie Learning

We're going to do some form of movie text wrangling using Python. First, we need to collect some data.

## 1. Data Collection

### 1.1. The OMDb API

We're going to use the OMDb API, which is free once again (http://www.omdbapi.com/). To use it, you need to get an API key. You can get a free key, which limits to 1000 requests per day. More requests (and a poster API) are available if you patronize the OMDb Patreon. The OMDb website explains how the API works pretty well. We'll use the `requests` package to make calls to the OMDb API. 

There are two ways we can get movie data using this API, either by movie title or IMDb ID. We'll want to be able to handle either, as I have a feeling that it will be easier to get a random list of IMDb ID's than a random list of movie titles? It is also more exact to use ID's since movie titles aren't unique (e.g.,"The Mummy" can either refer to the Brendan Fraser masterpiece, or the Tom Cruise dumpster fire).

Note, I'm using format strings (`f'some {text}'`) which is a Python 3.6 feature (equivalent to `'some {}.format(text)'`).

In [1]:
import os
import json

import requests

from dotenv import load_dotenv, find_dotenv
#find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()
# load up the entries as environment variables
load_dotenv(dotenv_path)

True

In [2]:
API_KEY = os.environ.get('OMDB_API_KEY')

def get_movie_data(name, year=None, api_key=API_KEY, full_plot=False):
    """Returns json from OMDb API for movie."""
    
    api_url = f'http://www.omdbapi.com/?apikey={api_key}'
    # There are actually utilities that can automatically escape invalid characters
    # but here we do the manual dumb solution
    name = name.lower().replace(' ', '+')
    
    # Can either manually extend the url with parameters or...
    #api_url += f'&t={name}'
    # if year is not None:
    #     api_url += f'&y={year}'
    # if full_plot:
    #     api_url += '&plot=full'
    # response = requests.get(api_url)
    
    # ... have `requests` do it for you!
    body = {'t': name}
    if year is not None:
        body['y'] = year
    if full_plot:
        body['plot'] = 'full'
    response = requests.get(api_url, params=body)
    
    # Throw error if API call has an error
    if response.status_code != 200:
        raise requests.HTTPError(
            f'Couldn\'t call API. Error {response.status_code}.'
        )
     
    # Throw error if movie not found
    if response.json()['Response'] == 'False':
        raise ValueError(response.json()['Error'])
    
    return response.json()

Let's test it out, and see what kind of information we get from our request.

In [3]:
response_json = get_movie_data('Snakes on a Plane')
print(json.dumps(response_json, indent=4))

{
    "Title": "Snakes on a Plane",
    "Year": "2006",
    "Rated": "R",
    "Released": "18 Aug 2006",
    "Runtime": "105 min",
    "Genre": "Action, Adventure, Crime",
    "Director": "David R. Ellis",
    "Writer": "John Heffernan (screenplay), Sebastian Gutierrez (screenplay), David Dalessandro (story), John Heffernan (story)",
    "Actors": "Samuel L. Jackson, Julianna Margulies, Nathan Phillips, Rachel Blanchard",
    "Plot": "An F.B.I. Agent takes on a plane full of deadly venomous snakes, deliberately released to kill a witness being flown from Honolulu to Los Angeles to testify against a mob boss.",
    "Language": "English",
    "Country": "Germany, USA, Canada",
    "Awards": "3 wins & 7 nominations.",
    "Poster": "https://m.media-amazon.com/images/M/MV5BZDY3ODM2YTgtYTU5NC00MTE4LTkzNjktMzNhZWZmMzJjMWRjXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg",
    "Ratings": [
        {
            "Source": "Internet Movie Database",
            "Value": "5.5/10"
        },
        {


There are all sorts of fun things we could do with this data, such as:
1. Predict something based off of the description, like the Rotten Tomatoes score (or just certified fresh or rotten), or the genre, or if Samuel L. Jackson is in the movie or not.
2. Train a sentiment analysis algorithm using the score as a proxy for positivity.
3. Predict the BoxOffice based off of engineered features, like the cast, genre, whether or not Sam Jack is in it...
4. Generate a short synopsis based off of the movie poster.

Genre classification might be a fun one, but is well-tread territory at this point. Maybe we'll train a classifier to determine based off of a movie pitch (aka plot summary) whether or not Samuel L. Jackson would be in it. This seems like a nice balance between interesting, and easy to do.

Note there is a `Response` field in the response, which is `"True"` if the movie is found, and `"False"` if not. At the moment I feel like it would be better to catch this early on in the process, so we added a few lines to throw an Exception when the movie is not found in the database.

In [4]:
try:
    get_movie_data('Snakes on a Plan')
except Exception as e:
    print('Exception:', e)

Exception: Movie not found!


Okay, now we just need a giant list of either movie names or IMDB ids...

### 1.2. Getting a List of Movies

Using `requests` we now have a way of extracting data about a film, given its title. Now all we need is a list of movies over which to iterate. After a bit of googling, I found a Wikipedia page listing movies, but it is organized through several subpages. We'll use the BeautifulSoup package to crawl these pages to generate our list of movies.

There are two paths of attack:
1. `https://en.wikipedia.org/wiki/Lists_of_films` has data on every movie ever made. There are a few different hierarchies movies are classified by on this. This catalogs every movie ever made.
2. `https://en.wikipedia.org/wiki/Lists_of_American_films` is just American films, organized by year. This is maybe easier to work with, since the organization is more orderly.

Since this project is ultimately stupid, I'll just do the easier thing and use the American movie list. Additionally, the American film list is split up by year, where the different years have their own pages. The urls for these pages are very consistent, so easy to crawl. Each year's page has the movies listed in tables, which appear to always have the same columns.

For this part I used this tutorial https://goo.gl/Bm1cdD (sort of).

A web page is basically a tree with nodes labeled by tags. BeautifulSoup packages this tree in an object that is easy to navigate and search, based off of the html tags attached to the nodes. Here's how we are going to pull the movie data for any given year.
1. We pull the raw html using requests and passing the url to a `GET` request.
2. Dump the html string into a BeautifulSoup object.
3. From inspecting the raw html it looks like the data we want is always in a table, which we can find with the `table` tag. There are some other tables, such as the footers and page navigation that we don't want. Again from inspection, it looks like the tables we do want always have `class=wikitable`.
4. The tables are then split into rows. A row either consists of headers (`th` table header tag) or data (`td` table data tag). We pull this, put it in a data frame if we actually get data, and skip if we get an error putting it into the data frame.

When we first wrote the `fetch_movie_data` function below, we did not have the `drop_colspan` feature Or the 1919 correction. When it was ran on all time, there were a few problems:
1. Starting in 2014, the movie lists have an "Opening" data column, whose entries span multiple rows. Even though our script didn't throw any errors for 2015 the data gathered is wrong: actors are listed as movie titles, titles are listed as opening dates, and so on. These cells have tags with colspan attributes, which are larger than 1. To fix this we added the `drop_colspan` flag, which does not include columns that span multiple cells.
2. There is a "typo" on the page for 1919. The very last row of the table has an extra cell. Not sure what the best way to deal with this one edge case is. I ended up counting the number of columns read, and only take that many cells for the data rows.

There are definitely better ways to handle these problems. In particular, my solution assumes the multi-columns are all up front in the table, and that the multicolumn and the 1919 typo don't coincide on any page.

In [5]:
from bs4 import BeautifulSoup
import pandas as pd
import time

In [6]:
def fetch_movie_data(url, drop_colspan=True):
    # Get the response from GET request
    response = requests.get(url)  
    # Throw error if API call has an error
    if response.status_code != 200:
        raise requests.HTTPError(
            f'Couldn\'t call API. Error {response.status_code}.'
        )
  
    soup = BeautifulSoup(response.text, 'html.parser')

    # Get all tables on the site where the class is wikitable
    # This prevents it from including layout tabes, like the wikipedia footer etc
    tables = soup.find_all('table', {'class': 'wikitable'})

    fetched = []
    for table in tables:
        # Get all table rows (tr) tag
        rows = table.find_all('tr')
        # Assuming first row is headers, look for table header (th) tags
        if drop_colspan:
            columns = []; drop = 0
            for col in rows[0].find_all('th'):
                # If has an attribute for colspan > 1, don't include
                try:
                    int(col.attrs['colspan']) > 1
                    drop += int(col.attrs['colspan'])
                except:
                    columns.append(col.text.strip())
        else:
            columns = [x.text.strip() for x in rows[0].find_all('th')]
        
        # Assuming remaining rows are data, look for table data (td) tags
        data = [[x.text.strip() for x in row.find_all('td')] for row in rows[1:]]
        
        if drop_colspan:
            # Assuming the multicols are all in front
            if drop > 0:
                data = [row[-len(columns):] for row in data]
        # Deal with error in the 1919 Wikipedia page...
        # data = [row[:len(columns)] for row in data]
        # Make sure we got data. 
        # If for whatever reason there is an error, don't include
        if data and columns:
            try:
                df = pd.DataFrame(data, columns=columns)
            except:
                continue
            fetched.append(df)
    return pd.concat(fetched, sort=False)

In [7]:
movies_1994 = fetch_movie_data('https://en.wikipedia.org/wiki/List_of_American_films_of_1952')

In [8]:
movies_1994.tail()

Unnamed: 0,Title,Director,Cast,Genre,Notes/Studio,Notes
91,Yankee Buccaneer,Frederick de Cordova,"Jeff Chandler, Scott Brady, Suzan Ball",Adventure,,Universal
92,You for Me,Don Weis,"Jane Greer, Peter Lawford",Romance,,MGM
93,Young Man with Ideas,Mitchell Leisen,"Glenn Ford, Ruth Roman",Comedy,,MGM
94,Yukon Gold,Frank McDonald,"Kirby Grant, Martha Hyer",Western,,Monogram
95,Zombies of the Stratosphere,Fred C. Brannon,,Serial,,Republic


Now that we've gotten things to appear to work for one year, let's creep that net. The above tutorial suggests using a timer to prevent us from sending too many requests to Wikipedia and getting blocked. I'm expecting the `fetch_movie_data` to fail sometimes, so let's collect our data in a dict. That way we can know exactly what years failed and what years we have data for.

In [None]:
# For each year from 1900 to 2017
movie_data = {}
for year in range(1900, 2018):
    url = f'https://en.wikipedia.org/wiki/List_of_American_films_of_{year}'
    try:
        data = fetch_movie_data(url)
        movie_data[year] = data
    except:
        print(f'Something went wrong fetching data for {year}')
        pass
    # Pause for 2 sec to not overwhelm Wikipedia
    time.sleep(2)

This took maybe 10 minutes to run. 

We really only want the title and the year, so let's get that. We could also get the cast and genre from here, but I feel like the OMDb versions of this will be more consistent.

In [None]:
for year, movies in movie_data.items():
    movies['Year'] = year
movies_full = pd.concat([movies[['Title', 'Year']] for movies in movie_data.values()])

In [None]:
len(movies_full)

We have ~26K movies, which would take about 26 days to get IMDb data for using the free OMDb API. Maybe we should kick them a buck to do this faster.

### 1.3. Combining Everything

Note to future self: When running this we got fewer results when requesting full synopsis. The API only return results with a full synopsis if requested.

In [None]:
all_teh_jsons = []
for row in movies_full.itertuples(index=False, name=None):
    try:
        response_json = get_movie_data(*row)
    except:
        # If can't find with year, don't use
        try:
            response_json = get_movie_data(row[0])
        except:
            # Continue if no data found
            continue
        if response_json['Plot'] != 'N/A':
            all_teh_jsons.append(response_json)