### Movie Recommender System Data Scraping

In [25]:
import pandas as pd
import numpy as np

### MovieLens

#### movies.dat

The file contains data on 3,883 movies in the format **MovieID::Title::Genres**.


|Item    | Description    |
|------ | ------|
|MovieID    | an integer, ranging from 1 to 3952, that identifies a movie   |
|Title    | a String that concatenates movie title and year of release (between brackets)   |
|Genres    | a list of genres    |

First step is to read the movies dataset into a pandas dataframe.

In [30]:
# Load in the dataset
movies = pd.read_csv('movies.dat', delimiter = '::', names = ['movieId', 'title', 'genres'],engine = 'python')

In [31]:
print('There are', movies.shape[0], 'movies in the dataset.')

There are 3883 movies in the dataset.


In [32]:
# Look at the dataframe
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [33]:
# Check the data types
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
movieId    3883 non-null int64
title      3883 non-null object
genres     3883 non-null object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


I will remove the year of release from the movie title in the movies dataframe and move to a newly created column.

In [34]:
# Add the release year to a new column
movies['year'] = movies['title'].str.extract('(\d\d\d\d)', expand=True)
# Remove the release year from the title
movies['title'] = movies['title'].astype(str).str[:-6]
# Remove parentheses in the title
movies['title'] = movies['title'].str.replace(r"\(.*\)","")

In [35]:
# Look at the first few rows again
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Animation|Children's|Comedy,1995
1,2,Jumanji,Adventure|Children's|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama,1995
4,5,Father of the Bride Part II,Comedy,1995


### Data Scraping

In [36]:
import omdb
import requests
import re
import json
import numpy as np
import pandas as pd

In order to create models for content based filtering I need additional information about the movies other than the user ratings. For this I will extract additional data on each movie from from the Internet Movie Database (IMDb), an online database of movie and television data and catalogues all relavent information relating to movies e.g. actors, director, runtime, plot summary, awards, trivia, etc.. 

## OMDb Scraping

The OMDb API, is a free web service to obtain movie information, it will be used to scrape the relavent data from IMDb.

First I will create a list of movies from the movies dataframe.

In [37]:
# Create a list
movies_list = list(movies.title)

print('There are', len(movies_list), 'movies in the list.\n')

# Let's look at the first few items in the list
movies_list[:10]

There are 3883 movies in the list.



['Toy Story ',
 'Jumanji ',
 'Grumpier Old Men ',
 'Waiting to Exhale ',
 'Father of the Bride Part II ',
 'Heat ',
 'Sabrina ',
 'Tom and Huck ',
 'Sudden Death ',
 'GoldenEye ']

I will create empty lists to store the all the scraped information as well as just the movie information I am looking for. It is useful to store all the information so if I want to add new features in the future I will not have to scrape again.

I will also create an empty list to store movie titles that do not return a result. This is because the information provided with the dataset states that the data may be incorrect- mispelling, incomplete, duplicated entries etc.

In [38]:
movie_info = []
feature_info = []
bad_movies = []

I am looking for information on IMDb movie title, plot summary, top actors, director, runtime, certificate and imdb score.

In [39]:
# OMDb url
url = 'http://www.omdbapi.com/?apikey=e3bdfa2c&t={}'


# Loop through the movies in the list
for movie in movies_list:
    
    movie = movie.replace(' ', '+')

    try:
        
        response = requests.get(url.format(movie))
        info = response.json()
        movie_info.append([info])
        
        title = info['Title']
        plot = info['Plot']
        actors = info['Actors']
        director = info['Director']
        runtime = int(info['Runtime'][:-4])
        cert = info['Rated']
        imdb = float(info['imdbRating'])
        
        feature_info.append([title, plot, actors, director, runtime, cert, imdb])
            
    except:
        
        # Movies that cannot be scrapped will be saved to bad_info for analysis
        bad_movies.append([movie])

In [41]:
print(len(feature_info), 'movies have been successfully scaped.')
print(len(bad_movies), 'movies have not been scraped.')

2978 movies have been successfully scaped.
905 movies have not been scraped.


After the initial scrape using OMDb API- 2978 movies were successfully scraped 905 movies were not been scraped. let's look at the first few to try and determine why they couldn't be scraped.

In [42]:
bad_movies[:10]

[['American+President,+The+'],
 ['City+of+Lost+Children,+The+'],
 ['Usual+Suspects,+The+'],
 ['Big+Green,+The+'],
 ['Postino,+Il++'],
 ['Confessional,+The++'],
 ['Indian+in+the+Cupboard,+The+'],
 ['Mis�rables,+Les+'],
 ['Crossing+Guard,+The+'],
 ['Juror,+The+']]

It can be seen that movie titles with transposed articles- 

- titles ending in
    - , The
    - , Les
    - , L'
    - , La
    - etc.
- both engligh and foreign names in the title
- mispellings e.g. Mis\xe9rables
- with numbers in place of letters e.g. Se7en

I will also define a function to remove the transposed artisles from the end of the movie titles and appending them to the beginning.

In [43]:
def_list = [", The", ", Las", ", La", ", Les", ", An", ", A", ", Le", ", El", ", Der", ", Das",]

def amendTitle(title):

        if title.find(i) == -1:
            return title.rstrip()
        else:
            title = title.replace(i, '')
            title = str(i[2:]) + ' ' + title
            return title.rstrip()
        
for i in def_list:
    
    movies['title'] = movies['title'].apply(amendTitle)

# Create a new list of the movie titles and movieId
movies_list = list(movies.title)
movieID_list = list(movies.movieId)

print('Confirm that there are', len(movies_list), 'movies in the list.')

Confirm that there are 3883 movies in the list.


I will perform another scrape with the cleaned movie titles.

In [44]:
# Create new blank lists to store the scraped movie information
movie_info = []
feature_info = []
bad_movies = []

# Loop through the movies in the list
for movie in movies_list:
    
    movie = movie.replace(' ', '+')

    try:
        
        response = requests.get(url.format(movie))
        info = response.json()
        movie_info.append([info])
        
        title = info['Title']
        plot = info['Plot']
        actors = info['Actors']
        director = info['Director']
        runtime = int(info['Runtime'][:-4])
        cert = info['Rated']
        imdb = float(info['imdbRating'])
        
        feature_info.append([title, plot, actors, director, runtime, cert, imdb])
            
    except:
        
        # Movies that cannot be scrapped will be saved to bad_info for analysis
        bad_movies.append([movie])

In [45]:
print(len(feature_info), 'movies have been successfully scaped.')
print(len(bad_movies), 'movies have not been scraped.')

3699 movies have been successfully scaped.
184 movies have not been scraped.


These 184 movies have to be amended manually in the MovieLens dataset. It was also seen that some correctly spelled movies titles still failed to be scraped. Looking at the raw json files on the online OMDb API it can be seen that these movies are missing one or more of the features - title, plot, actors, director, runtime, cert, imdb_rating, imdb_id or poster. Example below -


To deal with these movies I included a try/except for each feature, replacing missing information with None. I have also added some additional features to scrape as they will become useful in the future. There are -

IMDb Id - a unique identified for each movie that will be useful for future searches
poster - this is a link to the poster artwork of the film, useful for displaying recomended movies in a more user friendly way
I need to import the amended dataset to ensure that movies are scraped accurately. I will apply all the functions to clean the titles and create a new movies list.

In [46]:
# Create new blank lists to store the scraped movie information
movie_info = []
feature_info = []
bad_movies = []

# Loop through the movies in the list
for movie, ID in zip(movies_list, movieID_list):

    movie = movie.replace(' ', '+')
    
    try:
        
        response = requests.get(url.format(movie))
        info = response.json()
        movie_info.append([info])
        
        try:
            title = info['Title']
        except:
            title = None
        try:
            plot = info['Plot']
        except:
            plot = None
        try:
            actors = info['Actors']
        except:
            actors = None
        try:
            director = info['Director']
        except:
            director = None
        try:
            runtime = int(info['Runtime'][:-4])
        except:
            runtime = None
        try:
            cert = info['Rated']
        except:
            cert = None
        try:
            imdb = float(info['imdbRating'])
        except:
            imdb = None
        try:
            imdb_id = (info['imdbID'])
        except:
            imdb_id = None    
        try:
            poster = (info['Poster'])
        except:
            poster = None
        
        feature_info.append([ID, title, plot, actors, director, runtime, cert, imdb, imdb_id, poster])
            
    except:
        
        bad_movies.append(movie)

In [48]:
print(len(feature_info), 'movies have been successfully scraped.')
print(len(bad_movies), 'movies have not been scraped.')

3883 movies have been successfully scraped.
0 movies have not been scraped.


All movies have now been successfully scraped from OMDb API, next I will convert the feature information into a new dataframe for further analysis.

In [49]:
# Create a dataframe from the scraped movie information
imdb = pd.DataFrame(feature_info, columns=['movieId', 'title', 'Plot', 'actors', 
                                           'director', 'runtime', 'cert', 'imdb_rating', 'imdb_id', 'poster'])

In [51]:
# Look at the dataframe in more detail
imdb

Unnamed: 0,movieId,title,Plot,actors,director,runtime,cert,imdb_rating,imdb_id,poster
0,1,Toy Story,A cowboy doll is profoundly threatened and jea...,"Tom Hanks, Tim Allen, Don Rickles, Jim Varney",John Lasseter,81.0,G,8.3,tt0114709,https://m.media-amazon.com/images/M/MV5BMDU2ZW...
1,2,Jumanji,When two kids find and play a magical board ga...,"Robin Williams, Jonathan Hyde, Kirsten Dunst, ...",Joe Johnston,104.0,PG,6.9,tt0113497,https://m.media-amazon.com/images/M/MV5BZTk2Zm...
2,3,Grumpier Old Men,John and Max resolve to save their beloved bai...,"Walter Matthau, Jack Lemmon, Sophia Loren, Ann...",Howard Deutch,101.0,PG-13,6.6,tt0113228,https://m.media-amazon.com/images/M/MV5BMjQxM2...
3,4,Waiting to Exhale,"Based on Terry McMillan's novel, this film fol...","Whitney Houston, Angela Bassett, Loretta Devin...",Forest Whitaker,124.0,R,5.8,tt0114885,https://m.media-amazon.com/images/M/MV5BYzcyMD...
4,5,Father of the Bride Part II,George Banks must deal not only with the pregn...,"Steve Martin, Diane Keaton, Martin Short, Kimb...",Charles Shyer,106.0,PG,6.0,tt0113041,https://m.media-amazon.com/images/M/MV5BOTEyNz...
5,6,Heat,A group of professional bank robbers start to ...,"Al Pacino, Robert De Niro, Val Kilmer, Jon Voight",Michael Mann,170.0,R,8.2,tt0113277,https://m.media-amazon.com/images/M/MV5BNDc0YT...
6,7,Sabrina,A playboy becomes interested in the daughter o...,"Humphrey Bogart, Audrey Hepburn, William Holde...",Billy Wilder,113.0,Not Rated,7.7,tt0047437,https://m.media-amazon.com/images/M/MV5BYmFlNT...
7,8,Tom and Huck,Two best friends witness a murder and embark o...,"Jonathan Taylor Thomas, Brad Renfro, Eric Schw...",Peter Hewitt,97.0,PG,5.6,tt0112302,https://m.media-amazon.com/images/M/MV5BN2ZkZT...
8,9,Sudden Death,A former fireman takes on a group of terrorist...,"Jean-Claude Van Damme, Powers Boothe, Raymond ...",Peter Hyams,111.0,R,5.7,tt0114576,https://m.media-amazon.com/images/M/MV5BN2NjYW...
9,10,GoldenEye,James Bond teams up with the lone survivor of ...,"Pierce Brosnan, Sean Bean, Izabella Scorupco, ...",Martin Campbell,130.0,PG-13,7.2,tt0113189,https://m.media-amazon.com/images/M/MV5BMzk2OT...


Iam going to save the above scraped data to csv file named:(movies_full.csv)

In [None]:
#imdb.to_csv('movies_full.csv')

### Summary

I sucessfully scraped the movie data from the Internet Movie Database using the OMDb API and intergrated it with the MovieLens datasets. I descibed the process for the data extraction, data cleaning and integration.

Some of the major difficulties encountered were-

- IMDb data extraction - needed to develop pre-processing routines to clean up the movie titles at several stages
- Missing data - several titles were missing data and algorithms had to be amended to capture missing data
- Matching IMDb and MovieLens movie titles - had to deal with special characters, mispellings, transposed articles, foreign and alternative titles. Some of these had to be amended manually which was time consuming.