# Data Collection and Analysis of Movie Data
### Authors: Jan Hanzal, Eric Zila

In the first place, necessary imports are made. These include:
* package `re` used for operations with regular expressions
* package `requests` used for communication with HTTP servers
* class `BeautifulSoup` from package `bs4` used for scraping
* package `pandas` used for data frame manipulation

In [16]:
import re
import requests

from bs4 import BeautifulSoup
import pandas as pd

When downloading data about movies from IMDb, it is necessary to have an object at hand that is capable of storing these pieces of information in an organized manner. To do so, we have decided to create a separate class that accepts downloaded data pieces as parameters and stores them as instance variables that can be later on retrieved for analysis. At that point, information necessary for a chosen task will be stored as a dataframe for easier use.

In [106]:
"""
Encapsulates information about a movie.
"""
class Movie:
    """
    Initiates Movie class.
    
    @param title -> title of the movie
    @param original_title -> original title of the movie
    @param worldwide_gross -> worldwide gross of the movie
    @param rating -> rating of the movie
    @param rating_count -> total number of ratings of the movie
    @param runtime -> total length of the movie
    @param director -> director of the movie
    @param release_date -> release date of the movie
    @param genres -> genres of the movie
    @param countries -> countries of the movie
    @param languages -> languages of the movie
    @param budget -> budget of the movie
    """
    def __init__(self, title, original_title, worldwide_gross, rating, 
                 rating_count, runtime, director, release_date, genres,
                 countries, languages, budget):
        self.title = title
        self.original_title = original_title
        self.worldwide_gross = worldwide_gross
        self.rating = rating
        self.rating_count = rating_count
        self.runtime = runtime
        self.director = director
        self.release_date = release_date
        self.genres = genres
        self.countries = countries
        self.languages = languages
        self.budget = budget
        
    """
    Builds a string containing all available information about the movie.
    
    @return String object containing all information
    """
    def to_string(self):
        string = "title: " + self.title + "\n"
        
        if self.original_title is not None:
            string += "original title: " + self.original_title + "\n"
            
        if self.worldwide_gross is not None:
            string += "worldwide gross: " + self.worldwide_gross + "\n"
            
        string += "rating: " + self.rating + "\n"
        string += "rating count: " + self.rating_count + "\n" 
        string += "runtime: " + self.runtime + "\n"
        
        if self.director is not None:
            string += "director: " + self.director + "\n"
        
        if self.release_date is not None:
            string += "release date: " + self.release_date + "\n" 
            
        if self.genres is not None:
            string += "genres: "
            for genre in self.genres:
                string += genre + " "
            string += "\n"
            
        if self.countries is not None:
            string += "countries: "
            for country in self.countries:
                string += country + " "
            string += "\n"
        
        if self.languages is not None:
            string += "languages: "
            for language in self.languages:
                string += language + " "
            string += "\n"
            
        if self.budget is not None:
            string += "budget: " + self.budget + "\n"
        
        return string
    
    """
    Builds a dictionary containing all available information about the movie.
    
    @return Dictionary object containing all information
    """    
    def to_dict(self):
        movie_dict = {
        'title' : self.title,
        'original_title' : self.original_title,
        'worldwide_gross' : float(self.worldwide_gross),
        'rating' : float(self.rating),
        'rating_count' : int(self.rating_count),
        'runtime' : float(self.runtime),
        'director' : self.director,
        'release_date' : self.release_date,
        'genres' : self.genres,
        'countries' : self.countries,
        'languages' : self.languages,
        'budget' : float(self.budget)
        }
        return movie_dict

Our IMDb downloader class expects a link to an IMDb list page when initialized. If it is given such link, it is capable of going through and downloading information about every movie listed. Of course, it follows up on this task by downloading information about every movie on further pages as well. It can be given a limit of how many further pages should movie information be downloaded from.

In [18]:
"""
Downloader of movie data starting from an IMDb list page.
"""
class ImdbDownloader:
    """
    Initiates ImdbDownloader class.
    
    @param link -> link to the first IMDb list page to be downloaded from
    """
    def __init__(self, link):
        self.original_link = link # original link passed to the downloader
        self.movies = []
    
    """
    Controls the downloading process.
    
    @param pages -> number of pages to be downloaded, if not defined, all pages are downloaded
    """
    def start(self, pages=-1):
        cont = True # used for indication of reaching the last page
        count = 0 # counter of the number of movies downloaded
        cur_link = self.original_link # link to a website that is currently downloaded from
        
        while cont and pages != 0:
            cur_soup = self.get_soup(cur_link)
            ls = ImdbListScraper(cur_soup)
            cur_movie_links = ls.get_movie_links()
            
            for cur_movie_link in cur_movie_links:
                cur_movie_soup = self.get_soup(cur_movie_link)
                ms = ImdbMovieScraper(cur_movie_soup)
                
                title = ms.get_title()
                original_title = ms.get_original_title()
                worldwide_gross = ms.get_worldwide_gross()
                rating = ms.get_rating()
                rating_count = ms.get_rating_count()
                runtime = ms.get_runtime()
                director = ms.get_director()
                release_date = ms.get_release_date()
                genres = ms.get_genres()
                countries = ms.get_countries()
                languages = ms.get_languages()
                budget = ms.get_budget()
                
                movie = Movie(title, original_title, worldwide_gross, rating, 
                              rating_count, runtime, director, release_date,
                              genres, countries, languages, budget)
                self.movies.append(movie)
                
                count += 1
                if (count % 50) == 0 :
                    print("We have downloaded " + str(count) + " movies so far!")

            cur_link = ls.get_next_page_link()
            if cur_link is None:
                cont = False
                
            pages = pages - 1
    
    """
    Obtains the soup of a web page.
    
    @param link -> link to the web page a soup should be downloaded from
    
    @return BeautifulSoup object containing requested soup
    """
    def get_soup(self, link):
        request = requests.get(link)
        request.encoding = 'UTF-8'
        soup = BeautifulSoup(request.text)
        
        return soup
    
    """
    Prints information about all movies that were scraped so far.
    """
    def print_movies(self):
        print("\nAbout to print " + str(len(self.movies)) + " movies!\n")
        
        for movie in self.movies:
            print(movie.to_string())

In [19]:
"""
Scraper of information from an IMDb list page.
"""
class ImdbListScraper:
    """
    Initiates ImdbListScraper class.
    
    @param soup -> soup of an IMDb list page to be scraped
    """
    def __init__(self, soup):
        self.soup = soup
        
    """
    Obtains all movie links from an IMDb list soup.
    
    @return List object containing movie links
    """
    def get_movie_links(self):
        movies = self.soup.find_all('img', {'class':'loadlate'})
        links = ['https://www.imdb.com' + movie.parent['href'] for movie in movies]
        
        return links
    
    """
    Obtains a link to the next page from an IMDb list soup.
    
    @return String object containing link to the next page, None object if not found
    """
    def get_next_page_link(self):
        next_page = self.soup.find('a', {'class':'lister-page-next next-page'})
        
        if next_page is not None:
            return 'https://www.imdb.com' + next_page['href']
        else:
            return None

In [20]:
"""
Scraper of information from an IMDb movie page.
"""
class ImdbMovieScraper:
    """
    Initiates ImdbMovieScraper class.
    
    @param soup -> soup of an IMDb movie page to be scraped
    """
    def __init__(self, soup):
        self.soup = soup
        
    """
    Obtains the title of a movie on an IMDb movie page.
    
    @return String object containing title of a movie
    """
    def get_title(self):
        title = self.soup.find('div', {'class':'title_wrapper'})
        
        return title.h1.find(text=True, recursive=False)
    
    """
    Obtains the original title of a movie on an IMDb movie page.
    
    @return String object containing original title of a movie, None object if not found
    """
    def get_original_title(self):
        original_title = self.soup.find('div', {'class':'originalTitle'})
        
        if original_title is not None:
            return original_title.find(text=True, recursive=False)
        else:
            return None
        
    """
    Obtains the worldwide gross of a movie on an IMDb movie page.
    
    @return String object containing worldwide gross of a movie, None object if not found
    """
    def get_worldwide_gross(self):
        worldwide_gross = self.soup.find('h4', text='Cumulative Worldwide Gross:')
        
        if worldwide_gross is not None:
            return re.sub("[^0-9]", "", worldwide_gross.parent.text)
        else:
            return None
        
    """
    Obtains the rating of a movie on an IMDb movie page.
    
    @return String object containing rating of a movie
    """
    def get_rating(self):
        rating = self.soup.find('span', {'itemprop':'ratingValue'})
        
        return rating.text
    
    """
    Obtains the total number of ratings of a movie on an IMDb movie page.
    
    @return String object containing the number of ratings of a movie
    """
    def get_rating_count(self):
        rating_count = self.soup.find('span', {'class':'small', 'itemprop':'ratingCount'})
        
        return rating_count.text
    
    """
    Obtains the runtime of a movie on an IMDb movie page.
    
    @return String object containing the runtime of a movie
    """
    def get_runtime(self):
        runtime = self.soup.find('div', {'class':'subtext'})
        
        return re.sub("[^0-9]", "", runtime.time['datetime'])
    
    """
    Obtains the director of a movie on an IMDb movie page.
    
    @return String object containing the director of a movie, None object if not found
    """
    def get_director(self):
        director = self.soup.find('h4', text='Director:')
        
        if director is not None:
            return director.parent.a.text
        else:
            return None
        
    """
    Obtains the release date of a movie on an IMDb movie page.
    
    @return String object containing the release date of a movie, None object if not found
    """
    def get_release_date(self):
        release_date = self.soup.find('a', {'title':'See more release dates'})
        
        if release_date is not None:
            return re.sub("\n", "", release_date.text)
        else:
            return None
    
    """
    Obtains the genres of a movie on an IMDb movie page.
    
    @return List object containing genres of a movie, None object if not found
    """
    def get_genres(self):
        genres = self.soup.find('h4', text='Genres:').parent.find_all('a')
        
        if genres is not None:
            return [re.sub(" ", "", genre.text) for genre in genres]
        else:
            return None
        
    """
    Obtains the countries of a movie on an IMDb movie page.
    
    @return List object containing countries of a movie, None object if not found
    """
    def get_countries(self):
        countries = self.soup.find('h4', text='Country:').parent.find_all('a')
        
        if countries is not None:
            return [country.text for country in countries]
        else:
            return None
        
    """
    Obtains the languages of a movie on an IMDb movie page.
    
    @return List object containing languages of a movie, None object if not found
    """
    def get_languages(self):
        languages = self.soup.find('h4', text='Language:').parent.find_all('a')
        
        if languages is not None:
            return [language.text for language in languages]
        else:
            return None
        
    """
    Obtains the budget of a movie on an IMDb movie page.
    
    @return String object containing budget of a movie, None object if not found
    """
    def get_budget(self):
        budget = self.soup.find('h4', text='Budget:')
        
        if budget is not None:
            return re.sub("[^0-9]", "", budget.parent.text)
        else:
            return None

The following class uses the data downloaded in the form of the 'Movies' class and transforms them into a pandas dataframe. It further has the functionality of filtering the dataframe based on several parameters for easy access to modified dataframes.

In [116]:
"""
Stores the scraped data in a pandas dataframe.
For easier handling, includes several filtering methods.
Later on, more methods will be added when needed.
"""
class MovieDataFrame:
    """
    Initiates MovieDataFrame class.
    
    @param data -> instance of a downloader object of an IMDb movie page
    """
    def __init__(self, downloader):
        self.data = downloader.movies
        self.df = self.to_data_frame()
         
    """
    Converts movie dictionaries into a data frame.
    
    @return Pandas data frame object with every movie contained in a given downloader instance
    """  
    def to_data_frame(self):
        return pd.DataFrame.from_records([movie.to_dict() for movie in self.data])

    """
    Filters data frame based on a column of lists

    @return Pandas data frame, filtered
    """  
    def filter_df_list(self, column, element):
        mask = self.df[column].apply(lambda x: element in x)
        self.df = self.df.loc[mask,:]
        return self

    """
    Filters data frame based on a column of strings

    @return Pandas data frame, filtered
    """      
    def filter_df_string(self, column, element):
        self.df = self.df.loc[self.df[column] == element,:]
        return self

    """
    Transforms a column into floats if possible. Filters data frame based on a column of floats.

    @return Pandas data frame, filtered
    """      
    def filter_df_float(self, column, higher_than, lower_than, equal_to):
        self.df[column] = pd.to_numeric(self.df[column])
        if equal_to is None:
            if higher_than is not None:
                self.df = self.df.loc[self.df[column] > higher_than,:]
            if lower_than is not None:
                self.df = self.df.loc[self.df[column] < lower_than,:]
        else:
            self.df = self.df.loc[self.df[column] == equal_to,:]
        return self

    """
    Following methods filter data frame based on the indicated column using above methods.
    Can be easily chained.

    @return Pandas data frame, filtered
    """              
    def filter_genre(self, genre):
        return self.filter_df_list('genres', genre)   

    def filter_country(self, country):
        return self.filter_df_list('countries', country) 

    def filter_language(self, language):
        return self.filter_df_list('languages', language)   

    def filter_director(self, director):
        return self.filter_df_string('director', director)  

    def filter_budget(self, max_budget=None, min_budget=None, exact_budget=None):
        return self.filter_df_float('budget', higher_than=max_budget, lower_than=min_budget,
                                    equal_to=exact_budget)    

    def filter_rating(self, max_rating=None, min_rating=None, exact_rating=None):
        return self.filter_df_float('rating', higher_than=max_rating, lower_than=min_rating, 
                                    equal_to=exact_rating)

    def filter_runtime(self, max_runtime=None, min_runtime=None, exact_runtime=None):
        return self.filter_df_float('runtime', higher_than=max_runtime, lower_than=min_runtime, 
                                    equal_to=exact_runtime)

    def filter_gross(self, max_gross=None, min_gros=None, exact_gross=None):
        return self.filter_df_float('worldwide_gross', higher_than=max_gross, lower_than=min_gros, 
                                    equal_to=exact_gross)

To show what our program is currently capable of, we let it run through a list of feature films and tv movies having ratings from at least 100,000 users on IMDb. First, let's show the output of downloading the first two pages and printing the content of every single movie downloaded.

In [22]:
imdb1 = ImdbDownloader('https://www.imdb.com/search/title/?title_type=feature,tv_movie&num_votes=100000,&sort=release_date,desc&view=simple')
imdb1.start(1)
imdb1.print_movies()

We have downloaded 50 movies so far!

About to print 50 movies!

title: Birds of Prey (Podivuhodná proměna Harley Quinn) 
original title: Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn
worldwide gross: 201858461
rating: 6.2
rating count: 107,285
runtime: 109
director: Cathy Yan
release date: 6 February 2020 (Czech Republic)
genres: Action Adventure Crime 
countries: USA 
languages: English Chinese 
budget: 84500000

title: Star Wars: Vzestup Skywalkera 
original title: Star Wars: Episode IX - The Rise of Skywalker
worldwide gross: 1074144248
rating: 6.7
rating count: 318,022
runtime: 142
director: J.J. Abrams
release date: 19 December 2019 (Czech Republic)
genres: Action Adventure Fantasy Sci-Fi 
countries: USA 
languages: English 
budget: 275000000

title: 6 Underground 
rating: 6.1
rating count: 119,565
runtime: 128
director: Michael Bay
release date: 13 December 2019 (USA)
genres: Action Comedy Thriller 
countries: USA 
languages: English Turkmen Cantonese Itali

In [117]:
df_obj = MovieDataFrame(imdb1)

df_obj.filter_genre('Crime').filter_genre('Action').filter_country('USA').filter_language('English').filter_budget(0, 84500000).filter_director('Guy Ritchie')
df_obj.df



Unnamed: 0,title,original_title,worldwide_gross,rating,rating_count,runtime,director,release_date,genres,countries,languages,budget
6,The Gentlemen,,114996853,7.9,129105,113,Guy Ritchie,1 January 2020 (UK),"[Action, Comedy, Crime]","[UK, USA]",[English],22000000.0


Now, let's see what happens if the program does not receive any limit on the number of pages to be downloaded.

In [7]:
imdb2 = ImdbDownloader('https://www.imdb.com/search/title/?title_type=feature,tv_movie&num_votes=100000,&sort=release_date,desc&view=simple')
imdb2.start()
imdb2.print_movies()

We have downloaded 50 movies so far!
We have downloaded 100 movies so far!
We have downloaded 150 movies so far!
We have downloaded 200 movies so far!
We have downloaded 250 movies so far!
We have downloaded 300 movies so far!
We have downloaded 350 movies so far!
We have downloaded 400 movies so far!
We have downloaded 450 movies so far!
We have downloaded 500 movies so far!
We have downloaded 550 movies so far!
We have downloaded 600 movies so far!
We have downloaded 650 movies so far!
We have downloaded 700 movies so far!
We have downloaded 750 movies so far!
We have downloaded 800 movies so far!
We have downloaded 850 movies so far!
We have downloaded 900 movies so far!
We have downloaded 950 movies so far!
We have downloaded 1000 movies so far!
We have downloaded 1050 movies so far!
We have downloaded 1100 movies so far!
We have downloaded 1150 movies so far!
We have downloaded 1200 movies so far!
We have downloaded 1250 movies so far!
We have downloaded 1300 movies so far!
We hav