### IMDb and Criterion Channel Website Scraper
##### By Walter Goedecke, February 15, 2025

This Jupyter Notebook features two URL web-scraping examples using Cinemagoer and BeautifulSoup Python modules: gathering movie lists from the Criterion Channel, and then determining the International Movie Database (IMDb) rating of those movies. 

The method is featured twice: once as an integrated function, and then piecemeal, to show the individual functions and allow experimentation. 

### Table of Contents

* #### [Initialization](#initialization)
    * [Import Python packages](#import_python_packages)
    * [Assign directory](#assign_directory)

* #### [Functions](#functions)

* #### [Integrated web-scraping function](#integrated_web-scraping_function)
    * [Main program](#main_program)
    * [Program input](#program_input)

* #### [Some Criterion Channel example links](#example_links)

* #### [Piecemeal web-scraping functions](#piecemeal_operation)
    * [Input movie URL](#input_url)
    * [Access IMDb for movie specifications](#access_imdb)    
    * [Create output file and output results](#output_results)
    
* #### [Description of program](#description)
    * [Function of program](#function_of_program)
    * [Reason for developing](#reason_for_developing)
    * [How it works](#how_it_works)
    * [Improvements](#improvements)

### Initialization <a class="anchor" id="initialization"></a>

#### Import Python packages <a class="anchor" id="import_python_packages"></a>
Import Python packages used for calculations and web scraping. 

In [1]:
# Import standard python modules.
import os, sys
import math # Used for chopping long titles into segments. 

# Importing the IMDb Python module.
import requests
from imdb import Cinemagoer

# Creating an instance of Cinemagoer.
ia = Cinemagoer()

# Use beautifulsoup to parse html text.
from bs4 import BeautifulSoup

#### Assign output directory <a class="anchor" id="assign_directory"></a>
A directory is created for the output of the movie list and their ratings; the directory is not created if it already exists. 

In [2]:
''' Create directories for storing movie rating results. '''
# Assign the home directory.
home_dir = os.environ['HOME']

# Assign the output directory - customize as needed.
movie_dir = os.path.join(home_dir, "Documents", "entertainment-movies")
print("Movie directory: {}".format(movie_dir))

# Create the output directory, unless it already exists.
if os.path.isdir(movie_dir):
    print("{} exists".format(movie_dir))
else: 
    os.makedirs(movie_dir)
    print("{} created".format(movie_dir))


Movie directory: /home/walter/Documents/entertainment-movies
/home/walter/Documents/entertainment-movies exists


### Functions <a class="anchor" id="functions"></a>
Here are the principal Python functions: *criterion_list*, *selection_choice*, and *output_ratings*, with another, *chop_string*, to chop movie titles to fit the output file column.

In [25]:
''' 
This function access Criterion Channel URL, e.g., 
https://www.criterionchannel.com/british-noir.

The HTML page is returned, and beautifulsoup will parse the page, seeking 
specific information, such as 'div' tagged blocks with class characteristics 
that contain the movie title and release year. 

From these blocks the title and year is extracted, and appended to the lists.
'''
def criterion_list(url):
    # Use BeautifulSoup to parse HTML information.
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    #print(soup)
    #print(soup.prettify()) # "prettify" formats the code.
    
    # Set movie list.
    movie_list = []
    # Set year of release list.
    year_list = []

    # Zone in on HTML segments that have both movie titles and year listed. This will 
    # omit film teasers that have no official release date, and can't be rated on IMDb.
    blocks = soup.find_all('div', attrs = {'class':'tooltip background-white'})
    for block in blocks:
        # Will print formatted block of code for demonstration.
        #print(block.prettify(), '\n')
        
        # Blocks with "•" characters in them will have official movie dates, e.g., " • 1947 • "
        if "•" in block.text: 
            #print(block.prettify(), '\n')

            # Zone in on the line showing the movie, looking for <strong> tags surrounding 
            # the movie title, while also setting title, style, and class keys to null.
            title = block.find('strong', attrs = {'title':'', 'style':'', 'class':''})
            movie_list.append(title.text)
            #print(title.text)
            
            # Extract the movie release year, between the " • " characters.
            year = block.text.split("•")[1].strip()
            year_list.append(year)
            #print(year, '\n')

    return movie_list, year_list

In [21]:
'''
Function that finds the movie from the IMDb selection, which is usually several movie choices. 
From these choices, match the movie release year to find the correct movie.
'''
def selection_choice(movie, target_yr):
    # selection is a imdbpy/cinemagoer class object, which will be a list of movies with the title in them.
    selection = ia.search_movie(movie)
    #print(selection, '\n')
    
    # Sort thru the movie selection until the correct movie is found by year, usually the first one, 
    # but not always. 
    for line in selection:
        # Initialize rating, in case the movie can't be found. 
        rating = "Not listed"

        # Get the movie ID, with which will generate a link to a specific movie, e.g, 
        # movieID = "0032484" will generate URL "https://www.imdb.com/title/tt0032484/"
        movieID = line.movieID

        # With movie ID, go back to IMDb and find year of movie.
        fetched_movie = ia.get_movie(movieID)
        try: 
            year = fetched_movie['year']
        except: # If not possible to get a year, jump to the next line and movie.
            continue

        # Compare movie year with target year, if identical, get rating and break out of loop.
        if year == target_yr:
            try:
                rating = fetched_movie['rating']
            except: # If not listed or some other error. 
                rating = "Not listed"
               
            # With correct year found, break out of loop.
            break

    return movie, year, movieID, rating

In [11]:
'''
Routine that chops string into segments of specified width to fit into columns, e.g., 
"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,"
becomes: 
"It was the best of times, it
 was the worst of times, it
 was the age of wisdom, it was
 the age of foolishness,"
'''
def chop_string(string, width):
    length = len(string)
    pieces = []
    num = 0
    if len(string) > width:
        marker = 0
        markers = [marker]
        for i in range(math.ceil(length/width)):
            chunk = string[marker:]
            while len(chunk) > 30:
                chunk = chunk.rsplit(' ', 1)[0]
                #print(chunk)
                num = len(chunk)

            marker += num
            markers.append(marker)
            pieces.append(chunk)

    else: 
        pieces.append(string)
        
    return pieces

In [22]:
'''
Function will output formatted movie, year, and rating to screen and to file defined by "filename."
'''
def output_ratings(movies_specs_list, filename):
    # Create ratings list.
    rating_list = []
    with open(filename, "w") as fout:
        print("  {:40s} {:s}".format("Movie (Year)", "Rating"))
        fout.write("  {:39s} {:s}\n".format("Movie (Year)", "Rating"))

        for movie_specs in movies_specs_list:
            movie = movie_specs[0]
            year = int(movie_specs[1])
            rating = str(movie_specs[3])

            # Combine title and year as a single string for the list, e.g., 
            # "Green for Danger", "(1946)" becomes "Green for Danger (1946)"
            movie_year = "{} ({})".format(movie, year)
            
            # Output movie, date, and rating, e.g., 
            # "Topaz (1945)                               6.3"
            # or if the title is too long for list, chop, e.g., 
            # The Lodger: A Story of the London Fog 
            # (2020)                                     7.3
            if len(movie_year) > 42:
                #  Mess with the movie_year string. 
                pieces = chop_string(movie_year, 42)
                for i in range(len(pieces) - 1):
                    print("{:43s}".format(pieces[i]))

                # Print last line with rating. 
                print("{:43s} {:s}".format(pieces[i+1], rating))

            else: 
                print("{:43s} {:s}".format(movie_year, rating))
                fout.write("{:42s} {:s}\n".format(movie_year, rating))

### Integrated web-scraping function <a class="anchor" id="integrated_web-scraping_function"></a>
This is the main program that combines all of the functions to extract Criterion Channel movie lists, and then rate them at IMDb.

#### Main program <a class="anchor" id="main_program"></a>

In [23]:
# Main program, with the three principal functions: criterion_list(), selection_choice(), and output_ratings().
def main(args):
    #######################################
    # Input the Criterion Channel movie URL. 
    input_var = input("Enter the URL: ")
    #print ("you entered " + input_var) 
    category_url = input_var

    #######################################
    # Call the Criterion Channel access function. 
    movie_list, year_list = criterion_list(category_url)

    print("\nMovie list and year of release:")
    for i in range(len(movie_list)):
        print("{} ({})".format(movie_list[i], year_list[i]))

    #######################################
    # Find IMDb specifications for movies.
    ''' With movie list, find the IMDb movie IDs for movies, then the rating. 
        Continue seeking the movie rating from IMDB, since it now does not 
        return the rating except when reiterated several times - it used to!!
    '''
    print("\nIMDb specifications for movie list:")
    # Create movie ID list.
    movies_specifications_list = []
    # Print header for list.
    print("Initial processing:")
    print("(movie, year, movieID, rating)")
    print("------------------------------")

    # The movieID will access the movie at the Criterion Channel, e.g, 
    # "0032484" links to "https://www.imdb.com/title/tt0032484/" for 
    # "Foreign Correspondent"

    for i in range(len(movie_list)):
        movie = movie_list[i]
        year = int(year_list[i]) # Change to integer.

        # With the movie and the year, find the movie specifications, and 
        # append to list.
        movie_specs = selection_choice(movie, year)
        movies_specifications_list.append(movie_specs)
        print(movie_specs)

    # Iterate the process to eliminate the returned "Not listed" results.
    # Do 7 times; can be changed to a larger value.
    max_iterations = 7
    count = 0
    while count <= max_iterations:
        # Increment count.
        count += 1
        # Check to see if any movies were unrated, if so, create unrated movie list.
        if ([x for x in movies_specifications_list if "Not listed" in x]):
            movies_specifications_list_unrated = []
            for i in range(len(movies_specifications_list)):
                if movies_specifications_list[i][3] == 'Not listed':
                    movies_specifications_list_unrated.append(movies_specifications_list[i])

            print()
            if movies_specifications_list_unrated:
                print("Iteration {}:".format(count))
                # Print header for list.
                print("(movie, year, movieID, rating)")
                print("------------------------------")

                movies_specifications_list_rated = []
                for i in range(len(movies_specifications_list_unrated)):
                    movie = movies_specifications_list_unrated[i][0]
                    year = int(movies_specifications_list_unrated[i][1])

                    # With the movie and the year, find the movie specifications, and 
                    # append to list.
                    movie_specs = selection_choice(movie, year)
                    movies_specifications_list_rated.append(movie_specs)
                    print(movie_specs)

                # Copy newly rated movies in this iteration to the movies_specifications_list. 
                for i, line in enumerate(movies_specifications_list):
                    for j, subline in enumerate(movies_specifications_list_rated):
                        if line[0] == subline[0]:
                            movies_specifications_list[i] = subline
                            break
        else:
            break

    print("\nFinal list:")
    print("(movie, year, movieID, rating)")
    print("------------------------------")
    for line in movies_specifications_list:
        print(line)

    #######################################
    # Create movie-type filename to store ratings. This will be in the movie 
    # directory, "movie_dir," setup earlier. Then output results to screen and file.
    movie_file = os.path.split(category_url)[1] + ".txt"
    filepath = os.path.join(movie_dir, movie_file)
    print("\nFilepath where ratings are written:\n{}".format(filepath))

    # Output movie, year, and rating.
    output_ratings(movies_specifications_list, filepath)

    return 0

#### Program input <a class="anchor" id="program_input"></a>

In [24]:
# Program access point.
if __name__ == '__main__':
    try:
        sys.exit(main(sys.argv))
    except:
        pass

Enter the URL: https://www.criterionchannel.com/directed-by-alfred-hitchcock

Movie list and year of release:
Foreign Correspondent (1940)
The Lady Vanishes (1938)
Young and Innocent (1937)
Sabotage (1936)
The 39 Steps (1935)
The Man Who Knew Too Much (1934)
Downhill (1927)
The Lodger: A Story of the London Fog (1927)

IMDb specifications for movie list:
(movie, year, movieID, rating)
------------------------------
('Foreign Correspondent', 1940, '0032484', 7.4)
('The Lady Vanishes', 1938, '0030341', 7.7)
('Young and Innocent', 1937, '0029811', 6.8)
('Sabotage', 1936, '0028212', 7.0)
('The 39 Steps', 1935, '0026029', 7.6)
('The Man Who Knew Too Much', 1934, '0025452', 6.7)
('Downhill', 1927, '0017825', 6.0)
('The Lodger: A Story of the London Fog', 1927, '0017075', 7.3)

Filepath where ratings are written:
/home/walter/Documents/entertainment-movies/directed-by-alfred-hitchcock.txt
  Movie (Year)                             Rating
Foreign Correspondent (1940)                7.4
The Lad

### Some Criterion Channel example links <a class="anchor" id="example_links"></a>
Some example links to try. Copy link to insert into the *program input* to rate the movies.

**Notes:** 
* Break out of markdown in this cell to copy link - hit ***enter***, rather than active link, which will take you to the Criterion Channel. <br>
* Also, some of these links may not work, since the Criterion Channel rotates selections periodically. 

https://www.criterionchannel.com/british-noir

https://www.criterionchannel.com/japanese-noir

https://www.criterionchannel.com/three-noirs-by-john-farrow

https://www.criterionchannel.com/directed-by-alfred-hitchcock

https://www.criterionchannel.com/douglas-sirk-noirs

https://www.criterionchannel.com/hollywood-hits

### Piecemeal web-scraping functions <a class="anchor" id="piecemeal_operation"></a>
This section features the individual modules to show the process of extracting Criterion Channel movie categories and then rating them at IMDb.

For example, Alfred Hitchcock's ***Shadow of a Doubt*** **(1943)** has a IMDb rating of **7.8**.

#### Input movie URL <a class="anchor" id="input_url"></a>
Input the Criterion Channel movie URL and gather the movie list.

In [26]:
'''
Input the Criterion Channel movie URL, e.g., "https://www.criterionchannel.com/directed-by-alfred-hitchcock"
The output will be the movie and the year of release, i.e., title (year)
'''
input_var = input("Enter the URL: ")
#print ("you entered " + input_var) 
category_url = input_var
    
movie_list, year_list = criterion_list(category_url)
for i in range(len(movie_list)):
    print("{} ({})".format(movie_list[i], year_list[i]))

Enter the URL: https://www.criterionchannel.com/british-noir
Green for Danger (1946)
Odd Man Out (1947)
Obsession (1949)
The Small Back Room (1949)
The Woman in Question (1950)
Hell Drivers (1957)
Time Without Pity (1957)
All Night Long (1962)


#### Access IMDb for movie specifications <a class="anchor" id="access_imdb"></a>
Access IMDb for the movie identification numbers, then again for the movies specifications.

In [27]:
''' With movie list, find the IMDb movie IDs for movies, then the rating. 
    Continue seeking the movie rating from IMDB, since it now does not 
    return the rating except when reiterated several times - it used to!!
'''
# Create movie ID list.
movies_specifications_list = []
# Print header for list.
print("Initial processing:")
print("(movie, year, movieID, rating)")
print("------------------------------")

# The movieID will access the movie at the Criterion Channel, e.g, 
# "0032484" links to "https://www.imdb.com/title/tt0032484/" for 
# "Foreign Correspondent"

for i in range(len(movie_list)):
    movie = movie_list[i]
    year = int(year_list[i]) # Change to integer.

    # With the movie and the year, find the movie specifications, and 
    # append to list.
    movie_specs = selection_choice(movie, year)
    movies_specifications_list.append(movie_specs)
    print(movie_specs)

# Iterate the process to eliminate the returned "Not listed" results.
# Do 7 times; can be changed to a larger value.
max_iterations = 7
count = 0
while count <= max_iterations:
    # Increment count.
    count += 1
    # Check to see if any movies were unrated, if so, create unrated movie list.
    if ([x for x in movies_specifications_list if "Not listed" in x]):
        movies_specifications_list_unrated = []
        for i in range(len(movies_specifications_list)):
            if movies_specifications_list[i][3] == 'Not listed':
                movies_specifications_list_unrated.append(movies_specifications_list[i])

        print()
        if movies_specifications_list_unrated:
            print("Iteration {}:".format(count))
            # Print header for list.
            print("(movie, year, movieID, rating)")
            print("------------------------------")

            movies_specifications_list_rated = []
            for i in range(len(movies_specifications_list_unrated)):
                movie = movies_specifications_list_unrated[i][0]
                year = int(movies_specifications_list_unrated[i][1])

                # With the movie and the year, find the movie specifications, and 
                # append to list.
                movie_specs = selection_choice(movie, year)
                movies_specifications_list_rated.append(movie_specs)
                print(movie_specs)

            # Copy newly rated movies in this iteration to the movies_specifications_list. 
            for i, line in enumerate(movies_specifications_list):
                for j, subline in enumerate(movies_specifications_list_rated):
                    if line[0] == subline[0]:
                        movies_specifications_list[i] = subline
                        break
    else:
        break

print("\nFinal list:")
print("(movie, year, movieID, rating)")
print("------------------------------")
for line in movies_specifications_list:
    print(line)

(movie, year, movieID, rating)
------------------------------
('Green for Danger', 1946, '0038577', 7.4)
('Odd Man Out', 1947, '0039677', 7.6)
('Obsession', 1949, '0041460', 7.3)
('The Small Back Room', 1949, '0041886', 7.1)
('The Woman in Question', 1950, '0043140', 6.8)
('Hell Drivers', 1957, '0051713', 7.2)
('Time Without Pity', 1957, '0049856', 6.8)
('All Night Long', 1962, '0054614', 7.1)


#### Create output file and output results <a class="anchor" id="output_results"></a>
Create an output file with file name similar to the movie category, and write resutls to it.

In [28]:
# Create movie-type filename to store ratings.
movie_file = os.path.split(category_url)[1] + ".txt"
print("Movie file: {}".format(movie_file))

filepath = os.path.join(movie_dir, movie_file)
print("filepath: {}".format(filepath))

Movie file: british-noir.txt
filepath: /home/walter/Documents/entertainment-movies/british-noir.txt


In [29]:
# Call the output function that will format the results here and to the file.
output_ratings(movies_specifications_list, filepath)

  Movie (Year)                             Rating
Green for Danger (1946)                     7.4
Odd Man Out (1947)                          7.6
Obsession (1949)                            7.3
The Small Back Room (1949)                  7.1
The Woman in Question (1950)                6.8
Hell Drivers (1957)                         7.2
Time Without Pity (1957)                    6.8
All Night Long (1962)                       7.1


### Description of program <a class="anchor" id="description"></a>

#### Function of program <a class="anchor" id="function_of_program"></a>
This program determines the rating of movies, particularly groups of movies listed on the Criterion Channel (https://www.criterionchannel.com/), with ratings provided by the International Movie Database (IMDb - https://www.imdb.com/).

#### Reason for developing <a class="anchor" id="reason_for_developing"></a>
I have enjoyed watching movies, especially film noir movies that are above a certain caliber, such as at least an approximate 7.0 out of 10.0 IMDb rating. I've found it tedious to always reference a movie's rating individually. This app will take a listing of movies featured on the Criterion Channel, or provided to it as a list of movies and their year of release and seek their rating. 

#### How it works <a class="anchor" id="how_it_works"></a>
The program first accepts a URL that links to a listing of movies, e.g., *https://www.criterionchannel.com/british-noir.* The function **criterion_list(url)** returnes a lising of movies under the British noir category with the year of release, in this case these movies: 

    Green for Danger (1946)
    Odd Man Out (1947)
    Obsession (1949)
    The Small Back Room (1949)
    The Woman in Question (1950)
    Hell Drivers (1957)
    Time Without Pity (1957)
    All Night Long (1962)

It does so by parsing the HTML page with BeautifulSoup. The movie title and year of release are found in *div-tagged* blocks of HTML code, with class attributes *tooltip background-white*: 

    blocks = soup.find_all('div', attrs = {'class':'tooltip background-white'})

A returned block of code with a movie title and year of release is shown:

    <div class="tooltip background-white" id="collection-tooltip-455466">
     <h3 class="tooltip-item-title site-font-primary-family">
      <strong>
       Green for Danger
      </strong>
     </h3>
     <h4 class="transparent">
      <span class="media-identifier media-episode">
       Episode 2
      </span>
     </h4>
     <div class="transparent padding-top-medium">
      <p>
       Directed by Sidney Gilliat • 1946 • United Kingdom
       <br/>
       Starring Trevor Howard, Sally Gray, Alastair Sim
      </p>
      <p>
       In the midst of Nazi air raids, a postman dies on the operating table at a rural English hospital. But was the death accidental? A delightful and wholly unexpected murder mystery, British writer/d...
      </p>
     </div>
    </div>

If the block contains a "•" character it will have both a movie title and a year, otherwise it probably only has a teaser describing the featured movies, and thus is itself not a movie that can be rated. The movie title is extracted from the code block by seeking the line that has a *strong-tag* with no title, style, class attributes:
    
    title = block.find('strong', attrs = {'title':'', 'style':'', 'class':''})

In this example the movie title is *Green for Danger*. 

The year of release is extracted by a split("•") method since the year resides within two "•" symbols, e.g., *" • 1946 • "*. The title and year are then appended to the movie and year lists, and returned by the function. 

Next, the function **selection_choice(movie, target_yr)** operates on each movie and year in the movie and year lists, returning the movie specifications, i.e., the title and year again, along with the movie ID and the rating; e.g. for the *British noir* selection: 

    ('Green for Danger', 1946, '0038577', 7.4)
    ('Odd Man Out', 1947, '0039677', 7.6)
    ('Obsession', 1949, '0041460', 7.3)
    ('The Small Back Room', 1949, '0041886', 7.1)
    ('The Woman in Question', 1950, '0043140', 6.8)
    ('Hell Drivers', 1957, '0051713', 7.2)
    ('Time Without Pity', 1957, '0049856', 6.8)
    ('All Night Long', 1962, '0054614', 7.1)

The **selection_choice()** function works by the IMDdpy package, now known as *Cinemagoer*, which is particulary adapted to the International Movie Database (IMDb) repository of movie data. With the instance *ia = Cinemagoer()*, 
a selection of movies is returned from IMDb by:

    selection = ia.search_movie(movie)

Several movies are returned, since the movie title is generally not unique. The movie ID for each movie in the list is sought, and then with the movie ID, each movie's specifications is examined from the IMDb archives: 

    fetched_movie = ia.get_movie(movieID)

Each fetched movie's release year is compared to the desired one from the Criterion Channel, and if a match is found, it is assumed that the correct movie has been picked, and the rating is extracted, 

    rating = fetched_movie['rating']

and the next movie in the Criterion Channel's list is examined. There are a few error-trapping lines in the **selection_choice()** function since anomalies can occur now and then. The movie title, year of release, movie ID, and rating are then appended to the movies specifications list. 

Recently, movie ratings for all movies submitted to the IMDB site would not return, so I reiterated the process for movies with no return rating. I loop through seven times now, breaking out of the loop when all the ratings return. 

Next, the movie results are formatted and output to a file, with the function **output_ratings(movies_specs_list, filename)**, *filename* specifies the destination to write to. For the *British noir* example, the output is: 

      Movie (Year)                             Rating
    Foreign Correspondent (1940)                7.4
    The Lady Vanishes (1938)                    7.7
    Young and Innocent (1937)                   6.8
    Sabotage (1936)                             7.0
    The 39 Steps (1935)                         7.6
    The Man Who Knew Too Much (1934)            6.7
    Downhill (1927)                             6.0
    The Lodger: A Story of the London Fog 
     (1927)                                     7.3

The function **chop_string(string, width)** will chop the movie title to fit into the column, as seen by the last entry. 

While the program was inspired by Criterion Channel movies, which requires a yearly - about \\$100, or monthly - about \\$10 fee, a list of movies given the **selection_choice()** function will return the movie specifications, so the functions can be tailored to ones needs. 

These algorithms do not always work: one error is that the movie's year of release may be off by a year, e.g., the Criterion year is *1962*, while the IMDb year is *1961*, which I've seen for some foreign films. Some other times the movie is not listed at the IMDb site; this usually happens again with foreign films, with unmatched movie titles, e.g., when one title may be in English and the other in the host language of the country the movie was made at. I've found this movie rating algorithm to be at least 90% effective with foreign movies, and nearly 100% when rating English language films. 

#### Improvements <a class="anchor" id="improvements"></a>
While this github site is for mostly for cinephile experimentation, I may add some features to help streamline the movie rating processing. 

Since the site works mostly with Criterion Channel movie selections, and most users would probably not be subscribers, I may change the output of movie results from the **criterion_list()** function to be stored as a file, and require the input of movie titles and corresponding release years to **selection_choice()** be read from a file rather than passing the list directly between functions. This would allow a file of movie titles and years composed by another method to be passed to the **selection_choice()** function. 

I would also entertain any thoughts from users to improve the process.