# Scraping Top Movies for Genres from IMDB Using Python


![IMDB.png](https://i.postimg.cc/5yKDDZrx/IMDB.png)

Web Scraping is used to extract data from websites into a structured format like a csv file which can be further used for analysis. Many tools like scrapy, pyspider and other SaaS based platforms are available for this purpose. We'll be using the following libraries from Python for our project:  

* requests
* BeautifulSoup
* pandas

### Problem Statement
IMDB (The Internet Movie Database), owned by Amazon, is one of the most comprehensive movie databases on the internet. It is popularly browsed for movie reviews/ratings and other TV/celebrity content. We would like to create an automated way of scraping the top movies for any given genre from IMDB with the following details:
* Movie Name
* IMDB URL
* Year of release
* Duration
* Genre
* IMDB Rating
* Director
* Star cast
* Number of Votes

Once we have this information, we will write it to a csv file so that it can be used for further analysis.

### Step-by-Step Solution

* Download the page from URL using requests library and convert its contents into a BeautifulSoup Object
* Use the BeautifulSoup library to parse required information
* Write the gathered information to a csv
* Use the pandas library to show the information in a dataframe

## Downloading the Web Page

We'll be writing two functions to download the web page and convert it into a BeautifulSoup object for a given genre. Since IMDB shows only 50 movies on the first page, we'll write a function to download the content for Nth page as well.

IMDB has the following genres listed for which webpages can be downloaded:  
* action, adventure, animation, biography, comedy, crime, documentary, drama, family, fantasy, history, Horror, music, musical, mystery, romance, sci-fi, short film, sport, superhero, thriller, war, western

In [1]:
import requests
from bs4 import BeautifulSoup

def get_genre_first_page(genre):
    start = 'https://www.imdb.com/search/title/?genres='
    end = '&title_type=feature&explore=genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=facfbd0c-6f3d-4c05-9348-22eebd58852e&pf_rd_r=VAYJ7QXVS6BEX8377ERW&pf_rd_s=center-6&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_mvpop_1'
    genre_url = start + genre + end
    response = requests.get(genre_url)
    
    if not response.ok:
        print('Status Code:', response.status_code)
        raise Exception('Failed to fetch web page' + genre_url)
    
    doc = BeautifulSoup(response.text)
    return doc


def get_genre_nth_page(genre, n):
    start = 'https://www.imdb.com/search/title/?title_type=feature&genres='
    mid = '&start='
    num = str((n-1)*50 + 1)
    end = '&explore=genres&ref_=adv_nxt'
    genre_url = start + genre + mid + num + end
    response = requests.get(genre_url)
    
    if not response.ok:
        print('Status Code:', response.status_code)
        raise Exception('Failed to get web page' + genre_url)
    
    doc = BeautifulSoup(response.text)
    
    return doc

The functions defined above do the following:

* Create the page URL by substituting the genre in the generic URL taken from IMDB
* Use **requests** library to download the HTML source code as **response**
* Check the status of response object - the web page can be fetched only when it returns the value **200**
* Use **BeautifulSoup** to create an object called **doc** which can be used to scrape required information

Below is a demo using 'action' genre:

In [2]:
doc1 = get_genre_first_page('action')
doc2 = get_genre_nth_page('action', 2)

In [3]:
type(doc1)

bs4.BeautifulSoup

In [4]:
doc1.title.text

'Top 50 Action Movies - IMDb'

## Using BeautifulSoup to Parse Information

We will now define a few functions to get the required information from the below web page:

https://www.imdb.com/search/title/?genres=action&title_type=feature&explore=genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=facfbd0c-6f3d-4c05-9348-22eebd58852e&pf_rd_r=H7SMTEDNFWKPADVAM0ZG&pf_rd_s=center-6&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_mvpop_1

![Movie Tag](https://i.postimg.cc/7Yk472v9/Movie-Object.png)

### Movie Tags
In the code given below, we obtain a list of movie tags on the web page by filtering on Tag: `div` & Class: `lister-item mode-advanced`. We will use these movie tags to further extract information about each of the 50 movies

In [5]:
movie_tags = doc1.find_all('div', class_ = 'lister-item mode-advanced')
len(movie_tags)

50

### Movie Name & URL

The movie name is enclosed within `a` tag which is inside a `h3` tag. We will first obtain the `h3` tag and then use it to extract the movie link and its text (movie name):

![](https://i.postimg.cc/bwk0Fr3m/Movie-Name-URL.png)

In [6]:
sample_h3_tag = movie_tags[0].find('h3')
sample_h3_tag

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt1630029/">Avatar: The Way of Water</a>
<span class="lister-item-year text-muted unbold">(2022)</span>
</h3>

In [7]:
sample_name = sample_h3_tag.find('a').text.strip()
sample_name

'Avatar: The Way of Water'

In [8]:
imdb_url = 'https://www.imdb.com'
sample_url = imdb_url + sample_h3_tag.find('a')['href']
sample_url

'https://www.imdb.com/title/tt1630029/'

### Year of Release

We can obtain the release year using the tag: `span` and class: `lister-item-year text-muted unbold`. The release date is not given on the website for a few movies since they are still in production. So we write a function to take the value as `None` where no release date is available:

![](https://i.postimg.cc/44qkG2Qm/Year-of-release.png)



In [9]:
def find_year(movie_tag):
    try:
        year = int(movie_tag.find('span', class_ = 'lister-item-year text-muted unbold').text[-5:-1])
    except AttributeError:
        year = None
    except ValueError:
        year = None
        
    return year

In [10]:
find_year(movie_tags[0])

2022

### Genre

We can obtain the Genre using the tag: `span` and class: `genre`

![](https://i.postimg.cc/y62fNkpT/genre.png)

In [11]:
sample_genre = movie_tags[0].find('span', class_ = 'genre').text.strip()
sample_genre

'Action, Adventure, Fantasy'

### Duration

We can obtain the duration in mins using the tag: `span` and class: `lister-item-year text-muted unbold`. The duration is not given on the website for a few movies since they are still in production. So we write a function to take the value as `None` where no release date is available:

![](https://i.postimg.cc/k5XQXmQC/duration.png)


In [12]:
def find_duration(movie_tag):
    try:
        duration = movie_tag.find('span', class_ = 'runtime').text.strip()
    except AttributeError:
        duration = None
        
    return duration

In [13]:
find_duration(movie_tags[0])

'192 min'

### IMDB Rating

We can obtain the IMDB rating using the tag: `div` and class: `inline-block ratings-imdb-rating`. The rating is not given on the website for a few movies since they are still in production. So we write a function to take the value as `None` where no rating is available:

![](https://i.postimg.cc/qvGF5pBH/imdb-rating.png)

In [14]:
def find_imdb_rating(movie_tag):
    try:
        imdb_rating = float(movie_tag.find('div', class_ = 'inline-block ratings-imdb-rating').text.strip())
    except AttributeError:
        imdb_rating = None
        
    return imdb_rating

In [15]:
find_imdb_rating(movie_tags[0])

7.9

### Directors & Star Cast

We can obtain the directors and star cast using the tag: `p` and class as blank. Since the paragraph has a lot of text, we will convert it into a list to select only the text we require. The directors and actors are not given on the website for a few movies. So we write a function to take the value as `None` where no director/cast information is available:

![](https://i.postimg.cc/RVx4rVmW/Directors-and-Stars.png)

In [16]:
def find_directors(movie_tag):
    try:
        paragraph = movie_tag.find('p', class_ = '').get_text().strip().split('\n')
        directors = ''.join(paragraph[1 : paragraph.index('| ')])
        directors = directors.replace(',', ';')
    except ValueError:
        directors = None
        
    return directors

def find_actors(movie_tag):
    try:
        paragraph = movie_tag.find('p', class_ = '').get_text().strip().split('\n')
        actors = ''.join(paragraph[paragraph.index('    Stars:')+1 :])
        actors = actors.replace(',', ';')
    except ValueError:
        actors = None
        
    return actors

In [17]:
find_directors(movie_tags[0])

'James Cameron'

In [18]:
find_actors(movie_tags[0])

'Sam Worthington; Zoe Saldana; Sigourney Weaver; Stephen Lang'

### Number of Votes

We can obtain the votes using the tag: `p` and class: `sort-num_votes-visible`. Since we are going to export the data into a csv file, we will remove the comas from the number of votes. The votes information is not given on the website for a few movies since they are still in production. So we write a function to take the value as `None` where no votes information is available:

![](https://i.postimg.cc/W3t0nM8C/votes.png)

In [19]:
def find_votes(movie_tag):
    try:
        votes = movie_tag.find('p', class_ = 'sort-num_votes-visible').get_text().strip().split('\n')[1]
        if ',' in votes:
            votes = int(votes.replace(',' , ''))
        else:
            votes = int(votes)
    except AttributeError:
        votes = None
    except IndexError:
        votes = None
        
    return votes

In [20]:
find_votes(movie_tags[0])

220530

### Function to Parse top movies into a Python List

Now that we have written functions to get all the information required from a movie tag, we will write a function to put all the information in a dictionary. Using this function we will create a list of 100 dictionaries - each one containing the information of a movie tag

In [21]:
# Following is the information extracted for the first movie tag:
print('Movie Name:', sample_name)
print("IMDB URL:", sample_url)
print('Year of Release:', find_year(movie_tags[0]))
print('Duration:', find_duration(movie_tags[0]))
print('Genre:', sample_genre)
print('IMDB Rating:', find_imdb_rating(movie_tags[0]))
print('Director:', find_directors(movie_tags[0]))
print('Actors:', find_actors(movie_tags[0]))
print('Votes:', find_votes(movie_tags[0]))

Movie Name: Avatar: The Way of Water
IMDB URL: https://www.imdb.com/title/tt1630029/
Year of Release: 2022
Duration: 192 min
Genre: Action, Adventure, Fantasy
IMDB Rating: 7.9
Director: James Cameron
Actors: Sam Worthington; Zoe Saldana; Sigourney Weaver; Stephen Lang
Votes: 220530


In [22]:
#Function to write all the information into a dictionary

def parse_movie(movie_tag):
    
    h3_tag = movie_tag.find('h3')
    name = h3_tag.find('a').text.strip().replace(',', '')
    imdb_url = 'https://www.imdb.com'
    url = imdb_url + h3_tag.find('a')['href']
    year = find_year(movie_tag)
    
    duration = find_duration(movie_tag)
    genre = movie_tag.find('span', class_ = 'genre').text.strip().replace(',', ';')
    imdb_rating = find_imdb_rating(movie_tag)
    
    directors = find_directors(movie_tag)
    actors = find_actors(movie_tag)
    votes = find_votes(movie_tag)
    
    #Return a Dictionary
    
    return{
        'Movie Name': name,
        'IMDB URL': url,
        'Year of Release': year,
        'Duration': duration,
        'Genre': genre,
        'IMDB Rating': imdb_rating,
        'Directors': directors,
        'Actors': actors,
        'Votes': votes
    }

In [23]:
parse_movie(movie_tags[2])

{'Movie Name': 'Bullet Train',
 'IMDB URL': 'https://www.imdb.com/title/tt12593682/',
 'Year of Release': 2022,
 'Duration': '127 min',
 'Genre': 'Action; Comedy; Thriller',
 'IMDB Rating': 7.3,
 'Directors': 'David Leitch',
 'Actors': 'Brad Pitt; Joey King; Aaron Taylor-Johnson; Brian Tyree Henry',
 'Votes': 258485}

In [24]:
top_action_movies = [parse_movie(tag) for tag in movie_tags]
top_action_movies[:5]

[{'Movie Name': 'Avatar: The Way of Water',
  'IMDB URL': 'https://www.imdb.com/title/tt1630029/',
  'Year of Release': 2022,
  'Duration': '192 min',
  'Genre': 'Action; Adventure; Fantasy',
  'IMDB Rating': 7.9,
  'Directors': 'James Cameron',
  'Actors': 'Sam Worthington; Zoe Saldana; Sigourney Weaver; Stephen Lang',
  'Votes': 220530},
 {'Movie Name': 'Avatar',
  'IMDB URL': 'https://www.imdb.com/title/tt0499549/',
  'Year of Release': 2009,
  'Duration': '162 min',
  'Genre': 'Action; Adventure; Fantasy',
  'IMDB Rating': 7.9,
  'Directors': 'James Cameron',
  'Actors': 'Sam Worthington; Zoe Saldana; Sigourney Weaver; Michelle Rodriguez',
  'Votes': 1299311},
 {'Movie Name': 'Bullet Train',
  'IMDB URL': 'https://www.imdb.com/title/tt12593682/',
  'Year of Release': 2022,
  'Duration': '127 min',
  'Genre': 'Action; Comedy; Thriller',
  'IMDB Rating': 7.3,
  'Directors': 'David Leitch',
  'Actors': 'Brad Pitt; Joey King; Aaron Taylor-Johnson; Brian Tyree Henry',
  'Votes': 258485}

Below is a function to get the information of top 100 movies from a BeautifulSoup object as input:

In [25]:
def get_top_movies(doc):
    
    # Obtain a list of movie tags from the html source code
    movie_tags = doc.find_all('div', class_ = 'lister-item mode-advanced')
    
    # Create a list of dictionaries containing movie information
    top_movies = [parse_movie(tag) for tag in movie_tags]
    return top_movies

Finally, below function uses all the functions defined above to give the list of top 100 movies for any Genre:

In [26]:
def get_n_pages(genre, n):
    
    #Getting a list of top 50 movies from first page
    doc1 = get_genre_first_page(genre)
    top_movies = get_top_movies(doc1)
    
    #Getting a list of movies from the next (n-1) pages
    for i in range(2,n+1):
        doc = get_genre_nth_page(genre, i)
        top_movies_n = get_top_movies(doc)
        top_movies += top_movies_n
    
    return top_movies

In [27]:
top_200_comedy_movies = get_n_pages('comedy', 4)
len(top_200_comedy_movies)

200

In [28]:
top_200_comedy_movies[:5]

[{'Movie Name': 'Glass Onion',
  'IMDB URL': 'https://www.imdb.com/title/tt11564570/',
  'Year of Release': 2022,
  'Duration': '139 min',
  'Genre': 'Comedy; Crime; Drama',
  'IMDB Rating': 7.2,
  'Directors': 'Rian Johnson',
  'Actors': 'Daniel Craig; Edward Norton; Kate Hudson; Dave Bautista',
  'Votes': 255998},
 {'Movie Name': 'The Menu',
  'IMDB URL': 'https://www.imdb.com/title/tt9764362/',
  'Year of Release': 2022,
  'Duration': '107 min',
  'Genre': 'Comedy; Horror; Thriller',
  'IMDB Rating': 7.3,
  'Directors': 'Mark Mylod',
  'Actors': 'Ralph Fiennes; Anya Taylor-Joy; Nicholas Hoult; Hong Chau',
  'Votes': 115650},
 {'Movie Name': 'White Noise',
  'IMDB URL': 'https://www.imdb.com/title/tt6160448/',
  'Year of Release': 2022,
  'Duration': '136 min',
  'Genre': 'Comedy; Drama; Horror',
  'IMDB Rating': 5.7,
  'Directors': 'Noah Baumbach',
  'Actors': 'Adam Driver; Greta Gerwig; Don Cheadle; Madison Gaughan',
  'Votes': 24039},
 {'Movie Name': 'Knives Out',
  'IMDB URL': 'h

## Write Information to CSV

Below is a function which can be used to write all the extracted information into a csv file:

In [29]:
def write_csv(items, path):
    with open(path, 'w') as f:
        if len(items) == 0:
            return
        
        headers = ','.join(list(items[0].keys()))
        f.write(headers + '\n')
        
        for item in items:
            values = []
            for value in item.values():
                values.append(str(value))
            f.write(','.join(values) + '\n')

In [30]:
write_csv(top_200_comedy_movies, 'top-200-comedy-movies.csv')

### Use Pandas to Analyse the Data

We can now view all the extracted information in the form of a data frame using the pandas library

In [31]:
import pandas as pd

In [32]:
pd.read_csv('top-200-comedy-movies.csv').head()

Unnamed: 0,Movie Name,IMDB URL,Year of Release,Duration,Genre,IMDB Rating,Directors,Actors,Votes
0,Glass Onion,https://www.imdb.com/title/tt11564570/,2022,139 min,Comedy; Crime; Drama,7.2,Rian Johnson,Daniel Craig; Edward Norton; Kate Hudson; Dave...,255998
1,The Menu,https://www.imdb.com/title/tt9764362/,2022,107 min,Comedy; Horror; Thriller,7.3,Mark Mylod,Ralph Fiennes; Anya Taylor-Joy; Nicholas Hoult...,115650
2,White Noise,https://www.imdb.com/title/tt6160448/,2022,136 min,Comedy; Drama; Horror,5.7,Noah Baumbach,Adam Driver; Greta Gerwig; Don Cheadle; Madiso...,24039
3,Knives Out,https://www.imdb.com/title/tt8946378/,2019,130 min,Comedy; Crime; Drama,7.9,Rian Johnson,Daniel Craig; Chris Evans; Ana de Armas; Jamie...,677801
4,Babylon,https://www.imdb.com/title/tt10640346/,2022,189 min,Comedy; Drama; History,7.4,Damien Chazelle,Brad Pitt; Margot Robbie; Jean Smart; Olivia W...,10490


## The Last Leg

We can now modify our function to do all of the following:

* Download the web page from IMDB
* Extract data for top 100 movies from the web page
* Write the data into a CSV file
* Show the data as a pandas dataframe

In [33]:
def get_n_pages(genre, n):
    
    #Getting a list of top 50 movies from first page
    doc1 = get_genre_first_page(genre)
    top_movies = get_top_movies(doc1)
    
    #Getting a list of movies from the next (n-1) pages
    for i in range(2,n+1):
        doc = get_genre_nth_page(genre, i)
        top_movies_n = get_top_movies(doc)
        top_movies += top_movies_n
    
    #Writing the data into a csv file
    write_csv(top_movies, f'top-{genre}-movies.csv')
    
    return pd.read_csv(f'top-{genre}-movies.csv')

In [34]:
get_n_pages('action', 4)

Unnamed: 0,Movie Name,IMDB URL,Year of Release,Duration,Genre,IMDB Rating,Directors,Actors,Votes
0,Avatar: The Way of Water,https://www.imdb.com/title/tt1630029/,2022,192 min,Action; Adventure; Fantasy,7.9,James Cameron,Sam Worthington; Zoe Saldana; Sigourney Weaver...,220530
1,Avatar,https://www.imdb.com/title/tt0499549/,2009,162 min,Action; Adventure; Fantasy,7.9,James Cameron,Sam Worthington; Zoe Saldana; Sigourney Weaver...,1299311
2,Bullet Train,https://www.imdb.com/title/tt12593682/,2022,127 min,Action; Comedy; Thriller,7.3,David Leitch,Brad Pitt; Joey King; Aaron Taylor-Johnson; Br...,258485
3,Top Gun: Maverick,https://www.imdb.com/title/tt1745960/,2022,130 min,Action; Drama,8.3,Joseph Kosinski,Tom Cruise; Jennifer Connelly; Miles Teller; V...,489346
4,Everything Everywhere All at Once,https://www.imdb.com/title/tt6710474/,2022,139 min,Action; Adventure; Comedy,8.1,Dan Kwan; Daniel Scheinert,Michelle Yeoh; Stephanie Hsu; Jamie Lee Curtis...,280153
...,...,...,...,...,...,...,...,...,...
195,The Vault,https://www.imdb.com/title/tt9742794/,2021,118 min,Action; Adventure; Thriller,6.4,Jaume Balagueró,Freddie Highmore; Astrid Bergès-Frisbey; Sam R...,26543
196,Mission: Impossible - Rogue Nation,https://www.imdb.com/title/tt2381249/,2015,131 min,Action; Adventure; Thriller,7.4,Christopher McQuarrie,Tom Cruise; Rebecca Ferguson; Jeremy Renner; S...,376668
197,Oblivion,https://www.imdb.com/title/tt1483013/,2013,124 min,Action; Adventure; Sci-Fi,7.0,Joseph Kosinski,Tom Cruise; Morgan Freeman; Andrea Riseborough...,529804
198,From Dusk Till Dawn,https://www.imdb.com/title/tt0116367/,1996,108 min,Action; Crime; Horror,7.2,Robert Rodriguez,Harvey Keitel; George Clooney; Juliette Lewis;...,318283


## References & Future Work

### Summary

Following is a brief summary of what we did:

* Downloaded the page from URL using requests library and converted its contents into a BeautifulSoup Object
* Used the BeautifulSoup library to parse required information
* Wrote the gathered information to a csv file
* Used the pandas library to show the information in a dataframe

### References

* YouTube Video on Python Web Scraping by Jovian: https://www.youtube.com/watch?v=RKsLLG-bzEY&t=3568s
* BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Ideas for Future Work

* Create a live dashboard which updates at a regular frequency using the data extracted from IMDB
* Analyse the extracted data using pandas library for insights on directors/actors featuring in the list. An interesting analysis would be to assign a score to all the actors/directors in the list basis the ranking of their movies.
* Python libraries like matplotlib and seaborn can be used for visualisation of the data we have scraped

In [35]:
import jovian

In [36]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "snehabajaj108/web-scraping-assignment" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/snehabajaj108/web-scraping-assignment[0m


'https://jovian.com/snehabajaj108/web-scraping-assignment'