# Web Scrapping using Beautiful Soup (BS4 Library) to scrap top rated IMDB Movies from IMDB WEbsite

In [50]:
# Required libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

In [59]:
# Extract HTML and parse it using BeautifulSoup
try:
    source = requests.get('https://www.imdb.com/chart/top/')
    source.raise_for_status()
    soup = BeautifulSoup(source.text,'html.parser')
except Exception as e:
    print(e)


Soup object has HTML of complete Page, let's extract only the movies list part

In [61]:
movies = soup.find('tbody', class_='lister-list').find_all('tr')

In [62]:
# Let's print HTMl of one movie to see how it looks like
movies[0]

<tr>
<td class="posterColumn">
<span data-value="1" name="rk"></span>
<span data-value="9.221243521693388" name="ir"></span>
<span data-value="7.791552E11" name="us"></span>
<span data-value="2532245" name="nv"></span>
<span data-value="-1.7787564783066117" name="ur"></span>
<a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" height="67" src="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>
<td class="ratingColumn imdbRating">
<strong title="9.2 based on 2,532,245 user ratings">9.2</strong>
</td>
<td class="ratingColumn">
<div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
<div class="boundary">
<div class="popover">
<span c

Now we have the list of top-rated 250 movies, let's try to extract 4 important columns (rank, name, year, rating) out of this.

In [63]:
# let's print length of list movies, to see how many movies are extracted
print(len(movies))

250


In [64]:
movies_data = []
for movie in movies:
    try:
        name = movie.find('td', class_='titleColumn').a.text
        rank = movie.find('td', class_='titleColumn').get_text(strip=True).split('.')[0]
        year = movie.find('td', class_='titleColumn').span.text.strip('()')
        rating = movie.find('td', class_="ratingColumn imdbRating").strong.text
        #print(name,rank,year,rating)
        movies_data.append([rank, name,year,rating])
    except Exception as e:
        print(e)

In [65]:
# Let's print first 5 rows in the list
movies_data[0:5]

[['1', 'The Shawshank Redemption', '1994', '9.2'],
 ['2', 'The Godfather', '1972', '9.1'],
 ['3', 'The Godfather: Part II', '1974', '9.0'],
 ['4', 'The Dark Knight', '2008', '9.0'],
 ['5', '12 Angry Men', '1957', '8.9']]

This looks cool, we have data of top 5 movies

In [66]:
# convert the list to DataFrame
top_movies = pd.DataFrame(movies_data,columns =['Rank','Name','Year of Release','IMDB Rating'])
top_movies.head()

Unnamed: 0,Rank,Name,Year of Release,IMDB Rating
0,1,The Shawshank Redemption,1994,9.2
1,2,The Godfather,1972,9.1
2,3,The Godfather: Part II,1974,9.0
3,4,The Dark Knight,2008,9.0
4,5,12 Angry Men,1957,8.9


In [67]:
# Save the dataframe as csv to file system
top_movies.to_csv('top_imdb_movies.csv')

This is a very nice and clean code to scrap top 250 movies list from IMDB using BeautifulSoup library. Happy Coding