# Scraping IMDb Top 250 Movies




## What is web scraping?
![](https://www.edureka.co/blog/wp-content/uploads/2018/11/Untitled-1.jpg)
Web scraping is an automated method used to extract large amounts of data from websites. The data on  the websites are unstructured. Web scraping helps collect these unstructured data and store it in a   structured form.

## What is IMDb?

![](https://upload.wikimedia.org/wikipedia/commons/6/69/IMDB_Logo_2016.svg)

**The Internet Movie Database (IMDb) is an online database containing information and statistics about movies, TV shows and video games as well as actors, directors and other film industry professionals. This information can include lists of cast and crew members, movie release dates and box office information, plot summaries, trailers, actor and director biographies and other trivia.**

**Information on IMDb comes from a variety of sources, such as filmmakers, film studios, on-screen credits and other official sources. However, much of the information comes from IMDb users themselves, who can submit facts in a wiki-style format. Unlike traditional wiki sites, IMDb always authenticates information before it appears online -- although errors do show up, and the website allows users to report possible mistakes so they can be fixed.**

![](https://i.imgur.com/8Ks5sn8.png)




### **Libraries Used**
Here, we will use Python libraries:-
- **requests** To download a web page, 
- **beasutifulSoup4** in order to extract data from the webpage and
- **pandas**, It provides various data structures and operations for manipulating numerical data and time series.



 ## **Project Outline**:

1. We are going to scrape https://www.imdb.com/chart/top/ and Download the webpage using 'requests'.
2. Parse the HTML source code using 'beautifulSoup4'.
3. We will get movie name and URL 
4. Compile the extracted information into Python list and dictionaries.
5. Convert the Python dictionaries into Pandas DataFrames.
6. Finally we'll create a CSV file in the following format:

```
Movie Name, Url, Released On, Rating
The Shawshank Redemption, https://www.imdb.com/title/tt0111161/, 1994 , 9.2

```


**_We can execute this code using the "Run" button at the top of the page_**

## 1. Use the requests library to download web pages
- Use requests to download the page
- Use bs4 to parse and extract information
- Convert to a pandas dataframe

In [1]:
# Install the library using !pip
!pip install requests --upgrade --quiet

In [2]:
# Import library
import requests

In [3]:
# Copy the link and give it a variable name
topics_url='https://www.imdb.com/chart/top/'

In [5]:
response=requests.get(topics_url)

**requests.get**: In order to download a web page, we use requests.get()

In [6]:
response.status_code

200

**status_code**: It is used to check if the process of sending the HTTP request and getting a HTTP response back is successful or not. If the request was successful, response.status_code is set to a value between 200 and 299.

In [7]:
# len tells about the lenth of a response object
len(response.text)

588376

**len functin** will show how many characters are in response

**_We have total 142826 characters in HTML which we have just downloaded._**

In [8]:
page_contents =response.text

**.text**: is used to retrieve the HTML document.

In [9]:
#This displays the first 1000 characters of `page_contents` which have written in a language called HTML
page_contents[:1000] 

'\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n\n    \n    \n    \n\n    \n    \n    \n\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>Top 250 Movies - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex == \'function\') {\n      uex("ld", "LoadTitle", {wb: 1});\n    }\n</script>\n\n        <link rel="canonical" href="https://www.imd

In [10]:
with open('movies.html', 'w') as f:
    f.write(page_contents)

**_Now, we have saved html file with open statement and we can see this page in file._**
![](https://i.imgur.com/nn83zg7.png)

## 2. Parse the HTML source code using 'beautifulSoup4'


In [11]:
# Installed beautifulsoup
!pip install beautifulsoup4 --upgrade --quiet

In [12]:
# Import BeautifulSoup
from bs4 import BeautifulSoup

In [13]:
# Let's put it in a doc veriable with help of below function
doc = BeautifulSoup(page_contents,'html.parser')

In [14]:
type(doc)

bs4.BeautifulSoup

**type()**: Here we can check type of doc

## 3. We will get movie name, urls, year, rating

In [15]:
td_tags = doc.find_all('td')

**_To get topic titles, we can pick all 'td' tags with the class..._**


In [16]:
len(td_tags)

1250

In [17]:
td_tags[:5]

[<td class="posterColumn">
 <span data-value="1" name="rk"></span>
 <span data-value="9.234585334105239" name="ir"></span>
 <span data-value="7.791552E11" name="us"></span>
 <span data-value="2627252" name="nv"></span>
 <span data-value="-1.7654146658947614" name="ur"></span>
 <a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" height="67" src="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
 </a> </td>,
 <td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="ratingColumn imdbRating">
 <strong title="9.2 based on 2,627,252 user ratings">9.2</strong>
 </td>,
 <td class="ratingColumn">
 <div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
 <div class="boundary">
 <div class=

In [18]:
# doc.find_all with this fuction we can get the page titles with class 
movies_name= doc.find_all('td',{'class':'titleColumn'})

In [19]:
len(movies_name)

250

Here we got movies names let's print any five movies name

In [20]:
# we can print some results 
movies_name[:5]

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
 <span class="secondaryInfo">(1972)</span>
 </td>,
 <td class="titleColumn">
       3.
       <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>
 <span class="secondaryInfo">(2008)</span>
 </td>,
 <td class="titleColumn">
       4.
       <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">The Godfather Part II</a>
 <span class="secondaryInfo">(1974)</span>
 </td>,
 <td class="titleColumn">
       5.
       <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">12 Angry Men</a>
 <span class="secondaryInfo">(

What? not able to understand... 
movie names are written in a language is HTML. Now we'll get them in text format with help of some other function.

### Let's find list of MOVIE NAMES from MOST POPULAR MOVIES

![](https://i.imgur.com/iLIhJh1.png)

In [21]:
# First of all we have to get a tag which will get movie name 
tag = doc.find_all('td', {'class' : 'titleColumn'}) 
movie_names = []
for item in tag:
    movie_names.append(item.find('a').text)
movie_names[:5]

['The Shawshank Redemption',
 'The Godfather',
 'The Dark Knight',
 'The Godfather Part II',
 '12 Angry Men']

After getting movie name tag with a particular class we can use **for** function which will show us all possible values available in movie tag with the particular class it will **append** them in **movie names** in a text format.

 ### Now We'll get all url

In [22]:
#Similarlity we'll get movie urls with the help of base url after getting information we'll print 1st five urls
urls = []
base_url = 'https://www.imdb.com/'
for item in tag:
    a = item.find('a')
    urls.append(base_url + item.find('a')['href'])
urls[:5]

['https://www.imdb.com//title/tt0111161/',
 'https://www.imdb.com//title/tt0068646/',
 'https://www.imdb.com//title/tt0468569/',
 'https://www.imdb.com//title/tt0071562/',
 'https://www.imdb.com//title/tt0050083/']

**Here we can find all urls with `a` tag and href**
![](https://i.imgur.com/DkrEllb.png)

## Ratings

In [23]:
# ratings will get how popular is this movie
rate = doc.find_all('td', class_= 'ratingColumn imdbRating')
ratings=[]
for item in rate: 
    ratings.append(item.text.strip())
ratings[:5]
    

['9.2', '9.2', '9.0', '9.0', '8.9']

## Year

In [24]:
# Year is when the movie was released
year_tag = doc.find_all('span',class_= 'secondaryInfo')
years=[]
for item in year_tag:
    years.append(item.text.strip("()"))
years[:6]

['1994', '1972', '2008', '1974', '1957', '1993']

## 4.Compile the extracted information into Python list and dictionary

Now we got some necessary details Let's put them in pandas Dataframe

In [25]:
# Install and Import pandas 
!pip install pandas --upgrade --quiet

In [26]:
import pandas as pd

We can put these information in pandas Dataframe using dictionary so first we have to create a dictionary.

In [27]:
movies_dict= {
    'Movie Name':movie_names,
    'Url': urls,
    'Released On': years,
    'Rating': ratings
}

Dictionary is ready put them in pandas Dataframe

## 5. Convert the Python dictionary into Pandas DataFrame

In [28]:
# We"ll put Dataframe in movies_df variable
movies_df = pd.DataFrame(movies_dict)

**_Let's check how does it looks..._**

In [29]:
movies_df

Unnamed: 0,Movie Name,Url,Released On,Rating
0,The Shawshank Redemption,https://www.imdb.com//title/tt0111161/,1994,9.2
1,The Godfather,https://www.imdb.com//title/tt0068646/,1972,9.2
2,The Dark Knight,https://www.imdb.com//title/tt0468569/,2008,9.0
3,The Godfather Part II,https://www.imdb.com//title/tt0071562/,1974,9.0
4,12 Angry Men,https://www.imdb.com//title/tt0050083/,1957,8.9
...,...,...,...,...
245,Jai Bhim,https://www.imdb.com//title/tt15097216/,2021,8.0
246,Aladdin,https://www.imdb.com//title/tt0103639/,1992,8.0
247,Gandhi,https://www.imdb.com//title/tt0083987/,1982,8.0
248,The Help,https://www.imdb.com//title/tt1454029/,2011,8.0


In [30]:
# Import jovian to save our work
import jovian

In [31]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "sritu1510/scraping-imdb-top-250-movies" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/sritu1510/scraping-imdb-top-250-movies[0m


'https://jovian.ai/sritu1510/scraping-imdb-top-250-movies'

## 6. Finally we'll create a CSV file

All information is parsed from the main page let's save them in comma-separated values(CSV)

In [32]:
movies_df.to_csv('movies.csv', index= None)

**Now our work is successfully saved and we can check it in file**

###          Getting information out of each movie name

Now we'll get information about all topics with similarly we have done before...

In [33]:
#To get information from the 1st movie among Most Popular Movies
movie_page = urls[0] 

In [34]:
# Lets put it in response and check status code to check is page downloded successfully 
response = requests.get(movie_page)
response.status_code

200

In [35]:
# Check lenth
len(response.text)

895878

In [36]:
# Put it in 2nd doc means doc2
doc2 = BeautifulSoup(response.text, 'html.parser')


**Genre**

In [37]:
# With the help of genre finder can found movie by his interest
gen = doc2.find('a', {'class' : 'sc-16ede01-3 bYNgQ ipc-chip ipc-chip--on-baseAlt'})  #This is to fetch the Genre of the movie
genre = gen.text

**About**

In [38]:
# About is small summary of the movie
about = doc2.find('span', class_= "sc-16ede01-0 fMPjMP").text
about

'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.'

**Rating**

In [39]:
#This is to fetch the rating of the movie
rating1 = doc2.find('span', {'class':'sc-7ab21ed2-1 jGRxWM'})  
if rating1 == None:
    rating = 'NA'
else: 
    rating = rating1.text
rating

'9.3'

Similarly we'll get **viewers** for each movie

In [40]:
viewers = doc2.find('span', {'class' : "score"})   #This is to get the number of people who reviewed the movie
if viewers == None:
    viewers = 'NA'
else :
    viewers = viewers.text.strip()
viewers

'10.3K'

In [54]:
import os

# Now we"ll define some fuctions so that we can get all the information with help of a single fuction

def get_movie_page(url):
    topic_url =url
    # Download the page
    response= requests.get(topic_url)
    
    # Let's check successful response
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse info using BeautifulSoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc


def get_movie_details(topic_doc):
    # Get url details in doc2
    doc2 = get_movie_page(topic_doc)
    
    # find genre
    gen = doc2.find('a', {'class' : 'sc-16ede01-3 bYNgQ ipc-chip ipc-chip--on-baseAlt'})  #This is to fetch the Genre of the movie
    genre = gen.text
    
    # find summary
    about = doc2.find('span', class_= "sc-16ede01-0 fMPjMP").text
    
    # find Reviwes
    viewers = doc2.find('span', {'class' : "score"})   #This is to get the number of people who reviewed the movie
    if viewers == None:
        viewers = 'NA'
    else :
        viewers = viewers.text.strip()
    
    # Return all values in...
    return  genre , about ,  viewers


In [43]:
# Try to get all details by a single function
get_movie_details(urls[0])

('Drama',
 'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.',
 '10.3K')

**In this function we have put all details together whichever we have found before separately**

In [44]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "sritu1510/scraping-imdb-top-250-movies" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/sritu1510/scraping-imdb-top-250-movies[0m


'https://jovian.ai/sritu1510/scraping-imdb-top-250-movies'

In [61]:
# put them in dict and for function will append all posible values below
final_dict = {
    'Viewers':[],
    'genre':[],
    'summary':[]
}

for movie in range(len(urls)):           
    details = get_movie_details(urls[movie])
    
    final_dict['genre'].append(details[0])
    final_dict['summary'].append(details[1])
    final_dict['Viewers'].append(details[2])
df = pd.DataFrame(final_dict)
df

Unnamed: 0,Viewers,genre,summary
0,10.3K,Drama,Two imprisoned men bond over a number of years...
1,5.1K,Crime,The aging patriarch of an organized crime dyna...
2,8.2K,Action,When the menace known as the Joker wreaks havo...
3,1.3K,Crime,The early life and career of Vito Corleone in ...
4,1.9K,Crime,The jury in a New York City murder trial is fr...
...,...,...,...
245,3.3K,Crime,When a tribal man is arrested for a case of al...
246,381,Animation,A kindhearted street urchin and a power-hungry...
247,328,Biography,The life of the lawyer who became the famed le...
248,642,Drama,An aspiring author during the civil rights mov...


**Now we have two diffrent dataframes Let's combined them with the help of concat.**

In [62]:
df1= movies_df
df2 = df
final_df = pd.concat([df1,df2], axis = 1)

### Let's check how looks final_df

In [63]:
final_df

Unnamed: 0,Movie Name,Url,Released On,Rating,Viewers,genre,summary
0,The Shawshank Redemption,https://www.imdb.com//title/tt0111161/,1994,9.2,10.3K,Drama,Two imprisoned men bond over a number of years...
1,The Godfather,https://www.imdb.com//title/tt0068646/,1972,9.2,5.1K,Crime,The aging patriarch of an organized crime dyna...
2,The Dark Knight,https://www.imdb.com//title/tt0468569/,2008,9.0,8.2K,Action,When the menace known as the Joker wreaks havo...
3,The Godfather Part II,https://www.imdb.com//title/tt0071562/,1974,9.0,1.3K,Crime,The early life and career of Vito Corleone in ...
4,12 Angry Men,https://www.imdb.com//title/tt0050083/,1957,8.9,1.9K,Crime,The jury in a New York City murder trial is fr...
...,...,...,...,...,...,...,...
245,Jai Bhim,https://www.imdb.com//title/tt15097216/,2021,8.0,3.3K,Crime,When a tribal man is arrested for a case of al...
246,Aladdin,https://www.imdb.com//title/tt0103639/,1992,8.0,381,Animation,A kindhearted street urchin and a power-hungry...
247,Gandhi,https://www.imdb.com//title/tt0083987/,1982,8.0,328,Biography,The life of the lawyer who became the famed le...
248,The Help,https://www.imdb.com//title/tt1454029/,2011,8.0,642,Drama,An aspiring author during the civil rights mov...


**Finally we have got all details from each movie now save our work in final CSV _IMDb Top 250 Movies_**

In [65]:
final_df.to_csv('IMDb Top 250 Movies.csv', index=None) 

### All information are saved in CSV we can check in file 

## Summary
So, we are successful in parsing Web page for imdb movies. We got lots of interesting and useful information from it. 
All our work is saved in CSV file. Through this CSV file, we can get all our information in the best and optimal way.
 

![](https://comptiacdn.azureedge.net/webcontent/images/default-source/researchreports/data-analytics-vs.-data-science/data-analytics-vs-data-science.png?sfvrsn=28434515_0)

## Future Work
Now we can easily get details about most trending movies by top rating. Also we can find movies by our favorite genre.
this data can be use for further analysis into the data and it will be auto update by running all cells again.

## References

1. Python documentation (https://www.python.org/)
2. Requests library (https://pypi.org/project/requests/)
3. Beautiful Soup (https://beautiful-soup-4.readthedocs.io/en/latest/)
4. Aakash NS Web Scraping Tutorial (https://www.youtube.com/watch?v=RKsLLG-bzEY&t=8940s)


In [139]:
jovian.commit( project = "")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "sritu1510/web-scraping-project-on-github-topics" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/sritu1510/web-scraping-project-on-github-topics[0m


'https://jovian.ai/sritu1510/web-scraping-project-on-github-topics'