# Web Scraping IMDB Top 250

![title](img/crawling.jpg)

## Part 1 - Web Scraping Basics

### 1 - What is web scraping and web crawling？

**Web scraping** is an automated program that queries a web server, requests data (usually in the form of HTML), and then parses that data to extract needed information.

**Web crawling** refers to downloading and storing the contents of a large number of websites, by following links in web pages. Web crawlers are called such because they crawl across the web.

### 2 - Uses of Web Scraping/Crawling

**2.1 - Search Engines** – One of the largest companies whose whole business is based on Web Scraping. It is hard to imagine going by one day without using Google.

-  <font color=red>Googlebot</font> - _Google's web crawling bot. Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index. We use a huge set of computers to fetch (or "crawl") billions of pages on the web. Googlebot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site._

**2.2 - Content Aggregators** – almost all the content aggregators use web scraping. Job Aggregators scrape job boards and company websites and grab latest job openings.

**2.3 - Application** - Price monitoring, traning datasets for Machine Learning, etc.

### 3 - How does a web scraper work?

-  Step 1: Download content of web pages
-  Step 2: Parse and extract data
-  Step 3: Store data as txt, csv, json or in database, etc.

### 4 - HTML Introduction

**4.1 What is HTML?**   

-  HTML is the standard markup language for creating Web pages.
-  HTML elements are the building blocks of HTML pages, which are represented by tags
-  HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
-  Browsers do not display the HTML tags, but use them to render the content of the page

**4.2 A Simple HTML Document**

What it looks like in web browser:<br>
<!DOCTYPE html>
    <html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
    </html>

**Example Explained**    
-  The <!DOCTYPE html> declaration defines this document to be HTML5      
-  The < html> element is the root element of an HTML page   
-  The < head> element contains meta information about the document   
-  The < title> element specifies a title for the document  
-  The < body> element contains the visible page content  
-  The < h1> element defines a large heading  
-  The < p> element defines a paragraph  

**4.3 HTML Tags**

-  HTML tags are element names surrounded by angle brackets:
-  HTML tags normally come in pairs like < p> and < /p>
-  The first tag in a pair is the start tag, the second tag is the end tag
-  < tagname>content goes here...< /tagname>
-  The end tag is written like the start tag, but with a forward slash inserted before the tag name

**4.4 Web Browsers**

-  The purpose of a web browser (Chrome, IE, Firefox, Safari) is to read HTML documents and display them.
-  The browser does not display the HTML tags, but uses them to determine how to display the document:
-  Note: Only the content inside the < body> section is displayed in a browser.

## Part 2 - Scraping IMDB Top 250

###  1 - Packages

In [1]:
import requests
print('Requests version: ' + requests.__version__)

import bs4
print('Beautiful Soup version: ' + bs4.__version__)
from bs4 import BeautifulSoup

Requests version: 2.23.0
Beautiful Soup version: 4.9.0


###  2 - Send Requests

In [2]:
# Send a request to https://www.imdb.com/chart/top and download the HTML Content of the page
r = requests.get('https://www.imdb.com/chart/top')
page_html = r.text
# f = open('source_code.txt','w')
# f.write(page_html)
# f.close()

# If above code is not working, use below code to replace it.
# with open('source_code.txt', 'r') as myfile:
#     page_html = myfile.read().replace('\n', '')

In [3]:
page_html[:500]

'\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    \n    \n    \n\n    \n    \n    \n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">\n            <style>\n                body#styleguide-v2 {\n                    background: no-repeat fixed center top #000;\n                }\n           '

### 3 - Pass the HTML Content to BeautifulSoup and construct a tree (BS object) to parse

In [4]:
### START CODE HERE ###
page_soup = BeautifulSoup(page_html, "html.parser")
### END CODE HERE ###

### 4 - Find all the tags inside the tree that include top 250 movies' information 

In [5]:
### START CODE HERE ###
movies = page_soup.find_all(name = "tr")
movies[:3]
### END CODE HERE ###

[<tr>
 <th></th>
 <th>Rank &amp; Title</th>
 <th>IMDb Rating</th>
 <th>Your Rating</th>
 <th></th>
 </tr>, <tr>
 <td class="posterColumn">
 <span data-value="1" name="rk"></span>
 <span data-value="9.222907958359116" name="ir"></span>
 <span data-value="7.791552E11" name="us"></span>
 <span data-value="2312962" name="nv"></span>
 <span data-value="-1.7770920416408842" name="ur"></span>
 <a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" height="67" src="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
 </a> </td>
 <td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
 <span class="secondaryInfo">(1994)</span>
 </td>
 <td class="ratingColumn imdbRating">
 <strong title="9.2 based on 2,312,962 user ratings">9.2</strong>
 </td>
 <td class="ratingColumn">
 <div cla

**Exercise: Print out movie name, year, rating, number of user ratings for the highest ranking movie**

In [6]:
# Get bs4.element.Tag that includes the highest ranking movie info
### START CODE HERE ###
movie = movies[1]
movie
### END CODE HERE ###

<tr>
<td class="posterColumn">
<span data-value="1" name="rk"></span>
<span data-value="9.222907958359116" name="ir"></span>
<span data-value="7.791552E11" name="us"></span>
<span data-value="2312962" name="nv"></span>
<span data-value="-1.7770920416408842" name="ur"></span>
<a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" height="67" src="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>
<td class="ratingColumn imdbRating">
<strong title="9.2 based on 2,312,962 user ratings">9.2</strong>
</td>
<td class="ratingColumn">
<div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
<div class="boundary">
<div class="popover">
<span c

In [7]:
# Check the type
### START CODE HERE ###
type(movie)
### END CODE HERE ###

bs4.element.Tag

In [8]:
# Print out
### START CODE HERE ###
print(movie.prettify())
### END CODE HERE ###

<tr>
 <td class="posterColumn">
  <span data-value="1" name="rk">
  </span>
  <span data-value="9.222907958359116" name="ir">
  </span>
  <span data-value="7.791552E11" name="us">
  </span>
  <span data-value="2312962" name="nv">
  </span>
  <span data-value="-1.7770920416408842" name="ur">
  </span>
  <a href="/title/tt0111161/">
   <img alt="The Shawshank Redemption" height="67" src="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
  </a>
 </td>
 <td class="titleColumn">
  1.
  <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">
   The Shawshank Redemption
  </a>
  <span class="secondaryInfo">
   (1994)
  </span>
 </td>
 <td class="ratingColumn imdbRating">
  <strong title="9.2 based on 2,312,962 user ratings">
   9.2
  </strong>
 </td>
 <td class="ratingColumn">
  <div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111

In [9]:
# Get movie name
### START CODE HERE ###
name = movie.find(name="td",attrs={"class":"titleColumn"}).find(name="a").string
name = name.replace(",","|").strip()
### END CODE HERE ###
print(name)

The Shawshank Redemption


In [10]:
# Get movie year
### START CODE HERE ###
year = movie.find(name="td",attrs={"class":"titleColumn"}).find(name="span").string
year = year.replace(")", "").replace("(", "").strip()
### END CODE HERE ###
print(year)

1994


In [11]:
# Get movie rating
### START CODE HERE ###
rating = movie.find(name="td",attrs={"class":"ratingColumn imdbRating"}).find(name="strong").string
rating = rating.strip()
### END CODE HERE ###
print(rating)

9.2


In [12]:
# Get number of user rating
### START CODE HERE ###
num_user_rating = movie.find(name="td",attrs={"class":"ratingColumn imdbRating"}).find(name="strong").attrs['title']
num_user_rating = num_user_rating.split(" ")[3].replace(",","")
### END CODE HERE ###
print(num_user_rating)

2312962


### 5 - Extract movie features and save data in a csv file

In [13]:
# File name 'imdb_top_250.csv'
### START CODE HERE ###
filename = "imdb_top_250.csv"
### END CODE HERE ###

In [14]:
# Create above file with write permission
### START CODE HERE ###
f = open(filename, "w", encoding='utf-8')
### END CODE HERE ###

In [15]:
# Define header name
# Rank, Name, Year, Rating, Num_user_rating
### START CODE HERE ###
headers = "Rank,Name,Year,Rating,Num_user_rating\n"
### END CODE HERE ###

In [16]:
# Write header in csv
### START CODE HERE ###
f.write(headers)
### END CODE HERE ###

38

**Extract movie features and save data in a csv file**

In [17]:
### START CODE HERE ###
Rank = 0
for movie in movies[1:251]:
    
    Rank = Rank + 1
    
    Name = movie.find(name="td",attrs={"class":"titleColumn"}).find(name="a").string
    Name = Name.replace(",","|").strip()
    
    Year = movie.find(name="td",attrs={"class":"titleColumn"}).find(name="span").string
    Year = Year.replace(")", "").replace("(", "").strip()

    Rating = movie.find(name="td",attrs={"class":"ratingColumn imdbRating"}).find(name="strong").string
    Rating = Rating.strip()
        
    Num_user_rating = movie.find(name="td",attrs={"class":"ratingColumn imdbRating"}).find(name="strong").attrs['title']
    Num_user_rating = Num_user_rating.split(" ")[3].replace(",","")
    
    f.write(str(Rank) + "," + Name + "," + Year + "," + Rating + "," + Num_user_rating + "\n")
### END CODE HERE ###

**Don't forget the last step!!! -- close the file**

In [18]:
### START CODE HERE ###
f.close()
### END CODE HERE ###