# Web Scraping IMDB Top 250

![title](img/crawling.jpg)

## Part 1 - Web Scraping Basics

### 1 - What is web scraping and web crawling？

**Web scraping** is an automated program that queries a web server, requests data (usually in the form of HTML), and then parses that data to extract needed information.

**Web crawling** refers to downloading and storing the contents of a large number of websites, by following links in web pages. Web crawlers are called such because they crawl across the web.

### 2 - Uses of Web Scraping/Crawling

**2.1 - Search Engines** – One of the largest companies whose whole business is based on Web Scraping. It is hard to imagine going by one day without using Google.

-  <font color=red>Googlebot</font> - _Google's web crawling bot. Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index. We use a huge set of computers to fetch (or "crawl") billions of pages on the web. Googlebot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site._

**2.2 - Content Aggregators**– almost all the content aggregators use web scraping. Job Aggregators scrape job boards and company websites and grab latest job openings.

**2.3 - Application**- Price monitoring, traning datasets for Machine Learning, etc.

### 3 - How does a web scraper work?

-  Step 1: Download content of web pages
-  Step 2: Parse and extract data
-  Step 3: Store data as txt, csv, json or in database, etc.

### 4 - HTML Introduction

**4.1 What is HTML?**  

-  HTML is the standard markup language for creating Web pages.
-  HTML elements are the building blocks of HTML pages, which are represented by tags
-  HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
-  Browsers do not display the HTML tags, but use them to render the content of the page

**4.2 A Simple HTML Document**

What it looks like in web browser:<br>
<!DOCTYPE html>
    <html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
    </html>

**Example Explained**   
-  The <!DOCTYPE html> declaration defines this document to be HTML5      
-  The < html> element is the root element of an HTML page   
-  The < head> element contains meta information about the document   
-  The < title> element specifies a title for the document  
-  The < body> element contains the visible page content  
-  The < h1> element defines a large heading  
-  The < p> element defines a paragraph  

**4.3 HTML Tags**

-  HTML tags are element names surrounded by angle brackets:
-  HTML tags normally come in pairs like < p> and < /p>
-  The first tag in a pair is the start tag, the second tag is the end tag
-  < tagname>content goes here...< /tagname>
-  The end tag is written like the start tag, but with a forward slash inserted before the tag name

**4.4 Web Browsers**

-  The purpose of a web browser (Chrome, IE, Firefox, Safari) is to read HTML documents and display them.
-  The browser does not display the HTML tags, but uses them to determine how to display the document:
-  Note: Only the content inside the < body> section is displayed in a browser.

## Part 2 - Scraping IMDB Top 250

###  1 - Packages

In [1]:
import requests
print('Requests version: ' + requests.__version__)

import bs4
print('Beautiful Soup version: ' + bs4.__version__)
from bs4 import BeautifulSoup

Requests version: 2.24.0
Beautiful Soup version: 4.7.1


###  2 - Send Requests

In [None]:
# Send a request to https://www.imdb.com/chart/top and download the HTML Content of the page
r = requests.get('https://www.imdb.com/chart/top')
page_html = r.text
# f = open('source_code.txt','w')
# f.write(page_html)
# f.close()

# If above code is not working, use below code to replace it.
# with open('source_code.txt', 'r') as myfile:
#     page_html = myfile.read().replace('\n', '')

In [None]:
page_html[:500]

### 3 - Pass the HTML Content to BeautifulSoup and construct a tree (BS object) to parse

In [None]:
### START CODE HERE ###
page_soup = None
### END CODE HERE ###

### 4 - Find all the tags inside the tree that include top 250 movies' information 

In [None]:
### START CODE HERE ###
movies = None
movies[:3]
### END CODE HERE ###

** Exercise: Print out movie name, year, rating, number of user ratings for the highest ranking movie **

In [None]:
# Get bs4.element.Tag that includes the highest ranking movie info
### START CODE HERE ###
movie = None
movie
### END CODE HERE ###

In [None]:
# Check the type
### START CODE HERE ###
None
### END CODE HERE ###

In [None]:
# Print out
### START CODE HERE ###
None
### END CODE HERE ###

In [None]:
# Get movie name
### START CODE HERE ###
name = None
name = None
### END CODE HERE ###
print(name)

In [None]:
# Get movie year
### START CODE HERE ###
year = None
year = None
### END CODE HERE ###
print(year)

In [None]:
# Get movie rating
### START CODE HERE ###
rating = None
rating = None
### END CODE HERE ###
print(rating)

In [None]:
# Get number of user rating
### START CODE HERE ###
num_user_rating = None
num_user_rating = None
### END CODE HERE ###
print(num_user_rating)

### 5 - Extract movie features and save data in a csv file

In [None]:
# File name 'imdb_top_250.csv'
### START CODE HERE ###
filename = None
### END CODE HERE ###

In [None]:
# Create above file with write permission
### START CODE HERE ###
f = None
### END CODE HERE ###

In [None]:
# Define header name
# Rank, Name, Year, Rating, Num_user_rating
### START CODE HERE ###
headers = None
### END CODE HERE ###

In [None]:
# Write header in csv
### START CODE HERE ###
None
### END CODE HERE ###

**Extract movie features and save data in a csv file**

In [None]:
### START CODE HERE ###
Rank = None
for ...:
    
    Rank = None
    
    Name = None
    Name = None
    
    Year = None
    Year = None

    Rating = None
    Rating = None
        
    Num_user_rating = None
    Num_user_rating = None
    
    None
### END CODE HERE ###

**Don't forget the last step!!! -- close the file**

In [None]:
### START CODE HERE ###
None
### END CODE HERE ###