# Web Scraping

- Web scraping is a general term for techniques involving automating the gathering of data from a website.

- Prerequisites for web scraping:

    - Basic knowledge of **HTML, CSS**.
    
    - **Inspecting** a webpage for the required components.
    
- We should keep certain things in mind before doing web scraping:

    - Websites and their content are often protected by **copyrights**. So, before scraping the website, look how it's *policies, rules, guidelines, laws* etc. If required, get *permission* to scrape the website.
    
    - If you make too many requests to scrape a website, your IP address might be blocked for a certain amount of time. This is due to **Rate Limiting Mechanisms** in place.
    
    - Some websites may automatically **block** web scraping softwares.
    
- Before going into web scraping, set your expectations right. 
    
    - Each website is **unique**. Therefore a web scraping script you develop will be unique to that website.
    
    - Often, websites change their html. In such cases, a previous web scraping script might **break** and needs to be re-written.
    
- Now, for basic web scraping, there are 2 fundamental libraries in python:

    - `requests` library: used to make a HTTP request to the desired website and get the HTML content of the webpage.
    
    - `BeautifulSoup` library: used to parse and grab things from the HTML content.
    
- `BeautifulSoup` requires a parser to parse the HTML. A common parser is the `lxml` parser.

- So, you need to install the following libraries (using the `pip install` command):

    - `requests`
    
    - `bs4`
    
    - `lxml`
    
- There's actually a specific website designed to practice Web Scraping: [toscrape.com](https://toscrape.com/), we will use this it in the another notebook.

## Importing the required libraries

- First step is to import the required libraries i.e, `requests` and `bs4`.

- `lxml` will be used by `bs4` in the backend, and we specify this later on in the process.

In [1]:
# import libraries
import requests
import bs4

## Basic Syntax

- In the requests module, there is a `get()` method, which makes a HTTP GET request to a website and returns the HTML content as a special type of object.

- To get a response from a website, we use the following syntax: `res = requests.get('WEBSITE_URL')`

- While specifying the Website's URL, make sure you specify either `https` or `http`.

- If you print the `res` object, you also see the status code of the response i.e, `200` or `404`, etc.

- Now, make a request to [example.com](https://www.example.com).

- To view the HTML content of a response, we can use the `text` property available on the response object. It returns a string of HTML content.

In [2]:
# make a request to example.com
response = requests.get('https://example.com/')
print(type(response))
print(response)

<class 'requests.models.Response'>
<Response [200]>


In [3]:
# .text property of the response object is the HTML document, as a string
print(type(response.text))
print(response.text)

<class 'str'>
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
   

## Parsing Response HTML

- Now, we use `BeautifulSoup` library to parse the HTML string.

- In the `bs4` package, there is a `BeautifulSoup` class which has the following syntax: `bs4.BeautifulSoup(HTML_String, parser)`.

- The second argument of the constructor is the HTML Parser we want to use, in our case, `lxml` parser.

- After parsing, we get a special `soup` object which has various functionalities we need for scraping the content off of the HTML of the webpage.

In [4]:
# parse the HTML using BeautifulSoup, using lxml parser as the parsing engine
# returns a special object
soup = bs4.BeautifulSoup(response.text, 'lxml')
print(type(soup))
print(soup)

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for 

## Grabbing elements

- On the `soup` object, we have a `select()` method, which grabs the HTML elements based some sort of **CSS Selector** and has the following syntax: `soup.select('CSS_SELECTOR')`.

- It returns a `list` of all the elements which have that CSS Selector.

- In the example below, we are grabbing the `title` of the webpage. We are using the **name selector**. Note that the return type is a kind of `list`.

In [5]:
# to grab the elements, use the .select() method
print(type(soup.select('title')))
print(soup.select('title'))

<class 'bs4.element.ResultSet'>
[<title>Example Domain</title>]


- Now, since we want a single elements, we can use **indexing** on the ResultSet object to grab the specific element.

- In the below code, we are grabbing the `title` element, and extracting the content inside the `title` tag using a `getText()` method on the `Tag` object, which is a special kind of object for each element.

In [6]:
page_title_element = soup.select('title')[0]
print(type(page_title_element))
print(page_title_element)
page_title = page_title_element.getText()
print(type(page_title))
print(page_title)

<class 'bs4.element.Tag'>
<title>Example Domain</title>
<class 'str'>
Example Domain


## Example: Grabbing elements based on class name

- As an argument to the `.select()` method of the `soup` object, we simply pass a CSS Selector.

- From [this wikipedia article](https://en.wikipedia.org/wiki/Lee_Chong_Wei), lets grab all the table of content items in the below example.

> **NOTE:** Wikipedia's copyright guidelines allow us to use their content for educational purposes.

In [7]:
# grab all the table of contents from a wikipedia article
response = requests.get('https://en.wikipedia.org/wiki/Lee_Chong_Wei')

soup = bs4.BeautifulSoup(response.text, 'lxml')

table_of_contents = soup.select('span.toctext')

for item in table_of_contents:
    print(item.getText())

Early life
Career
2002–2007
2008
2009
2010
2011
2012
2013
2014
Doping
2015
2016
2017
2018
Retirement
Personal life
Awards
Honours
Achievements
Career finals (69 titles, 34 runners-up)
In popular culture
See also
References
External links


## Example: Grabbing & Downloading Images

- Typically, all images on a webpage have their own URL.

- So, once we parse `img` tags and get the `src` attributes, we make another requests to the specific image URL.

- When we make a request to a image, on the `response`, we have a `content` property, which is the `binary content` of the image.

- So, to locally save that image, we open a file in `wb` (Write Binary) mode and write the content.

- In the below example, we are grabbing and downloading 3 images from [this wikipedia article](https://en.wikipedia.org/wiki/Lin_Dan).

> Before downloading & using the image, make sure you check the copyright guidelines and obtain the required permission,


In [8]:
response = requests.get('https://en.wikipedia.org/wiki/Lin_Dan')

soup = bs4.BeautifulSoup(response.text, 'lxml')

image_elements = soup.select('#bodyContent img.thumbimage')

print(len(image_elements))

for ind, image_el in enumerate(image_elements):
    image_url = 'https:' + image_el['src']
    image_response = requests.get(image_url)
    binary_content = image_response.content
    f = open(f'image_{ind+1}.jpg', 'wb')
    f.write(binary_content)
    f.close()

3
