# 14 - Web Scraping

## What is Web Scraping?
**Web Scraping** is a technique used to automatically extract information from websites. It involves accessing the HTML content of a web page, parsing that content, and extracting the data that we're interested in. This can include collecting text, images, links, and structured data such as tables.

<p align="center">
  <figure align="center">
    <img src="imgs/web_scraping1.png" alt="Alt text">
    <figcaption>Source: WebHarvey</figcaption>
  </figure>
</p>

### Why is Web Scraping Useful?
- **Automated Data Collection**: Often, the data we need is scattered across various web pages, and manually collecting this data would be inefficient and error-prone. Web scraping automates this process.

- **Access to Unstructured Data**: Many websites display unstructured data that is not readily available through traditional APIs. Web scraping helps convert this unstructured data into a structured format, making it easier to analyze and use in various applications.

- **Generating Datasets**: Machine learning models often require large datasets, and scraping can be an effective way to collect training data from the web.

### Real-world applications of web scraping:
- **Price Comparison**: Collecting product prices from multiple e-commerce platforms to compare prices and identify the best deals.

- **Employment Trends Analysis**: Scraping job postings to analyze industry demand, skill requirements, and employment patterns over time.

- **Sentiment Analysis**: Gathering content from news outlets or social media platforms to assess public sentiment and trends based on text data.

##  Introduction to HTTP and HTML
To scrape a website, it’s essential to understand two key concepts: **HTTP** and **HTML**.

### What is HTTP?
**HTTP (HyperText Transfer Protocol)** is the protocol used to transfer data over the web. 

It is a request-response protocol, meaning the client (your browser or Python code) sends a request to a web server, and the server responds with data.

<p align="center">
  <figure align="center">
    <img src="imgs/web_scraping2.webp" alt="Alt text" width="550" height="250">
    <figcaption>Source: JC Chouinard</figcaption>
  </figure>
</p>

#### How HTTP Works:
- **Client-Server Communication**: The client sends an HTTP request, and the server sends back an HTTP response. This response could be HTML (which we will scrape), JSON, or another type of data.
    
    - **Example**: When you enter "https://example.com" into your browser, your browser sends a GET request to the server hosting that website. The server responds with the website’s HTML.

In [1]:
# Example: Making an HTTP request using the requests library
import requests

response = requests.get("https://example.com")
print(response.text[:500])  # Output the first 500 characters of the HTML content

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
    


### What is HTML?
**HTML (HyperText Markup Language)** is the language used to create and structure web pages. When you visit a website, the browser receives an HTML document from the server and renders it into a web page. It consists of a series of **tags** that define elements such as headings, paragraphs, links, tables, and more.

HTML structure is hierarchical:
- **Tags**: Tags are the building blocks of HTML (e.g., `<p>` for paragraphs, `<h1>` for headings).

- **Attributes**: Tags can have attributes (e.g., `<a href="https://example.com">Link</a>` where `href` is an attribute).

<p align="center">
    <img src="imgs/web_scraping3.png" alt="Alt text"  width="550" height="450">
</p>

#### Why HTML is Important for Web Scraping:
Since the content of most websites is written in HTML, we need to understand its structure to extract the information we need efficiently.

## Types of HTTP Requests
The type of HTTP request made depends on the action you want to perform. Here are the four most commonly used methods:

### `GET`
- **Purpose**: Retrieves data from a server.

- **Common Usage**: Loading a webpage, downloading a file, or requesting specific resources (like images or documents).

- **How it works**: The client sends a `GET` request to a server, asking for data without sending any data itself.

- **Example**: A `GET` request is made to the URL https://example.com. The server responds, and the status code `200` indicates that the request was successful.

In [2]:
response = requests.get("https://example.com")
print(response.status_code)  # 200 indicates success

200


### `POST`
- **Purpose**: Sends data to a server (e.g., submitting a form).

- **Common Usage**: Used when you need to upload data to a server or send information, like login credentials or form data.

- **How it works**: The client sends data (like a form submission) along with the `POST` request.

- **Example**: A `POST` request is sent to the URL https://example.com. The data being sent includes a username and password, which are submitted to the server.

In [3]:
response = requests.post("https://example.com", data={"username": "user", "password": "pass"})

### `PUT`
- **Purpose**: Updates a resource on the server.

- **Common Usage**: Used for updating information that already exists on the server.

- **How it works**: The client sends data along with the `PUT` request to replace or update a resource.

- **Example**: The `PUT` request is used to update a resource (in this case, an item with ID `1`) with new data (a new name and email).

In [4]:
response = requests.put("https://example.com/update/1", data={"name": "New Name", "email": "newemail@example.com"})

### `DELETE`
- **Purpose**: Deletes a specified resource from the server.

- **Common Usage**: Removing records or data from a server.

- **How it works**: The client sends a `DELETE` request to remove specific data from the server.

- **Example**: A `DELETE` request is sent to https://example.com/delete/1, where the resource with ID `1` is deleted from the server.

In [5]:
response = requests.delete("https://example.com/delete/1")

### HTTP Response Codes:
HTTP responses come with status codes that indicate the result of the request. Here are a few common status codes:

- `200`: OK (the request was successful)

- `404`: Not Found (the requested resource was not found on the server.)

- `500`: Internal Server Error (the server encountered an error)

- For a complete list of HTTP status codes, visit [this link](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

In [4]:
# Example of checking the status code
response = requests.get("https://example.com")

if response.status_code == 200:
    print("Request was successful")
else:
    print(f"Failed with status code: {response.status_code}")

Request was successful


Understanding these request types will help in choosing the right method when scraping data or interacting with APIs.

## Setting Up for Web Scraping: Installing Python Libraries
To scrape web pages efficiently, we need two Python libraries:
- `requests`: This library allows us to send HTTP requests to download web pages.

- `BeautifulSoup`: This library allows us to parse HTML and extract specific content from it.

### Installation:
Before we begin, install these libraries using `pip`:

In [None]:
!pip install requests beautifulsoup4

Once installed, we can begin scraping static web pages.

## Fetching Web Page Content
The first step in web scraping is to fetch the HTML content of the page we want to scrape. We use the `requests` library to make an HTTP request and retrieve the page’s HTML.

### Steps to Fetch a Web Page:
1. **Make a GET Request**: Send a `GET` request to the URL of the webpage.

In [6]:
# The `requests` library was imported at the top of this notebook
# import requests

# Step 1: Fetch the HTML content of a webpage
url = "https://quotes.toscrape.com/"
response = requests.get(url)

2. **Check the Response Status**: Ensure that the request was successful (status code `200`).

In [7]:
# Step 2: Check if the request was successful
if response.status_code == 200:
    print("Page fetched successfully!")
else:
    print(f"Failed with status code: {response.status_code}")

Page fetched successfully!


3. **Extract the HTML**: Once the request is successful, extract the HTML content for further processing.

In [8]:
# Step 3: Print the first 500 characters of the HTML content
print(response.text[:500])

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div cla


### What Happens Behind the Scenes:
1. A `GET` request is sent to the server hosting https://quotes.toscrape.com/.

2. The server responds with the HTML of the page.

3. The HTML is stored in the `response.text` variable, and we can now parse it.

## Parsing HTML with BeautifulSoup
Once the HTML content is fetched, the next step is to parse the document so we can extract specific data. 

**BeautifulSoup** provides an intuitive interface to navigate through the HTML tree structure.

### How BeautifulSoup Works:
- **Tree Structure**: BeautifulSoup treats the HTML document as a nested tree of elements. For example, an `<html>` tag contains a `<body>` tag, which may contain a series of `<div>` or `<p>` tags.

- **Navigating the Tree**: We can navigate through this tree structure and extract the text or attributes of the elements we’re interested in.

In [9]:
from bs4 import BeautifulSoup

# Step 4: Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Pretty-print the first 500 characters of the parsed HTML
print(soup.prettify()[:500])

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
   


The `.prettify()` function helps visualize the HTML content in a readable format, displaying the hierarchy of elements.

### Common Methods in BeautifulSoup:
- `.find()`: Finds the first matching element.

- `.find_all()`: Finds all elements that match the criteria.

- `.select()`: Selects elements using CSS selectors.

## Extracting Specific Information from Web Pages
After parsing the HTML, we can start extracting specific data, such as quotes, authors, or any other element that interests us.

### Extracting Data by Tag Name:
We can use the `.find()` or `.find_all()` methods to extract elements based on their tag names (like `<h1>`, `<p>`, etc.).

#### Example: Extract all quotes from the page
Let's say we want to extract all quotes from the page. After inspecting the HTML, we realize that all quotes exist in a `<span>` tag with `class='text'`. We can use `.find_all()` to extract all elements that match this criteria. This should extract all quotes from the page.

<p align="center">
    <img src="imgs/web_scraping4.png" alt="Alt text">
</p>

In [11]:
# Extract all quotes from the page
quotes = soup.find_all('span', class_='text')

# Display the first 5 quotes
for i, quote in enumerate(quotes[:5]):
    print(f"Quote {i+1}: {quote.text}")

Quote 1: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Quote 2: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Quote 3: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Quote 4: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Quote 5: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”


### Extracting Links
Web pages often contain links that can be extracted using the `<a>` tag.

#### Example: Extract all links from page

<p align="center">
    <img src="imgs/web_scraping5.png" alt="Alt text">
</p>

In [13]:
# Extract all links (anchor tags) from the page
links = soup.find_all('a')
links

# Display the first 5 links
for link in links[:5]:
    print(link.get('href'))

/
/login
/author/Albert-Einstein
/tag/change/page/1/
/tag/deep-thoughts/page/1/


The `get('href')` method extracts the URL of each link.

### Using CSS Selectors:
CSS selectors provide a more flexible way to extract data by matching patterns in the HTML structure.

#### Example:
<p align="center">
    <img src="imgs/web_scraping6.png" alt="Alt text">
</p>

In [14]:
# Extracting elements using CSS selectors
tags = soup.select('div.tags a.tag')
for tag in tags[:5]:
    print(tag.text)

change
deep-thoughts
thinking
world
abilities


CSS selectors allow you to target elements more precisely when tags alone are not enough.

## Extracting and Storing Tables in Pandas
Many websites display data in tables (e.g., product listings, stock prices). These tables can be extracted and stored in a Pandas DataFrame for analysis.

### Steps to Extract a Table:
For this example, we will use [the Wikipedia page on world population](https://en.wikipedia.org/wiki/World_population) and extract the "Historical estimates of world population" table.

1. **Identify the Table**: First, navigate to the Wikipedia page on world population: https://en.wikipedia.org/wiki/World_population.

    - On this page, there is a table with historical estimates of the world’s population, which we'll scrape.

<p align="center">
    <img src="imgs/web_scraping7.png" alt="Alt text">
</p>

2. **Fetch the Web Page Content**: Use the `requests` library to fetch the HTML content of the Wikipedia page.

In [15]:
# Wikipedia page with the table
url = "https://en.wikipedia.org/wiki/World_population"

# Send an HTTP request to fetch the page content
response = requests.get(url)

# Check the response status
if response.status_code == 200:
    print("Page fetched successfully!")
else:
    print(f"Failed to fetch page: {response.status_code}")

Page fetched successfully!


3. **Parse the HTML Content**: After fetching the page, use **BeautifulSoup** to parse the HTML and locate the table.

In [16]:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Pretty-print the HTML to visualize the structure
print(soup.prettify()[:1000])  # Show the first 1000 characters of the page

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   World population - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-c

4. **Find and Extract the Table**: Now, we need to find the table on the page. This table is defined using the `<table>` tag. Let's locate the table and extract its rows.

In [18]:
# Find the table by its class name or id (using the class "wikitable" for Wikipedia tables)
table = soup.find('table', {'class': 'wikitable'})

# Extract all rows in the table
rows = table.find_all('tr')

# Print the first row (header row) to inspect it
print(rows[0])

<tr>
<th scope="row">Population
</th>
<th scope="col">1
</th>
<th scope="col">2
</th>
<th scope="col">3
</th>
<th scope="col">4
</th>
<th scope="col">5
</th>
<th scope="col">6
</th>
<th scope="col">7
</th>
<th scope="col">8
</th>
<th scope="col">9
</th>
<th scope="col">10
</th></tr>


5. **Extract the Data from Each Row**: We need to loop through the table rows (`<tr>`) and extract the data from each cell (`<td>` and `<th>` tags for headers).

In [19]:
# Initialize an empty list to hold the rows of data
table_data = []

# Loop through the rows in the table
for row in rows:
    # Extract the cells from the row
    cells = row.find_all(['td', 'th'])
    
    # Extract the text from each cell and strip any surrounding whitespace
    row_data = [cell.text.strip() for cell in cells]
    
    # Append the row data to our list
    table_data.append(row_data)

# Print the first few rows of extracted data
print(table_data[:5])

[['Population', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10'], ['Year', '1804', '1927', '1960', '1974', '1987', '1999', '2011', '2022', '2037', '2057'], ['Years elapsed', '200,000+', '123', '33', '14', '13', '12', '12', '11', '15', '20']]


6. **Convert the Data into a Pandas DataFrame**: Once we've extracted the table data, we can convert it into a Pandas `DataFrame`. This allows us to easily manipulate, analyze, or store the data for later use.

In [20]:
import pandas as pd

# Create a DataFrame from the table data
df = pd.DataFrame(table_data)

# Rename the columns based on the header row (assuming the first row is the header)
df.columns = df.iloc[0]  # Set the first row as column headers
df = df.drop(0)  # Drop the header row from the data

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,Population,1,2,3,4,5,6,7,8,9,10
1,Year,1804,1927,1960,1974,1987,1999,2011,2022,2037,2057
2,Years elapsed,"200,000+",123,33,14,13,12,12,11,15,20


7. **Save the Data to a CSV File**: We can save the extracted table to a CSV file for further analysis or use.

    - You can also save the DataFrame as a table in a database via `SQLAlchemy`, `SQLite`, etc.

In [21]:
# Save the DataFrame to a CSV file
df.to_csv('data/world_population.csv', index=False)

print("Table data has been saved to 'world_population.csv'")

Table data has been saved to 'world_population.csv'


## Handling Dynamic Websites
Some websites load content dynamically using JavaScript. In these cases, the content may not be immediately available in the HTML source code. To handle this, we can use **Selenium**, a browser automation tool that interacts with web pages like a user.

### Key Features of Selenium:
- **JavaScript Rendering**: Waits for dynamic content to load before scraping.

- **User Interaction Simulation**: Can mimic actions like clicks, form submissions, and scrolling.

- **Multi-step Navigation**: Automates workflows that require user actions, such as logging into websites.

### Pros of Selenium:
- Works with multiple browsers (Chrome, Firefox, etc.).

- Supports **headless mode** for environments without a graphical interface.

### Limitations:
- Resource-intensive and slower compared to static scrapers.

- Running in environments like Codespaces can be challenging due to GUI limitations.

### Alternatives:
- **Playwright**: A robust alternative that also supports headless browsing and handles dynamic content.

- **Splash**: A lightweight option for rendering JavaScript-heavy pages.

## Ethics of Web Scraping
While web scraping is a powerful tool, it is essential to follow ethical guidelines and avoid scraping data without permission.

### Key Ethical Considerations:
1. Check the `robots.txt` file: Many websites specify scraping permissions in their `robots.txt` file. Always check this file before scraping.

In [25]:
# URL of Wikipedia's robots.txt file
robots_url = "https://en.wikipedia.org/robots.txt"

# Send a GET request to fetch the robots.txt file
response = requests.get(robots_url)

# Check if the request was successful
if response.status_code == 200:
    print("robots.txt file fetched successfully!")
else:
    print(f"Failed to fetch robots.txt file: {response.status_code}")

# Print the contents of the robots.txt file
print(response.text[:500])

robots.txt file fetched successfully!
﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapa


The `robots.txt` file for Wikipedia contains rules that control which sections of the website can be crawled or scraped. Let's break down some of the key parts of the file.

- **User-agent: \***: This means the rules apply to all web crawlers and scrapers.

- **Disallow**: These are paths that crawlers are not allowed to access. For example, crawlers are instructed not to scrape URLs starting with `/wiki/Special`: or `/w/`.

- **Allow**: The `/wiki/` path is allowed for crawling, meaning the general pages of Wikipedia can be accessed by crawlers.

2. **Avoid Overloading Servers**: Sending too many requests in a short period can overwhelm a server. Be sure to include delays between requests to avoid overloading the server.

3. **Legal Implications**: Scraping copyrighted or sensitive data without permission can lead to legal consequences. Always ensure you are complying with the website’s terms of service.

## Conclusion
This lesson covers the full process of web scraping, from understanding HTTP and HTML to scraping static and dynamic websites. You should now have an understanding of how to:
- Fetch and parse web pages.

- Extract specific data elements.

- Store data in structured formats like Pandas DataFrames.

- Ethically and legally scrape websites, ensuring they respect the rules set by each site.

## Additional Resources
- [Selenium Web Automation Demo](https://www.youtube.com/watch?v=h8E_beaVYgs)

- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- [Requests Documentation](https://docs.python-requests.org/en/latest/)

- [List of HTTP Status Codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)