# Web Scraping using Python




## 1. Introduction

Web scraping is a technique to extract data from websites. It is also called web data extraction. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations.



## 2. Web Scraping using Python

### 2.1 Getting the data from websites

The protocol to access a website is called HTTP (HyperText Transfer Protocol). It is a protocol to exchange information between computers. The client (your computer) sends a request to the server (the computer where the website is hosted) and the server sends back a response. The response is usually an HTML file. HTML is a markup language to describe web pages. It contains the information displayed on the website and a lot of other information about the page. The HTML file can be parsed by the browser to display the page to the user. The HTML file can also be parsed by a computer program to extract the data from the page. This is called web scraping.

In python, there are two general/popular ways to get the data from websites:

1. Using the `requests` library

Requests library is one of the most popular libraries in Python. It is used to send HTTP requests to the server and receive a response back. It is a very powerful library and has many useful functions. It is also very easy to use. It is a must-have library for every Python developer.

2. Using selenium

Selenium is a web automation framework that can be used to automate website testing. It is open-source and can be used to automate anything we do on the web. It is also very easy to use. It is a must-have library for every Python developer. For web scraping, Selenium is often used when the website is dynamic and requires user interaction to load the data. For example, when the website uses JavaScript to load the data, we can use Selenium to automate the process of clicking the button to load the data.

3. Using Scrapy

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Unlike the previous two approachs, Scrapy both downloads and processes the data, saving the user from having to do it manually. 
   

### 2.2 Processing the data (HTML)

After getting the data from the website, we need to process the data to extract the information we want. The data is usually in HTML format. HTML is a markup language to describe web pages. It contains the information displayed on the website and a lot of other information about the page. The HTML file can be parsed by the browser to display the page to the user. The HTML file can also be parsed by a computer program to extract the data from the page. This is called web scraping.

There are two general/popular ways to process the data:

1. Using the `BeautifulSoup` library
2. Using the `Scrapy` library

Since scrapy handles both data collection and data processing, it doesn't involve the use of BeautifulSoup or selenium. BeuatifulSoup though requires the use of requests library or selnium to get the data from the website. Beautiful soup focuses on data processing of web pages. It is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

## 3.0 Examples using BeautifulSoup

### 3.1 Example 1: Scraping a static website

In this example, we will scrape the data from a static website. A static website is a website that doesn't use JavaScript to load the data. The data is already in the HTML file. We can use the requests library to get the data from the website and then use the BeautifulSoup library to process the data.

The website we will scrape is https://www.imdb.com/chart/top. It is a list of the top 250 movies on IMDb. We will scrape the title, year, rating, and number of votes of each movie.

First, we need to import the requests library and the BeautifulSoup library.

In [45]:
import requests
from bs4 import BeautifulSoup

Next, we need to get the data from the website. We can use the requests library to get the data from the website. We can use the get() function to get the data from the website. We need to pass the URL of the website as an argument to the get() function. The get() function returns a response object. We can use the text attribute of the response object to get the HTML file of the website.

In [46]:
url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url)
html = response.text

html[:100]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la'

Next, we need to process the data. We can use the BeautifulSoup library to process the data. We need to pass the HTML file of the website as an argument to the BeautifulSoup() function. We also need to pass the name of the parser we want to use as an argument to the BeautifulSoup() function. We can use the html.parser parser. We can use the find_all() function to get all the elements with the specified tag. We can use the find() function to get the first element with the specified tag. We can use the get_text() function to get the text of an element. We can use the get() function to get the value of an attribute of an element.

In [47]:
soup = BeautifulSoup(html, 'html.parser')
news = soup.find('div', {'id': 'mp-itn'})
for link in news.find_all('a'):
    print(link.get('href'))

/wiki/File:RA-02795_(51956663093)_(cropped).jpg
/wiki/2023_Tver_plane_crash
/wiki/Tver_Oblast
/wiki/Wagner_Group
/wiki/Yevgeny_Prigozhin
/wiki/Chandrayaan-3
/wiki/Lunar_south_pole
/wiki/Pragyan_(rover)
/wiki/Srettha_Thavisin
/wiki/Prime_Minister_of_Thailand
/wiki/2023_Thai_general_election
/wiki/Hun_Manet
/wiki/Prime_Minister_of_Cambodia
/wiki/Hun_Sen
/wiki/Portal:Current_events
/wiki/2023_Canadian_wildfires
/wiki/2023_Nigerien_crisis
/wiki/Russian_invasion_of_Ukraine
/wiki/Timeline_of_the_Russian_invasion_of_Ukraine_(8_June_2023_%E2%80%93_present)
/wiki/2023_Sudan_conflict
/wiki/Deaths_in_2023
/wiki/Uteng_Suryadiyatna
/wiki/Abe_Jacobs
/wiki/Gloria_Coates
/wiki/Bob_Barker
/wiki/Isabel_Crook
/wiki/Bray_Wyatt
/wiki/Wikipedia:In_the_news/Candidates


### Planned Sections for This Notebook

#### Section 1: Making HTTP Requests

This section will introduce how to use the `requests` library to make HTTP GET requests to fetch the HTML content of Wikipedia's homepage. Topics will include:

- Simple GET requests
- Handling HTTP status codes
- Adding headers to requests
  
#### Section 2: Introduction to Beautiful Soup

Here, we will get acquainted with the `Beautiful Soup` library for HTML parsing. We will discuss:

- Creating a Beautiful Soup object
- Navigating the HTML tree structure
- Searching for tags and text
  
#### Section 3: Data Extraction Techniques

This section will delve into various techniques for extracting data from the HTML document:

- Finding single and multiple elements
- Searching by tag name, classes, and IDs
- Traversing parent-child and sibling relationships
  
#### Section 4: Practical Examples with Wikipedia

We will apply the concepts learned to scrape different sections of Wikipedia's homepage:

- Extracting the "Today's Featured Article" text
- Scraping items from the "In the News" section
- Enumerating various language options available
  
#### Section 5: Data Storage

Once data is scraped, it's important to store it effectively for further analysis. We'll explore:

- Saving data in CSV format
- Storing data in a SQLite database
  
#### Section 6: Best Practices and Ethical Considerations

The notebook will conclude with an overview of ethical considerations and best practices in web scraping, such as:

- Complying with a website’s `robots.txt` file
- Implementing rate limiting to prevent overloading servers
- Ensuring data privacy and legal considerations
  
---

### Hands-on Activities and Assessments

To ensure that the learning objectives are met, the notebook will also include hands-on activities and assessments:

- **Interactive Quizzes**: To assess the grasp of key concepts.
- **Mini-Projects**: Practical tasks to solidify understanding and skills.


## Section 1: Making HTTP Requests

### Introduction to HTTP Requests

HTTP (HyperText Transfer Protocol) is the foundation for any data exchange on the Web. HTTP requests are how we ask for information from a web server. There are several types of HTTP requests, but the most common one is the GET request, which we use to retrieve data.

#### Objectives of this Section

- Understand what HTTP requests are and the role they play in web scraping.
- Use the `requests` library to make simple GET requests.
- Handle different HTTP status codes.
- Learn how to add headers to your HTTP requests.

### 1.1 Simple GET Requests

A GET request is a way to fetch data from a server. We can make a GET request using the `requests.get()` method.



In [48]:
# Importing the requests library
import requests

# Making a GET request to Wikipedia's homepage
response = requests.get('https://www.wikipedia.org/')

# Output the status code
print("Status Code: ", response.status_code)


Status Code:  200


#### Note:

- `200`: OK. The GET request was successful.
- Other common status codes: `404` (Not Found), `403` (Forbidden), `500` (Internal Server Error).

### 1.2 Handling HTTP Status Codes

Different status codes indicate the outcome of the HTTP request. It's crucial to handle these appropriately in your code.



In [49]:
# Handling status codes
if response.status_code == 200:
    print("Success!")
elif response.status_code == 404:
    print("Not Found.")

Success!


### 1.3 Viewing the Response Content

You can view the HTML content of the page by accessing the `.text` attribute of the response object.

In [50]:
# Output the first 500 characters of the HTML content
print(response.text[:500])

<!DOCTYPE html>
<html lang="en" class="no-js">
<head>
<meta charset="utf-8">
<title>Wikipedia</title>
<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">
<script>
document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
</script>
<meta name="viewport" content="initial-scale=1,user-scalable=yes">
<link rel="apple-touch-


### 1.4 Adding Headers to Requests

Some websites may block automated requests. By adding headers, we can make our request look more like it's coming from an actual browser.

In [51]:
# Specifying headers
headers = {'User-Agent': 'Mozilla/5.0'}

# Making a GET request with headers
response_with_headers = requests.get('https://www.wikipedia.org/', headers=headers)

# Output the status code
print("Status Code with Headers: ", response_with_headers.status_code)

Status Code with Headers:  200


## Section 2: Introduction to Beautiful Soup

### Overview

Beautiful Soup is a Python library that allows for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

#### Objectives of this Section

- Understand what Beautiful Soup is and why it is essential in web scraping.
- Learn how to create a Beautiful Soup object.
- Explore basic functionalities to navigate and search through an HTML document.

### 2.1 Creating a Beautiful Soup Object

The first step in using Beautiful Soup is to create an object and initialize it with the HTML content you have fetched using the `requests` library.

In [52]:
from bs4 import BeautifulSoup

# Initialize the object with the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Pretty-print the HTML content
print(soup.prettify()[:500])

<!DOCTYPE html>
<html class="no-js" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Wikipedia
  </title>
  <meta content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." name="description"/>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
  </script>
  <meta content="initial-scale=1,user-scalable=yes" name="viewport"


#### Note:

- The argument `'html.parser'` specifies the parser to be used. You can also use other parsers like `'lxml'` or `'html5lib'`.

### 2.2 Navigating the HTML Tree Structure

Beautiful Soup allows you to navigate the HTML tree structure and access parent, sibling, or descendant tags.

In [53]:
# Find title tag
title_tag = soup.title

title_tag

<title>Wikipedia</title>

In [54]:
# Access the parent of the title tag
parent_tag = title_tag.find_parent()

parent_tag

<head>
<meta charset="utf-8"/>
<title>Wikipedia</title>
<meta content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." name="description"/>
<script>
document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
</script>
<meta content="initial-scale=1,user-scalable=yes" name="viewport"/>
<link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>
<link href="/static/favicon/wikipedia.ico" rel="shortcut icon"/>
<link href="//creativecommons.org/licenses/by-sa/4.0/" rel="license"/>
<style>
.sprite{background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-8bb90067.svg);background-repeat:no-repeat;display:inline-block;vertical-align:middle}.svg-Commons-logo_sister{background-position:0 0;width:47px;height:47px}.svg-MediaWiki-logo_sister{background-position:0 -47px;width:42px;height:42px}.svg-Me

In [55]:
# Access the siblings of the title tag
sibling_tags = title_tag.find_next_siblings() # see https://www.geeksforgeeks.org/find-the-siblings-of-tags-using-beautifulsoup/

sibling_tags

[<meta content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." name="description"/>,
 <script>
 document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
 </script>,
 <meta content="initial-scale=1,user-scalable=yes" name="viewport"/>,
 <link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>,
 <link href="/static/favicon/wikipedia.ico" rel="shortcut icon"/>,
 <link href="//creativecommons.org/licenses/by-sa/4.0/" rel="license"/>,
 <style>
 .sprite{background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-8bb90067.svg);background-repeat:no-repeat;display:inline-block;vertical-align:middle}.svg-Commons-logo_sister{background-position:0 0;width:47px;height:47px}.svg-MediaWiki-logo_sister{background-position:0 -47px;width:42px;height:42px}.svg-Meta-Wiki-logo_sister{background-position:

### 2.3 Searching for Tags and Text

Beautiful Soup offers a variety of methods to find tags and text within an HTML document.

#### 2.3.1 Finding Single Elements

In [56]:
# Find the first occurrence of a specific tag
first_paragraph = soup.find('p')

first_paragraph

<p class="jsl10n" data-jsl10n="portal.app-links.description">
Save your favorite articles to read offline, sync your reading lists across devices and customize your reading experience with the official Wikipedia app.
</p>

In [57]:
# Find the first occurrence of a tag with specific attributes
h2_headers = soup.find_all('h1')


if h2_headers:
    for h2 in h2_headers:
        print(h2.text)




Wikipedia

The Free Encyclopedia



#### 2.3.2 Finding Multiple Elements

In [58]:
# Find all occurrences of a specific tag
all_paragraphs = soup.find_all('p')

all_paragraphs

[<p class="jsl10n" data-jsl10n="portal.app-links.description">
 Save your favorite articles to read offline, sync your reading lists across devices and customize your reading experience with the official Wikipedia app.
 </p>,
 <p class="site-license">
 <small class="jsl10n" data-jsl10n="license">This page is available under the <a href="https://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike License</a></small>
 <small class="jsl10n" data-jsl10n="terms"><a href="https://meta.wikimedia.org/wiki/Terms_of_use">Terms of Use</a></small>
 <small class="jsl10n" data-jsl10n="privacy-policy"><a href="https://meta.wikimedia.org/wiki/Privacy_policy">Privacy Policy</a></small>
 </p>]

In [59]:
# Find all occurrences of tags with specific attributes
count = 0
for link in soup.find_all('a'):
    count += 1
    print(link.get('href'))
    if count > 20:
        break

//en.wikipedia.org/
//ja.wikipedia.org/
//es.wikipedia.org/
//ru.wikipedia.org/
//de.wikipedia.org/
//fr.wikipedia.org/
//it.wikipedia.org/
//zh.wikipedia.org/
//pt.wikipedia.org/
//fa.wikipedia.org/
//pl.wikipedia.org/
//ar.wikipedia.org/
//de.wikipedia.org/
//en.wikipedia.org/
//es.wikipedia.org/
//fr.wikipedia.org/
//it.wikipedia.org/
//arz.wikipedia.org/
//nl.wikipedia.org/
//ja.wikipedia.org/
//pt.wikipedia.org/




### 2.4 Exercise: Extracting 'Today's Featured Article' from Wikipedia

As a hands-on you can try to extract the "Today's Featured Article" section from Wikipedia's homepage. The objective is to apply the Beautiful Soup methods discussed so far to locate and extract this specific piece of information.

```python
# Your exercise code will go here
```



## Section 3: Data Extraction Techniques

### Overview

Data extraction is a crucial step in web scraping. While Beautiful Soup makes it easy to create a parse tree, knowing how to navigate this tree to extract the exact data you need is a skill that requires practice. This section covers several techniques that allow you to do just that.

#### Objectives of this Section

- Learn how to find single and multiple elements within an HTML document.
- Understand how to search by tag name, classes, and IDs.
- Explore techniques to traverse parent-child and sibling relationships in the HTML tree.

### 3.1 Finding Single and Multiple Elements

Beautiful Soup provides the `find()` and `find_all()` methods to locate single or multiple elements, respectively.

#### 3.1.1 Finding Single Elements

```python
# Find the first occurrence of a specific tag
first_paragraph = soup.find('p')

# Find the first occurrence of a tag with specific attributes
first_header = soup.find('h1', {'class': 'special'})
````

#### 3.1.2 Finding Multiple Elements


```python
# Find all occurrences of a specific tag
all_paragraphs = soup.find_all('p')
```

```python
# Find all occurrences of tags with specific attributes
special_headers = soup.find_all('h1', {'class': 'special'})
```

### 3.2 Searching by Tag Name, Classes, and IDs

Beautiful Soup allows you to be very specific in your search queries.

#### 3.2.1 By Tag Name

```python
# Find the first 'table' tag
first_table = soup.find('table')
```

#### 3.2.2 By Classes

```python
# Find all elements with a specific class
elements_with_class = soup.find_all(class_='target_class')
```

#### 3.2.3 By IDs


```python
# Find an element by its ID
element_with_id = soup.find(id='target_id')
```

### 3.3 Traversing Parent-Child and Sibling Relationships

Navigating relationships within the HTML tree is often necessary for complex scraping tasks.

#### 3.3.1 Parent-Child Relationships

```python
# Access the parent of a tag
parent_tag = soup.find('span').find_parent()
```

```python
# Find all children of a tag
children_tags = soup.find('div').findChildren()
```

#### 3.3.2 Sibling Relationships

```python
# Find next sibling of a tag
next_sibling = soup.find('p').find_next_sibling()
```

```python
# Find all next siblings of a tag
all_next_siblings = soup.find('p').find_next_siblings()
```

## Section 4: Practical Examples with Wikipedia

### Overview

In this section, we will apply the concepts and techniques learned in the previous sections to scrape different portions of Wikipedia's homepage. These exercises should reinforce your understanding and offer a realistic glimpse into the types of data extraction tasks you might encounter.

#### Objectives of this Section

- Extract text from the "Today's Featured Article" section.
- Scrape items listed in the "In the News" section.
- Enumerate various language options available on the Wikipedia homepage.

### 4.1 Extracting the "Today's Featured Article" Text

Let’s apply our skills to fetch and display the text content of "Today's Featured Article" from Wikipedia's homepage.

In [60]:
# Fetch the homepage content
response = requests.get('https://en.wikipedia.org/wiki/Main_Page/')
soup = BeautifulSoup(response.text, 'html.parser')

# Extract 'Today's Featured Article'
featured_article = soup.find('div', {'id': 'mp-upper'})

if featured_article:
    featured_text = featured_article.get_text().strip()
    print("Today's Featured Article:", featured_text)
else:
    print("Section not found.")

Today's Featured Article: From today's featured article




USS Marmora was a stern­wheel steamer serving in the Union Navy from 1862 to 1865 in the American Civil War. Built in 1862 as a civilian vessel, she was bought for military service in September, and converted into a tinclad warship. Commissioned on October 21, she served on the Yazoo River and was on the Yazoo during the Battle of Chickasaw Bayou in December. She was assigned in 1863 to a fleet operating against Fort Hindman, but was absent when the fort surrendered on January 11. From February to April, she participated in the Yazoo Pass expedition, and in June burned two Arkansas settlements. In August, she saw action on the White River when the Little Rock campaign was beginning, and patrolled on the Mississippi River late that year. She fought in the Battle of Yazoo City on March 5. She was declared surplus in May 1865 and put in reserve status at Mound City, Illinois. She was decommissioned in July, and sold at auction on

### 4.2 Scraping Items from the "In the News" Section

This exercise focuses on scraping bullet points from the "In the News" section of Wikipedia's homepage.

In [61]:
# Locate 'In the News' section
news_section = soup.find('div', {'id': 'mp-itn'})

# Extract list items
if news_section:
    news_list = news_section.find('ul')
    news_items = news_list.find_all('li')
    for index, item in enumerate(news_items):
        print(f"News {index + 1}: {item.get_text().strip()}")
else:
    print("Section not found.")

News 1: A business jet (pictured) crashes in Tver Oblast, Russia, killing Wagner Group  leader Yevgeny Prigozhin and nine others.
News 2: Indian spacecraft Chandrayaan-3 lands near the lunar south pole, carrying the Pragyan rover.
News 3: Thailand's parliament elects Srettha Thavisin as prime minister following general elections in May.
News 4: Hun Manet is sworn in as Prime Minister of Cambodia, succeeding his father Hun Sen's 38-year term.


## Section 5: Data Storage

### Overview

After successfully scraping data, the next vital step is to store it effectively for future analysis and data manipulation. In this section, we will explore two popular methods of storing scraped data: saving it in a CSV (Comma-Separated Values) file and storing it in an SQLite database.

#### Objectives of this Section

- Learn how to save scraped data in CSV format.
- Understand how to store scraped data in an SQLite database.

### 5.1 Saving Data in CSV Format

CSV files are simple and widely supported, making them a popular choice for storing tabular data.

#### 5.1.1 Basic CSV Storage

In [62]:
import csv

# Create or open a CSV file
with open('./data/scraped_data.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    
    # Write header
    writer.writerow(["Column1", "Column2"])
    
    # Write data rows
    writer.writerow(["Data1", "Data2"])

#### 5.1.2 Saving DataFrames to CSV

For those who are comfortable with Pandas, DataFrame objects can be easily saved to CSV files.

In [63]:
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Column1': ['Data1'],
    'Column2': ['Data2']
})

# Save to CSV
df.to_csv('./data/scraped_data_dataframe.csv', index=False)

### 5.2 Storing Data in an SQLite Database

SQLite offers a more structured storage solution and is particularly useful for larger datasets and for relational data storage.

#### 5.2.1 Creating and Connecting to SQLite Database

In [64]:
import sqlite3

# Create or connect to a database
conn = sqlite3.connect('./data/scraped_data.db')

# Create a table
conn.execute('''CREATE TABLE IF NOT EXISTS DATA
                (ID INTEGER PRIMARY KEY AUTOINCREMENT,
                COLUMN1 TEXT NOT NULL,
                COLUMN2 TEXT NOT NULL);''')

<sqlite3.Cursor at 0x219a9cead40>

#### 5.2.2 Inserting Data into SQLite Database

In [65]:
# Insert a row of data
conn.execute("INSERT INTO DATA (COLUMN1, COLUMN2) VALUES ('Data1', 'Data2');")

# Commit changes and close connection
conn.commit()
conn.close()

## 6.0 Key Considerations Before Web Scraping

The following are a sample of considerations for webscraping:

* Website's Terms of Service: Always read the website's terms of service to make sure you are allowed to scrape it. Websites may have specific rules against scraping, so it's important to be compliant.

* Rate Limiting: Sending too many requests in a short period can overload the server and may result in your IP being blocked. Consider implementing delays in your scraping script, or better yet, check if the website offers API access for the data you need.

* Web Page Structure: Websites can change their HTML structure over time, which can break your scraping code. It's crucial to make your code robust enough to handle minor changes in the webpage's structure.

In the following sections we will demonstrate the use of Python to deal with these considerations.


### 6.1 Checking Robots.txt

Before you scrape a website, it's essential to check its robots.txt file to understand what you're allowed and not allowed to scrape. Additionally, it's good practice to not overload a server with too many requests in a short period.

In [66]:
# Example: Checking the robots.txt file for Wikipedia
robots_txt = requests.get("https://www.wikipedia.org/robots.txt").text
print(robots_txt[:500])  # Output the first 500 characters of the robots.txt file

# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapa


### 6.2 Dealing with Dynamic Websites
Some websites use JavaScript to load content dynamically. In such cases, a simple GET request may not retrieve all the content you see when browsing the site manually. Selenium is a popular library to handle dynamic content.

In [67]:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager 
# note1: to install webdriver_manager, on the command line use the name webdriver-manager (no underscore)
# note2: you will need to find and download Geckodriver for your OS. Put make this easier to setup, place
# this downloaded drive in the same folder as your notebook. This is not the best practice, but it will work.
# Best practice would be to put it in a location, and then add that location to your PATH variable.
# note3: You will need to have Firefox installed on your computer for this to work. (though, you're welcomed 
# to try it with Chrome... you will need to download the Chrome driver and change the code below to use it)

driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))

url = "https://www.britannica.com/topic/Presidents-of-the-United-States-1846696"
driver.get(url)
driver.implicitly_wait(10) # this is how long to wait for the page to load

driver.page_source

  driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))


'<html data-ytrk-page="TOPIC PAGINATED SMALL" lang="en" class="topic-desktop ui-firefox116 ui-firefox" style="--100vh: 955px;"><head prefix="og: https://ogp.me/ns# fb: https://ogp.me/ns/fb#">\n\n    <meta charset="utf-8">\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n\n    <meta property="fb:pages" content="74442380906">\n\n\t<link rel="dns-prefetch" href="https://cdn.britannica.com/mendel-resources/3-102">\n\t<link rel="preconnect" href="https://cdn.britannica.com/mendel-resources/3-102">\n\n    <link rel="preload" as="script" href="https://www.googletagservices.com/tag/js/gpt.js">\n\n    <link rel="icon" href="/favicon.ico">\n\n    <meta name="description" content="As the head of the government of the United States, the president is arguably the most powerful government official in the world. The president is elected to a four-year term via an electoral college system. Since the Tw

### 6.3 Pacing Your Requests

Frequent requests to a server can lead to your IP being blocked. It's advisable to pace your requests by adding delays. Python’s time.sleep() function can be used for this purpose.

In [68]:
import time

responses = []
# Make a request, then sleep for 5 seconds before the next request
responses.append(requests.get('https://en.wikipedia.org/wiki/CBS_Building'))
time.sleep(5)
responses.append(requests.get('https://en.wikipedia.org/wiki/Midtown_Manhattan'))

responses[0].text[:500]


'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>CBS Building - Wikipedia</'