In [None]:
%cd -

In [None]:
!pwd

In [None]:
%cd drive/MyDrive/ams_595_python_teaching

# **Lecture 12 Introduction to Web Scraping with Python**

## HTML Structures

**Hypertext Markup Language (HTML)** serves as the standard markup language for creating web documents viewable in a browser. It defines the structure of a webpage and works seamlessly with **Cascading Style Sheets (CSS)** and scripting languages like **JavaScript** to develop engaging and interactive websites. HTML is comprised of various elements that instruct the browser on how to display content. These elements are represented by **tags**.

- HTML provides the basic structure of web pages, defining elements like headings, paragraphs, links, images, and more. It allows developers to organize and present content effectively for web browsers. HTML elements are represented by tags, which come in pairs, such as `<tag>` and `</tag>`, with content nested in between. The first tag in a pair is the opening tag, the second tag is the closing tag. The end tag is written like the start tag, but with a slash inserted before the tag name.

![link text](https://blog.hubspot.com/hs-fs/hubfs/html-elements-diagram.png?width=650&name=html-elements-diagram.png)
 [Source](https://blog.hubspot.com/hs-fs/hubfs/html-elements-diagram.png?width=650&name=html-elements-diagram.png)

- CSS is used to style the presentation of HTML elements on a web page. It controls the layout, appearance, and design, allowing developers to customize the look and feel of the web pages. By separating the content from its presentation, CSS enhances the visual aesthetics and user experience of a website.

- JavaScript is a scripting language that adds interactivity and dynamic behavior to web pages. It enables the creation of responsive features, such as interactive forms, animations, and content updates, enhancing the overall functionality of a website. JavaScript is commonly used alongside HTML and CSS to create rich and engaging web applications.

Here is an explanation of the basic tags of an HTML page:

- `<!DOCTYPE html>`: This declaration specifies the HTML version used in the document. For modern web pages, the HTML5 doctype is commonly used.

- `<html>`: This tag represents the root of an HTML document and contains two main sections: `<head>` and `<body>`.

- `<head>`: This section includes metadata, such as the page title, character encoding, and links to external resources like CSS stylesheets and JavaScript files. It doesn't display content directly on the web page.

  - `<title>`: The title of the web page, which is displayed in the browser's title bar or tab.

  - Other elements like `<meta>`, `<link>`, and `<script>` are commonly included within the `<head>` for defining additional information or linking external resources.

- `<body>`: This section contains the visible content of the web page, including text, images, videos, and other multimedia elements. It represents the main content area that users see and interact with.

  - Structural elements like headings (`<h1>`, `<h2>`, etc.), paragraphs (`<p>`), and lists (`<ul>`, `<ol>`, `<li>`) are used to organize and structure the content.

  - Semantic elements such as `<header>`, `<footer>`, `<nav>`, `<main>`, and `<article>` help define different parts of the page and their purposes.

  - Interactive elements like `<a>` (anchor tags for hyperlinks), `<button>`, and `<input>` are used for creating links, buttons, and forms for user interaction.

  - The `<div>` element in HTML is a block-level container that is commonly used to group elements for styling purposes or to create sections on a web page. It does not inherently represent any specific semantic meaning and is often styled with CSS to provide layout and structure to the content within it.



HTML exhibits a hierarchical (tree-like) structure facilitated by the Document Object Model (DOM), which is a across-platform and language-independent interface. Below is an illustration of a basic HTML tree for reference.

<img src="https://github.com/nestauk/im-tutorials/blob/3-ysi-tutorial/figures/Web-Scraping/dom_tree.gif?raw=1">  [Source](https://github.com/nestauk/im-tutorials/blob/3-ysi-tutorial/figures/Web-Scraping/dom_tree.gif?raw=1)

## Creating a HTML Page

In [None]:
from IPython.core.display import display, HTML

In [None]:
display(HTML("""
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Sample HTML Page</title>
</head>
<body>
  <h1>Welcome to My Blog</h1>
  <p>This is a sample paragraph demonstrating the use of <strong>HTML</strong>.</p>
  <p>You can also create <em>italic</em> text, define <mark>highlighted</mark> sections, <b>bold</b>, <mark>mark</mark>, <ins>underline</ins>, <del>strikethrough</del>, and <i>emphasize</i> words.</p>
  <p>Here is a link to <a href="https://www.google.com">Google</a>.</p>
  <p>This paragraph has a <del>strikethrough</del> text and an <ins>underlined</ins> text.</p>
  <p style="color: blue;">You can style text in different colors.</p>
  <p>Below is a sample unordered list:</p>
  <ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
  </ul>

   <p>Ordered list:</p>
  <ol>
    <li>A</li>
    <li>B</li>
    <li>C</li>
    <li>D</li>
  </ol>

   <form>
    <p>Enter your name: <input type="text" id="username" name="username"></p>
    <input type="button" value="Submit" onclick="displayMessage()">
  </form>

  <p id="greeting"></p>

  <script>
    function displayMessage() {
      var name = document.getElementById("username").value;
      document.getElementById("greeting").innerHTML = "Thank you, " + name + ", for visiting my page.";
    }
  </script>
</body>
</html>
"""))

## Web Scraping with requests and BeautifulSoup

Web scraping is the process of extracting and parsing data from websites. It involves fetching the HTML content of a web page and then extracting specific information from it. Python offers powerful libraries such as `requests` and `BeautifulSoup` that facilitate web scraping tasks efficiently.

### Using the requests Library

The `requests` library in Python is used for making HTTP requests. It allows users to send GET and POST requests to a URL, retrieve data from web pages, and handle HTTP response codes. Here's an example of how to use the `requests` library:

In [None]:
import requests

# Send a GET request to a URL
url = 'https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams595.php'
response = requests.get(url)

# Check the status code
if response.status_code == 200:
    # Process the response
    print(response.content)
else:
    print("Failed to retrieve the webpage.")

b'<!DOCTYPE html><html lang="en"><head>\n   <meta charset="utf-8"></meta>\n   <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></meta>\n   <meta name="viewport" content="width=device-width, initial-scale=1.0"></meta>\n   <title>AMS 595 | Applied Mathematics & Statistics</title><meta name="keywords" content="Applied Math, Statistics"/><meta name="description" content="Applied Math and Statistics at Stony Brook University"/><meta property="og:title" content="AMS 595 | Applied Mathematics & Statistics"/><meta property="og:description" content="Applied Math and Statistics at Stony Brook University"/><meta property="og:type" content="website"/><meta property="og:url" content="https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams595.php"/><meta property="og:image" content="https://www.stonybrook.edu/commcms/_resources/images/branding/rays/red-rays-full-2.png"/>   <link rel="apple-touch-icon" href="/commcms/_resources/favicon/apple-touch-icon-144x144-precomposed.png"></lin

In [None]:
display(HTML(response.text))


### Using the BeautifulSoup Library

The `BeautifulSoup` library provides tools for scraping information from web pages. It helps parse HTML and XML documents, making it easy to extract data from complex web pages. Here's an example of how to use the `BeautifulSoup `library:

In [None]:
import requests
from bs4 import BeautifulSoup

# 1
url = 'https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams595.php'
response = requests.get(url) # response contains the HTML content

# 2
soup = BeautifulSoup(response.text, 'html.parser')

# 3
ps = soup.find_all('p')

# 4
for p in ps:
    print(p.text)

AMS 595, Fundamentals of Computing This course provides an introduction to several modern approaches for developing computer
                     programs and their use to solve mathematical problems. It will cover the fundamentals
                     of programming in MATLAB, Python, and C/C++, including scripting, basic data structures,
                     algorithms, scientific computing, performance optimization, software engineering and
                     program development tools. No previous programming experience is required.
Fall semester, 1-9 credits, ABCF grading 
Prerequisite:  Familiarity of linear algebra and discrete mathematics at undergraduate
                        level are required.  No previous programming experience is required.
Antirequisite:  AMS 561
 
Learning Outcomes:Knowledge of the basic elements of computer programming languages. An understanding
                     of the use of several different programming interfaces. Practical programming skills


Complete Web Scraping Workflow:

1. Use the `requests` library to fetch the HTML content of a web page.

2. Create a `BeautifulSoup` object to parse the HTML content.

3. Use the BeautifulSoup object to find specific elements based on their tags, classes, or other attributes.

4. Extract and manipulate the desired data from the web page.

## Example Web Scraping Script

In [None]:
import requests
from bs4 import BeautifulSoup

# Example: Extracting links from a webpage
url = 'https://www.stonybrook.edu/commcms/ams/graduate/offerings.php'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')  # contains all the hyperlink reference information in this webpage
for link in links:
    print(link.get('href'))


#main-site-content
/commcms/ams/about/job-opportunities.php
https://www.stonybrook.edu/
http://www.stonybrook.edu/ceas
https://www.stonybrook.edu/commcms/ams/
/commcms/ams/
/commcms/ams/about/index.php
/commcms/ams/about/welcome-from-the-chair.php
/commcms/ams/about/index.php
/commcms/ams/about/leadership.php
/commcms/ams/about/directions.php
/commcms/ams/about/give.php
/commcms/ams/about/job-opportunities.php
/commcms/ams/people/index.php#CoreFaculty
/commcms/ams/people/index.php#CoreFaculty
/commcms/ams/people/affiliatedfaculty.php
/commcms/ams/people/postdoc.php
/commcms/ams/people/staff.php
/commcms/ams/people/grad-students.php
/commcms/ams/people/ams-alumnus.php
/commcms/ams/undergraduate/index.php
/commcms/ams/undergraduate/index.php
/commcms/ams/undergraduate/course-offerings.php
/commcms/ams/undergraduate/schedules.php
/commcms/ams/undergraduate/actuary.php
/commcms/ams/graduate/index.php
/commcms/ams/graduate/index.php
/commcms/ams/graduate/offerings.php
/commcms/ams/graduate/

We also want to see the title of pages

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Example 1: Extracting links and their titles from a webpage
url = 'https://www.stonybrook.edu/commcms/ams/graduate/offerings.php'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
for link in links:    # link.text is the title of the webpage and link.get('href') gives the path of the reference
    print(f"Title: {link.text}, Link: {link.get('href')}")

Title: Skip Navigation, Link: #main-site-content
Title: Job Opportunities, Link: /commcms/ams/about/job-opportunities.php
Title: , Link: https://www.stonybrook.edu/
Title: College of Engineering and Applied Sciences, Link: http://www.stonybrook.edu/ceas
Title: Applied Mathematics & Statistics, Link: https://www.stonybrook.edu/commcms/ams/
Title: Home, Link: /commcms/ams/
Title: About Us, Link: /commcms/ams/about/index.php
Title: Welcome from the Chair, Link: /commcms/ams/about/welcome-from-the-chair.php
Title: Overview, Link: /commcms/ams/about/index.php
Title: Leadership, Link: /commcms/ams/about/leadership.php
Title: Directions, Link: /commcms/ams/about/directions.php
Title: Giving Back, Link: /commcms/ams/about/give.php
Title: Job Opportunities, Link: /commcms/ams/about/job-opportunities.php
Title: People, Link: /commcms/ams/people/index.php#CoreFaculty
Title: Faculty, Link: /commcms/ams/people/index.php#CoreFaculty
Title: Affiliated Faculty, Link: /commcms/ams/people/affiliatedfacu

Some of the links are displayed as relative paths (/commcms/ams/graduate/_courses/ams526.php) because the get('href') method returns the href attribute as it appears in the HTML, without the domain name included. To display the complete absolute URLs, we can join the base URL of the webpage with the href attribute.

However, some of the links are already complete URLs and we want to preserve them as they are, we can add a condition to check if the href is already a complete URL.

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'https://www.stonybrook.edu/commcms/ams/graduate/offerings.php'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
for link in links:
    href = link.get('href')
    absolute_link = urljoin(url, href) if not href.startswith('http') else href
    print(f"Title: {link.text}, Link: {absolute_link}")

Title: Skip Navigation, Link: https://www.stonybrook.edu/commcms/ams/graduate/offerings.php#main-site-content
Title: Job Opportunities, Link: https://www.stonybrook.edu/commcms/ams/about/job-opportunities.php
Title: , Link: https://www.stonybrook.edu/
Title: College of Engineering and Applied Sciences, Link: http://www.stonybrook.edu/ceas
Title: Applied Mathematics & Statistics, Link: https://www.stonybrook.edu/commcms/ams/
Title: Home, Link: https://www.stonybrook.edu/commcms/ams/
Title: About Us, Link: https://www.stonybrook.edu/commcms/ams/about/index.php
Title: Welcome from the Chair, Link: https://www.stonybrook.edu/commcms/ams/about/welcome-from-the-chair.php
Title: Overview, Link: https://www.stonybrook.edu/commcms/ams/about/index.php
Title: Leadership, Link: https://www.stonybrook.edu/commcms/ams/about/leadership.php
Title: Directions, Link: https://www.stonybrook.edu/commcms/ams/about/directions.php
Title: Giving Back, Link: https://www.stonybrook.edu/commcms/ams/about/give.ph

Now, I only want to get the tiles and links for webpages of the graduate courses that are offered. We can use a regular expression to match the pattern of three letters followed by a space and then three numbers (and possibly more texts after it, such as " Webpage".

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re

url = 'https://www.stonybrook.edu/commcms/ams/graduate/offerings.php'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
pattern = re.compile(r'^[A-Z]{3}\s\d{3}')  # Pattern for three letters, a space, and three digits
for link in links:
    if pattern.match(link.text):
        href = link.get('href')
        absolute_link = urljoin(url, href) if not href.startswith('http') else href
        print(f"Title: {link.text}, Link: {absolute_link}")

Title: AMS 500 Webpage, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams500.php
Title: AMS 501 Webpage, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams501.php
Title: AMS 501, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams501.php
Title: AMS 502 Webpage , Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams502.php
Title: AMS 503 Webpage, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams503.php
Title: AMS 504 Webpage, Link: http://www.stonybrook.edu/commcms/ams/graduate/_courses/ams504.php
Title: AMS 505 Webpage , Link: http://www.stonybrook.edu/commcms/ams/graduate/_courses/ams505.php
Title: AMS 506 Webpage, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams506.php
Title: AMS 507 Webpage , Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams507.php
Title: AMS 510 Webpage, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams510.php
Title: AMS 511 Webpage , Li

Now, we have a list of course titles and course websites, we want to get the information about the textbooks.

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re

base_url = 'https://www.stonybrook.edu/commcms/ams/graduate/offerings.php'  # Replace this with the base URL of the target website
response = requests.get(base_url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)

pattern_title = re.compile(r'^[A-Z]{3}\s\d{3}')  # Pattern for three letters, a space, and three digits
pattern_ISBN = re.compile(r'ISBN(?:-1[03])?:?\s.*\d\s')

for link in links:
    if pattern_title.match(link.text):
        absolute_link = urljoin(base_url, link['href'])
        link_response = requests.get(absolute_link)
        link_soup = BeautifulSoup(link_response.text, 'html.parser')
        text = link_soup.get_text()
        ISBN = pattern_ISBN.findall(text)
        print(f"Title: {link.text}, Link: {absolute_link}, ISBN: {', '.join(ISBN)}")

Title: AMS 500 Webpage, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams500.php, ISBN: 
Title: AMS 501 Webpage, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams501.php, ISBN: ISBN #9780387989310

Title: AMS 501, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams501.php, ISBN: ISBN #9780387989310

Title: AMS 502 Webpage , Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams502.php, ISBN: ISBN: 978-3-319-30769-5

Title: AMS 503 Webpage, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams503.php, ISBN: ISBN: 978-0070006577
, ISBN: 978-0-716728771

Title: AMS 504 Webpage, Link: http://www.stonybrook.edu/commcms/ams/graduate/_courses/ams504.php, ISBN: 
Title: AMS 505 Webpage , Link: http://www.stonybrook.edu/commcms/ams/graduate/_courses/ams505.php, ISBN: 
Title: AMS 506 Webpage, Link: https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams506.php, ISBN: ISBN #9780321794772

Title: AMS 507 Webpage , Link: 

## Yet Another Example

We have the capability to extract tables from a website and then leverage the full range of powerful tools available in Python to handle and process the data within these tables.

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
%matplotlib inline
url = 'https://www.stonybrook.edu/commcms/ams/graduate/_courses/ams500.php'

In [None]:
# Access the webpage content
r = requests.get(url)
# Parse the HTML page
soup = BeautifulSoup(r.text, 'html.parser')
# Choose the relevant table
tables = soup.find_all('table')
print(len(tables))
table = soup.find_all('table')[0]

2


In [None]:
# Parse and store the data of every table row
lst = []
for row in table.find_all('tr'):
    s = pd.Series([data.text for data in row.find_all('td')])
    lst.append(s)

In [None]:
df = pd.DataFrame(lst[1:])
df.columns = lst[0]
df

Unnamed: 0,Class Meeting,Faculty,Week,Location
0,"August 30, 2023",Joseph Mitchell,1,Remote
1,"September 6, 2023",Song WuWei Zhu,2,Remote
2,"September 13, 2023",Pawel PolakStan Uryasev,3,Remote
3,"September 20, 2023",Haipeng XingYan Yu,4,Remote
4,"September 27, 2023",Eugene FeinbergRoman Samulyak,5,Remote
5,"October 4, 2023",Xiangmin JiaoPeiFen Kuan,6,Remote
6,"October 11, 2023",James GlimmHyunkyung Lim,7,Remote
7,"October 18, 2023",Jiaqiao HuXiaolin Li,8,Remote
8,"October 25, 2023",Yuefan DengEvangelos Coutsias,9,Remote
9,"November 1, 2023",NO CLASS,Makeup,Remote


## Ethics in Web Scraping

Ethical practices in web scraping are crucial to ensure the responsible and fair acquisition of data from websites.

[This website](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01) summarizes the best practices to consider when conducting web scraping activities.

1. Terms of Service and Robots.txt: Always review a website's Terms of Service and adhere to the rules outlined in the `robots.txt` file. Respect the directives provided in these documents, as they outline the permissions and restrictions for web scraping. Example: https://www.google.com/robots.txt

2. Use Legal Data: Ensure that the data you collect is legally available for scraping. Avoid scraping any confidential, proprietary, or copyrighted information that is not meant for public distribution.

3. Respect Rate Limits and Delays: Implement appropriate rate limits and delays between requests to avoid overwhelming the website's server. Excessive scraping can cause server overload and may be perceived as a denial-of-service attack.

4. Identify Yourself: Provide a valid User-Agent in your HTTP header that clearly identifies your web scraper. This helps website administrators identify the source of the traffic and reach out in case of any issues.

5. Observe Crawl Frequency: Crawl only the necessary data and avoid scraping the same website too frequently. Respect the website's bandwidth and server resources, as excessive traffic can lead to performance issues and affect the experience of other users.

6. Data Privacy and Security: Ensure that the data you collect is handled securely and is not misused or shared without consent. Respect the privacy of the users whose data you are scraping.

7. Scrape Publicly Available Data: Collect data that is publicly available and does not require authentication or bypassing security measures. Avoid using any means to bypass login screens or security measures, as it can be considered unauthorized access.

8. Transparency and Attribution: If you intend to publish or share the scraped data, provide proper attribution to the source website. Clearly mention the source of the data and any transformations or interpretations you've made.