# Parsing with `beautifulsoup4`

Previously, we have seen how to obtain the HTML content of a webpage using the `requests` library. But the HTML content is not very useful in its raw form. We need to parse it to extract the information we need. This is where the `beautifulsoup4` library comes in. 

## What is `beautifulsoup4`?

`beautifulsoup4` is a Python library that is used to parse HTML and XML documents. It creates a parse tree from the page source code that can be used to extract data easily. It is a very powerful library and can be used to extract data from webpages in a very efficient manner.

## Parsing using regular expressions

Let's try to parse the HTML content of a webpage without using `beautifulsoup4`. We will use the `requests` library to obtain the HTML content of the webpage and then use the `re` library to parse the content.

I assume that you are familiar with regular expressions. Perhaps you have visited the concept in your automata theory class. If not, you can learn about regular expressions from the [Python documentation](https://docs.python.org/3/library/re.html).

In [1]:
import re
import requests

a_tag = "https://bicol-u.edu.ph/?s=computer+science"
response = requests.get(a_tag)
html = response.text
html[:1000]

'<!DOCTYPE html>\n<html lang="en-US">\n    <head>\n        <meta charset="UTF-8">\n        <meta name="viewport" content="width=device-width, initial-scale=1">\n        <link rel="icon" href="/wp-content/uploads/2022/11/Bicol_University-1.png" sizes="any">\n                <link rel="apple-touch-icon" href="/wp-content/themes/yootheme/packages/theme-wordpress/assets/images/apple-touch-icon.png">\n                <title>Search Results for &#8220;computer science&#8221; &#8211; Official Website of Bicol University</title>\n<meta name=\'robots\' content=\'noindex, follow, max-image-preview:large\' />\n<link rel=\'dns-prefetch\' href=\'//translate.google.com\' />\n<link rel="alternate" type="application/rss+xml" title="Official Website of Bicol University &raquo; Feed" href="https://bicol-u.edu.ph/feed/" />\n<link rel="alternate" type="application/rss+xml" title="Official Website of Bicol University &raquo; Comments Feed" href="https://bicol-u.edu.ph/comments/feed/" />\n<link rel="alternat

Let's extract all links from the webpage using regular expressions.

In [2]:
pattern = re.compile(r'<a class="uk-link-reset" href="(.+?)">')
links = pattern.findall(html)
links[:5]

['https://bicol-u.edu.ph/dswd-v-to-offer-internship-program-for-buenos/',
 'https://bicol-u.edu.ph/ceng-mentors-off-to-international-academic-visit-cultural-immersion-in-taiwan/',
 'https://bicol-u.edu.ph/buenos-return-after-a-successful-international-student-exchange-program-in-korea-taiwan/',
 'https://bicol-u.edu.ph/bucs-comsci-it-dept-goes-benchmarking/',
 'https://bicol-u.edu.ph/bu-nqu-language-exchange-project-commences-20-buenos-to-learn-chinese-mandarin/']

```
pattern = re.compile(r'<a class="uk-link-reset" href="(.+?)">')
links = pattern.findall(html)
links[:5]
```

Let me explain the code:

1. A regular expression pattern is created to match the links of the website posts.

    - `<a class="uk-link-reset"`: This is the start of the anchor tag. Reviewing the HTML content of the webpage, we can see that all links are inside an anchor tag with this class.
    - `href="`: This is the start of the `href` attribute of the anchor tag.
    - `(.+?)`: This is a non-greedy match for the `href` attribute. It will match everything until the next `"` is encountered.
    - `">`: This is the end of the `href` attribute.

2. `findall()` method is used to find all matches of the pattern in the HTML content.

3. The first 5 links are printed.

Does this look complicated? Yes, it does. Regular expressions are powerful, but they are not the best tool for parsing HTML content. They are not very readable and can be difficult to maintain. Thankfully, we have `beautifulsoup4` to help us with this.

## Parsing using `beautifulsoup4`

The library makes it very easy to parse HTML content. Let's see how we can extract all links from the webpage using `beautifulsoup4`.

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
a_tags = soup.find_all("a", class_="uk-link-reset")
for a in a_tags:
    print(a["href"])

https://bicol-u.edu.ph/dswd-v-to-offer-internship-program-for-buenos/
https://bicol-u.edu.ph/ceng-mentors-off-to-international-academic-visit-cultural-immersion-in-taiwan/
https://bicol-u.edu.ph/buenos-return-after-a-successful-international-student-exchange-program-in-korea-taiwan/
https://bicol-u.edu.ph/bucs-comsci-it-dept-goes-benchmarking/
https://bicol-u.edu.ph/bu-nqu-language-exchange-project-commences-20-buenos-to-learn-chinese-mandarin/
https://bicol-u.edu.ph/student/


The code works as follows:

1. `BeautifulSoup` is imported from the `bs4` library.
2. The HTML content is passed to the `BeautifulSoup` constructor, along with the parser to be used. In this case, we are using the `html.parser` parser.
3. The `find_all()` method is used to find all anchor tags with the class `uk-link-reset`.
4. A list is created to store the links.
5. The `get()` method is used to extract the `href` attribute of each anchor tag and append it to the list.
6. The first 5 links are printed.

## Extracting meaningful information

We have seen how `beautifulsoup4` can simplify our parsing tasks. But we have only extracted the links from the webpage. What if we want to extract more meaningful information, such as the title of the posts, metadata, and summary? We can do that using `beautifulsoup4` as well.

In [4]:
a_tag = "https://bicol-u.edu.ph/?s=computer+science"
response = requests.get(a_tag)
soup = BeautifulSoup(response.text, "html.parser")

main_content = soup.find("main")
articles = main_content.find_all("article")

for article in articles:
    title = article.find("h2").text.strip()
    a_tag = article.find("a").get("href")
    summary = article.find("div", class_="uk-margin-medium").text.strip()
    print(f"{title}\n{a_tag}\n{summary}\n")

DSWD V to offer internship program for BUeños
https://bicol-u.edu.ph/dswd-v-to-offer-internship-program-for-buenos/
To learn through actual work experience, the Department of Social Welfare and Development (DSWD) Field Office V will provide an on-the-job training (OJT) program to select BUeños, formalized through a Memorandum of Agreement signed with Bicol University (BU).  DSWD will accommodate BUeños who will undergo field instruction (FI), field observation (FO), and/or OJT from the […]

CENG Mentors Off to International Academic Visit, Cultural Immersion in Taiwan
https://bicol-u.edu.ph/ceng-mentors-off-to-international-academic-visit-cultural-immersion-in-taiwan/


BUeños return after a successful international student exchange program in Korea & Taiwan
https://bicol-u.edu.ph/buenos-return-after-a-successful-international-student-exchange-program-in-korea-taiwan/
Three Bicol University students return, having completed two international student exchange programs for the Global Kor

These are the steps followed in the code:

1. The `requests` and `BeautifulSoup` libraries are imported.
2. The HTML content of the webpage is obtained using the `requests` library.
3. A `BeautifulSoup` object is created to parse the HTML content.
4. The main content of the webpage is extracted using the `find()` method.
5. All articles containing the posts are extracted using the `find_all()` method.
6. For each article in the list of articles, the title, link, and summary are extracted and printed.

## Example: Getting to Philosophy

**How many pages does it take to get to the "Philosophy" page on Wikipedia?**

This is a fun experiment that you can do using `beautifulsoup4`. The experiment is based on the observation that if you click on the first link in the main text of a Wikipedia page and then repeat the process for subsequent pages, you will eventually reach the "Philosophy" page. This is known as the "Getting to Philosophy" phenomenon. There is even a dedicated [Wikipedia page](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy) for this!

In [5]:
def is_valid_link(a_tag):
    """
    Checks if an anchor tag is a valid link to follow.

    A link is valid if it is not empty, does not contain parentheses, and is not italicized.

    Args:
        `a_tag` (bs4.element.Tag): The anchor tag to check.
    """
    return (
        a_tag.text != ""
        and "(" not in a_tag.text
        and ")" not in a_tag.text
        and not a_tag.has_attr("italic")
    )


def get_first_valid_link(url):
    """
    Scrapes a Wikipedia page and returns the URL of the first valid link on the page.

    Args:
        `url` (str): The URL of the Wikipedia page to scrape.

    Returns:
        `str` or `None`: The URL of the first valid link on the page, or None if no valid links are found.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    content = soup.find(id="mw-content-text")
    paragraphs = content.find_all("p")

    for paragraph in paragraphs:
        links = paragraph.find_all("a", href=True)
        for link in links:
            if not is_valid_link(link):
                continue
            return "https://en.wikipedia.org" + link["href"]

    return None


def scrape_until_philosophy(start_url, max_steps=50):
    """
    Scrapes Wikipedia pages, following the first valid link on each page, until the Philosophy page is reached.

    Args:
        `start_url` (str): The URL of the Wikipedia page to start scraping from.
        `max_steps` (int): The maximum number of steps to take before the process times out.

    Returns:
        `int`: The number of steps it took to get to the Philosophy page, or -1 if the process times out.
    """
    goal_link = "https://en.wikipedia.org/wiki/Philosophy"
    next_link = start_url
    for i in range(max_steps):
        print(f"Scraping: {next_link}")
        if next_link == goal_link:
            return i
        next_link = get_first_valid_link(next_link)
        if next_link is None:
            return -1
    return -1


start_link = "https://en.wikipedia.org/wiki/Independent_film"
visit_count = scrape_until_philosophy(start_link)
print(f"Philosophy reached in {visit_count} steps")

Scraping: https://en.wikipedia.org/wiki/Independent_film
Scraping: https://en.wikipedia.org/wiki/Feature_film
Scraping: https://en.wikipedia.org/wiki/Narrative_film
Scraping: https://en.wikipedia.org/wiki/Motion_picture
Scraping: https://en.wikipedia.org/wiki/Visual_art
Scraping: https://en.wikipedia.org/wiki/Art#Forms,_genres,_media,_and_styles
Scraping: https://en.wikipedia.org/wiki/Human_behavior
Scraping: https://en.wikipedia.org/wiki/Energy_(psychological)
Scraping: https://en.wikipedia.org/wiki/Psychology
Scraping: https://en.wikipedia.org/wiki/Mind
Scraping: https://en.wikipedia.org/wiki/Thought
Scraping: https://en.wikipedia.org/wiki/Consciousness
Scraping: https://en.wikipedia.org/wiki/Awareness
Scraping: https://en.wikipedia.org/wiki/Philosophy
Philosophy reached in 13 steps


I will leave the explanation of the code as an exercise for you. Try changing some parts of the code and see what happens.

## Coding Challenge

The code for the "Getting to Philosophy" experiment doesn't always work. For example, it fails under this link: [https://en.wikipedia.org/wiki/Special:Random](https://en.wikipedia.org/wiki/Special:Random).

It seems that there are some pages on Wikipedia that don't lead to the "Philosophy" page. Can you modify the code to handle this? Begin by opening the link in a web browser and inspecting the HTML content. Is there a pattern that you can use to identify pages that don't lead to the "Philosophy" page? 

In [6]:
random_page = "https://en.wikipedia.org/wiki/Special:Random"
visit_count = scrape_until_philosophy(random_page)
print(f"Philosophy reached in {visit_count} steps")

Scraping: https://en.wikipedia.org/wiki/Special:Random
Scraping: https://en.wikipedia.org/wiki/Great_Britain
Scraping: https://en.wikipedia.org/wiki/Island
Scraping: https://en.wikipedia.org/wiki/Atoll
Scraping: https://en.wikipedia.org/wiki/Help:IPA/English
Scraping: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet
Scraping: https://en.wikipedia.org/wiki/Alphabet
Scraping: https://en.wikipedia.org/wiki/Letter_(alphabet)
Scraping: https://en.wikipedia.org/wiki/Symbol
Scraping: https://en.wikipedia.org/wiki/Sign_(semiotics)
Scraping: https://en.wikipedia.org/wiki/Semiotics
Scraping: https://en.wikipedia.org/wiki/Help:IPA/English
Scraping: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet
Scraping: https://en.wikipedia.org/wiki/Alphabet
Scraping: https://en.wikipedia.org/wiki/Letter_(alphabet)
Scraping: https://en.wikipedia.org/wiki/Symbol
Scraping: https://en.wikipedia.org/wiki/Sign_(semiotics)
Scraping: https://en.wikipedia.org/wiki/Semiotics
Scraping: https://