# Class 10: Web scraping with python

## Learning outcomes

At the completion of this unit students should be able to:
1.   Understand how to scrape web pages using the `BeautifulSoup` library

## 10.1 Scraping the internet using the `BeautifulSoup` library

Scraping a website means: visiting pages on the website and downloading a copy of those pages. Typically, the scraping algorithm will navigate the website by visiting the links on each page, and then visiting and downloading the code in the linked pages.

Let's check this simple example. We are going to visit a web page that has a link.

In [3]:
a = "<h1 class='some class'>I need to extract this text.</h1>"

a = a.replace('<h1>','')
a = a.replace('</h1>','')
print(a)



<h1 class='some class'>I need to extract this text.


In [4]:
import requests
response = requests.get("https://www.mathpractice.xyz")
data = response.text
print(data)


<!DOCTYPE html><html lang="en" data-critters-container=""><head><meta name="generator" content="Scully 0.0.0">
    <!-- Google tag (gtag.js) -->
    <script async="" src="https://www.googletagmanager.com/gtag/js?id=G-LWXMJ0KBMX"></script>
    <script>
        window.dataLayer = window.dataLayer || [];
        function gtag() { dataLayer.push(arguments); }
        gtag('js', new Date());

        gtag('config', 'G-LWXMJ0KBMX');
    </script>
    <script type="application/ld+json">
    {
      "@context": "https://schema.org/",
      "@type": "Quiz",
      "typicalAgeRange": "11-12",
      "educationalLevel": "intermediate",
      "assesses": "Algebra",
      "educationalAlignment": [
        {
          "@type": "AlignmentObject",
          "alignmentType": "educationalSubject",
          "targetName": "Mathematics"
        },
        {
          "@type": "AlignmentObject",
          "alignmentType": "educationalSubject",
          "targetName": "Physics"
        }
      ],
      "name"

Next, we are going to get the link text so that we can visit the link. To do that, we need to use `BeautifulSoup`. Its documentation is https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Let's take a look at an example.



In [15]:

from bs4 import BeautifulSoup

import requests
response = requests.get("https://abc.com")
data = response.text

data_bs = BeautifulSoup(data)

a = data_bs.find('a')


print("First link using find():", a)


list_of_links = data_bs.find_all('a')

print('The list of links:',list_of_links)


for l in list_of_links:
  print(l.get('href'))

print(len(list_of_links))


First link using find(): <a class="navigation__skipnav__button__link" href="#" id="skipnav" tabindex="0">Skip to Content</a>
The list of links: [<a class="navigation__skipnav__button__link" href="#" id="skipnav" tabindex="0">Skip to Content</a>, <a aria-label="ABC" class="AnchorLink navButton__link dib logo tc fitt-tracker" data-track-collection_name="none" data-track-content_language="none" data-track-cta_text="abc.com" data-track-global.authenticated_user_flag="false" data-track-global.personalization="false" data-track-global.tagid="f_click01" data-track-link_name_custom="abc:home - full view:main_menu:abc.com" data-track-module_position_number="none" data-track-position_number="none" data-track-section_page="main_menu" href="/" tabindex="0"><div aria-hidden="true" class="navButton__icon dib logo"><img alt="ABC" class="sitelogo" src="https://assets-cdn.watchdisneyfe.com/delta/assets/abc/abc-nav.png"/></div><div aria-hidden="true" class="abc__link"><span class="navButton__text ttc tc

In [None]:

from bs4 import BeautifulSoup

import requests
response = requests.get("http://abc.com")
data = response.text
print(data)


Several things happened in these lines of code:
- First of all, to use `BeautifulSoup`, we import `bs4` and create a `BeautifulSoup` object.
- Note that the `response` object can give access to the web page content via the `text` attribute. You can also convert a JSON content into a dictionary by using the `json()` method, as we did in Class 9.
- You can get a list of all HTML tags using the `find_all()` method in `BeautifulSoup`.
- In each link tag, you can obtain the content of the `href` attribute using the `get()` method.

## Features of `BeautifulSoup` library

You can dig into web pages using a large range of methods and properties that are provided by `BeautifulSoup`, and which we will be using in the lab:

- `findChildren`: it finds children of a HTML tag
- `find`: it finds all HTML tags with the specified name
- `get`: gets a particular attribute value from an element object
- `attrs`: retrieves the attributes of an element as a dictionary, and allows you to set attribute values

**GOTO Lab exercises 1-3**
