# Class 10: Web scraping with python

## Learning outcomes

At the completion of this unit students should be able to:
1.   Understand how to scrape web pages using the `BeautifulSoup` library

## 10.1 Scraping the internet using the `BeautifulSoup` library

Scraping a website means: visiting pages on the website and downloading a copy of those pages. Typically, the scraping algorithm will navigate the website by visiting the links on each page, and then visiting and downloading the code in the linked pages.

Let's check this simple example. We are going to visit a web page that has a link.

In [None]:
import requests
response = requests.get("http://www.sheriftawfikabbas.com/blog/basic_home.html")
data = response.text
print(data)


<html>
  <head>
    <title>I am a very simple web page</title>
  </head>
  <body>
    This is the home page of the website.
    Please visit this page: <a href='basic.html'>Basic</a>
  </body>

</html>



Next, we are going to get the link text so that we can visit the link. To do that, we need to use `BeautifulSoup`. Its documentation is https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Let's take a look at an example.



In [2]:

from bs4 import BeautifulSoup

import requests
response = requests.get("http://www.mathpractice.xyz")
data = response.text

data_bs = BeautifulSoup(data)

list_of_links = data_bs.find('a')



print("First link using find():", list_of_links,list_of_links.text)


list_of_links = data_bs.find_all('a')

print('The list of links:',list_of_links)

for l in list_of_links:
  link=l.get('href')
  print(link)
  response = requests.get("http://www.mathpractice.xyz/"+link)
  new_page = response.text
  print("New page:","http://www.mathpractice.xyz/"+link)


First link using find(): <a _ngcontent-ng-c2373332088="" class="active" href="/" routerlink="/" routerlinkactive="active"><img _ngcontent-ng-c2373332088="" align="left" src="assets/images/logo.png" width="200px"/></a> 
The list of links: [<a _ngcontent-ng-c2373332088="" class="active" href="/" routerlink="/" routerlinkactive="active"><img _ngcontent-ng-c2373332088="" align="left" src="assets/images/logo.png" width="200px"/></a>, <a _ngcontent-ng-c2373332088="" class="nav-link float-right" href="https://www.facebook.com/mathpractice.xyz"><img _ngcontent-ng-c2373332088="" src="assets/images/Facebook_Logo_Primary.png" style="width: 1em; margin-right: 0.5em;"/></a>, <a _ngcontent-ng-c2373332088="" class="nav-link float-right" href="/llms" routerlink="/llms" routerlinkactive="active">ChatGPT etc.</a>, <a _ngcontent-ng-c2373332088="" class="nav-link float-right" href="/why-us" routerlink="/why-us" routerlinkactive="active">Why us?</a>, <a _ngcontent-ng-c2373332088="" class="nav-link float-ri

Several things happened in these lines of code:
- First of all, to use `BeautifulSoup`, we import `bs4` and create a `BeautifulSoup` object.
- Note that the `response` object can give access to the web page content via the `text` attribute. You can also convert a JSON content into a dictionary by using the `json()` method, as we did in Class 9.
- You can get a list of all HTML tags using the `find_all()` method in `BeautifulSoup`.
- In each link tag, you can obtain the content of the `href` attribute using the `get()` method.

## Features of `BeautifulSoup` library

You can dig into web pages using a large range of methods and properties that are provided by `BeautifulSoup`, and which we will be using in the lab:

- `findChildren`: it finds children of a HTML tag
- `find`: it finds all HTML tags with the specified name
- `attrs`: retrieves the attributes of an element as a dictionary, and allows you to set attribute values

**GOTO Lab exercises 1-3**
