# Class 10: Web scraping with python

## Learning outcomes

At the completion of this unit students should be able to:
1.   Understand how to scrape web pages using the `BeautifulSoup` library

## 10.1 Scraping the internet using the `BeautifulSoup` library

Scraping a website means: visiting pages on the website and downloading a copy of those pages. Typically, the scraping algorithm will navigate the website by visiting the links on each page, and then visiting and downloading the code in the linked pages.

Let's check this simple example. We are going to visit a web page that has a link.

In [None]:
import requests
response = requests.get("http://www.sheriftawfikabbas.com/blog/basic_home.html")
data = response.text
print(data)


<html>
  <head>
    <title>I am a very simple web page</title>
  </head>
  <body>
    This is the home page of the website.
    Please visit this page: <a href='basic.html'>Basic</a>
  </body>

</html>



Next, we are going to get the link text so that we can visit the link. To do that, we need to use `BeautifulSoup`. Its documentation is https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Let's take a look at an example.



In [None]:

from bs4 import BeautifulSoup

import requests
response = requests.get("http://www.sheriftawfikabbas.com/blog/basic_home.html")
data = response.text

data_bs = BeautifulSoup(data)

list_of_links = data_bs.find('a',{'href':"basic.html"})



print("First link using find():", list_of_links,list_of_links.text)


list_of_links = data_bs.find_all('a')

print('The list of links:',list_of_links)

for l in list_of_links:
  link=l.get('href')
  print(link)
  response = requests.get("http://www.sheriftawfikabbas.com/blog/"+link)
  new_page = response.text
  print("New page:\n"+new_page)


First link using find(): <a href="basic.html">Basic</a> Basic
The list of links: [<a href="basic.html">Basic</a>]
basic.html
New page:
<html>
  <head>
    <title>I am a very simple web page</title>
  </head>
  <body>
    This is the stuff in the body of this web page.
  </body>

</html>



Several things happened in these lines of code:
- First of all, to use `BeautifulSoup`, we import `bs4` and create a `BeautifulSoup` object.
- Note that the `response` object can give access to the web page content via the `text` attribute. You can also convert a JSON content into a dictionary by using the `json()` method, as we did in Class 9.
- You can get a list of all HTML tags using the `find_all()` method in `BeautifulSoup`.
- In each link tag, you can obtain the content of the `href` attribute using the `get()` method.

**GOTO Lab exercises 1-3**
