### Webscraping

##### What is Beautiful Soup?
Simple but powerful or simply powerful, Beautiful Soup is a Python parsing library that can get data from HTML, XML, and other markup languages. If you're just starting out and want to learn how to use Beautiful Soup, you'll find it to be a very beginner-friendly option for parsing web content. It uses tags, text content, and attributes as search criteria which makes navigating and searching the HTML tree much easier. Put simply, it’s a tool that helps you pull structured data from web pages. If you're new to using Beautiful Soup or want a refresher on how it works in practice, check out our BeautifulSoup Tutorial – How to Parse Web Data With Python.

##### Main features
- Dealing with poorly formatted HTML

    In most situations, Beautiful Soup will help you parse data even from the most ill-formatted HTMLs. Of course, for the most extreme cases you might need to play around with Beautiful Soup’s parameters.

- Encoding conversion

    Beautiful Soup has the capability of automatically detecting the document encoding method and converting it to a suitable format. In case it doesn’t, you can still specify it and get the job done.

- Integration with parsing libraries

    Sitting on top of such parsing libraries as lxml and html5lib, Beautiful Soup can give your parsing approaches much more flexibility.

- Excellent error handling

    Beautiful Soup handles parsing mistakes by giving you thorough error messages and facilitating easier parsing error recovery. As a result, the parsing process becomes much more manageable.

##### Advantages of using Beautiful Soup
- Beginner friendly
- Open-source and free
- Simple to implement
- Flexible parsing options

##### Disadvantages of using Beautiful Soup
- Many dependencies
- Not very scalable
- Minimal proxy support

|Criteria|	Scrapy|	Beautiful Soup|
|--------|-----------|-------|
|Purpose|	Web scraping and crawling|	Parsing|
|Speed	|Fast	|Average|
|Scraping projects	|Small to large scale	|Small to medium scale|
|Scalability	Highly scalable and can handle large-scale projects	|Not as suitable for large-scale projects|
|Asynchronous	|Yes	|No|
|Crawling	|Designed for web scraping and crawling	|Focused on parsing and manipulating HTML|
|Extensions	|High	|Limited|
|Browser support	|No	|Chrome, Edge, Firefox, and Safari|
|Headless execution|	No	|Yes|
|Browser interaction|	No|	Yes|

In [2]:
!pip install bs4 requests



Beautiful Soup is a powerful Python library for parsing HTML and XML documents. It’s ideal for lightweight, memory-efficient scraping, especially when you don’t need the full crawling power of Scrapy. Let’s walk through the entire process—from setup to advanced parsing—with clean, modular code examples.

##### What Is Beautiful Soup?
Beautiful Soup transforms raw HTML into a navigable parse tree, allowing you to extract elements using Pythonic syntax. It’s often paired with requests for fetching pages.

##### Step-by-Step Scraping Workflow
1. Fetch the Web Page
2. Parse HTML with Beautiful Soup
3. Extract Data
4. Handle Pagination

In [8]:
import requests
from bs4 import BeautifulSoup

# 1. Fetch the webpage
url = "http://books.toscrape.com/"
response = requests.get(url)

# Check for successful response
if response.status_code == 200:
    html = response.text
else:
    raise Exception(f"Failed to fetch page: {response.status_code}")

In [9]:
print(html)

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="" />
        <meta name="viewport" content="width=device-width" />
        <meta name="robots" content="NOARCHIVE,NOCACHE" />

        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
        <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->

        
            <link rel="shortcut icon" href="static/oscar/favicon.

In [10]:
# 2. Parse HTML with Beautiful Soup
soup = BeautifulSoup(html, 'html.parser')
print(soup)

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [11]:
soup.title

<title>
    All products | Books to Scrape - Sandbox
</title>

In [14]:
soup.title.string

'\n    All products | Books to Scrape - Sandbox\n'

In [15]:
# All links in the page
nb_links = len(soup.find_all("a"))
print(f"There are {nb_links} links in this page")

There are 94 links in this page


In [16]:
# Text from the page
print(soup.get_text())





  


    All products | Books to Scrape - Sandbox

















Books to Scrape We love being scraped!








Home

All products









                            
                                Books
                            
                        



                            
                                Travel
                            
                        



                            
                                Mystery
                            
                        



                            
                                Historical Fiction
                            
                        



                            
                                Sequential Art
                            
                        



                            
                                Classics
                            
                        



                            
                                Philosophy
          

In [17]:
#You can also use 'lxml' or 'html5lib' for faster or more lenient parsing.

# 3. Extract Data
#Let’s extract book titles and prices:
books = soup.select('article.product_pod')

for book in books:
    title = book.h3.a['title']
    price = book.select_one('.price_color').text
    print(f"{title} - {price}")

A Light in the Attic - Â£51.77
Tipping the Velvet - Â£53.74
Soumission - Â£50.10
Sharp Objects - Â£47.82
Sapiens: A Brief History of Humankind - Â£54.23
The Requiem Red - Â£22.65
The Dirty Little Secrets of Getting Your Dream Job - Â£33.34
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull - Â£17.93
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics - Â£22.60
The Black Maria - Â£52.15
Starving Hearts (Triangular Trade Trilogy, #1) - Â£13.99
Shakespeare's Sonnets - Â£20.66
Set Me Free - Â£17.46
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1) - Â£52.29
Rip it Up and Start Again - Â£35.02
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991 - Â£57.25
Olio - Â£23.88
Mesaerion: The Best Science Fiction Stories 1800-1849 - Â£37.59
Libertarianism for Beginners - Â£51.33
It's Only the Himalayas - Â£45.17


##### Targeting DOM elements

When using BeautifulSoup for web scraping, one of the most important tasks is targeting and extracting specific DOM (Document Object Model) elements. The DOM is a programming interface for web documents. Imagine the HTML code of a webpage as an upside-down tree. Each HTML element (headings, paragraphs, and links) is a node in this tree.

![DOM](dom_tree.png)

##### BS4 allows you to quickly and elegantly target the DOM elements you need. Here are the different ways Beautiful Soup provides to target these elements within the DOM:

Finding by Tag

To find elements by their tag name in Beautiful Soup, you have two main options: the find method and the find_all method.

- find: This method searches the parsed HTML document from top to bottom and returns the first occurrence of the tag you specify.
- find_all: This method searches the entire parsed HTML document and returns a list containing all instances of the specified tag or set of tags.

In [18]:
soup.find_all("a")

[<a href="index.html">Books to Scrape</a>,
 <a href="index.html">Home</a>,
 <a href="catalogue/category/books_1/index.html">
                             
                                 Books
                             
                         </a>,
 <a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>,
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>,
 <a href="catalogue/category/books/historical-fiction_4/index.html">
                             
                                 Historical Fiction
                             
                         </a>,
 <a href="catalogue/category/books/sequential-art_5/index.html">
                             
                                 Sequential Art
            

The .find('span') method will return the first <span> tag it finds in the parsed HTML:

In [20]:
soup.find_all("span")

[]

##### Finding by Class or ID
BS4 provides methods to locate elements by their class or ID attributes. For instance, to find an element with a specific class:

In [21]:
soup.find(class_="athing")

In [22]:
soup.find_all(class_="athing")

[]

In [23]:
soup.find(id="pagespace")

In [5]:
# 4. Handle Pagination
#To scrape multiple pages:
while True:
    books = soup.select('article.product_pod')
    for book in books:
        title = book.h3.a['title']
        price = book.select_one('.price_color').text
        print(f"{title} - {price}")

    next_link = soup.select_one('li.next a')
    if next_link:
        next_url = url.rsplit('/', 1)[0] + '/' + next_link['href']
        response = requests.get(next_url)
        soup = BeautifulSoup(response.text, 'html.parser')
    else:
        break

A Light in the Attic - Â£51.77
Tipping the Velvet - Â£53.74
Soumission - Â£50.10
Sharp Objects - Â£47.82
Sapiens: A Brief History of Humankind - Â£54.23
The Requiem Red - Â£22.65
The Dirty Little Secrets of Getting Your Dream Job - Â£33.34
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull - Â£17.93
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics - Â£22.60
The Black Maria - Â£52.15
Starving Hearts (Triangular Trade Trilogy, #1) - Â£13.99
Shakespeare's Sonnets - Â£20.66
Set Me Free - Â£17.46
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1) - Â£52.29
Rip it Up and Start Again - Â£35.02
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991 - Â£57.25
Olio - Â£23.88
Mesaerion: The Best Science Fiction Stories 1800-1849 - Â£37.59
Libertarianism for Beginners - Â£51.33
It's Only the Himalayas - Â£45.17
A Light in the Attic - Â£51.77
Tipping the Velvet - Â£53.74
Soumission - 

In [6]:
#Advanced Techniques
# Find by Tag and Attribute
soup.find_all('a', class_='active')


# CSS Selectors
soup.select('div.nav ul li.blog a.active')


# Extract Metadata
description = soup.find('meta', attrs={'name': 'description'})['content']

TypeError: 'NoneType' object is not subscriptable

In [7]:
# Export to JSON
import json

data = []

for book in books:
    data.append({
        'title': book.h3.a['title'],
        'price': book.select_one('.price_color').text
    })

with open('books.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=4)

##### 🛡️ Ethics & Robustness
- Respect robots.txt
- Use time.sleep() to avoid hammering servers
- Handle exceptions and broken HTML gracefully