# Beautiful soup Introduction: Examples

This notebook is intended to give you an introduction to the basic functionalities of the **`Beautiful soup`** library.  

The goal is to:

1. **parse HTML reliably**
2. **locate elements**
3. **extract clean text/links**

---
## 1. Setup and Introduction

The **`Beautiful soup`** library is not native to Python, so we need to start by importing it:  

```bash
pip install requests beautifulsoup4 lxml
```

In [4]:
!pip install requests beautifulsoup4 lxml



In [5]:
import requests
from bs4 import BeautifulSoup

In [6]:
# Helper: create a soup object from a URL
def get_soup(url, *, parser="lxml"):
    resp = requests.get(url, timeout=20)
    resp.raise_for_status()
    return BeautifulSoup(resp.text, parser)

Beautiful Soup turns your HTML into a **tree**. You can:
1. **Find one** element (`find`)
2. **Find many** elements (`find_all` / `select`)
3. **Filter** by tag name, attributes, text, and CSS selectors
4. **Extract** cleaned text (`.get_text`) and attribute values (`.get('href')`)


#### 1.1 `find` - grab the **first** match

**Signature:** `soup.find(tag_name, attrs={...}, string=..., class_="...")`

- Returns a **single element** (or `None` if not found)
- Great when you expect exactly one main region or header

**Example (Simple page):**

In [10]:
url = "https://www.scrapethissite.com/pages/simple/"
soup = get_soup(url)
# Find the first <h3> with class "country-name"
country_h3 = soup.find("h3", class_="country-name")
country_h3.get_text(strip=True) if country_h3 else None

'Andorra'

**Tips**
- Prefer `class_="foo"` instead of `attrs={'class': 'foo'}` (since `class` is a Python keyword)
- For IDs, use `id="main"` or `attrs={'id': 'main'}`
- Use `.find(...).get_text(strip=True)` to get trimmed text in one go


---
## 2. `find_all` - grab **all** matches (list)

**Signature:** `soup.find_all(tag_name, attrs={...}, limit=None)`

- Returns a **list** of elements (possibly empty)
- Combine with list comprehensions for fast extraction

In [14]:
# All countries listed on the Simple page
countries = soup.find_all("div", class_="country")
len(countries), [c.find("h3", class_="country-name").get_text(strip=True) for c in countries[:15]]

(250,
 ['Andorra',
  'United Arab Emirates',
  'Afghanistan',
  'Antigua and Barbuda',
  'Anguilla',
  'Albania',
  'Armenia',
  'Angola',
  'Antarctica',
  'Argentina',
  'American Samoa',
  'Austria',
  'Australia',
  'Aruba',
  'Åland'])

*Filtering by attributes**

Any HTML attribute can be matched via `attrs`. For example: `attrs={'data-role': 'row'}`

---
## 3. `select` - use **CSS selectors** (power move)

**Signature:** `soup.select("css selector")` (many) and `soup.select_one("css selector")` (first)

- Readable and **very flexible**
- Supports descendant selectors, classes, IDs, attribute selectors, `:has()`, `:contains()` **(limited in BS4)**

**Example (Hockey Teams page):**

In [18]:
teams_url = "https://www.scrapethissite.com/pages/forms/?page_num=1"
teams_soup = get_soup(teams_url)

# Each team is a row with class 'team'
team_rows = teams_soup.select(".team")
len(team_rows)

25

In [19]:
# Extract a few columns with CSS selectors
def parse_team_row(row):
    return {
        "Team": row.select_one(".name").get_text(strip=True) if row.select_one(".name") else None,
        "Year": row.select_one(".year").get_text(strip=True) if row.select_one(".year") else None,
        "Wins": row.select_one(".wins").get_text(strip=True) if row.select_one(".wins") else None,
        "Losses": row.select_one(".losses").get_text(strip=True) if row.select_one(".losses") else None,
    }

[parse_team_row(r) for r in team_rows[:5]]

[{'Team': 'Boston Bruins', 'Year': '1990', 'Wins': '44', 'Losses': '24'},
 {'Team': 'Buffalo Sabres', 'Year': '1990', 'Wins': '31', 'Losses': '30'},
 {'Team': 'Calgary Flames', 'Year': '1990', 'Wins': '46', 'Losses': '26'},
 {'Team': 'Chicago Blackhawks', 'Year': '1990', 'Wins': '49', 'Losses': '23'},
 {'Team': 'Detroit Red Wings', 'Year': '1990', 'Wins': '34', 'Losses': '38'}]

**Selector mini-cheat sheet**

- `.class` — elements with class
- `#id` — element with ID
- `tag[attr="value"]` — attribute equals value (quotes optional for simple values)
- `a[href^="/pages/"]` — attribute **starts with** `/pages/`
- `div.country .country-name` — descendant selection
- `div.country > h3.country-name` — **direct child** only

---
## 4.  `get_text` - extract **clean text**

**Signature:** `element.get_text(separator="", strip=False)`

- Collapses nested tags to a single string
- `strip=True` trims whitespace
- Use `separator=" "` to insert spaces between nodes if needed

In [23]:
# Get country block text and clean it
block = countries[0] if countries else None
block.get_text(" ", strip=True) if block else None

'Andorra Capital: Andorra la Vella Population: 84000 Area (km 2 ): 468.0'

---
## 5. Attributes - grab links, titles, etc.

Use `element.get('href')` (safe; returns `None` if missing) or `element['href']` (raises KeyError if missing).

**Example:**

In [26]:
# On the "Oscar Winners" example page, get links from the sidebar
oscars_url = "https://www.scrapethissite.com/pages/"
oscars_soup = get_soup(oscars_url)

links = [(a.get_text(strip=True), a.get('href')) for a in oscars_soup.select('a')[:10]]
links

[('Scrape This Site', '/'),
 ('Sandbox', '/pages/'),
 ('Lessons', '/lessons/'),
 ('FAQ', '/faq/'),
 ('Login', '/login/'),
 ('Countries of the World: A Simple Example', '/pages/simple/'),
 ('Hockey Teams: Forms, Searching and Pagination', '/pages/forms/'),
 ('Oscar Winning Films: AJAX and Javascript', '/pages/ajax-javascript/'),
 ('Turtles All the Way Down: Frames & iFrames', '/pages/frames/'),
 ("Advanced Topics: Real World Challenges You'll Encounter",
  '/pages/advanced/')]

---
## 6.  Pulling it together - small scraping routine

We'll scrape the **Simple** countries and build a structured list.

In [29]:
def scrape_simple_countries(page_url="https://www.scrapethissite.com/pages/simple/"):
    ??? = get_soup(???)
    items = []
    for block in soup.select(".???"):
        name = block.select_one(".country-name").get_text(strip=True)
        capital = block.select_one(".country-capital").get_text(strip=True)
        population = block.select_one(".country-population").get_text(strip=True)
        area_km2 = block.select_one(".country-area").get_text(strip=True)
        items.append({
            "name": ???,
            "???": capital,
            "population": population,
            "area_km2": area_km2,
        })
    return ???

countries_data = scrape_simple_countries()
countries_data[:???]

SyntaxError: invalid syntax (2946688476.py, line 10)

---
## Pagination pattern (when pages=1,2,3…)

Some pages paginate via `?page_num=...`. You can loop until no results are returned.


In [None]:
def iterate_hockey_pages(base="https://www.scrapethissite.com/pages/forms/?page_num={}"):
    page = ???
    while ???:
        soup = ???(base.format(page))
        rows = soup.select('.team')
        if not rows:
            break
        for r in rows:
            yield parse_team_row(r)
        page += 1

# Peek at the first 8 rows across pages (this may take a moment to run)
from itertools import islice
list(islice(iterate_hockey_pages(), 8))