# ðŸ¥£ Mastering Web Scraping with BeautifulSoup
---
## 1. Introduction
**BeautifulSoup** is a Python library for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

### Why BeautifulSoup?
* **Robustness**: It handles poorly formatted HTML (missing tags, bad nesting) without crashing.
* **Simplicity**: It provides Pythonic ways to search and navigate the DOM tree.
* **Compatibility**: It works with different parsers like `lxml` and `html5lib`.

## 2. Installation
You need `beautifulsoup4` for parsing and `requests` to fetch the webpage content.

In [None]:
!pip install beautifulsoup4 requests lxml

## 3. Understanding the HTML Tree Structure
Every HTML page is a hierarchical tree of tags. 



To extract data, you need to navigate this tree. Let's start with a sample string.

In [None]:
from bs4 import BeautifulSoup

html_content = """
<html>
    <head><title>The Python Store</title></head>
    <body>
        <h1 id="main-title">Welcome to the Store</h1>
        <p class="description">We sell high-quality Python scripts.</p>
        <ul class="item-list">
            <li class="product" price="$10">Web Scraper Tool</li>
            <li class="product" price="$20">Data Cleaner</li>
            <li class="product" price="$50">Auto-Blogger</li>
        </ul>
        <a href="https://example.com/contact" id="contact-link">Contact Us</a>
    </body>
</html>
"""

# Create the 'soup' object
soup = BeautifulSoup(html_content, 'lxml')

print("Soup Object Created!")

## 4. Searching the Tree
### A. `find()` vs `find_all()`
* `find()`: Returns the first matching element.
* `find_all()`: Returns a list of all matching elements.

In [None]:
# Find the main heading
title = soup.find('h1')
print(f"Title Tag: {title.text}")

# Find all product items
products = soup.find_all('li', class_='product')
for item in products:
    print(f"Product Name: {item.text} | Price: {item['price']}")

### B. Selecting by ID and Attributes
You can target elements precisely using their unique IDs or specific attributes.

In [None]:
# Finding by ID
contact = soup.find(id="contact-link")
print(f"Link URL: {contact['href']}")

# Using CSS Selectors (similar to Javascript/CSS)
desc = soup.select_one(".description")
print(f"Description: {desc.get_text()}")

## 5. Real-World Project: Scraping a Live Website
We will scrape quotes from `quotes.toscrape.com` and store them in a list.

In [None]:
import requests

URL = "http://quotes.toscrape.com/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

# Locate all quote containers
quote_divs = soup.find_all('div', class_='quote')

scraped_data = []

for div in quote_divs:
    text = div.find('span', class_='text').text
    author = div.find('small', class_='author').text
    tags = [tag.text for tag in div.find_all('a', class_='tag')]
    
    scraped_data.append({
        'quote': text,
        'author': author,
        'tags': tags
    })

# Print the first result
print(scraped_data[0])

## 6. Saving Data to CSV
Finally, let's export our scraped data to a CSV file using `pandas`.

In [None]:
import pandas as pd

df = pd.DataFrame(scraped_data)
df.to_csv('quotes.csv', index=False)
print("Data saved to quotes.csv successfully!")
df.head()

---
## ðŸŽ¯ Summary Checklist
1. **Inspect**: Use Browser DevTools (F12) to find the tags/classes you need.
2. **Request**: Use `requests.get()` to fetch the HTML.
3. **Parse**: Use `BeautifulSoup(html, 'lxml')`.
4. **Extract**: Use `find()`, `find_all()`, or `select()`.
5. **Clean**: Use `.text` and `.strip()` to clean the data.
6. **Store**: Use `pandas` for easy CSV/Excel export.