```{contents}
```
## Web Scraping


**Web scraping** is the process of **programmatically extracting data from websites** by fetching HTML pages and parsing the required content.

In LLM / RAG systems, web scraping is used to:

* Collect external knowledge
* Build knowledge bases
* Keep documents up to date
* Ingest web pages for search and QA

---

### Where Web Scraping Fits in LLM Pipelines

```
Web Page
  ↓
HTTP Request
  ↓
HTML Content
  ↓
Parser / Scraper
  ↓
Clean Text
  ↓
Chunking → Embeddings → Vector DB → RAG
```

Scraping is **before** embeddings and retrieval.

---

### Types of Web Scraping

| Type               | Description                   |
| ------------------ | ----------------------------- |
| Static scraping    | HTML is directly available    |
| Dynamic scraping   | Content loaded via JavaScript |
| API-based scraping | Hidden APIs behind web apps   |


---

###  Basic Web Scraping with `requests` + `BeautifulSoup`

#### Install Dependencies

```bash
pip install requests beautifulsoup4
```

---

#### Fetch a Web Page

```python
import requests

url = "https://example.com"
response = requests.get(url)

html = response.text
print(html[:500])
```

---

#### Parse HTML Content

```python
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

text = soup.get_text()
print(text[:300])
```

This removes HTML tags and extracts readable text.

---

### Targeted Scraping (Specific Elements)

#### Extract Headings and Paragraphs

```python
headings = [h.text for h in soup.find_all("h1")]
paragraphs = [p.text for p in soup.find_all("p")]

print("Headings:", headings)
print("Paragraphs:", paragraphs[:3])
```

---

#### Extract Links (Metadata)

```python
links = [a["href"] for a in soup.find_all("a", href=True)]
print(links[:5])
```

Useful for:

* Crawling
* Source tracking
* Metadata enrichment

---

### Web Scraping for RAG (Clean + Structured)

#### Convert Scraped Content into Documents

```python
from langchain.schema import Document

documents = [
    Document(
        page_content=p.text,
        metadata={"source": url}
    )
    for p in soup.find_all("p")
]
```

These `Document` objects are now ready for:

* Chunking
* Embeddings
* Vector storage

---

### Using LangChain Web Loaders (Recommended)

#### WebBaseLoader (Simple & Clean)

```python
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://example.com")
documents = loader.load()

print(documents[0].page_content)
```

This handles:

* Fetching
* Cleaning
* Metadata

---

### Scraping Multiple Pages

#### Bulk Web Scraping

```python
urls = [
    "https://example.com/page1",
    "https://example.com/page2"
]

loader = WebBaseLoader(urls)
documents = loader.load()
```

---

### Handling Dynamic Websites (JavaScript)

Static scraping fails when content loads via JS.

**Solution options**:

* Playwright / Selenium
* Use hidden APIs
* Headless browsers

Example (conceptual):

```python
from playwright.sync_api import sync_playwright
```

Used only when required due to cost.

---

### Best Practices (Important)

| Practice                  | Reason            |
| ------------------------- | ----------------- |
| Respect robots.txt        | Legal & ethical   |
| Rate limiting             | Avoid blocking    |
| Clean boilerplate         | Better embeddings |
| Store source URLs         | Traceability      |
| Avoid scraping auth pages | Security risk     |

---

### Common Scraping Pitfalls

* Scraping JS-heavy pages incorrectly
* Including navigation/footer noise
* Ignoring website terms
* Over-fetching pages
* Not normalizing text

---

### Mental Model

```
Web Scraping =
Fetch → Parse → Clean → Structure → Ingest
```

If cleaning is bad, **RAG quality drops**.

---

### Key Takeaways

* Web scraping is the foundation of web-based RAG
* Use `requests + BeautifulSoup` for static pages
* Use LangChain loaders for faster integration
* Convert scraped data into structured documents
* Always clean and preserve metadata

