# Web Scraping

## Introduction

### Definition

Web scraping is the process of automatically extracting data from websites.

### Why important in Data Science?

- Collect raw data for analysis (reviews, prices, news, research, etc.)
- Build datasets where APIs are not available
- Automate repetitive data collection

## How Web Scraping Works

- Send HTTP request to a web server (using tools like requests).
- Receive the response (usually HTML or JSON).
- Parse the response to extract meaningful data (using `BeautifulSoup`, `lxml`, or `regex`).
- Store the data in CSV, JSON, or a database.

## Key Python Libraries

- requests → Fetch HTML pages from websites.
- BeautifulSoup4 (bs4) → Parse and navigate HTML/XML.
- lxml → Fast HTML/XML parser.
- selenium → Automate browsers for dynamic (JavaScript) websites.
- pandas → Store and analyze structured data.
- re (regex) → Pattern matching in text.

## Basic Workflow

In [3]:
import requests
from bs4 import BeautifulSoup

# Step 1: Request the page
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)

# Step 2: Parse HTML
soup = BeautifulSoup(response.text, "lxml")

# Step 3: Extract data
title = soup.title.text
links = [a['href'] for a in soup.find_all("a", href=True)]

print(title, links)

Example Domain ['https://www.iana.org/domains/example']


## Extracting Data

- Tags: `soup.find("h1")`, `soup.find_all("p")`
- Attributes: `element["src"]`, `element.get("href")`
- CSS selectors: `soup.select("div.article h2")`

## Handling Tables

`import pandas as pd`

`df = pd.read_html(response.text)[0]`

## Dynamic Websites

- Many sites use JavaScript to load data.
- Use:
    - selenium → Automates a browser to wait for content to load.
    - requests_html / playwright → For JavaScript rendering.
    - Or directly hit the API endpoint (preferred if available).

## Storing Scraped Data

- CSV: `df.to_csv("output.csv", index=False)`
- JSON: `json.dump(data, f)`
- Database: `sqlite3`, `SQLAlchemy`, or `MongoDB`.

## Ethical & Legal Considerations

- Check robots.txt before scraping (`https://example.com/robots.txt`).
- Do not overload servers → use `time.sleep()` or rate limiting.
- Respect copyright and terms of service.
- Prefer official APIs when available.

## Challenges in Web Scraping

- Dynamic content (needs JavaScript rendering).
- Captcha / Anti-bot systems.
- Frequent site changes (your scraper may break).
- IP blocking (solution: proxies, rotating headers).

# ✅ Summary:

Web scraping is a critical data collection skill in data science. The workflow involves **requesting → parsing → extracting → storing** data. Use `requests + BeautifulSoup` for static sites, and `selenium` or API calls for dynamic sites. Always respect ethics and legality.