# Introduction to Web Crawling

- **Web crawling**: Navigating and collecting data from across multiple web pages  
- **Web scraping**: Extracting specific information from a single page  
- Depending on the target, crawling is categorized as **static (HTML-based)** or **dynamic (JavaScript-based)**

# Web Crawling with R – Summary

## Static Web Crawling (`rvest`)

- Use the `rvest` package to extract content directly from HTML pages  
- Functions like `read_html()`, `html_nodes()`, and `html_text()` help collect text, links, etc.  
- Tables built with `<table>` tags can be converted into data frames using `html_table()`  
- Use `dplyr` for cleaning the data and `write.csv()` to save it

## API-Based Data Collection (`httr`, `jsonlite`)

- APIs provide structured data (JSON/XML) directly from the server  
- Use `httr::GET()` to request data and `jsonlite::fromJSON()` to parse the JSON response  
- API-based collection is faster and more stable than HTML scraping  
- You can extract fields like article titles, links, and dates, and save them as CSV

## Dynamic Crawling with JavaScript (`RSelenium`)

- Pages rendered by JavaScript cannot be scraped using `rvest`  
- Use `RSelenium` to automate a browser and interact with JavaScript-based pages  
- Simulate user actions like clicking buttons, scrolling down, or handling infinite scroll  
- Requires setting up a web driver, and may depend on the system configuration

## Handling Login and Sessions (`httr::POST()`)

- Some websites require login before data can be accessed  
- Simulate login with `httr::POST()` and maintain session cookies  
- Note: Many modern websites use CAPTCHA or 2FA, so automatic login may be blocked  
- When possible, prefer official APIs with proper authentication

## Automation and Real-World Projects

- To run crawlers regularly, you need automation tools  
- **Linux**: Use `cronR`; **Windows**: Use `taskscheduleR`  
- Example projects:
  - Daily crawling of trending news, saved to a cumulative CSV
  - Monitoring product price changes or trending keywords  
- When automating, also consider logging, date-tracking, and duplicate removal

## Key Takeaways

- **Static scraping**: `rvest`, `html_table`, `dplyr`  
- **API-based**: `httr::GET()`, `jsonlite::fromJSON()`  
- **Dynamic pages**: `RSelenium` for JavaScript interactions  
- **Login/session handling**: `httr::POST()`  
- **Automation**: `cronR` (Linux), `taskscheduleR` (Windows)  
- All techniques come together in real-world data collection projects

# Web Crawling with Python – Summary

## Static Web Crawling (`requests`, `BeautifulSoup`)

- Use `requests` to fetch HTML pages and `BeautifulSoup` to parse them  
- Extract text, links, or tables based on HTML tags and class structures  
- Use `pandas.read_html()` for tables, or manually parse with `BeautifulSoup`  
- Clean data using `pandas` and save with `.to_csv()`

## API-Based Data Collection (`requests`, `json`, `pandas`)

- REST APIs provide structured data (JSON/XML) directly  
- Use `requests.get()` and `response.json()` or `json.loads()` to parse responses  
- Convert JSON to `pandas.DataFrame` for analysis and export  
- More stable and faster than HTML scraping

## JavaScript-Based Dynamic Crawling (`Selenium`)

- JavaScript-rendered pages can’t be scraped with `requests` or `BeautifulSoup`  
- Use `Selenium` to simulate browser actions (clicks, scrolling, etc.)  
- Extract final content using `driver.page_source` after the page loads  
- Ideal for infinite scrolls, pop-ups, and dynamic interactions

## Login and Session Management (`requests.Session()` or `Selenium`)

- Use `requests.Session()` to maintain cookies and login credentials  
- For complex interactions, automate login via `Selenium`  
- Beware of CAPTCHA or multi-factor authentication – prefer APIs if available

## Automation and Real-World Projects

- Use schedulers or job runners to automate crawling  
- Tools: `schedule`, `APScheduler`, `cron` (Linux), Task Scheduler (Windows)  
- Example use cases:
  - Daily news scraping into CSV  
  - Monitoring product price trends or real-time keywords  
- Include logging, duplicate checks, and date-stamping for reliability

## Key Takeaways

- **Static scraping**: `requests`, `BeautifulSoup`, `pandas`  
- **API collection**: `requests`, `json`, `pandas`  
- **Dynamic scraping**: `Selenium` for full browser interaction  
- **Login/session**: `requests.Session()` or automated login via `Selenium`  
- **Automation**: `schedule`, `APScheduler`, `cron`, Task Scheduler  
- Together, these tools can power scalable, real-world web crawling workflows

# Web Crawling: R vs Python

| Feature              | R                                                | Python                                                 |
|----------------------|--------------------------------------------------|---------------------------------------------------------|
| **Static Crawling**  | `rvest`, `html_table`, `dplyr`                   | `requests`, `BeautifulSoup`, `pandas`                  |
| **Table Extraction** | `html_table()` + `dplyr` for cleaning            | `pandas.read_html()` or manual parsing with BeautifulSoup |
| **API Requests**     | `httr::GET()` + `jsonlite::fromJSON()`          | `requests.get()` + `response.json()` or `json.loads()` |
| **Dynamic Crawling** | `RSelenium`                                      | `Selenium`                                              |
| **Login Handling**   | `httr::POST()` + session cookies                 | `requests.Session()` or login via `Selenium`           |
| **Automation Tools** | `cronR` (Linux), `taskscheduleR` (Windows)       | `cron`, `schedule`, `APScheduler`, Windows Task Scheduler |
| **Strengths**        | Strong for statistics and quick data cleaning    | Broad ecosystem, powerful libraries, large community   |
| **Weaknesses**       | Setup can be tricky; less real-time flexibility  | Steeper learning curve when combining with web tech    |

## Summary

- **R** is great for statistical workflows and lightweight crawling combined with data wrangling.
- **Python** excels in scalable automation, complex website handling, and API-based data pipelines.