# Introduction to Web Scraping

Across 10 notebooks, I will introduce you to web scraping using Python. We will start with the basics and gradually move to more advanced topics. I assume that you have a basic understanding of Python and HTML. But don't worry if you don't, I will explain everything in detail.

## What is web scraping?

Web scraping is a technique to extract data from websites. It is a way to transform unstructured data on the web into structured data that can be stored in your local computer or a database.

<div style="text-align:center">
    <img src="https://topwebscrapingservice.files.wordpress.com/2016/06/custom-web-scraping-624x301.png" width=600>
</div>

1. **Request.** Initiate by sending a request to the target website using HTTP. This reaches out to the website’s server.

2. **Response.** The server processes your request and responds with the website's HTML content.

3. **Parsing.** Analyze the HTML content to locate and extract the necessary information using a parser, a tool designed to understand HTML structure.

4. **Output.** Save the extracted data into a structured format, such as a spreadsheet or database, for further analysis or visualization.

## Why web scraping?

Imagine you are a data scientist and you want to analyze the prices of products on an e-commerce website. You can manually copy and paste the prices of products into a spreadsheet, but this is not feasible for large websites with thousands of products. This is where web scraping comes in. It allows you to automate the process of extracting data from websites.

## Is it legal?

Web scraping is a legal gray area. There are no clear laws that govern web scraping. Over the years, courts have ruled on web scraping cases, but the rulings are not consistent. Here are some legal considerations to keep in mind:

- **Terms of Service (ToS).** Many websites explicitly forbid web scraping in their ToS. Ignoring this can lead to lawsuit

- **robots.txt.** Respect a website’s `robots.txt` file, which outlines permissible scraping activities.

- **Copyright Law.** Ensure you're legally allowed to scrape and reuse the website's content, particularly if it's copyrighted.

- **Data Protection Regulations.** Be mindful of regulations like the GDPR when handling personal or sensitive data.

## What tools can I use?

I can roughly categorize web scraping tools into two categories: visual tools and programming tools.

### Visual Tools

Visual tools are designed for non-programmers who want to scrape websites without writing code. These tools allow you to create scraping agents by pointing and clicking. <span style="color: darkorange">These tools, however, are greatly limited when using the free version.</span> Nonetheless, they can be useful for small-scale web scraping projects.

- **[Octoparse](https://www.octoparse.com/)** can help you with automatic pagination, extract behind login, and ad-blocking.

<div style="text-align:center">
    <img src="https://rpa.octoparse.com/static/home/swiper/1.png" width=600>
</div>

- **[ParseHub](https://www.parsehub.com/)** supports scheduling, automatic IP rotation, and handling infinite scrolling.

<div style="text-align:center">
    <img src="https://www.kdnuggets.com/wp-content/uploads/parsehub_img1.jpg" width=600>
</div>

- **[WebHarvy](https://www.webharvy.com/)** offers intelligent pattern detection, privacy protection, and category scraping.

<div style="text-align:center">
    <img src="https://www.webharvy.com/images/tour/8.png" width=600>
</div>

### Programming Tools

Programming tools are designed for programmers who want to scrape websites using code. These tools allow you to write scripts to extract data from websites. You can try these tools if you are comfortable with programming:

- **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)** can pull data out of HTML and XML files. It provides methods for navigating, searching, and modifying the parse tree. It is a popular choice for web scraping because of its simplicity and ease of use.

- **[Selenium](https://www.selenium.dev/)** is a web browser automation tool that allows you to interact with web pages using a web browser. It is a popular choice for web scraping because it can handle JavaScript-heavy websites. I usually use Selenium for web scraping projects that require interaction with web pages.

- **[Scrapy](https://scrapy.org/)** is an open-source web crawling and web scraping framework for Python. It provides a set of tools for extracting data from websites. It is a popular choice for web scraping because of its speed and scalability. I usually use Scrapy for large-scale web scraping projects.

## Ready to start?

With an understanding of what web scraping entails, why it's valuable, the legalities to keep in mind, and the tools at your disposal, you're now equipped to dive into the world of web scraping. Experiment with the different tools discussed and explore the vast possibilities that web scraping opens up for data collection and analysis.