In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In this [post](https://www.scrapingbee.com/blog/web-scraping-101-with-python/), which can be read as a follow-up to our guide about [web scraping without getting blocked](https://www.scrapingbee.com/blog/web-scraping-without-getting-blocked/), we will cover almost all of the tools Python offers to scrape the web. We will go from the basic to advanced ones, covering the pros and cons of each. Of course, we won't be able to cover every aspect of every tool we discuss, but this post should give you a good idea of what each tool does, and when to use one.

*Note: When I talk about Python in this blog post you should assume that I talk about Python3.*

> 在这篇[文章](https://www.scrapingbee.com/blog/web-scraping-101-with-python/)中，可以作为我们关于[网络爬虫而不被阻止](https://www.scrapingbee.com/blog/web-scraping-without-getting-blocked/)指南的后续文章来阅读，我们将涵盖几乎所有Python提供的网络爬虫工具。我们将从基本的到高级的，涵盖每个工具的优点和缺点。当然，我们不可能涵盖我们所讨论的每个工具的每一个方面，但是这篇文章应该让你对每个工具的作用有一个很好的概念，以及什么时候使用一个工具。
>
> *注意：当我在这篇博文中谈及Python时，你应该认为我谈的是Python3*。

## 0. Web Fundamentals

The Internet is **complex**: there are many underlying technologies and concepts involved to view a simple web page in your browser. The goal of this article is not to go into excruciating detail on every single of those aspects, but to provide you with the most important parts for extracting data from the web with Python.

> 互联网是**复杂的**：在你的浏览器中查看一个简单的网页，涉及许多底层技术和概念。本文的目的不是要对这些方面的每一个细节进行深入研究，而是要为你提供用Python从网络中提取数据的最重要的部分。

### HyperText Transfer Protocol

HyperText Transfer Protocol (HTTP) uses a **client/server** model. An HTTP client (a browser, your Python program, cURL, libraries such as Requests...) opens a connection and sends a message (“I want to see that page : /product”) to an HTTP server (Nginx, Apache...). Then the server answers with a response (the HTML code for example) and closes the connection.

> 超文本传输协议（HTTP）使用一个**的客户/服务器**模型。一个HTTP客户端（浏览器、你的Python程序、cURL、Requests等库......）打开一个连接，向HTTP服务器（Nginx、Apache......）发送一个信息（"我想看那个页面：/product"）。然后，服务器回复一个响应（例如HTML代码）并关闭连接。

HTTP is called a ***stateless protocol*** because each transaction (request/response) is independent. FTP, for example, is stateful because it maintains the connection.

HTTP被称为***无状态协议***，因为每个事务（请求/响应）都是独立的。例如，FTP是有状态的，因为它维护连接。

Basically, when you type a website address in your browser, the HTTP request looks like this:

> 基本上，当你在浏览器中输入一个网站地址时，HTTP请求看起来像这样：

In the first line of this request, you can see the following:

- The **HTTP method** or verb. In our case `GET`, indicating that we would like to fetch data. There are quite a few other HTTP methods available as (e.g. for uploading data) and a full list is available [here](https://www.w3schools.com/tags/ref_httpmethods.asp).
- The **path of the file, directory, or object** we would like to interact with. In the case here the directory `product` right beneath the root directory.
- The **version of the HTTP** protocol. In this tutorial we will focus on HTTP 1.
- Multiple **headers fields**: Connection, User-Agent... Here is an exhaustive list of [HTTP headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers)

> 在这个请求的第一行，你可以看到以下内容。
>
> - **HTTP方法**或动词。在我们的例子中是`GET`，表示我们想获取数据。还有很多其他的HTTP方法（例如，用于上传数据），完整的列表可在[这里](https://www.w3schools.com/tags/ref_httpmethods.asp)找到。
> - 我们想与之互动的**文件、目录或对象的路径**。在这里的情况下，目录`product`就在根目录下面。
> - HTTP**协议的**版本。在本教程中，我们将专注于HTTP 1。
> - 多个**头文件字段**。连接、用户代理... 下面是一个[HTTP头信息](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers)的详尽列表

Here are the most important header fields :

- **Host:** This header indicates the hostname for which you are sending the request. This header is particularly important for name-based [virtual hosting](https://en.wikipedia.org/wiki/Virtual_hosting#Name-based), which is the standard in today's hosting world.
- **User-Agent:** This contains information about the client originating the request, including the OS. In this case, it is my web browser (Chrome) on macOS. This header is important because it is either used for statistics (how many users visit my website on mobile vs desktop) or to prevent violations by bots. Because these headers are sent by the clients, they can be modified (*“Header Spoofing”*). This is exactly what we will do with our scrapers - **make our scrapers look like a regular web browser**.
- **Accept:** This is a list of [MIME types](https://en.wikipedia.org/wiki/Media_type), which the client will accept as response from the server. There are lots of different content types and sub-types: **text/plain, text/html, image/jpeg, application/json** ...
- **Cookie** : This header field contains a list of name-value pairs (name1=value1;name2=value2). Cookies are one way how websites can store data on your machine. This could be either up to a certain date of expiration (standard cookies) or only temporarily until you close your browser (session cookies). Cookies are used for a number of different purposes, ranging from authentication information, to user preferences, to more nefarious things such as user-tracking with personalised, unique user identifiers. However, they are a **vital browser feature** for mentioned authentication. When you submit a login form, the server will verify your credentials and, if you provided a valid login, issue a session cookie, which clearly identifies the user session for your particular user account. Your browser will receive that cookie and will pass it along with all subsequent requests.
- **Referer**: The referrer header (please note [the typo](https://en.wikipedia.org/wiki/HTTP_referer#Etymology)) contains the URL from which the actual URL has been requested. This header is important because websites use this header to change their behavior based on where the user came from. For example, lots of news websites have a paying subscription and let you view only 10% of a post, but if the user comes from a news aggregator like Reddit, they let you view the full content. They use the referrer to check this. Sometimes we will have to spoof this header to get to the content we want to extract.

> 以下是最重要的标头字段。
>
> - **host：**这个头表示你要发送请求的主机名。这个头对基于名字的[虚拟主机](https://en.wikipedia.org/wiki/Virtual_hosting#Name-based)特别重要，这也是当今主机世界的标准。
> - **User-Agent：**这包含了关于发起请求的客户端的信息，包括操作系统。在这种情况下，它是我在MacOS上的网络浏览器（Chrome）。这个标头很重要，因为它要么用于统计（有多少用户在手机上与桌面上访问我的网站），要么用于防止机器人的侵犯。因为这些标头是由客户发送的，它们可以被修改（*"标头欺骗 "*）。这正是我们将对我们的搜刮器所做的--**使我们的搜刮器看起来像一个普通的网络浏览器**。
> - **Accpet：**这是一个[MIME类型](https://en.wikipedia.org/wiki/Media_type)的列表，客户端将接受这些类型作为服务器的响应。有很多不同的内容类型和子类型。**text/plain, text/html, image/jpeg, application/json** ...
> - **Cookie** ：这个头域包含一个名-值对的列表（name1=value1;name2=value2）。Cookies是网站在你的机器上存储数据的一种方式。这可能是直到某个到期日（标准cookies），或者只是暂时的，直到你关闭你的浏览器（会话cookies）。Cookies被用于许多不同的目的，从认证信息到用户偏好，再到更邪恶的事情，如用个性化的、独特的用户标识符进行用户跟踪。然而，它们是一个**重要的浏览器功能，**提到的认证。当你提交一个登录表格时，服务器将验证你的凭证，如果你提供了一个有效的登录，就会发出一个会话cookie，明确识别你的特定用户账户的用户会话。你的浏览器将收到该cookie，并将其与所有后续请求一起传递。
> - **Referer**：Referrer头（请注意[错字](https://en.wikipedia.org/wiki/HTTP_referer#Etymology)）包含请求实际URL的URL。这个头很重要，因为网站使用这个头来根据用户的来源改变其行为。例如，很多新闻网站都有付费订阅，只让你查看文章的10%，但如果用户来自像Reddit这样的新闻聚合网站，他们会让你查看完整的内容。他们使用推荐人来检查这一点。有时，我们将不得不欺骗这个头以获得我们想要提取的内容。

And the list goes on...you can find the full header list [here](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields).

A server will respond with something like this:

> 还有很多......你可以找到完整的headers列表[这里](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields)。
>
> 一个服务器将以这样的方式回应：

On the first line, we have a new piece of information, the HTTP code `200 OK`. A code of 200 means the request was properly handled. You can find a full list of all available codes on [Wikipedia](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). Following the status line, you have the response headers, which serve the same purpose as the request headers we just discussed. After the response headers, you will have a blank line, followed by the actual data sent with this response.

Once your browser received that response, it will parse the HTML code, fetch all embedded assets (JavaScript and CSS files, images, videos), and render the result into the main window.

We will go through the different ways of performing HTTP requests with Python and extract the data we want from the responses.

> 在第一行，我们有一个新的信息，HTTP代码`200 OK`。代码为200意味着请求被正确处理。你可以在[维基百科](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)上找到所有可用代码的完整列表。在状态行之后，是响应头信息，其作用与我们刚才讨论的请求头信息相同。在响应头之后，你会有一个空白行，然后是与该响应一起发送的实际数据。
>
> 一旦你的浏览器收到该响应，它将解析HTML代码，获取所有嵌入的资料（JavaScript和CSS文件、图像、视频），并将结果呈现在主窗口中。
>
> 我们将经历用Python执行HTTP请求的不同方式，并从响应中提取我们想要的数据。

## 1. Manually Opening a Socket and Sending the HTTP Request

### Socket

The most basic way to perform an HTTP request in Python is to open a [TCP socket](https://docs.python.org/3/howto/sockets.html) and manually send the HTTP request.

> 在Python中执行HTTP请求的最基本方法是打开一个[TCP套接字](https://docs.python.org/3/howto/sockets.html)并手动发送HTTP请求。

In [4]:
import socket

HOST = 'www.google.com'    # Server hostname or IP address
PORT = 80                  # The standard port for HTTP is 80. for HTTPS it is 443

client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_address = (HOST, PORT)
client_socket.connect(server_address)

request_header = b'GET / HTTP/1.0\r\nHost: www.google.com\r\n\r\n'
client_socket.sendall(request_header)

response = ''
while True:
    recv = client_socket.recv(1024)
    if not recv:
        break
    response += str(recv)
    
print(response)
client_socket.close()

b'HTTP/1.0 200 OK\r\nDate: Sun, 24 Jul 2022 14:52:36 GMT\r\nExpires: -1\r\nCache-Control: private, max-age=0\r\nContent-Type: text/html; charset=ISO-8859-1\r\nP3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."\r\nServer: gws\r\nX-XSS-Protection: 0\r\nX-Frame-Options: SAMEORIGIN\r\nSet-Cookie: 1P_JAR=2022-07-24-14; expires=Tue, 23-Aug-2022 14:52:36 GMT; path=/; domain=.google.com; Secure\r\nSet-Cookie: AEC=AakniGN3NTL0SwC53UQ_GBxAnEbuqXA-N49yXKF9QThoILLWgaGmEuIrFg; expires=Fri, 20-Jan-2023 14:52:36 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax\r\nSet-Cookie: NID=511=psfxVXfbM2MnCpvZd1Ip2__5RZQTtMXhnnrtm8n17ECGgbVh0OPqyo9hGRlBodU3GY9UVpzGGAkQ3PWGiyx6ReJ2HNFQuRgW-NH8x4H5yOx008N7RB8UP3PwpWlA88rkSlMm1ROYmVvT4yqCAMnCiHbn5T93NTvj_92eADAtFRA; expires=Mon, 23-Jan-2023 14:52:36 GMT; path=/; domain=.google.com; HttpOnly\r\nAccept-Ranges: none\r\nVary: Accept-Encoding\r\n\r\n<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="ko"><head><m

Now that we have the HTTP response, the most basic way to extract data from it is to use regular expressions.

> 现在我们有了HTTP响应，从其中提取数据的最基本方法是使用正则表达式。

### Regular Expressions

Regular expressions (or also regex) are an extremely versatile tool for handling, parsing, and validating arbitrary text. A regular expression is essentially a string which defines a search pattern using a standard syntax. For example, you could quickly identify all phone numbers in a web page.

Combined with classic *search and replace*, regular expressions also allow you to perform string substitution on dynamic strings in a relatively straightforward fashion. The easiest example, in a web scraping context, may be to replace uppercase tags in a poorly formatted HTML document with the proper lowercase counterparts.

You may be, now, wondering why it is important to understand regular expressions when doing web scraping. That's a fair question and after all, there are many different Python modules to parse HTML, with XPath and CSS selectors.

In an ideal [semantic world,](https://en.wikipedia.org/wiki/Semantic_Web) data is easily machine-readable, and the information is embedded inside relevant HTML elements, with meaningful attributes. But the real world is messy. You will often find huge amounts of text inside a `<p>` element. For example, if you want to extract specific data inside a large text (a price, a date, a name...), you will have to use regular expressions.

> 正则表达式（或称regex）是一种极其通用的工具，用于处理、解析和验证任意文本。正则表达式本质上是一个字符串，它使用标准的语法定义了一个搜索模式。例如，你可以快速识别一个网页中的所有电话号码。
>
> 与经典的*搜索和替换*相结合，正则表达式还允许你以相对简单的方式对动态字符串进行字符串替换。最简单的例子是，在网络爬虫的背景下，可以用适当的小写字母替换格式不好的HTML文档中的大写字母标签。
>
> 你现在可能想知道，为什么在进行网络爬虫时，理解正则表达式很重要。这是一个公平的问题，毕竟，有许多不同的Python模块来解析HTML，有XPath和CSS选择器。
>
> 在一个理想的[语义世界中，](https://en.wikipedia.org/wiki/Semantic_Web)数据很容易被机器读取，而且信息被嵌入到相关的HTML元素里面，并带有有意义的属性。但是，现实世界是混乱的。你经常会在一个`<p>`元素里面发现大量的文本。例如，如果你想提取一个大文本里面的特定数据（一个价格、一个日期、一个名字......），你将不得不使用正则表达式。

**Note:** Here is a great website to test your regex: https://regex101.com/. Also, here is an [awesome blog](https://www.rexegg.com/) to learn more about them. This post will only cover a small fraction of what you can do with regex.

> **注意：**这里有一个很好的网站来测试你的正则表达式：https://regex101.com/ 。另外，这里有一个[很棒的博客](https://www.rexegg.com/) 可以了解更多关于它们的信息。这篇文章只涵盖了你能用regex做的事情的一小部分。

Regular expressions can be useful when you have this kind of data:

> 当你有这样的数据时，正则表达式会很有用：

We could select this text node with an XPath expression and then use this kind of regex to extract the price:

> 我们可以用XPath表达式来选择这个文本节点，然后用这种正则表达式来提取价格：

If you only have the HTML, it is a bit trickier, but not all that much more after all. You can simply specify in your expression the tag as well and then use a capturing group for the text.

> 如果你只有HTML，那就有点麻烦了，但毕竟没有那么多麻烦。你可以简单地在你的表达式中指定标签，然后为文本使用一个捕获组。

In [7]:
import re

html_content = '<p>Price : 19.99$</p>'

m = re.match('<p>(.+)</p>', html_content)
if m:
    print(m.group(1))

Price : 19.99$


As you can see, manually sending the HTTP request with a socket and parsing the response with regular expression can be done, but it's complicated and there are higher-level API that can make this task easier.

> 正如你所看到的，用套接字手动发送HTTP请求，并用正则表达式解析响应是可以做到的，但它很复杂，有更高级别的API可以使这项任务更容易。

## 2. urllib3 & LXML