# Web Crawler

### 1. Introduction to Python Web Scraping

#### 1.1 Definition of a Web Scraper

- **Web Scraper**: An automated network program that can access web pages on the World Wide Web according to specific rules, extract information, and store the data locally.

#### 1.2 Applications of Web Scrapers

- Search engines (e.g., Google, Baidu)
- Data analysis (market analysis, competitor analysis, etc.)
- Monitoring changes in website content (such as stock prices, news updates)

#### 1.3 Basic Networking Knowledge and HTTP Protocol Introduction

##### 1.3.1 Network Basics

- **World Wide Web (WWW)**: A system comprised of many interconnected web pages accessible via the Internet.
- **Internet**: A global network of computers connected to exchange data.

##### 1.3.2 IP Addresses and Domain Names

- **IP Address**: A unique address for every computer on the Internet, e.g., 192.168.1.1.
- **Domain Name**: An easier-to-remember address, such as `www.example.com`, resolved to an IP address via the Domain Name System (DNS).

##### 1.3.3 HTTP Protocol

- **HTTP (Hypertext Transfer Protocol)**: Defines the format and rules for exchanging information between clients and servers.
- **Requests and Responses**: Clients send HTTP requests to servers, which return responses.
- **Methods**: Major HTTP methods include GET (request resources), POST (submit data for processing), PUT (replace all current representations of the target resource), DELETE (remove the specified resource), etc.

##### 1.3.4 HTTPS Protocol

- **HTTPS (Hypertext Transfer Protocol Secure)**: Adds SSL/TLS protocol to HTTP for encrypting communications between clients and servers, ensuring data security.
- **SSL/TLS**: Used for encrypting data between web browsers and servers.
- **Encryption Process**: Ensures data is not stolen or tampered with during transmission, using keys for encryption and decryption.

##### 1.3.5 URL Structure

- URL (Uniform Resource Locator)  : An address on the Internet for a resource, including protocol, domain name, port (optional), resource path, and query parameters.

  - Example: `https://www.example.com:443/path/to/file?query=value`

#### 1.4 Common Web Servers and Client Tools

- **Web Servers**: Such as Apache, Nginx, Microsoft IIS.
- Client Tools:
  - Browsers (Chrome, Firefox)
  - Command-line tools (curl, wget)
  - **Fiddler**: An HTTP debugging tool that captures both HTTP and HTTPS traffic, allowing users to monitor, modify, and replay inbound and outbound data.
  - **Charles**: A proxy server that enables developers to view all HTTP and SSL/HTTPS traffic, including requests and responses, headers, and metadata.

#### 1.5 Initial Test Sites

- **HTTPBin** ([http://httpbin.org](http://httpbin.org/)): A simple service that receives HTTP requests and echoes back sent information. It supports various request methods such as GET, POST, PUT, DELETE, etc., and can be used to test HTTP headers, response data, and status codes.
- **Reqres** ([https://reqres.in](https://reqres.in/)): A lightweight mock REST API that provides various API response simulations including user registration, user information retrieval, data updating, as well as simulation of HTTPS information.

### Example  for HTTP Requests and HTTPS

#### HTTP Requests:

- **Example HTTP Request**:

  ```
  GET /api/users HTTP/1.1
  Host: example.com
  User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36
  Accept: application/json
  Cookie: session_id=abc123; language=en-US
  Authorization: Bearer <access_token>
  ```

- **Request Headers**:

  - **Cookie**: Sends saved data back to the server.
  - **Session**: Maintained by the server using session cookies which help to personalize user interactions without requiring login credentials for each page visited.

#### HTTPS Responses:

- Example HTTPS Response:

  ```
  HTTP/1.1 200 OK
  Content-Type: application/json
  Server: Apache/2.4.41 (Unix)
  Set-Cookie: session_id=def456; Expires=Sat, 14 May 2023 23:59:59 GMT; Secure; HttpOnly
  Cache-Control: max-age=3600
  Content-Length: 45
  
  {
    "id": 123,
    "name": "John Doe",
    "email": "johndoe@example.com"
  }
  ```

This expanded content integrates detailed explanations of HTTP requests, including headers, the use of sessions and cookies, and provides examples of both HTTP and HTTPS responses to help students better understand web communications within a web scraping context.


Understanding GET and POST requests:

GET is used to retrieve information, while POST is used to send data to a server.

Status codes and their meanings:

Each HTTP response comes with a status code. For example, 200 means the request was successful, while 404 means the requested resource was not found.

HTTP Request:

1. Request line: Specifies the HTTP method (e.g., GET, POST), the target URL, and the HTTP version.
2. Headers:
   - Host: Specifies the domain name or IP address of the server.
   - User-Agent: Identifies the client making the request (e.g., browser or software).
   - Accept: Specifies the desired content type for the response (e.g., text/html, application/json).
   - Content-Type: Indicates the format of the data included in the request body (e.g., application/json, multipart/form-data).
   - Cookie: Contains any previously stored cookies sent by the server.
   - Authorization: Provides credentials for accessing protected resources (e.g., API keys, access tokens).
   - Other headers: Additional information, such as Accept-Language, Referer, User-Agent, etc.
3. Body (optional): Contains the payload or data sent with the request, such as form data or JSON payload.

HTTP Response:

1. Status line: Specifies the HTTP version, the status code indicating the outcome of the request (e.g., 200 OK, 404 Not Found), and a brief reason phrase.
2. Headers:
   - Content-Type: Indicates the format of the response content (e.g., text/html, application/json).
   - Set-Cookie: Sets a cookie on the client's side for future requests.
   - Server: Identifies the software or server handling the request.
   - Cache-Control: Controls caching behavior on the client or intermediate proxies.
   - Content-Length: Specifies the length of the response body in bytes.
   - Other headers: Vary, Expires, Last-Modified, etc., providing additional information about the response.
3. Body (optional): Contains the actual content of the response, such as HTML, JSON, or binary data.

HTTP response codes and their meanings:

- **1xx (Informational)**:
  - 100 Continue: The server has received the request headers and the client should proceed to send the request body.
  - 101 Switching Protocols: The server is switching protocols according to the request.
- **2xx (Success)**:
  - 200 OK: Standard response for successful HTTP requests, typically used for GET and POST requests.
  - 201 Created: The request has been fulfilled and resulted in the creation of a new resource.
  - 204 No Content: The server successfully processed the request but is not returning any content.
- **3xx (Redirection)**:
  - 301 Moved Permanently: The requested resource has been permanently moved to a new URI.
  - 302 Found: The requested resource resides temporarily under a different URI.
  - 304 Not Modified: The resource has not been modified since the last request.
- **4xx (Client Error)**:
  - 400 Bad Request: The server cannot process the request due to a client error in syntax.
  - 401 Unauthorized: The request requires user authentication.
  - 403 Forbidden: The server understood the request but refuses to authorize it.
  - 404 Not Found: The server cannot find the requested resource.
- **5xx (Server Error)**:
  - 500 Internal Server Error: A generic error message, given when an unexpected condition was encountered and no more specific message is suitable.
  - 502 Bad Gateway: The server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request.
  - 503 Service Unavailable: The server is currently unable to handle the request due to a temporary overload or maintenance.
  - 504 Gateway Timeout: The server, while acting as a gateway or proxy, did not receive a timely response from the upstream server it accessed in attempting to complete the request.

In [9]:
!ping www.baidu.com


正在 Ping www.a.shifen.com [110.242.68.4] 具有 32 字节的数据:
来自 110.242.68.4 的回复: 字节=32 时间=23ms TTL=53
来自 110.242.68.4 的回复: 字节=32 时间=19ms TTL=53
来自 110.242.68.4 的回复: 字节=32 时间=20ms TTL=53
来自 110.242.68.4 的回复: 字节=32 时间=20ms TTL=53

110.242.68.4 的 Ping 统计信息:
    数据包: 已发送 = 4，已接收 = 4，丢失 = 0 (0% 丢失)，
往返行程的估计时间(以毫秒为单位):
    最短 = 19ms，最长 = 23ms，平均 = 20ms


### 2. Legal and Ethical Issues of Python Web Scraping

#### 2.1 Understanding the robots.txt File

- **Purpose of robots.txt**: This file is used by websites to communicate with web crawlers and tell them where they are allowed or disallowed from crawling. It’s located at the root of the website .
- **Content Structure**: The file contains `User-agent` lines specifying different web crawlers, followed by `Disallow` or `Allow` directives to restrict or grant access to specific paths of the website.
- **Respecting robots.txt**: Ethical web scraping involves adhering to the restrictions specified in the `robots.txt` file. Ignoring this can lead to legal actions and being banned from websites.

#### 2.2 Legal Guidelines for Using Web Scrapers

- **Compliance with Laws**: The legality of web scraping depends on the jurisdiction and the specific laws of the country. In general, accessing publicly available data is often legal, but scraping data without permission from protected areas or in violation of terms of service can lead to legal consequences.
- **Terms of Service (ToS)**: Many websites include clauses in their ToS that restrict or prohibit scraping. It’s important to review and comply with these terms before scraping data.
- **Avoiding System Overload**: Legal issues can also arise from overloading a website’s server by sending too many requests in a short period. This can be considered a denial-of-service attack.

#### 2.3 Ethical Considerations in Data Use

- **Privacy Concerns**: When scraping data, it's crucial to consider the privacy of individuals. Personal data should be handled with care, and it’s best to anonymize data when possible.
- **Data Use**: Ethically, the data collected through scraping should be used responsibly. Misusing data can lead to ethical breaches and damage to individuals or organizations.
- **Transparency and Consent**: Whenever possible, obtaining consent for data use and being transparent about how the data will be used can help mitigate ethical risks.

#### 2.4 Case Studies and Examples

- **Positive Example**: Academic researchers scraping data for analyzing market trends, where they comply with robots.txt, use data ethically, and publish their findings for public benefit.
- **Negative Example**: A business scraping contact information from a competitor’s website without consent and using it for spam marketing campaigns, violating privacy and legal guidelines.

By adhering to these legal and ethical guidelines, Python web scrapers can ensure their activities are not only effective but also respect the rights and regulations of the online environment. This section of the course could involve discussing real-world cases to illustrate the implications of ethical and legal considerations in web scraping.

### 3. Basic Components of Python Web Scrapers

#### 3.1 Request Library: Requests

##### Introduction

Requests is a popular Python HTTP library designed to make HTTP requests simple and intuitive. It's built with the philosophy of "being for humans", supporting features like session objects, persistent connections, and persistent cookies.

##### Installation

To install the Requests library, enter the following command in the command line or terminal:

```
pip install requests
```

##### Basic Usage

Using Requests to send HTTP requests is straightforward. Here are some basic examples:

###### Sending GET Requests

GET requests are used to retrieve data from a specified URL. The following example shows how to send a GET request and print the response content:

In [1]:
!pip install requests



In [2]:
import requests

# Send a GET request
response = requests.get('https://httpbin.org/get')
print(response.text)  # Print the text content of the response
print(type(response))

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-662f4726-39e30f5464c628db5c0e3204"
  }, 
  "origin": "122.206.190.72", 
  "url": "https://httpbin.org/get"
}

<class 'requests.models.Response'>


###### Sending POST Requests

POST requests are commonly used to send data to the server. Here is how to send a POST request and handle a JSON response:

In [3]:
import requests

# Send a POST request
response = requests.post('https://httpbin.org/post', data={'key': 'value'})
print(response)
print(response.json())  # Print the JSON content of the response


<Response [200]>
{'args': {}, 'data': '', 'files': {}, 'form': {'key': 'value'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Content-Length': '9', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.28.1', 'X-Amzn-Trace-Id': 'Root=1-662f47a6-793ff5536cec543b5f41542f'}, 'json': None, 'origin': '122.206.190.72', 'url': 'https://httpbin.org/post'}


##### Handling Query Parameters

When sending GET requests, you often need to include query parameters in the URL. Requests allow you to provide these parameters as a dictionary, as shown below:

In [27]:
import requests

# Define query parameters
params = {
    'key1': 'value1',
    'key2': 'value2'
}

# Send the request
response = requests.get('https://httpbin.org/get', params=params)
print(response.url)  # View the actual URL requested
print(response.text)  # Print the text content of the response


https://httpbin.org/get?key1=value1&key2=value2
{
  "args": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-662e77a6-23ad6b8727d27d7a5412860e"
  }, 
  "origin": "221.15.159.208", 
  "url": "https://httpbin.org/get?key1=value1&key2=value2"
}



Handling Request Headers
If you need to customize HTTP headers, such as setting a User-Agent or an authentication token, you can pass a dictionary to the headers parameter:

In [5]:
import requests

# URL of bilibili
url = 'https://www.bilibili.com/'

# Send a GET request to bilibili
response = requests.get(url)
print(response)
print(response.text)

<Response [412]>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="zh-cn">

<head>
    <meta http-equiv="Access-Control-Allow-Origin" content="*" />
    <meta http-equiv="Page-Enter" content="blendTrans(Duration=0.5)">
    <meta http-equiv="Page-Exit" content="blendTrans(Duration=0.5)">
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">
    <meta name="spm_prefix" content="333.937">
    <title>åºéå¦! - bilibili.com</title>
    <link rel="shortcut icon" href="//static.hdslb.com/images/favicon.ico">
    <script type="text/javascript" src="//s1.hdslb.com/bfs/static/jinkela/long/js/jquery/jquery1.7.2.min.js"></script>
    
</head>

<body>
    <div class="error-container">
        <div class="txt-item err-code">éè¯¯å·:412</div>
        <div class="txt-item err-text">ç±äºè§¦ååå©åå©å®å¨é£æ§ç­ç¥ï

##### Handling Request Headers

If you need to customize HTTP headers, such as setting a `User-Agent` or an authentication token, you can pass a dictionary to the `headers` parameter:

In [7]:
import requests

# URL of bilibili
url = 'https://www.bilibili.com/'

# Define a dictionary containing the headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}

# Send a GET request to Baidu with the specified headers
response = requests.get(url, headers=headers)
# print(response.text)
with open('bilili_page.html', 'w', encoding='utf-8') as file:
    file.write(response.text)

print("\nThe HTML content has been saved to 'bilili_page.html'.")




The HTML content has been saved to 'bilili_page.html'.


In [9]:
import requests

# 测试 GET 请求
def test_get():
    response = requests.get('https://reqres.in/api/users?page=1')
    print("GET Response:", response.json())

# 测试 POST 请求
def test_post():
    data = {
        "name": "morpheus",
        "job": "leader"
    }
    response = requests.post('https://reqres.in/api/users', data=data)
    print("POST Response:", response.json())

# 测试 PUT 请求
def test_put():
    data = {
        "name": "morpheus",
        "job": "zion resident"
    }
    response = requests.put('https://reqres.in/api/users/2', data=data)
    print("PUT Response:", response.json())

# 测试 DELETE 请求
def test_delete():
    response = requests.delete('https://reqres.in/api/users/2')
    print("DELETE Response:", response.status_code)  # 成功删除通常返回 204

# 执行测试
test_get()
print('\n')

test_post()
print('\n')
test_put()
print('\n')
test_delete()


GET Response: {'page': 1, 'per_page': 6, 'total': 12, 'total_pages': 2, 'data': [{'id': 1, 'email': 'george.bluth@reqres.in', 'first_name': 'George', 'last_name': 'Bluth', 'avatar': 'https://reqres.in/img/faces/1-image.jpg'}, {'id': 2, 'email': 'janet.weaver@reqres.in', 'first_name': 'Janet', 'last_name': 'Weaver', 'avatar': 'https://reqres.in/img/faces/2-image.jpg'}, {'id': 3, 'email': 'emma.wong@reqres.in', 'first_name': 'Emma', 'last_name': 'Wong', 'avatar': 'https://reqres.in/img/faces/3-image.jpg'}, {'id': 4, 'email': 'eve.holt@reqres.in', 'first_name': 'Eve', 'last_name': 'Holt', 'avatar': 'https://reqres.in/img/faces/4-image.jpg'}, {'id': 5, 'email': 'charles.morris@reqres.in', 'first_name': 'Charles', 'last_name': 'Morris', 'avatar': 'https://reqres.in/img/faces/5-image.jpg'}, {'id': 6, 'email': 'tracey.ramos@reqres.in', 'first_name': 'Tracey', 'last_name': 'Ramos', 'avatar': 'https://reqres.in/img/faces/6-image.jpg'}], 'support': {'url': 'https://reqres.in/#support-heading', '

In [62]:
import requests

# API base URL
base_url = "https://reqres.in/api"

def register_user(email, password):
    """Function to register a user."""
    url = f"{base_url}/register"
    data = {
        "email": email,
        "password": password
    }
    response = requests.post(url, json=data)
    return response.json(), response.status_code

def login_user(email, password):
    """Function to login a user."""
    url = f"{base_url}/login"
    data = {
        "email": email,
        "password": password
    }
    response = requests.post(url, json=data)
    return response.json(), response.status_code

# Test registration success
print("Testing registration success:")
print(register_user("eve.holt@reqres.in", "pistol"))

# Test registration failure (missing password)
print("\nTesting registration failure (missing password):")
print(register_user("eve.holt@reqres.in", ""))

# Test login success
print("\nTesting login success:")
print(login_user("eve.holt@reqres.in", "cityslicka"))

# Test login failure (wrong password)
print("\nTesting login failure (wrong password):")
print(login_user("eve.holt@reqres.in", "wrongpassword"))


Testing registration success:
({'id': 4, 'token': 'QpwL5tke4Pnpja7X4'}, 200)

Testing registration failure (missing password):
({'error': 'Missing password'}, 400)

Testing login success:
({'token': 'QpwL5tke4Pnpja7X4'}, 200)

Testing login failure (wrong password):
({'token': 'QpwL5tke4Pnpja7X4'}, 200)




In web development, `session` and `cookie` are technologies used for storing information, primarily for maintaining the state of users between the browser and the server. While they serve similar purposes, they operate differently and are used for distinct reasons.

### Cookie

A cookie is a small piece of data sent from a server and stored on the user's browser. Whenever the same user makes a request to the server again, the browser sends the cookie back to the server along with the request. This way, the server can recognize the user and remember information about them, such as their login status, preferences, etc.

**Key features include**:

- **Persistence**: Cookies can be set with an expiration date. If an expiration date is set, the information remains saved even after the browser is closed; if not set, it becomes a session cookie, which expires when the browser is closed.
- **Limited size**: Each cookie is limited to about 4KB, and there is a limit to the number of cookies stored per domain.
- **Security**: Although cookie data is stored locally and can be accessed and modified by users, security can be enhanced by setting HttpOnly and Secure flags to prevent cross-site scripting (XSS) attacks from reading cookies or sending cookies over non-encrypted connections.

### Session

Session is another server-side data storage mechanism used to store information about a user's session. The server assigns a unique identifier, usually called a session ID, to each user's session. This identifier is stored in a cookie or passed through URL rewriting. Each time the user interacts with the server, the server can recognize the user by the session ID and access the data stored on the server about that user.

**Key features include**:

- **Increased security**: Since session data is stored on the server side, it cannot be accessed directly by users, making it more secure than cookies.
- **No size limit**: Sessions can store a larger amount of data without the size limitations of cookies.
- **Dependent on cookies**: Although session information is stored on the server, the session ID is typically managed via cookies. If cookies are disabled by the user, other methods (such as URL rewriting) need to be used to pass the session ID.

### Summary

In summary, cookies are a way to store data in the user's browser, mainly used for tracking and identifying users. Sessions are a server-side solution that provides a method to store user-specific data, managed through session IDs to recognize and manage the user's state. In web applications, both are often used together to implement user authentication, state management, and other functions.



In [73]:
import requests

# 创建一个 Session 对象
session = requests.Session()

# 添加一个名为 'sessioncookie' 的 cookie 到 session 中
session.cookies.set('sessioncookie', '12345')

# 发送 GET 请求
response = session.get('https://httpbin.org/get')

# 打印响应文本，可以看到请求中包含的 cookie
print(response.text)

# 打印当前 session 中的所有 cookies
print(session.cookies.get_dict())


{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Cookie": "sessioncookie=12345", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-662e82fe-412039c85d09c5491dfedd0b"
  }, 
  "origin": "221.15.159.208", 
  "url": "https://httpbin.org/get"
}

{'sessioncookie': '12345'}


In [10]:
import requests

kw = input("Please enter the content you want to search: ")
response = requests.get(f"https://www.sogou.com/web?query={kw}")  # Send GET request

with open("search_sogou.html", mode="w", encoding="utf-8") as f:
    f.write(response.text)

Please enter the content you want to search: nih 


In [11]:
import requests

# 用户输入搜索内容
kw = input("Please enter the content you want to search: ")

# 创建自定义请求头部
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5'
}

# 发送带有自定义头部的 GET 请求
response = requests.get(f"https://www.sogou.com/web?query={kw}", headers=headers)

# 将响应内容写入文件
with open("search_sogou1.html", mode="w", encoding="utf-8") as f:
    f.write(response.text)


Please enter the content you want to search: hello


In [12]:
import requests

# Request URL
url = 'https://fanyi.baidu.com/sug'

# Prompt the user to enter the text to translate
text = input("Please enter the text to translate: ")

# Build the request data
data = {
    'kw': text,    # Text to translate
    'from': 'auto',   # Source language automatically detected
    'to': 'zh'      # Target language is Chinese
}

# Send a POST request
response = requests.post(url, data=data)

# Get the response result
result = response.json()

# Print the translation result
print(result)


Please enter the text to translate: apple
{'errno': 0, 'data': [{'k': 'Apple', 'v': 'n. 苹果公司，原称苹果电脑公司'}, {'k': 'apple', 'v': 'n. 苹果; 苹果公司; 苹果树'}, {'k': 'APPLE', 'v': 'n. 苹果'}, {'k': 'apples', 'v': 'n. 苹果，苹果树( apple的名词复数 ); [美国口语]棒球; [美国英语][保龄球]坏球; '}, {'k': 'Apples', 'v': '[地名] [瑞士] 阿普勒'}], 'logid': 1192332965}


In [13]:
import requests

# Prompt the user to enter the text to translate
kw = input("Please enter the text to translate:")

# Prepare the request data
dic = {
    "kw": kw   # This must match the parameter in the request tool
}

# Send a POST request to Baidu Translate's 'sug' endpoint
resp = requests.post("https://fanyi.baidu.com/sug", data=dic)

# The response is JSON, so parse it directly
resp_json = resp.json()

# Extract the translation from the response
# Here we're assuming the translation we want is always in the first dictionary in the 'data' list
# If this is not the case, you might need to modify this part
print(resp_json['data'][0]['v'])


Please enter the text to translate:apple
n. 苹果公司，原称苹果电脑公司


In [16]:
import requests
import json

def fetch_douban_movies():
    url = 'https://movie.douban.com/j/chart/top_list'

    # Get user input for the start position and limit
    start = input("Enter the start position (from which movie to start): ")
    limit = input("Enter the number of movies to fetch: ")

    param = {
        'type': '24',
        'interval_id': '100:90',
        'action':'',
        'start': start,  # From which movie to fetch
        'limit': limit,  # Number of movies to fetch
    }

    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
    }

    # Make the request and handle any exceptions
    try:
        response = requests.get(url=url,params=param,headers=headers)
        response.raise_for_status()
        response.encoding = response.apparent_encoding

        # Convert the response to JSON
        list_data = response.json()
        print(list_data)

        # Write the data to a file
        with open('./douban.json','w',encoding='utf-8') as fp:
            json.dump(list_data, fp, ensure_ascii=False)

        print('Successfully fetched the movie data!')

    except Exception as e:
        print(f"Failed to fetch the movie data: {e}")

# Call the function
fetch_douban_movies()

Enter the start position (from which movie to start): 1
Enter the number of movies to fetch: 10
[{'rating': ['9.3', '50'], 'rank': 2, 'cover_url': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2553104888.jpg', 'is_playable': False, 'id': '1291858', 'types': ['剧情', '喜剧'], 'regions': ['中国大陆'], 'title': '鬼子来了', 'url': 'https://movie.douban.com/subject/1291858/', 'release_date': '2000-05-12', 'actor_count': 30, 'vote_count': 655098, 'score': '9.3', 'actors': ['姜文', '香川照之', '袁丁', '姜宏波', '丛志军', '李丛喜', '泽田谦也', '李海滨', '蔡卫东', '陈述', '陈莲梅', '史建全', '陈强', '宫路佳具', '吴大维', '梶冈润一', '石山雄大', '述平', '姜武', '姜金才', '石山雄太', '山田将之', '贾幼然', '王义和', '杜世儒', '周海超', '白云生', '徐海东', '长野客弘', '鱼见亮介'], 'is_watched': False}, {'rating': ['9.3', '50'], 'rank': 3, 'cover_url': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p1454261925.jpg', 'is_playable': True, 'id': '6786002', 'types': ['剧情', '喜剧'], 'regions': ['法国'], 'title': '触不可及', 'url': 'https://movie.douban.com/subject/6786002/', 'release_dat

In [17]:
import requests

def fetch_douban_movies():
    url = 'https://movie.douban.com/j/chart/top_list'

    # Get user input for the start position and limit
    start = input("Enter the start position (from which movie to start): ")
    limit = input("Enter the number of movies to fetch: ")

    param = {
        'type': '24',   # This represents the type of the list. '24' stands for "Top Chinese Movies" 
        'interval_id': '100:90',   # This represents the score range. '100:90' means 90-100 score
        'action':'',  
        'start': start,  # This represents the starting index for the movies to fetch
        'limit': limit,  # This represents the number of movies to fetch
    }

    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
        # This is just a common user agent string. Some websites require this to allow the request
    }

    # Make the request and handle any exceptions
    try:
        response = requests.get(url=url,params=param,headers=headers)
        response.raise_for_status()
        response.encoding = response.apparent_encoding

        # Convert the response to JSON
        list_data = response.json()

        # Print the movie names
        for movie in list_data:
            print(movie['title'])

    except Exception as e:
        print(f"Failed to fetch the movie data: {e}")

# Call the function
fetch_douban_movies()

Enter the start position (from which movie to start): 2
Enter the number of movies to fetch: 5
触不可及
摩登时代
大话西游之大圣娶亲
疯狂动物城
三傻大闹宝莱坞


In [64]:
import requests

# URL of Baidu
url = 'https://www.baidu.com'

# Send a GET request to Baidu
response = requests.get(url)

# Ensure the response encoding is set correctly
response.encoding = 'utf-8'

# Print the request details
print("Request Details:")
print("URL:", response.request.url)
print("Method:", response.request.method)
print("Headers:", response.request.headers)

# Print the response
print("\nResponse Details:")
print("Status Code:", response.status_code)
print("Headers:", response.headers)
print("\nResponse Body:")
print(response.text[:10000])  # Print the first 10000 characters of the response

# Save the response content as an HTML file
with open('baidu_page.html', 'w', encoding='utf-8') as file:
    file.write(response.text)

print("\nThe HTML content has been saved to 'baidu_page.html'.")


Request Details:
URL: https://www.baidu.com/
Method: GET
Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}

Response Details:
Status Code: 200
Headers: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 28 Apr 2024 16:52:03 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:23:46 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

Response Body:
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head

### 4. Data Parsing with Python

In the previous chapter, we essentially mastered the fundamental skills of scraping an entire webpage. However, in most cases, we don't need the entire webpage content; we only require a small portion of it. So, what can we do? This brings us to the issue of data extraction.

Data extraction involves retrieving specific data elements or information from a larger dataset or webpage. Instead of dealing with the entire page, we can use various techniques and tools to extract only the relevant data we need. This allows us to focus on the specific information of interest, making our scraping process more efficient and targeted.

There are different methods and approaches to perform data extraction during web scraping. Some common techniques include using BeautifulSoup, Regular Expressions (re), and XPath.

BeautifulSoup: It is a Python library that provides a convenient way to parse HTML and XML documents. With BeautifulSoup, we can navigate the HTML structure and extract specific elements or data based on their tags, attributes, or other patterns.

Regular Expressions (re): Regular Expressions offer a powerful and flexible approach for pattern matching and text manipulation. Using regex, we can define specific patterns and extract data that matches those patterns from the webpage content.

XPath: XPath is a query language used to navigate and select elements in XML or HTML documents. It provides a way to traverse the document structure and select specific nodes or data based on their location or attributes.

By employing these techniques, we can efficiently and precisely extract the desired data from webpages, focusing only on the relevant information needed for our analysis or application.

#### 4.1 Regular Expressions (re)

Step 1: Importing the re Module To use regular expressions in Python, you need to import the `re` module:

```python
import re
```

Step 2: Basic Pattern Matching The most basic use of regular expressions is to match a specific pattern in a string. Here's an example:

```python
pattern = r"apple"
text = "I have an apple and a banana."

match = re.search(pattern, text)
if match:
    print("Pattern found!")
else:
    print("Pattern not found.")
```

In this example, we define a pattern using a raw string `r"apple"`. We then use `re.search()` to search for that pattern within the `text` string. If the pattern is found, we print "Pattern found!"; otherwise, we print "Pattern not found."

Step 3: Metacharacters and Special Sequences Regular expressions have special characters called metacharacters that carry special meaning. Here are a few commonly used metacharacters:

*   `.`: Matches any character except a newline.
*   `^`: Matches the start of a string.
*   `$`: Matches the end of a string.
*   `[]`: Matches any single character within the brackets.
*   `|`: Matches either the expression before or after the pipe.
*   `*`: Matches zero or more occurrences of the preceding pattern.
*   `+`: Matches one or more occurrences of the preceding pattern.
*   `?`: Matches zero or one occurrence of the preceding pattern.
*   `()`: Creates a capturing group.

Special sequences are shorthand codes that represent common patterns:

*   `\d`: Matches any digit character (0-9).
*   `\w`: Matches any alphanumeric character (a-z, A-Z, 0-9, and underscore).
*   `\s`: Matches any whitespace character (space, tab, newline).
*   `\b`: Matches a word boundary.

Step 4: Using Patterns with Functions The `re` module provides various functions for working with regular expressions. Here are some commonly used functions:

*   `re.search(pattern, string)`: Searches for a pattern match anywhere in the string.
*   `re.match(pattern, string)`: Searches for a pattern match at the beginning of the string.
*   `re.findall(pattern, string)`: Returns all non-overlapping matches of the pattern in the string.
*   `re.split(pattern, string)`: Splits the string by the occurrences of the pattern.
*   `re.sub(pattern, repl, string)`: Replaces occurrences of the pattern in the string with the replacement string.

Step 5: Capturing Groups and Backreferences Capturing groups allow you to extract specific parts of a matched pattern. Here's an example:

```python
pattern = r"(\d+)-(\d+)-(\d+)"
text = "Date: 2023-05-14"

match = re.search(pattern, text)
if match:
    year = match.group(1)
    month = match.group(2)
    day = match.group(3)
    print("Year:", year)
    print("Month:", month)
    print("Day:", day)
```

In this example, the pattern `(\d+)-(\d+)-(\d+)` captures the year, month, and day from a date string. We use the `match.group()` method to access the captured groups and print them.

These are just some of the basics of using regular expressions in Python. Regular expressions offer a powerful way to search, match, and manipulate text patterns in a flexible manner. I recommend referring to the official Python documentation for more detailed information on

In [19]:
import re
pattern = r"apple" 

match = re.search(pattern, text)
if match:
    print("Pattern found!")
else:
    print("Pattern not found.")

Pattern found!


In [21]:
import re
pattern = r"(\d+)-(\d+)-(\d+)"
text = "Date: 2023-05-14"

match = re.search(pattern, text)
if match:
    year = match.group(1)
    month = match.group(2)
    day = match.group(3)
    print("Year:", year)
    print("Month:", month)
    print("Day:", day)

Year: 2023
Month: 05
Day: 14


In [108]:
import re

# findall: Matches all occurrences of the pattern in the string
lst = re.findall(r"\d+", "My phone number is: 10086, and my girlfriend's phone number is: 10010")
print(lst)

# finditer: Matches all occurrences of the pattern in the string [returns an iterator], accessing the content from the iterator requires .group()
it = re.finditer(r"\d+", "My phone number is: 10086, and my girlfriend's phone number is: 10010")
for i in it:
    print(i.group())

# search: Returns the first occurrence of a match, the result is a match object, accessing the data requires .group()
s = re.search(r"\d+", "My phone number is: 10086, and my girlfriend's phone number is: 10010")
print(s.group())

# match: Matches from the beginning of the string
s = re.match(r"\d+", "10086, and my girlfriend's phone number is: 10010")
print(s.group())

# Precompile regular expression
obj = re.compile(r"\d+")

ret = obj.finditer("My phone number is: 10086, and my girlfriend's phone number is: 10010")
for it in ret:
    print(it.group())

ret = obj.findall("Hahaha, I don't believe you won't change me 1000000000")
print(ret)

s = """
<div class='jay'><span id='1'>Guo Qilin</span></div>
<div class='jj'><span id='2'>Song Tie</span></div>
<div class='jolin'><span id='3'>Da Congming</span></div>
<div class='sylar'><span id='4'>Fan Sizhe</span></div>
<div class='tory'><span id='5'>Hu Shuo Badao</span></div>
"""

# (?P<group_name>regex) can be used to further extract content from the matched content
obj = re.compile(r"<div class='.*?'><span id='(?P<id>\d+)'>(?P<wahaha>.*?)</span></div>", re.S)  # re.S: allows . to match newline characters

result = obj.finditer(s)
for it in result:
    print(it.group("wahaha"))
    print(it.group("id"))


['10086', '10010']
10086
10010
10086
10086
10086
10010
['1000000000']
Guo Qilin
1
Song Tie
2
Da Congming
3
Fan Sizhe
4
Hu Shuo Badao
5


In [22]:
import re

# findall: Matches all occurrences of the pattern in the string
lst = re.findall(r"\d+", "My phone number is: 10086, and my girlfriend's phone number is: 10010")
print(lst)

['10086', '10010']


In [23]:
# finditer: Matches all occurrences of the pattern in the string [returns an iterator], accessing the content from the iterator requires .group()
it = re.finditer(r"\d+", "My phone number is: 10086, and my girlfriend's phone number is: 10010")
for i in it:
    print(i.group())


10086
10010


In [24]:
# search: Returns the first occurrence of a match, the result is a match object, accessing the data requires .group()
s = re.search(r"\d+", "My phone number is: 10086, and my girlfriend's phone number is: 10010")
print(s.group())

10086


In [25]:
# Precompile regular expression
obj = re.compile(r"\d+")

ret = obj.finditer("My phone number is: 10086, and my girlfriend's phone number is: 10010")
for it in ret:
    print(it.group())

ret = obj.findall("Hahaha, I don't believe you won't change me 1000000000")
print(ret)

10086
10010
['1000000000']


In [26]:
s = """
<div class='jay'><span id='1'>Guo Qilin</span></div>
<div class='jj'><span id='2'>Song Tie</span></div>
<div class='jolin'><span id='3'>Da Congming</span></div>
<div class='sylar'><span id='4'>Fan Sizhe</span></div>
<div class='tory'><span id='5'>Hu Shuo Badao</span></div>
"""

# (?P<group_name>regex) can be used to further extract content from the matched content
obj = re.compile(r"<div class='.*?'><span id='(?P<id>\d+)'>(?P<wahaha>.*?)</span></div>", re.S)  # re.S: allows . to match newline characters

result = obj.finditer(s)
for it in result:
    print(it.group("wahaha"))
    print(it.group("id"))

Guo Qilin
1
Song Tie
2
Da Congming
3
Fan Sizhe
4
Hu Shuo Badao
5


In [110]:
import requests
import re
import csv

url = " "
headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36"
}
resp = requests.get(url, headers=headers)
page_content = resp.text

# Parse data
pattern = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)'
                     r'</span>.*?<p class="">.*?<br>(?P<year>.*?)&nbsp.*?<span '
                     r'class="rating_num" property="v:average">(?P<score>.*?)</span>.*?'
                     r'<span>(?P<num>.*?)人评价</span>', re.S)


# Start matching
result = pattern.finditer(page_content)

# Create and write to CSV file
with open("data.csv", mode="w", encoding="utf-8") as f:
    csvwriter = csv.writer(f)
    for item in result:
        dic = item.groupdict()
        dic['year'] = dic['year'].strip()
        csvwriter.writerow(dic.values())

print("Data extraction and writing to CSV complete!")



Data extraction and writing to CSV complete!


1.  `<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)</span>`

    *   `<li>`: Matches the starting `<li>` tag.
    *   `.*?`: Matches any characters (except newlines) zero or more times, non-greedily.
    *   `<div class="item">`: Matches the `<div>` tag with the class attribute "item".
    *   `<span class="title">`: Matches the starting `<span>` tag with the class attribute "title".
    *   `(?P<name>.*?)`: Capturing group named "name" to match the movie name. `.*?` matches any characters (except newlines) zero or more times, non-greedily.
    *   `</span>`: Matches the closing `</span>` tag.
2.  `.*?<p class="">.*?<br>(?P<year>.*?)&nbsp.*?<span class="rating_num" property="v:average">(?P<score>.*?)</span>`

    *   `.*?<p class="">`: Matches any characters (except newlines) zero or more times, non-greedily, followed by the `<p>` tag with the class attribute "" (empty string).
    *   `.*?<br>`: Matches any characters (except newlines) zero or more times, non-greedily, followed by the `<br>` tag.
    *   `(?P<year>.*?)`: Capturing group named "year" to match the movie year. `.*?` matches any characters (except newlines) zero or more times, non-greedily.
    *   `&nbsp`: Matches the non-breaking space character.
    *   `.*?<span class="rating_num" property="v:average">`: Matches any characters (except newlines) zero or more times, non-greedily, followed by the `<span>` tag with the class attribute "rating\_num" and property attribute "v:average".
    *   `(?P<score>.*?)`: Capturing group named "score" to match the movie score. `.*?` matches any characters (except newlines) zero or more times, non-greedily.
    *   `</span>`: Matches the closing `</span>` tag.
3.  `.*?<span>(?P<num>.*?)人评价</span>`

    *   `.*?<span>`: Matches any characters (except newlines) zero or more times, non-greedily, followed by the `<span>` tag.
    *   `(?P<num>.*?)`: Capturing group named "num" to match the number of ratings. `.*?` matches any characters (except newlines) zero or more times, non-greedily.
    *   `人评价</span>`: Matches the text "人评价" followed by the closing `</span>` tag.

The regular expression pattern is designed to match the relevant information for each movie on the Douban top 250 page. 

#### 4.2 BeautifulSoup library

*   Parsing HTML using BeautifulSoup
*   Navigating parse tree with BeautifulSoup

BeautifulSoup (often abbreviated as bs4) is a valuable Python library when it comes to handling web pages or HTML files. It offers a simple and flexible way to parse HTML and extract data from it. Here are some key functionalities of the BeautifulSoup library:

1.  HTML Parsing: BeautifulSoup can parse HTML content into a Python object called a "BeautifulSoup object." This object represents the structure of the entire HTML document, allowing you to easily traverse and manipulate it.

2.  Navigating the Parse Tree: BeautifulSoup provides a range of methods to navigate the HTML parse tree. You can search for specific elements based on tags, attributes, or hierarchical relationships, or iterate through the entire tree structure to retrieve the desired data.

3.  Data Extraction: With BeautifulSoup, you can effortlessly extract data from HTML documents. You can access the content, attributes, and text of individual tags, as well as extract multiple elements based on specific selectors.

4.  Modifying Documents: BeautifulSoup also allows you to modify HTML documents. You can add, delete, and modify tags, change tag attributes and text content, and restructure the document as needed.

5.  Handling Complex HTML: BeautifulSoup is powerful in handling complex HTML documents. It can handle incomplete tags, nested tag structures, and other HTML errors, ensuring that you can parse and extract data correctly.

In summary, BeautifulSoup is a powerful library that is useful for extracting data from HTML, handling web pages, and performing web scraping tasks. It provides a simple and flexible API, making HTML parsing and manipulation easier. Whether it's web scraping, data extraction, or web page analysis, BeautifulSoup is a valuable tool.

In [116]:
!pip install beautifulsoup4

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


In [27]:
from bs4 import BeautifulSoup

html_content = '''
<html>
  <head>
    <title>Example Page</title>
  </head>
  <body>
    <h1>Welcome to the Example Page</h1>
    <div class="content">
      <p>This is some example content.</p>
      <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
      </ul>
    </div>
  </body>
</html>
'''

# Create BeautifulSoup object
soup = BeautifulSoup(html_content, 'html.parser')

# Access tag content
title = soup.title
print(title.text)  # Output: Example Page

h1 = soup.h1
print(h1.text)  # Output: Welcome to the Example Page

# Find elements by tag name
div = soup.find('div')
print(div)

# Access element attributes
div_class = div['class']
print(div_class)  # Output: ['content']

# Iterate through tag elements
ul = soup.find('ul')
for li in ul.find_all('li'):
    print(li.text)

# Select elements using CSS selectors
p = soup.select_one('.content p')
print(p.text)  # Output: This is some example content.

items = soup.select('.content li')
for item in items:
    print(item.text)



Example Page
Welcome to the Example Page
<div class="content">
<p>This is some example content.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</div>
['content']
Item 1
Item 2
Item 3
This is some example content.
Item 1
Item 2
Item 3


In [120]:
import requests
from bs4 import BeautifulSoup

url = "https://movie.douban.com/top250"
headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36"
}
resp = requests.get(url, headers=headers)
page_content = resp.text

# Parse the page content with BeautifulSoup
soup = BeautifulSoup(page_content, "html.parser")

# Find movie titles using CSS selector
movie_titles = soup.select("#content > div > div.article > ol > li > div > div.info > div.hd > a > span.title")

# Print the movie titles
for title in movie_titles:
    print(title.get_text())

print("Movie title list printed!")


肖申克的救赎
 / The Shawshank Redemption
霸王别姬
阿甘正传
 / Forrest Gump
泰坦尼克号
 / Titanic
千与千寻
 / 千と千尋の神隠し
这个杀手不太冷
 / Léon
美丽人生
 / La vita è bella
星际穿越
 / Interstellar
盗梦空间
 / Inception
楚门的世界
 / The Truman Show
辛德勒的名单
 / Schindler's List
忠犬八公的故事
 / Hachi: A Dog's Tale
海上钢琴师
 / La leggenda del pianista sull'oceano
三傻大闹宝莱坞
 / 3 Idiots
放牛班的春天
 / Les choristes
机器人总动员
 / WALL·E
疯狂动物城
 / Zootopia
无间道
 / 無間道
控方证人
 / Witness for the Prosecution
大话西游之大圣娶亲
 / 西遊記大結局之仙履奇緣
熔炉
 / 도가니
教父
 / The Godfather
触不可及
 / Intouchables
当幸福来敲门
 / The Pursuit of Happyness
寻梦环游记
 / Coco
Movie title list printed!


In [122]:
import os
import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin

url = "https://www.umei.cc/bizhitupian/weimeibizhi/"
resp = requests.get(url)
resp.encoding = 'utf-8'  # Handle encoding issues

# Pass the response content to BeautifulSoup
main_page = BeautifulSoup(resp.text, "html.parser")
items = main_page.find_all("div", class_="item")

# Create 'imgC' directory if it doesn't exist
os.makedirs("imgC", exist_ok=True)

for item in items:
    # Find the link to the child page
    link = item.find("a", href=True)
    href = link["href"]
    
    # Check if the URL has a scheme
    if not href.startswith("http"):
        href = urljoin(url, href)

    # Get the content of the child page
    child_page_resp = requests.get(href)
    child_page_resp.encoding = 'utf-8'
    child_page_text = child_page_resp.text

    # Extract the image download URL from the child page
    child_page = BeautifulSoup(child_page_text, "html.parser")
    img = child_page.find("img", class_="lazy")
    src = img["data-original"]

    # Check if the URL has a scheme
    if not src.startswith("http"):
        src = urljoin(url, src)

    # Download the image
    img_resp = requests.get(src)
    img_name = src.split("/")[-1]  # Extract the image name from the URL

    with open("imgC/" + img_name, mode="wb") as f:
        f.write(img_resp.content)

    print("Downloaded:", img_name)
    time.sleep(1)

print("All images downloaded!")


Downloaded: zgghaapkfhy.jpg
Downloaded: zyecmylhfrn.jpg
Downloaded: gsyxb1o4gdq.jpg
Downloaded: zgghaapkfhy.jpg
Downloaded: ihk3g03psgi.jpg
Downloaded: t1ouhdmbhjo.jpg
Downloaded: ap2c1vg3whm.jpg
Downloaded: zj2ggdrhl44.jpg
Downloaded: dnvk3qz2ocy.jpg
Downloaded: cyhlqhlylep.jpg
Downloaded: ql23ngdggqt.jpg
Downloaded: fhkfzrkfyyv.jpg
Downloaded: vxrtmf3rnig.jpg
Downloaded: xbz4cl1lhtg.jpg
Downloaded: yotyomy0svb.jpg
Downloaded: 5g54nolova5.jpg
Downloaded: y1mahuysmqw.jpg
Downloaded: u0ffxygdpgk.jpg
Downloaded: epb4dxkxtlz.jpg
Downloaded: f3ypjikdmf0.jpg
Downloaded: ghj2jfe5twm.jpg
Downloaded: oxxnb3niz1h.jpg
Downloaded: c3td3px1qvo.jpg
Downloaded: nbuidh0n0cj.jpg
Downloaded: af2f41cry2n.jpg
Downloaded: hv1yg315qua.jpg
Downloaded: hpfrstvqizi.jpg
Downloaded: w2yy320dp5b.jpg
Downloaded: srpiuysntej.jpg
Downloaded: j32saeaez3h.jpg
All images downloaded!


4.3 Xpath library

XPath is a powerful query language used for selecting elements and navigating XML and HTML documents. It allows you to traverse the structure of the document and extract specific data based on patterns and conditions.

To use the XPath library in Python, you need to install the `lxml` library, which provides XPath functionality. You can install it using pip:

`pip install lxml`

Once installed, you can import the necessary modules to work with XPath:

```python
from lxml import etree
```

Now, let's go through the key concepts and techniques of XPath:

1.  Selecting Elements: XPath expressions are used to select elements in an XML or HTML document. You can specify the elements you want to target by their tag names, attributes, or their position in the document's structure.

2.  XPath Axes: Axes allow you to navigate the document relative to the current element. Common axes include `child`, `parent`, `descendant`, `ancestor`, `following-sibling`, and `preceding-sibling`. They help you select elements based on their relationship to other elements.

3.  Predicates: Predicates are conditions that further refine the element selection. You can use predicates to filter elements based on their attributes, values, or positions.

4.  XPath Functions: XPath provides a range of built-in functions to perform operations on elements and values. Functions like `text()`, `contains()`, `starts-with()`, `position()`, and `last()` are commonly used in XPath expressions.

5.  XPath Operators: XPath supports various operators such as `|` (union), `+`, `-`, `*`, `div`, `mod`, `=`, `!=`, `<`, `>`, `<=`, `>=`, `and`, `or`, and `not`. These operators allow you to combine expressions and compare values.

6.  Using XPath in Python: With the `lxml` library, you can parse an XML or HTML document using the `etree` module. Once parsed, you can use the `xpath()` method to execute XPath expressions and retrieve the matching elements or values.

XPath is a versatile tool for extracting data from XML and HTML documents. It provides a precise and flexible way to navigate the document structure and target specific elements. By mastering XPath, you can efficiently extract the data you need from complex documents.

Note: Although XPath is primarily designed for XML, it can also be used with HTML documents. However, HTML documents may have structural differences that could affect the accuracy and reliability of XPath expressions. In such cases, it is recommended to use libraries specifically designed for parsing HTML, such as BeautifulSoup.

In [125]:
!pip install lxml

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


In [127]:
from lxml import etree

# Create an HTML document
html_content = """
<html>
    <body>
        <h1>Welcome to the XPath Tutorial</h1>
        <div class="content">
            <p>Learn XPath for web scraping</p>
            <ul>
                <li>Introduction</li>
                <li>Basic Syntax</li>
                <li>Expressions and Predicates</li>
                <li>Functions</li>
            </ul>
        </div>
    </body>
</html>
"""

# Parse the HTML document
root = etree.HTML(html_content)

# Select elements using XPath
headings = root.xpath("//h1")
for heading in headings:
    print(heading.text)

paragraph = root.xpath("//p")[0]
print(paragraph.text)

list_items = root.xpath("//ul/li/text()")
for item in list_items:
    print(item)

# Use predicates to filter elements
div = root.xpath("//div[@class='content']")[0]
print(div.tag)

# Access parent and child elements
ul = root.xpath("//ul")[0]
parent_div = ul.getparent()
print(parent_div.tag)

# Evaluate XPath expressions with namespaces (not applicable for HTML)



Welcome to the XPath Tutorial
Learn XPath for web scraping
Introduction
Basic Syntax
Expressions and Predicates
Functions
div
div


In [129]:
import requests
from lxml import etree

url = "https://movie.douban.com/top250"
headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36"
}
resp = requests.get(url, headers=headers)
page_content = resp.text

# Parse data
tree = etree.HTML(page_content)

# Find movie names using XPath
movie_names = tree.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]")

# Print the movie names
for name in movie_names:
    print(name.text)

print("Movie list printed!")


肖申克的救赎
霸王别姬
阿甘正传
泰坦尼克号
千与千寻
这个杀手不太冷
美丽人生
星际穿越
盗梦空间
楚门的世界
辛德勒的名单
忠犬八公的故事
海上钢琴师
三傻大闹宝莱坞
放牛班的春天
机器人总动员
疯狂动物城
无间道
控方证人
大话西游之大圣娶亲
熔炉
教父
触不可及
当幸福来敲门
寻梦环游记
Movie list printed!


### 5.Anti-Anti-Crawling Strategies

*   Common anti-crawling techniques
*   Ways to bypass these techniques
    Web scraping is subject to various anti-crawling techniques employed by websites to protect their data and control access. These techniques are designed to detect and prevent automated scraping activities. Understanding common anti-crawling techniques and learning ways to bypass them can help you improve the success rate and reliability of your web scraping projects. Let's explore some common anti-crawling techniques and methods to bypass them:

Robots.txt: Websites often use a robots.txt file to specify which parts of their website should not be accessed by web crawlers. It is a standard method for communicating the crawling permissions to search engine bots. To bypass this, you can choose to ignore the robots.txt file and proceed with scraping the desired content. However, be cautious and respectful of website policies when doing so.

User-Agent Restrictions: Websites may block requests that do not have a valid User-Agent header or have suspicious User-Agent values. To bypass this, you can set a User-Agent header in your scraping code to mimic a legitimate web browser. You can find popular User-Agent strings for various browsers and set them in your requests headers to make your scraper appear more like a regular user.

Captcha Challenges: Captchas are used to differentiate between humans and bots. Websites may employ captchas to prevent automated scraping. To bypass captchas, you can use third-party services or libraries that can automatically solve captchas, such as CAPTCHA-solving APIs. These services typically require an API key and can handle the captcha challenges on your behalf.

IP Blocking: Websites may block IP addresses that make too many requests within a short time frame. To bypass IP blocking, you can use rotating proxies or proxy services. Proxies allow you to make requests from different IP addresses, making it difficult for websites to track and block your scraping activities. Be sure to choose reliable and reputable proxy providers.

Dynamic Website Content: Websites that heavily rely on client-side rendering using JavaScript may present challenges for scraping. To bypass this, you can use headless browsers or scraping frameworks that can render JavaScript, such as Puppeteer or Selenium. These tools simulate a real browser environment and allow you to interact with dynamically loaded content.

Session Management: Websites may use cookies or sessions to track user activity and prevent scraping. To bypass session-based protections, you can maintain and manage cookies in your scraping code. You can extract cookies from initial requests and include them in subsequent requests to maintain a session with the website.

Rate Limiting: Websites may implement rate limiting mechanisms to restrict the number of requests made by a single user within a given time period. To bypass rate limiting, you can introduce delays between requests or use intelligent scraping techniques like adaptive rate limiting, where you adjust the scraping speed dynamically based on the website's response times.

Honeypot Traps: Websites may employ hidden links or form fields that are invisible to human users but detectable by bots. Submitting requests to these traps can lead to IP blocking or other countermeasures. To bypass honeypot traps, you can inspect the HTML structure of the web page, analyze form fields, or avoid clicking on suspicious links.

It's important to note that while these methods can help bypass common anti-crawling techniques, they should be used responsibly and in compliance with the website's terms of service. It's always a good practice to respect website policies, limit the scraping rate, and avoid putting excessive load on the target website's servers.

#### 5.1 Asynchronous Scraping

*   Understanding asynchronous crawling

Asynchronous crawling, also known as asynchronous scraping or concurrent scraping, is a technique used in web scraping to improve the efficiency and speed of data extraction from multiple web pages. In traditional scraping, requests are sent and processed synchronously, which means that each request must wait for a response before the next request is made. This can lead to significant delays and reduced performance, especially when dealing with a large number of web pages.

Asynchronous crawling solves this problem by allowing multiple requests to be made simultaneously and processed independently, without waiting for each response. This enables scraping scripts to take advantage of parallel processing and maximize the utilization of system resources. As a result, the overall scraping speed can be significantly improved.

#### 5.2 Selenium

Selenium is a popular open-source library that provides a programming interface for automating web browsers. It enables developers to automate browser actions, interact with web elements, and perform various tasks on web pages. Selenium supports multiple programming languages, including Python, Java, C#, and more. In this description, we'll focus on Selenium with Python.

Key features and components of the Selenium library include:

WebDriver: WebDriver is the core component of Selenium that provides a programming interface to interact with web browsers. It allows you to automate browser actions such as navigating to URLs, filling forms, clicking buttons, and extracting data from web elements.

Selenium WebDriver APIs: Selenium WebDriver provides APIs to interact with different browsers, including Chrome, Firefox, Safari, Edge, and more. Each browser requires a specific WebDriver, which acts as a bridge between the Selenium library and the browser.

Locating Elements: Selenium provides various methods to locate elements on a web page, such as finding elements by their ID, class name, tag name, CSS selector, or XPath. These methods enable you to identify and interact with specific elements on a web page.

Interacting with Elements: Selenium allows you to interact with web elements by performing actions like clicking buttons, filling forms, selecting options from dropdowns, submitting forms, or even simulating keyboard input. You can also retrieve element attributes, text, or perform other manipulations.

Navigating and Manipulating Browser Windows: Selenium provides methods to handle multiple browser windows or tabs. You can switch between windows, open new windows, or close existing ones. It also allows you to control the browser's size, position, and perform scrolling operations.

Advanced Interactions: Selenium supports advanced interactions with web elements, such as hovering over elements, double-clicking, dragging and dropping, and executing JavaScript code within the browser.

Waiting for Elements: Selenium provides explicit and implicit wait mechanisms to handle dynamic web pages. You can wait for specific conditions to be met before performing actions, such as waiting for an element to be visible, clickable, or present on the page.

Selenium is widely used for various purposes, including web scraping, automated testing, browser automation, and web application development. It offers flexibility and compatibility across different browsers and platforms, making it a versatile tool for automating browser interactions.

In Python, the Selenium library can be installed using pip with the command pip install selenium. Additionally, you need to download and set up the corresponding WebDriver（https://chromedriver.chromium.org/downloads） for the browser you intend to automate.

Selenium documentation, tutorials, and community resources are available on the official Selenium website (https://www.selenium.dev/). These resources provide detailed information, examples, and best practices to help you make the most out of the Selenium library for your web automation needs.


In [135]:
!pip install selenium


Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.4.3 requires jedi<0.19.0,>=0.17.2, but you have jedi 0.19.0 which is incompatible.



Collecting selenium
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/04/4d/a6e8afd65b87372e275eb612d564ec68f79195e9b7e27004a3b2cce69686/selenium-4.20.0-py3-none-any.whl (9.5 MB)
     ---------------------------------------- 0.0/9.5 MB ? eta -:--:--
      --------------------------------------- 0.2/9.5 MB 5.3 MB/s eta 0:00:02
     -- ------------------------------------- 0.5/9.5 MB 5.2 MB/s eta 0:00:02
     ---- ----------------------------------- 1.1/9.5 MB 7.9 MB/s eta 0:00:02
     ------- -------------------------------- 1.8/9.5 MB 9.8 MB/s eta 0:00:01
     ------------ --------------------------- 3.0/9.5 MB 12.7 MB/s eta 0:00:01
     ---------------- ----------------------- 3.9/9.5 MB 13.9 MB/s eta 0:00:01
     ----------------- ---------------------- 4.1/9.5 MB 13.9 MB/s eta 0:00:01
     ----------------- ---------------------- 4.2/9.5 MB 11.1 MB/s eta 0:00:01
     ------------------------- -------------- 6.0/9.5 MB 14.2 MB/s eta 0:00:01
     ---------------------------- -

In [143]:
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from time import sleep
from lxml import etree
import pandas as pd

class Jdmobile:
    def __init__(self, pages=2):
        self.url = 'https://www.jd.com/'
        self.pages = pages
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)
        self.data = pd.DataFrame()

    def open_html(self):
        self.driver.get(self.url)

    def search_product(self, key):
        self.wait.until(EC.presence_of_element_located((By.ID, 'key'))).send_keys(key)
        self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'button'))).click()

    def scrape_pages(self):
        for _ in range(self.pages):
            self.scroll_down()
            self.get_content()
            self.next_page()

    def scroll_down(self):
        for _ in range(2):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            sleep(3)

    def get_content(self):
        htmll = etree.HTML(self.driver.page_source)
        items = htmll.xpath('//div[@class="gl-i-wrap"]')
        for item in items:
            D = {}
            D['price'] = item.xpath('.//div[@class="p-price"]/strong/i/text()')[0]
            D['comment'] = item.xpath('.//div[@class="p-commit"]/strong/a/text()')[0]
            D['shopname'] = item.xpath('.//div[@class="p-shop"]/span/a/text()')[0] if item.xpath('.//div[@class="p-shop"]/span/a/text()') else 'None'
            D['URL'] = 'https:' + item.xpath('.//div[@class="p-commit"]/strong/a/@href')[0]
            D['title'] = item.xpath('.//div[@class="p-name p-name-type-2"]/a/em')[0].xpath('string(.)').strip()
            image_url = item.xpath('.//div[@class="p-img"]/a/img/@data-lazy-img')[0]
            D['pnglink'] = 'https:' + image_url if image_url != 'done' else 'https:' + item.xpath('.//div[@class="p-img"]/a/img/@src')[0]
            self.data = pd.concat([self.data, pd.DataFrame([D])])

    def next_page(self):
        next_button = self.wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="J_bottomPage"]/span[1]/a[9]')))
        self.driver.execute_script("arguments[0].click();", next_button)
        sleep(4)

    def run(self, key):
        self.open_html()
        self.search_product(key)
        self.scrape_pages()
        self.driver.quit()
        return self.data

if __name__ == '__main__':
    mql = Jdmobile()
    data = mql.run('手机')
    print(data)


NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=122.0.6261.112)
Stacktrace:
	GetHandleVerifier [0x00007FF76CE8AD02+56930]
	(No symbol) [0x00007FF76CDFF602]
	(No symbol) [0x00007FF76CCB42E5]
	(No symbol) [0x00007FF76CC91D4C]
	(No symbol) [0x00007FF76CD223F7]
	(No symbol) [0x00007FF76CD37891]
	(No symbol) [0x00007FF76CD1BA43]
	(No symbol) [0x00007FF76CCED438]
	(No symbol) [0x00007FF76CCEE4D1]
	GetHandleVerifier [0x00007FF76D206F8D+3711213]
	GetHandleVerifier [0x00007FF76D2604CD+4077101]
	GetHandleVerifier [0x00007FF76D25865F+4044735]
	GetHandleVerifier [0x00007FF76CF29736+706710]
	(No symbol) [0x00007FF76CE0B8DF]
	(No symbol) [0x00007FF76CE06AC4]
	(No symbol) [0x00007FF76CE06C1C]
	(No symbol) [0x00007FF76CDF68D4]
	BaseThreadInitThunk [0x00007FFA09BF257D+29]
	RtlUserThreadStart [0x00007FFA0A9EAA48+40]
