# 爬虫

Python爬虫是指使用Python编程语言编写的程序，用于自动化地从互联网上抓取数据。爬虫可以模拟浏览器行为，访问网页、解析HTML内容，提取所需的信息，如文本、图像、链接等。Python爬虫常用的库包括requests用于发送HTTP请求，BeautifulSoup和XPath用于解析HTML，以及Scrapy用于构建更复杂的爬虫系统。爬虫可以应用于各种场景，如数据采集、搜索引擎索引、价格监测、舆情分析等。使用Python编写爬虫具有简单易学、强大灵活的特点，广泛应用于数据挖掘和信息获取领域。

In [17]:
# 导入库
import urllib3 as urllib
import requests
from lxml import etree
from pathlib import Path
import scrapy

# 反爬虫
headers = {
    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.58",
}

## Scrapy Module

In [None]:
class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")

## Request 部分

### Urllib 库

Urllib是Python的标准库之一，用于处理URL和进行网络请求。它提供了简单且易于使用的接口，用于打开URL、发送请求和处理响应数据。

- 访问网页

In [2]:
# 获取网页的内容
response = urllib.request.urlopen('https://www.baidu.com/')
# 获取html
html = response.read()
# 打印内容
print(html.decode('utf-8'))

<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>


- 模拟百度搜索

    前半部分链接之所以是http，不是https。因为百度在https做了重定向，重定向到http的链接。
若直接获取https的链接内容是获取不到什么数据。

In [None]:
# 前半部分链接(注意用http 不能用https)
url_pre = 'http://www.baidu.com/s'

# 构造表单
params = {}
params['wd'] = u'python'.encode('utf-8')
url_params = urllib.parse.urlencode(params)

# 构造请求
url = url_pre + '?' + url_params
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)

# 获取html
html = response.read()

# 打印内容
print(html.decode('utf-8'))

- POST请求示例

In [8]:
# 构造表单
values = {}
values['usr'] = "W.ZeHao"
values['pwd'] = "W.2zezebao"
data = urllib.parse.urlencode(values)

# 构造请求
url = 'http://passport.csdn.net/account/login'
request = urllib.request.Request(url, data=data.encode('utf-8'), headers=headers)
response = urllib.request.urlopen(request)

# 获取html
html = response.read()

# 打印内容
print(html.decode('utf-8'))

<!DOCTYPE html><html><head><meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1"><meta name="viewport" content="width=device-width,initial-scale=1"><meta name="referrer" content="always"><meta name="renderer" content="webkit"><meta name="force-rendering" content="webkit"><meta charset="utf-8"><meta name="keyword" content="CSDN登录"><meta name="description" content="CSDN桌面端登录"><meta name="google" value="notranslate"><meta name="report" content='{"spm":"1031.2352"}'><link type="image/x-icon" href="https://g.csdnimg.cn/static/logo/favicon32.ico" rel="SHORTCUT ICON"><title>CSDN-专业IT技术社区-登录</title><!--[if lte IE 9]>
       <script>window.location.href="https://g.csdnimg.cn/browser_upgrade/1.0.2/index.html";</script>
    <![endif]--><script src="//g.csdnimg.cn/tingyun/1.8.5/passport.js" type="text/javascript"></script><link href="https://csdnimg.cn/release/passport_fe/assets/css/loginv3.7dbb4731a9963f240fea6261f5f9c0e2.css" rel="stylesheet"></head><body><div id="app"></div><script type

### Request 库


In [14]:
#我们邀抓取的页面链接
url='https://book.douban.com/subject/1084336/comments/'

#用requests库的get方法下载网页
r=requests.get(url).text

#解析网页并且定位短评
s=etree.HTML(r)
print(s)
file=s.xpath('//*[@id="comments"]/ul/li/div[2]/p/text()')

#打印抓取的信息
print(file)

None


AttributeError: 'NoneType' object has no attribute 'xpath'