在Python中，利用第三方库requests可以非常方便地模拟网络请求。

# 安装requests

requests是Python的第三方库

In [1]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


# 实例引入

用Python写爬虫的第一步就是模拟发起一个请求，获取网页的源代码。我们基于一个大牛发布的[爬虫练习平台](https://scrape.center/)进行学习。

In [1]:
import requests

r = requests.get("https://ssr1.scrape.center/")
# print(r.text)

执行上述脚本，我们获取到网页源码，进一步把想要的数据提取出来，数据的爬取就完成了。

# 请求

## GET请求

In [1]:
import requests  

r = requests.get('http://httpbin.org/get')  
print(r.text)

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.24.0", 
    "X-Amzn-Trace-Id": "Root=1-5f47cd87-4df3df14a43ba1b0154899b2"
  }, 
  "origin": "114.84.169.66", 
  "url": "http://httpbin.org/get"
}



In [3]:
# URL传入参数
import requests  

data = {  
    'name': 'lin',  
    'age': 27
}  
r = requests.get('http://httpbin.org/get', params=data)  
print(r.text)

{
  "args": {
    "age": "27", 
    "name": "lin"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.24.0", 
    "X-Amzn-Trace-Id": "Root=1-5f47ce1c-38b6686251d5ab2b18fa9267"
  }, 
  "origin": "114.84.169.66", 
  "url": "http://httpbin.org/get?name=lin&age=27"
}



In [4]:
# 请求返回的结果为json字符串，可调用json()方法返回字典格式
import requests  

r = requests.get('http://httpbin.org/get')  
print(type(r.text))  
print(type(r.json()))
print(r.json())  

<class 'str'>
<class 'dict'>
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.24.0', 'X-Amzn-Trace-Id': 'Root=1-5f47cea7-949008827e48ba1c4413839c'}, 'origin': '114.84.169.66', 'url': 'http://httpbin.org/get'}


### 抓取网页

In [3]:
import requests
import re

r = requests.get('https://static1.scrape.cuiqingcai.com/')
titles = re.findall('<h2.*?>(.*?)</h2>', r.text)
print(titles)

['霸王别姬 - Farewell My Concubine', '这个杀手不太冷 - Léon', '肖申克的救赎 - The Shawshank Redemption', '泰坦尼克号 - Titanic', '罗马假日 - Roman Holiday', '唐伯虎点秋香 - Flirting Scholar', '乱世佳人 - Gone with the Wind', '喜剧之王 - The King of Comedy', '楚门的世界 - The Truman Show', '狮子王 - The Lion King']


### 抓取二进制数据

In [5]:
import requests

r = requests.get('https://github.com/favicon.ico')
with open('E:/favicon.ico', 'wb') as f:
    f.write(r.content)

把二进制数据成功保存成一张图片了，这个小图标就被我们成功爬取下来了。同样地，音频和视频文件我们也可以用这种方法获取。

### 添加headers

在发起一个 HTTP 请求的时候，会有一个请求头 Request Headers，使用 headers 参数就可以完成设置。

In [7]:
import requests


headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get('https://static1.scrape.cuiqingcai.com/', headers=headers)
# print(r.text)

## POST请求

In [8]:
import requests

data = {'name': 'lin', 'age': '27'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "27", 
    "name": "lin"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "15", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.24.0", 
    "X-Amzn-Trace-Id": "Root=1-5f4b4302-ab2a5a923e96ffd793b0dcef"
  }, 
  "json": null, 
  "origin": "114.84.169.66", 
  "url": "http://httpbin.org/post"
}



## 响应

我们可以使用 text 和 content 获取了响应的内容。此外，此外，还有很多属性和方法可以用来获取其他信息，比如状态码、响应头、Cookies 等。

In [10]:
import requests

r = requests.get('https://static1.scrape.cuiqingcai.com/')
print(type(r.status_code), r.status_code)   # 状态码
print(type(r.headers), r.headers)  # 响应头
print(type(r.cookies), r.cookies)  # cookies
print(type(r.url), r.url)  # URL
print(type(r.history), r.history) # 请求历史

<class 'int'> 200
<class 'requests.structures.CaseInsensitiveDict'> {'Server': 'nginx/1.17.8', 'Date': 'Sun, 30 Aug 2020 06:14:20 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'X-Frame-Options': 'DENY', 'X-Content-Type-Options': 'nosniff', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains', 'Content-Encoding': 'gzip'}
<class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[]>
<class 'str'> https://static1.scrape.cuiqingcai.com/
<class 'list'> []


requests 还提供了一个内置的状态码查询对象 requests.codes

In [13]:
import requests

r = requests.get('https://static1.scrape.cuiqingcai.com/')
exit() if not r.status_code == requests.codes.ok else print('Request Successfully')

Request Successfully


**返回码和相应的查询条件：**

# 高级用法

## 文件上传

In [15]:
import requests

files = {'file': open('E:/favicon.ico', 'rb')}
r = requests.post('http://httpbin.org/post', files=files)
# print(r.text)

## Cookies

In [16]:
import requests

r = requests.get('http://www.baidu.com')
print(r.cookies)
for key, value in r.cookies.items():
    print(key + '=' + value)

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315


示例：直接用 Cookie 来维持登录状态。首先我们登录 GitHub，然后将 Headers 中的 Cookie 内容复制下来。将自己的Cookie设置到 Headers 里面，然后发送请求。

In [19]:
import requests

headers = {
    'Cookie':'_gh_sess=504H5EmtQ0KO4CEcBPd9zp2Dh859H7sL4SMcXKh1X%2FBQBDbNRB7szpki2mpFL1%2BWXrzhcAS9dhbWO2%2B4uJJGCartkm9S4el5zObLYtmFPgpSbfwP5km8yL440BRbBovtGyoqBzlulidoiOCEFqVEKwKxFcWJcKXvNtwqcuSnR7i26APuHmuEdGM8ur3uiy23w6rfMN3d4u7zPIlVXSjeEG%2FSoOCKGuNeV48iwrkAKktsC%2BYELMgLIBP7PwqjRBe1zsbnyGIeKdwKQC%2B%2Bz%2BU5l0vTx%2FdWHSeTQXaNl4OOLmGs6sxUFaaDbiTshVUan9y0%2FjbZEAjyeWfab1c%2FYNpx2VfOhC%2Bg3ZCKBOrkirWrKzUIpWpkrIdmG7nCy6RUCgSJROMFHyL%2BPXk6armpXtL37McZlyd4qx%2Fno4g33K%2Bi%2BkyHLvscW%2Fay0XYH74Y9XLZ%2BVfxUFuJQEztSd%2BH9%2BegjU9k5b5ZpsobbUVhfh8oL10VVi7VYhGI2rgGGxqvMztBIGRS2dXBb1RvWwLR5QLcAVGWAKp0zOVN0hZnUDya0b7qnrhGoVHDeIYvosNgOWjdx088AV89X8coZyQZ0gjBgfxI8dQFIq08qdcpKqZ3WWGmZSmxrvJUuvkTdEK%2BqSsHiULXpI%2B1tdFFtezrGyyQ6ilH0Qimiux9xBiCk5xow4ZzDTBj0cD7400LF2Qpgzjoz--SwEzziq7am4ijVOQ--3M4XpxgeOLbZvl3uxzSvUw%3D%3D; path=/; secure; HttpOnly; SameSite=Lax',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
}
r = requests.get('https://github.com/', headers=headers)
# print(r.text)

## Session 维持

In [20]:
import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)

{
  "cookies": {
    "number": "123456789"
  }
}



## SSL 证书验证

在浏览器中通过一些设置来忽略证书的验证。可以使用 verify 参数控制是否验证证书，如果将其设置为 False，在请求时就不会再验证证书是否有效。如果不加 verify 参数的话，默认值是 True，会自动验证。

In [24]:
import requests
from requests.packages import urllib3

urllib3.disable_warnings()
response = requests.get('https://static1.scrape.cuiqingcai.com/', verify=False)
print(response.status_code)

200


## 超时设置

在本机网络状况不好或者服务器网络响应延迟甚至无响应时，我们可能会等待很久才能收到响应，甚至到最后收不到响应而报错。为了防止服务器不能及时响应，应该设置一个超时时间，即超过了这个时间还没有得到响应，那就报错。

In [25]:
import requests

r = requests.get('https://httpbin.org/get', timeout=1)
print(r.status_code)

200
