## 使用 urllib

In [1]:
import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(type(response))
# 注意 response是一个类类型的对象，包含 read(), readinto(), 等.
print(response.read().decode('utf-8')[:500])

<class 'http.client.HTTPResponse'>
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqu


In [2]:
response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

200
[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Content-Length', '48360'), ('Accept-Ranges', 'bytes'), ('Date', 'Thu, 25 Jul 2019 16:46:21 GMT'), ('Via', '1.1 varnish'), ('Age', '1640'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2141-IAD, cache-sin18027-SIN'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '3, 15'), ('X-Timer', 'S1564073181.336745,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx


In [3]:
# urlopen()方法可以完成最基本的简单网页的GET请求抓取。
help(urllib.request.urlopen)

Help on function urlopen in module urllib.request:

urlopen(url, data=None, timeout=<object object at 0x000001997FA4D5B0>, *, cafile=None, capath=None, cadefault=False, context=None)
    Open the URL url, which can be either a string or a Request object.
    
    *data* must be an object specifying additional data to be sent to
    the server, or None if no such data is needed.  See Request for
    details.
    
    urllib.request module uses HTTP/1.1 and includes a "Connection:close"
    header in its HTTP requests.
    
    The optional *timeout* parameter specifies a timeout in seconds for
    blocking operations like the connection attempt (if not specified, the
    global default timeout setting will be used). This only works for HTTP,
    HTTPS and FTP connections.
    
    If *context* is specified, it must be a ssl.SSLContext instance describing
    the various SSL options. See HTTPSConnection for more details.
    
    The optional *cafile* and *capath* parameters specify a se

- data参数, 可选。 如果要添加该参数, 并且如果它是字节流编码格式的内容, 即bytes类型, 则需要通过bytes()方法转化。 另外, 如果传递了这个参数, 则它的请求方式就不再是GET方式, 而是POST方式.

In [4]:
import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word':'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "word": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "10", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "json": null, \n  "origin": "119.95.60.62, 119.95.60.62", \n  "url": "https://httpbin.org/post"\n}\n'


- timeout 参数
timeout参数用于设置超时时间, 单位为秒, 意思就是如果请求超出了设置的这个时间, 还没有得到响应, 就会抛出异常。

In [5]:
import urllib.request
try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
    print(response.read())
except Exception as e:
    print(e)
# 这里设置了超时时间为0.1秒, 发生超时事件。

<urlopen error timed out>


In [6]:
# 如果一个网页长时间未响应, 就跳过它的抓取。 这可以利用try except语句来实现
import socket
import urllib.request
import urllib.error
try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

TIME OUT


### Request 类

In [7]:
import urllib.request

request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8')[:500])

<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqu


可以发现, 我们依然用urlopen()方法发起基本的请求, 只不过这次该方法的参数不再是URL, 二十一个Request类型的对象。

In [8]:
# 传入多个参数构建请求
from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'Host': 'httpbin.org'
}
dict = {
    'name':'Germey'
}

data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data = data, headers = headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
  }, 
  "json": null, 
  "origin": "119.95.60.62, 119.95.60.62", 
  "url": "https://httpbin.org/post"
}



观察结果可以发现, 我们成功设置了data, headers 和 method。

另外, headers也可以用 add_header()方法来添加:

In [9]:
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent',"Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
print(req)

<urllib.request.Request object at 0x0000019903EFD4E0>


### 高级用法
对于一些更高级的操作, 比如 Cookies处理, 代理设置等, 可以使用Handler

urllib.request模块里的BaseHandler类, 它是所有其他Handler的父类

另外一个比较重要的类就是OpenerDirector, 我们可以称为Opener, urlopen()这个方法, 实际上他就是urllib为我们提供的一个Opener, Openr 可以使用open()方法, 返回的类型和urlopen()一样。

- 验证
有些网站在打开时就会弹出提示框, 直接提示你输入用户名和密码, 验证成功后才能查看页面

    对于这样的页面, 可以借助HTTPBasicAuthHandler就可以完成

In [10]:
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html[:500])
except URLError as e:
    print(e.reason)

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Dashboard - pyspider</title>
    <!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->

    <meta name="description" content="pyspider dashboard">
    <meta name="author" content="binux">
    <link href="//cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.1.1/css/bootstrap.min.css" rel="stylesheet">
    <link href="//cdnjs.cloudflare.com/ajax/libs/x-edita


- 代理
在做爬虫的时候, 免不了要使用代理, 如果要添加代理,可以这样做:

In [11]:
from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http':'http://127.0.0.1:9743',
    'https':'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

[WinError 10061] No connection could be made because the target machine actively refused it


在这里需要在本地搭建一个代理, 并将其运行在9743端口上。

上面报错的原因是由于没有安装代理, 无法连接。

- Cookies

In [12]:
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
# cookie 只有在与服务器连接之后才能读取
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name + ' = '+item.value)

BAIDUID = 3D66A92FDD4BE44C3FDFA49DFDD7687D:FG=1
BIDUPSID = 3D66A92FDD4BE44C3FDFA49DFDD7687D
H_PS_PSSID = 1453_21106_18559_29519_28519_29098_29568_28833_29220_26350_29072_22158
PSTM = 1564073187
delPer = 0
BDSVRTM = 0
BD_HOME = 0


In [13]:
# 将cookies存储为文本格式
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

In [14]:
!more cookies.txt

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com      TRUE    /       FALSE   3711556831      BAIDUID 3D66A92FDD4BE44CCADC58E49A51D138:FG=1
.baidu.com      TRUE    /       FALSE   3711556831      BIDUPSID        3D66A92FDD4BE44CCADC58E49A51D138
.baidu.com      TRUE    /       FALSE           H_PS_PSSID      1993_1461_21081_18559_29521_28518_29099_29568_28839_29220
.baidu.com      TRUE    /       FALSE   3711556831      PSTM    1564073187
.baidu.com      TRUE    /       FALSE           delPer  0
www.baidu.com   FALSE   /       FALSE           BDSVRTM 0
www.baidu.com   FALSE   /       FALSE           BD_HOME 0


另外, LWPCookieJar同样可以读取和保存Cookies, 但是保存的格式和MozillaCookieJar不一样, 它会保存成libwww-perl格式的Cookies文件

In [15]:
# 将cookies存储为lwp格式
filename = 'lwpcookies.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

In [16]:
!more lwpcookies.txt

#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="3D66A92FDD4BE44C3E5A11C617521CBD:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-08-12 20:00:31Z"; version=0
Set-Cookie3: BIDUPSID=3D66A92FDD4BE44C3E5A11C617521CBD; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-08-12 20:00:31Z"; version=0
Set-Cookie3: H_PS_PSSID=1431_21090_29519_28519_29099_29568_28838_29221; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1564073187; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-08-12 20:00:31Z"; version=0
Set-Cookie3: delPer=0; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0


In [22]:
# 如何使用cookies
cookie = http.cookiejar.LWPCookieJar()
cookie.load('lwpcookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8')[:40])

<!DOCTYPE html>
<!--STATUS OK-->






### 处理异常