# Python访问网页和HTTP协议
## 一、Web和HTTP协议
- 1990年，万维网WWW的发明者蒂姆·伯纳斯·李，基于HTTP协议实现与服务器的第一次通讯。
- 1994年，网景公司发布 Mosaic Netscape 1.0 beta 浏览器，后改名为Netscape Navigator。

![](images/web.jpg)

### 1. HTTP协议基础
HTTP（超文本传输协议）是基于 TCP/IP 协议的应用层协议。它不涉及数据包传输，主要规定了客户端和服务器之间的通信格式，默认使用80端口。
HTTPS（超文本传输安全协议）经由HTTPS协议进行通信，但是利用SSL/TLS加密数据包，默认使用443端口。
#### URL
URL(统一资源定位器)，用来唯一标识网络资源。例如：http://quote.eastmoney.com/sh600019.html
#### HTTP请求和响应
- HTTP请求由三部分组成，分别是：请求行、消息报头、请求正文。常见的请求方式
    - get
    - post
- HTTP响应由三部分组成，分别是：状态行、消息报头、响应正文。常见的状态码有
    - 200 OK
    - 400 Bad Request
    - 401 Unauthorized
    - 403 Forbidden
    - 404 Not Found
    - 500 Internal Server Error
    - 503 Server Unavailable

### 2. HTML、CSS和JavaScirpt
HTTP协议响应正文主要是：文本内容，如HTML、CSS和JavaScirpt文件，也可以是任意文本类型内容，如XML和JSON文件等；二进制内容，如图片和其他任意二级制文件。

![](images/web.png)

其中HTML、CSS和JavaScirpt是构建Web页面的核心内容和技术

![](images/html.png)

## 二、Python访问网页

### 1. 使用urllib

In [1]:
from urllib import request # python 2中是urllib2
resp = request.urlopen('https://www.baidu.com')
print(resp.geturl(), resp.status, resp.reason, sep=' ')
print(resp.read().decode('utf-8'))

https://www.baidu.com 200 OK
<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>


In [2]:
resp = request.urlopen('http://quote.eastmoney.com/newapi/getlrqs?code=600019')
print('Status:', resp.status, resp.reason)

Status: 200 OK


In [3]:
{k: v for k, v in resp.getheaders()}

{'Date': 'Thu, 03 Jun 2021 13:56:31 GMT',
 'Content-Type': 'application/json; charset=utf-8',
 'Content-Length': '657',
 'Connection': 'close',
 'Server': 'Tengine',
 'Age': '1',
 'X-Via': '1.1 PS-WNZ-01c1W35:12 (Cdn Cache Server V2.0), 1.1 PS-FOC-01SoY26:24 (Cdn Cache Server V2.0)',
 'X-Ws-Request-Id': '60b8df8f_oudxin23_26989-20527'}

In [4]:
import json
json.loads(resp.read().decode('utf-8'))

[{'code': '600019', 'Profit': 2725832313.54, 'RptDate': '2019-03-31T00:00:00'},
 {'code': '600019', 'Profit': 3613532099.7, 'RptDate': '2019-06-30T00:00:00'},
 {'code': '600019', 'Profit': 2679718306.09, 'RptDate': '2019-09-30T00:00:00'},
 {'code': '600019', 'Profit': 3543935484.89, 'RptDate': '2019-12-31T00:00:00'},
 {'code': '600019', 'Profit': 1540624304.54, 'RptDate': '2020-03-31T00:00:00'},
 {'code': '600019', 'Profit': 2461743613.88, 'RptDate': '2020-06-30T00:00:00'},
 {'code': '600019', 'Profit': 3856596954.59, 'RptDate': '2020-09-30T00:00:00'},
 {'code': '600019', 'Profit': 4817810871.38, 'RptDate': '2020-12-31T00:00:00'},
 {'code': '600019', 'Profit': 5359008715.48, 'RptDate': '2021-03-31T00:00:00'}]

### 2. 使用requests

In [5]:
import requests
resp = requests.get('https://www.baidu.com')
print(resp.url, resp.status_code, resp.reason, sep=' ')

https://www.baidu.com/ 200 OK


In [6]:
import json
resp = requests.get('http://quote.eastmoney.com/newapi/getlrqs?code=600019')
print('Status:', resp.status_code, resp.reason)
json.loads(resp.text)

Status: 200 OK


[{'code': '600019', 'Profit': 2725832313.54, 'RptDate': '2019-03-31T00:00:00'},
 {'code': '600019', 'Profit': 3613532099.7, 'RptDate': '2019-06-30T00:00:00'},
 {'code': '600019', 'Profit': 2679718306.09, 'RptDate': '2019-09-30T00:00:00'},
 {'code': '600019', 'Profit': 3543935484.89, 'RptDate': '2019-12-31T00:00:00'},
 {'code': '600019', 'Profit': 1540624304.54, 'RptDate': '2020-03-31T00:00:00'},
 {'code': '600019', 'Profit': 2461743613.88, 'RptDate': '2020-06-30T00:00:00'},
 {'code': '600019', 'Profit': 3856596954.59, 'RptDate': '2020-09-30T00:00:00'},
 {'code': '600019', 'Profit': 4817810871.38, 'RptDate': '2020-12-31T00:00:00'},
 {'code': '600019', 'Profit': 5359008715.48, 'RptDate': '2021-03-31T00:00:00'}]

### 3. 使用pyspider和scrapy
pyspider是国人开发的爬虫框架，支持多线程，js动态解析，带有web操作界面，支持多种数据库。
http://docs.pyspider.org/

scrapy 
https://scrapy-chs.readthedocs.io/

```python
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2021-05-01 12:03:36
# Project: test

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {
    }
    
    
    def __init__(self):
        super().__init__()
        self.start_url = 'https://www.baidu.com'
        self.request_headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36'
        }         

    @every(minutes=24 * 60)
    def on_start(self):  
        save = {'id': 0}
        self.crawl(self.start_url, callback=self.index_page, headers=self.request_headers, save=save, retries=3)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):        
        response.save['id'] += 1 
        items = response.doc('a[href^="http"]').items()
        for item in items:
            self.crawl(item.attr.href, callback=self.detail_page, save=response.save, retries=3)
        return {'id': response.save['id'], 'url': response.url, 'status_code': response.status_code, 
                'title': response.doc('title').text()}

    @config(priority=2)
    def detail_page(self, response):        
        response.save['id'] += 1
        return {'id': response.save['id'], 'url': response.url, 'status_code': response.status_code, 
                'title': response.doc('title').text()}
```

### 4. 使用pyquery、beautifullsoap
- pyquery https://pythonhosted.org/pyquery/api.html
- beautifullsoap https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
- xpath https://www.w3school.com.cn/xpath/xpath_syntax.asp

### 5. 编程实践：尝试获东方财富、新浪财经上的金融数据页面

## 三、代码阅读：基于requests的小型爬虫框架

In [19]:
import time
from functools import wraps
from typing import Mapping, Dict, List, Callable, Optional, Any
import requests
from requests import Response
import pandas as pd
import logging
from pyquery import PyQuery

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(thread)d %(levelname)s %(module)s - %(message)s')
logger = logging.getLogger(__name__)


def retry(num: int):
    """
    重试装饰器，若被装饰函数执行失败，则重试num次
    :param num: 重试次数
    """

    def retry_decorator(func: Callable):
        @wraps(func)
        def retry_wrapper(*args, **kwargs):
            for i in range(num + 1):
                if i > 0:
                    logger.info('Retry: %d/%d' % (i, num))
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    logger.error('Error: %s' % e, exc_info=True)
                    time.sleep(1)

        return retry_wrapper

    return retry_decorator


class CrawlerBase:
    """
    爬虫框架基类
    """

    def __init__(self):
        self.request_headers: Dict = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36'
        }
        self.results: Optional[List] = []
        self.methods: Mapping[str, Callable] = {
            'get': self.get,
            'post': self.post
        }
        self.jobs = []

    @retry(3)
    def get(self, url: str, headers: Optional[Dict] = None, *args, **kwargs) -> Response:
        """
        GET请求
        :param url: 请求URL
        :param headers: 请求头
        :param args: requests请求位置参数
        :param kwargs: requests请求关键字参数
        :return: Response 响应对象
        """
        _headers: Dict = headers.copy() if headers else {}
        _headers.update(self.request_headers)
        response: Response = requests.get(url, headers=_headers, *args, **kwargs)
        logger.info('[GET %d] %s' % (response.status_code, response.url))
        return response

    @retry(3)
    def post(self, url: str, data: Optional[Any] = None, headers: Optional[Dict] = None, *args, **kwargs) -> Response:
        """
        POST请求
        :param url: 请求URL
        :param data: 请求BODY
        :param headers: 请求头
        :param args: requests请求位置参数
        :param kwargs: requests请求关键字参数
        :return: 响应对象
        """
        _headers: Dict = headers.copy() if headers else {}
        _headers.update(self.request_headers)
        response: Response = requests.post(url, data=data, headers=_headers, *args, **kwargs)
        logger.info('[POST %d] %s' % (response.status_code, response.url))
        return response

    def crawl(self, url: str, callback: Callable, save: Optional[Mapping] = None,
              method: Optional[str] = 'get', wait: int = 0, *args, **kwargs) -> None:
        """
        发起一个请求，并用回调函数处理结果
        :param url: 请求URL
        :param callback: 回调函数
        :param save: 向回调函数传递数据
        :param method: 请求方式，支持get，post，默认为get
        :param wait: 等待时长，单位秒
        :param args: requests请求位置参数
        :param kwargs: requests请求关键字参数
        """

        if method not in self.methods.keys():
            raise 'Method not allowed: %s' % method
        try:
            if url not in self.jobs:
                time.sleep(wait)
                response = self.methods[method](url, *args, **kwargs)
                self.jobs.append(url)
                callback(response, save=save)
        except Exception as e:
            logger.error('Crawl [%s] error' % url, exc_info=True)

    def save_data(self, record: Any) -> None:
        """
        在内存中缓存单条记录
        :param record: 单条记录对象
        """
        if self.results is None:
            self.results = []
        self.results.append(record)
        logger.debug('Saved: %s' % record)

    def to_excel(self, filename: Optional[str] = 'data', sheet_name: Optional[str] = 'data') -> None:
        """
        将全部缓存的数据导出为Excel
        :param filename 文件名，不包含后缀名
        :param sheet_name excel 表名
        """
        try:
            result_df = pd.DataFrame.from_records(data=self.results)
            result_df.to_excel("%s.xlsx" % filename, sheet_name=sheet_name, index=False)
            logger.info('Save to excel successfully')
        except Exception as e:
            logger.error('Save to excel failed', exc_info=True)
        return None

### 1. 获取百度首页下所有链接页面

In [20]:
class TestCrawler(CrawlerBase):

    def __init__(self):
        super().__init__()
        self.start_url = 'https://www.baidu.com'

    def run(self):
        save = {'id': 0}
        self.crawl(self.start_url, self.index_page, save=save, method='get')

    def index_page(self, response: Response, save: Optional[Mapping] = None):
        doc = PyQuery(response.text)
        save['id'] += 1
        self.save_data(
            {'id': save['id'], 'url': response.url, 'status_code': response.status_code, 'title': doc('title').text()})
        items = doc('a[href^="http"]').items()
        for item in items:
            self.crawl(item.attr.href, self.detail_page, save=save)

    def detail_page(self, response: Response, save: Optional[Mapping] = None):
        doc = PyQuery(response.text)
        save['id'] += 1
        self.save_data(
            {'id': save['id'], 'url': response.url, 'status_code': response.status_code, 'title': doc('title').text()})


In [None]:
c = TestCrawler()
c.run()
print(c.results)

### 2. 获取百度搜索结果

In [21]:
class SearchCrawler(CrawlerBase):

    def __init__(self, keys):
        super().__init__()
        self.keys = keys
        self.start_url = 'https://www.baidu.com/s?wd=%s%%20site:open.163.com'
        self.flv_url = 'http://www.flvcd.com/parse.php?format=&kw=%s'
        self.open_url = 'https://open.163.com/newview/movie/free?pid=%s'
        self.request_headers['Accept'] = '*/*'
        self.request_headers['Accept-Encoding'] = 'gzip, deflate, br'
        self.request_headers['Accept-Language'] = 'zh-CN,zh;q=0.9,en;q=0.8'

    def run(self):
        for key in self.keys:
            self.crawl(self.start_url % key, self.search_page, method='get', wait=3)

    def search_page(self, response: Response, save: Optional[Mapping] = None):
        logger.info('Get search page: %d %s', response.status_code, response.url)
        doc = PyQuery(response.text)
        items = [(item('a').attr('href'), item('div.c-span-last').text()) for item in
                 doc('div.c-container div.c-gap-top-small').items()]
        if len(items) > 0:
            self.crawl(items[0][0], self.redirect_page, method='get', allow_redirects=False, wait=3)

    def redirect_page(self, response: Response, save: Optional[Mapping] = None):
        logger.info('Get redirect page: %d %s', response.status_code, response.url)
        redirect_url = ''
        if response.status_code == 302:
            redirect_url = response.headers.get('location')
        elif response.status_code == 200:
            match = re.search(r'URL=\'(.*?)\'', response.text, re.S)
            redirect_url = match.group(1)
        if redirect_url:
            self.crawl(self.flv_url % redirect_url, self.video_page, method='get')

    def video_page(self, response: Response, save: Optional[Mapping] = None):
        logger.info('Get video page: %d %s', response.status_code, response.url)
        doc = PyQuery(response.text)
        items = [item.attr('href') for item in doc('a[href^="http"].link').items()]
        if len(items) > 0 and items[0]:
            video_url = items[0]
            self.save_data(video_url)
            logger.info('Finally video url is: %s', video_url)

In [22]:
c = SearchCrawler(['如何掌控你的自由时间', '阅读全世界', '如何做得更好'])
c.run()
print(c.results)

2021-06-04 08:58:12,253 4502040064 DEBUG connectionpool - Starting new HTTPS connection (1): www.baidu.com:443
2021-06-04 08:58:12,437 4502040064 DEBUG connectionpool - https://www.baidu.com:443 "GET /s?wd=%E5%A6%82%E4%BD%95%E6%8E%8C%E6%8E%A7%E4%BD%A0%E7%9A%84%E8%87%AA%E7%94%B1%E6%97%B6%E9%97%B4%20site:open.163.com HTTP/1.1" 200 None
2021-06-04 08:58:13,143 4502040064 INFO <ipython-input-19-4da83ac8ec60> - [GET 200] https://www.baidu.com/s?wd=%E5%A6%82%E4%BD%95%E6%8E%8C%E6%8E%A7%E4%BD%A0%E7%9A%84%E8%87%AA%E7%94%B1%E6%97%B6%E9%97%B4%20site:open.163.com
2021-06-04 08:58:13,145 4502040064 INFO <ipython-input-21-71bd470c92a0> - Get search page: 200 https://www.baidu.com/s?wd=%E5%A6%82%E4%BD%95%E6%8E%8C%E6%8E%A7%E4%BD%A0%E7%9A%84%E8%87%AA%E7%94%B1%E6%97%B6%E9%97%B4%20site:open.163.com
2021-06-04 08:58:16,185 4502040064 DEBUG connectionpool - Starting new HTTP connection (1): www.baidu.com:80
2021-06-04 08:58:16,233 4502040064 DEBUG connectionpool - http://www.baidu.com:80 "GET /link?url=GMi

['http://mov.bn.netease.com/open-movie/nos/mp4/2017/01/03/SC8U8K7BC_sd.mp4', 'http://mov.bn.netease.com/open-movie/nos/mp4/2015/12/08/SB9D26LE6_sd.mp4', 'http://mov.bn.netease.com/open-movie/nos/mp4/2017/02/10/SCC01VIDC_sd.mp4']


### 四、编程实践：使用爬虫框架获取上市公司利润数据
- http://quote.eastmoney.com/newapi/getlrqs?code=600019