# 解析Web

## Web客户端

### Bottle

`Bottle`只包含一个简单的`Python`文件，所以非常易于使用并且易于部署。运行如下的`./Web/bottle1.py`会运行一个测试用`python`服务器，如果在浏览器中访问`http://localhost:9999/`，服务器会返回一行文本。

In [1]:
from bottle import route, run

@route('/')
def home():
    return "It isn't fancy, but it's my home page"

run(host='localhost', port=9999)

Bottle v0.12.19 server starting up (using WSGIRefServer())...
Listening on http://localhost:9999/
Hit Ctrl-C to quit.

127.0.0.1 - - [21/Mar/2022 12:41:47] "GET / HTTP/1.1" 200 37
127.0.0.1 - - [21/Mar/2022 12:41:47] "GET /favicon.ico HTTP/1.1" 404 742


还可以创建一个单独的`HTML`文件作为网页内容。例 `./Web/bottle2.py`

In [1]:
from bottle import route, run,static_file

@route('/')
def main():
    return static_file('index.html', root='.')

run(host='localhost', port=9999)

Bottle v0.12.19 server starting up (using WSGIRefServer())...
Listening on http://localhost:9999/
Hit Ctrl-C to quit.

127.0.0.1 - - [21/Mar/2022 13:34:45] "GET / HTTP/1.1" 404 716
127.0.0.1 - - [21/Mar/2022 13:35:23] "GET / HTTP/1.1" 404 716


指定`URL`的参数并使用它们。例 `./Web/bottle3.py`

In [None]:
from bottle import route, run,static_file

@route('/')
def home():
    return static_file('index.html', root='.')
@route('/echo/<thing>')
def echo(thing):
    return "Say hello to my little friend: %s!"% thing

run(host='localhost', port=9999)

利用客户端库`requsets`来测试服务器的正常运行。例: `./Web/bottle_test.py`

In [None]:
import requests

resp = requests.get('http://localhost:9999/echo/Mothra')
if resp.status_code == 200 and \
    resp.text == 'Say hello to my little friend: Mothra!':
    print('It worked!')
else:
    print('No, got this:', resp.text)

### Flask
使用 `Flask`重写`bottle`的例子。

In [None]:
from flask import Flask

app = Flask(__name__, static_folder='.', static_url_path='')

@app.route('/')
def home():
    return app.send_static_file('index.html')

@app.route('/echo/<thing>')
def echo(thing):
    return "Say hello to my little friend: %s" % thing

app.run(port=9999, debug=True)

使用`Flask`内置的`jinja2`，并通过`URL`路径传入参数。

In [None]:
from flask import Flask, render_template, request

app = Flask(__name__)

@app.route('/')
def home():
    return app.send_static_file('index.html')

@app.route('/echo/')
def echo():
    thing = request.args.get('thing')
    place = request.args.get('place')
    return render_template('flask2.html', thing=thing, place=place)

app.run(port=9999, debug=True)

## 抓取数据
### Scrapy 

如果你需要一个企业级爬虫， 强烈推荐使用 `Scrapy`(`http://scrapy.org/`)，它是一个框架，而不是单纯的模块。
### 用`BeautifulSoup`来抓取`HTML`

下例展示了使用`BeautifulSoup`来获取一个网页上的所有链接。`HTML`的`a`元素表示一个链接，`href`属性表示链接的目标地址。

In [None]:
def get_links(url):
    import requests
    from bs4 import BeautifulSoup as soup
    result = requests.get(url)
    page = result.text
    doc = soup(page)
    links = [element.get('href') for element in doc.find_all('a')]
    return links

if __name__ == '__main__':
    import sys
    for url in sys.argv[1:]:
        print('Links in', url)
        for num, link in enumerate(get_links(url), start=1):
            print(num, link)
        print()