
# <center>网络数据获取——课程实践</center>

## 课程内容

* 下载链接文件
* 调用API爬取数据
* 解析网页爬取数据
* 动态网页数据爬取

### 下载链接文件

* 2019全国知识图谱与语义计算大会评测论文集 https://conference.bj.bcebos.com/ccks2019/eval/webpage/index.html

In [None]:
import requests
import urllib
import os

cur_dir = '/'.join(os.path.abspath('__file__').split('/')[:-1])
paper_dir = os.path.join(cur_dir, 'papers')
if not os.path.exists(paper_dir):
    os.makedirs(paper_dir)

for page in range(1, 4):
    try:
        _page = str(page)
        url = 'https://conference.bj.bcebos.com/ccks2019/eval/webpage/pdfs/eval_paper_1_1_{}.pdf'.format(_page)
        file_name = os.path.join(paper_dir, _page + '.pdf')
        print(file_name)
        urllib.request.urlretrieve(url, file_name)
        
    except Exception as e:
        print(e)

#### 普通下载
```
r = requests.get(url)
f = open(file_name, 'wb')
f.write(r.content)
f.close()
```

#### 有效率的下载
直接使用f.write的话，是先把r.content全部写到内存里，在写到硬盘当中，显然这样既不效率且占用内存，因此另一种更有效率的下载方式是以文件流的形式下载
* 把get()里的stream参数设置为True
* 使用for...in的形式写文件
```
r = requests.get(url, stream=True)
f = open(file_name, 'wb')
for a in r.iter_content(chunk_size=32):  # iter是iter
    f.write(a)
f.close()
```

In [None]:
#小练习
# 爬取会议论文集：https://conference.bj.bcebos.com/ccks2019/eval/webpage/index.html 中文知识图谱问答
import requests
import urllib
import os

cur_dir = '/'.join(os.path.abspath('__file__').split('/')[:-1])
paper_dir = os.path.join(cur_dir, 'papers')
if not os.path.exists(paper_dir):
    os.makedirs(paper_dir)

for page in range(1, 5):
    try:
        _page = str(page)
        url = 'https://conference.bj.bcebos.com/ccks2019/eval/webpage/pdfs/eval_paper_6_{}.pdf'.format(_page)
        file_name = os.path.join(paper_dir, _page + '_6.pdf')
        print(file_name)
        urllib.request.urlretrieve(url, file_name)
        
    except Exception as e:
        print(e)

### 爬取静态数据
* 调用API爬取数据
* 解析网页爬取数据

#### 调用API爬取数据
* 爬取**豆瓣电影**中选电影的热门推荐电影数据 https://movie.douban.com

#### 请求API地址
* https://movie.douban.com/j/search_subjects?type=movie&tag=热门&sort=recommend&page_limit=50&page_start=0
* 可以根据提供的标签、排序方法、每页数量、每页开始编号等参数返回相应的电影数据，这里是按推荐程度排名，从0号开始，返回热门标签下的50条电影数据。

#### 返回Json数据
* 在浏览器中访问以上链接，得到的也是一个json格式字符串，同样转成Python字典再处理即可。
```
{
	"subjects": [{
		"rate": "8.3",
		"cover_x": 5906,
		"title": "调音师",
		"url": "https:\/\/movie.douban.com\/subject\/30334073\/",
		"playable": true,
		"cover": "https://img1.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2551995207.webp",
		"id": "30334073",
		"cover_y": 8268,
		"is_new": false
	}, {
		"rate": "8.9",
		"cover_x": 2000,
		"title": "绿皮书",
		"url": "https:\/\/movie.douban.com\/subject\/27060077\/",
		"playable": true,
		"cover": "https://img3.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2549177902.webp",
		"id": "27060077",
		"cover_y": 3167,
		"is_new": false
	}]
}
```

In [None]:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import requests
import json

# 定义请求url
url = "https://movie.douban.com/j/search_subjects"
# 定义请求头
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
}
# 循环构建请求参数并且发送请求
for page_start in range(0, 100, 50):
    params = {
        "type": "movie",
        "tag": "热门",
        "sort": "recommend",
        "page_limit": "50",
        "page_start": page_start
    }
    response = requests.get(
        url=url,
        headers=headers,
        params=params
    )
    # 方式一:直接转换json方法
    results = response.json()
    # 方式二: 手动转换
#     # 获取字节串
#     content = response.content
#     # 转换成字符串
#     string = content.decode('utf-8')
#     # 把字符串转成python数据类型
#     results = json.loads(string)
    # 解析结果
    for movie in results["subjects"]:
        print(movie["title"], movie["rate"])

In [None]:
#小练习
#爬取豆瓣电影中选电影的华语推荐电影数据 https://movie.douban.com

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import requests
import json

# 定义请求url
url = "https://movie.douban.com/j/search_subjects"
# 定义请求头
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
}
# 循环构建请求参数并且发送请求
for page_start in range(0, 10, 5):
    params = {
        "type": "movie",
        "tag": "华语",
        "sort": "recommend",
        "page_limit": "5",
        "page_start": page_start
    }
    response = requests.get(
        url=url,
        headers=headers,
        params=params
    )
    # 方式一:直接转换json方法
    results = response.json()
    # 解析结果
    for movie in results["subjects"]:
        print(movie["title"], movie["rate"])

#### 解析网页爬取数据

* 简书 https://www.jianshu.com/

#### 分析页面获取数据
```
<div id="list-container">
<!-- 文章列表模块 -->
<ul class="note-list" infinite-scroll-url="/">  
<li id="note-42616373" data-note-id="42616373" class="have-img">
    <a class="wrap-img" href="/p/85732e50fbf8" target="_blank">
      <img class="  img-blur-done" src="//upload-images.jianshu.io/upload_images/2376211-5ae568461a3c3308.jpg?imageMogr2/auto-orient/strip|imageView2/1/w/360/h/240" alt="120">
    </a>
  <div class="content">
    <a class="title" target="_blank" href="/p/85732e50fbf8">随想：卖基金浅谈＂投资＂</a>
    <p class="abstract">
      今天，我把唯一养的基金给卖了。这只基金，我从2016年11月就开始定投了，一月定投100元，期间全部卖出过几次，但我没有终止定投就又继续定投，累...
    </p>
    <div class="meta">
        <span class="jsd-meta">
          <i class="iconfont ic-paid1"></i> 15.5
        </span>
      <a class="nickname" target="_blank" href="/u/c2969a4ab893">沐滢</a>
        <a target="_blank" href="/p/85732e50fbf8#comments">
          <i class="iconfont ic-list-comments"></i> 23
</a>      <span><i class="iconfont ic-list-like"></i> 43</span>
    </div>
  </div>
</li>  

</ul>
<!-- 文章列表模块 -->
</div>
```

#### 正则表达式

In [None]:
import requests
import re

In [None]:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"}
url = 'https://www.jianshu.com/'
response = requests.get(url, headers=headers)
html = response.text
html

In [None]:
pattern = re.compile('<a class="title" target="_blank" href=(.*?)>(.*?)</a>.*?<p class="abstract">(.*?)</p>', re.S)
items = re.findall(pattern, str(html))
items

In [None]:
for item in items:
    print({'url': item[0].strip(), 'name': item[1].strip(), 'text': item[2].replace("\n", "").strip()})

#### BeautifulSoup

In [None]:
import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"}
url = 'https://www.jianshu.com/'
r = requests.get(url=url, headers=headers)
html = r.text        
html

In [None]:
soup = BeautifulSoup(html, 'html.parser')
con = soup.find(id='list-container')
con_list = con.find_all('div', class_="content")
con_list

In [None]:
for i in con_list:
    url = i.find('a', class_='title')['href']
    name = i.find('a', class_='title').string
    text = i.find('p', class_='abstract').get_text()
    print({'url': url.strip(), 'name': name.strip(), 'text': text.strip()})

#### 爬取动态数据
* 爬取空气质量数据 https://www.aqistudy.cn/historydata/daydata.php?city=北京&month=2015-07

#### 方法一

```py
from bs4 import BeautifulSoup
import requests

base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city='
city = '北京'

month_set = list()
for i in range(1, 10):
    month_set.append(('2018-0%s' % i))
for i in range(10, 13):
    month_set.append(('2018-%s' % i))
print(month_set)

file_name = city + '.csv'
fp = open('tmp/' + file_name, 'w')
for i in range(len(month_set)):
    str_month = month_set[i]
    weburl = ('%s%s&month=%s' % (base_url, city, str_month))
    response = requests.get(weburl).content
    soup = BeautifulSoup(response, 'html.parser', from_encoding='utf-8')
    result = soup.find_all('td', attrs={'align': 'center'}, recursive=True)

    for j in range(0, len(result), 9):
        record_day = result[j].get_text().strip()
        record_aqi = result[j + 1].get_text().strip()
        fp.write(('%s,%s\n' % (record_day, record_aqi)))
    print('%s,%s---DONE' % (city, str_month))
fp.close()
```

In [None]:
from bs4 import BeautifulSoup
import requests

base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city='
city = '北京'

month_set = list()
for i in range(1, 10):
    month_set.append(('2018-0%s' % i))
for i in range(10, 13):
    month_set.append(('2018-%s' % i))
print(month_set)

file_name = city + '.csv'
fp = open('tmp/' + file_name, 'w')
for i in range(len(month_set)):
    str_month = month_set[i]
    weburl = ('%s%s&month=%s' % (base_url, city, str_month))
    response = requests.get(weburl).content
    soup = BeautifulSoup(response, 'html.parser', from_encoding='utf-8')
    result = soup.find_all('td', attrs={'align': 'center'}, recursive=True)

    for j in range(0, len(result), 9):
        record_day = result[j].get_text().strip()
        record_aqi = result[j + 1].get_text().strip()
        fp.write(('%s,%s\n' % (record_day, record_aqi)))
    print('%s,%s---DONE' % (city, str_month))
fp.close()

#### 方法二

```py
import csv
from bs4 import BeautifulSoup as BS
from selenium import webdriver
import time

base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city='
city = '北京'

month_set = list()
for i in range(1, 10):
    month_set.append(('2018-0%s' % i))
for i in range(10, 13):
    month_set.append(('2018-%s' % i))
print(month_set)

file_name = city + '.csv'
fp = open('tmp/' + file_name, 'w', newline='', encoding='utf-8')
fieldnames = ['city', 'date', 'AQI', 'LEVEL', 'PM2_5', 'PM10', 'SO2', 'CO', 'NO2', 'O3_8h']
writer = csv.DictWriter(fp, fieldnames=fieldnames)
writer.writeheader()
browser = webdriver.Firefox()
browser.maximize_window()
for i in range(len(month_set)):
    str_month = month_set[i]
    weburl = ('%s%s&month=%s' % (base_url, city, str_month))
    browser.get(weburl)
    time.sleep(1)
    # 获取html
    html = browser.page_source
    soup = BS(html, 'html.parser', from_encoding='utf-8')
    result = soup.find_all('td', attrs={'align': 'center'}, recursive=True)

    for j in range(0, len(result), 9):
        item = {}
        # 城市
        item['city'] = city
        # 日期
        item['date'] = result[j].get_text().strip()
        # AQI
        item['AQI'] = result[j+1].get_text().strip()
        # 质量等级
        item['LEVEL'] = result[j+2].get_text().strip()
        # PM2.5
        item['PM2_5'] = result[j+3].get_text().strip()
        # PM10
        item['PM10'] = result[j+4].get_text().strip()
        # SO2
        item['SO2'] = result[j+5].get_text().strip()
        # CO
        item['CO'] = result[j+6].get_text().strip()
        # NO2
        item['NO2'] = result[j+7].get_text().strip()
        # O3_8h
        item['O3_8h'] = result[j+8].get_text().strip()
        writer.writerow(item)
        print(item)
    print('%s,%s---DONE' % (city, str_month))
fp.close()
browser.quit()
```

In [None]:
import csv
from bs4 import BeautifulSoup as BS
from selenium import webdriver
import time

base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city='
city = '北京'

month_set = list()
for i in range(1, 10):
    month_set.append(('2018-0%s' % i))
for i in range(10, 13):
    month_set.append(('2018-%s' % i))
print(month_set)

file_name = city + '.csv'
fp = open('tmp/' + file_name, 'w', newline='', encoding='utf-8')
fieldnames = ['city', 'date', 'AQI', 'LEVEL', 'PM2_5', 'PM10', 'SO2', 'CO', 'NO2', 'O3_8h']
writer = csv.DictWriter(fp, fieldnames=fieldnames)
writer.writeheader()
browser = webdriver.Chrome()
browser.maximize_window()
for i in range(len(month_set)):
    str_month = month_set[i]
    weburl = ('%s%s&month=%s' % (base_url, city, str_month))
    browser.get(weburl)
    time.sleep(10)
    # 获取html
    html = browser.page_source
    soup = BS(html, 'html.parser', from_encoding='utf-8')
    result = soup.find_all('td', attrs={'align': 'center'}, recursive=True)

    for j in range(0, len(result), 9):
        item = {}
        # 城市
        item['city'] = city
        # 日期
        item['date'] = result[j].get_text().strip()
        # AQI
        item['AQI'] = result[j+1].get_text().strip()
        # 质量等级
        item['LEVEL'] = result[j+2].get_text().strip()
        # PM2.5
        item['PM2_5'] = result[j+3].get_text().strip()
        # PM10
        item['PM10'] = result[j+4].get_text().strip()
        # SO2
        item['SO2'] = result[j+5].get_text().strip()
        # CO
        item['CO'] = result[j+6].get_text().strip()
        # NO2
        item['NO2'] = result[j+7].get_text().strip()
        # O3_8h
        item['O3_8h'] = result[j+8].get_text().strip()
        writer.writerow(item)
        print(item)
    print('%s,%s---DONE' % (city, str_month))
fp.close()
browser.quit()

# Any Questions?