# 大數據研究社 1450の網頁爬蟲實戰

學會爬蟲就可以加入網軍了嗎? 可以領1450嗎? 既然大家都被標題騙來了，就看下去吧ヾ(●゜▽゜●)

## 爬蟲是什麼
網路爬蟲是一個可以自動化抓取網頁內容的程式。

相信大家多少都遇過需要抓取網頁資訊的時候，也許是因為要做報告、或是出於興趣想研究，需要相關參考資料。最簡單的方法就是一筆一筆複製，然後貼到excel或是文字編輯器儲存，再做後續的分析。

如果只有幾十筆還好，那假如有上百筆、上千筆怎麼辦？

## 背景知識

網頁爬蟲是利用程式模仿瀏覽器，透過HTTP(S)請求向伺服器取得**網頁原始碼**，再從擷取到網頁原始碼蒐集**想要的資訊**。網頁爬蟲可以很簡單，難度取決於我們想爬的網站的設計和資料得詳細程度。

網頁爬蟲可以分成2步驟
- 第一步是取得網頁原始碼(HTML)
- 第二步是解析HTML原始碼，找出我們要的資訊

![](https://drive.google.com/uc?export=download&id=1UiDUXriZVo83ePdiSQTLS4QWnCN_YsVu)

![](https://scontent.frmq2-2.fna.fbcdn.net/v/t1.0-9/p960x960/41313754_306610563225314_4813346634328965120_o.jpg?_nc_cat=109&_nc_oc=AQlhNysrF9hrCiLlCwn2iOtHOZu1J8e051CaXd13EC5L_YyewojFAdO95M4LJyLz1ic&_nc_ht=scontent.frmq2-2.fna&_nc_tp=6&oh=346011528de001b6dd81937095240e45&oe=5EBD24E1)

## 套件介紹

* Requests [官方文件](https://requests.readthedocs.io/en/master/user/quickstart/)
* BeautifulSoup [官方文件](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/)

### 1. Requests 套件
Requests 是一個簡潔又優雅的 Python HTTP 套件，可以一行搞定 HTTP 連線。

安裝指令: `$ pip install requests`

In [1]:
import requests

requests.get("http://pythonscraping.com/pages/warandpeace.html")

<Response [200]>

我們成功取得到了對方伺服器的回應(Response)，代號是200。
接著用一個變數`res`來接收Response物件，取出HTML網頁原始碼吧

In [2]:
res = requests.get("http://pythonscraping.com/pages/warandpeace.html")
print(res.text)

<html>
<head>
<style>
.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
</style>
</head>
<body>
<h1>War and Peace</h1>
<h2>Chapter 1</h2>
<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>"
<p/>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the first t

### 2. BeautifulSoup 套件

Beautiful Soup [官方文件](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/) 是一個可以從HTML或XML文件中提取數據的Python庫。它能夠通過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式。Beautiful Soup會幫你節省數小時甚至數天的工作時間。典故來自 Alice in WonderLand 裏面一首同名的詩，由假海龜 (Mock Turtle) 所唱，影射英國料理假海龜湯...

安裝指令: `$ pip install beautifulsoup4`

In [3]:
from bs4 import BeautifulSoup
import requests

res = requests.get("http://pythonscraping.com/pages/warandpeace.html")
soup = BeautifulSoup(res.text, "html.parser")
soup

<html>
<head>
<style>
.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
</style>
</head>
<body>
<h1>War and Peace</h1>
<h2>Chapter 1</h2>
<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>"
<p></p>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the firs

好的! 成功使用bs4解析範例網站，我們習慣把解析出來的物件存入名稱為`soup`的變數，試試把所有的**人物名字**都給記錄下來吧。

### bs4套件提供幾個函式來選取標籤

`findAll()` 是找出**所有**符合條件的標簽

`find()` 是找到**第一筆**符合條件的資料

`select()` Always use css selectors when chaining tags or using `tag.classname`. [Beautifulsoup : Is there a difference between .find() and .select()](https://stackoverflow.com/questions/38028384/beautifulsoup-is-there-a-difference-between-find-and-select-python-3-x)
```python
soup.findAll("span", { "class": "green"})
soup.select('span.green')
```

### 取出標籤中的文字

`.text` 是取出tag和它底下的其他tags**所有**的文字

`.string` 是取出**這一個tag**裡面的文字

In [4]:
name_list = soup.findAll("span", { "class": "green"})
for n in name_list:
    print(n.text)

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


但是，最後提醒一下爬蟲是個走後門的方法，爬蟲程式既脆弱又繁複，我們來看一下爬蟲會面臨的問題:
- 對方網站框架太新潮，根本無法爬蟲。例如 react.js, Gmail 等
- 對方網頁維護/改版，爬蟲程式也要跟著更新

所以重要的資料有 api 可以使用的話，付點費購買還是值得的 ┐(´д`)┌