# 爬蟲-網頁資料擷取

- 擷取網頁用 [`requests`](https://docs.python-requests.org/en/latest/) 模組
  - requests.get()
  - requests.post()
- 解析網頁用 [`beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/) 模組
  - soup.find()
  - soup.find_all()
  - soup.select()

# Requests

### Requests常用狀態描述
- `response.status_code`
  - 200 OK
  - 403 Forbidden （禁止）
  - 404 Not Found
- `response.encoding`
  - 如果是中文網站要特別注意編碼的問題
  - 常用編碼UTF-8，windows可能會遇到CP950、Big5等編碼問題
- `response.text`
  - 目標網頁的HTML文字，即被Tag包起來的內文(目標資訊）
- `request.json()`
  - 目標為JSON的話，可以解析為Python的list與dict


- 可從文件學習，搭配[requests官方文件quickstart服用](https://requests.readthedocs.io/en/latest/user/quickstart/)!!

### 檢查連線資訊

In [None]:
import requests

url="https://api.github.com/events"
r = requests.get(url)

r.json()

In [None]:
# 連線狀態
r.status_code

In [None]:
# 編碼
r.encoding

In [None]:
# 內容
r.content

In [None]:
# 連線的錯誤訊息(正確連線則無)
r.raise_for_status()

In [None]:
# cookies
r.cookies

In [None]:
# header (HTTP 標頭名稱不區分大小寫。)
r.headers

In [None]:
r.headers['Content-Type']

- Request讀取影音圖片檔案(二進位制)的方式

### 快速認識`GET`、`POST`、`PUT`、`DELETE`

In [None]:
!curl -X GET "https://api.github.com/events"

In [None]:
# GET
r = requests.get('https://httpbin.org/get')
r.json()

In [None]:
# POST
r = requests.post('https://httpbin.org/post', data={'key': 'value'})
r.json()

In [None]:
# PUT
r = requests.put('https://httpbin.org/put', data={'key': 'value2'})
r.json()

In [None]:
# DELETE
r = requests.delete('https://httpbin.org/delete')
r.json()

In [None]:
r.text

### `GET`，以及增加參數的方式

In [None]:
!curl -X GET "https://httpbin.org/get?k1=v1&k2=v2"

In [None]:
import requests

payload = {'k1': 'v1', 'k2': 'v2'}
r = requests.get('https://httpbin.org/get', params=payload)
print(r.url)

In [None]:
# GET的參數會接在URL後面，還記得HW2嗎?
import requests

payload  = {"api":"1", "map_action":"map", "zoom":"16", "query":"24.149660,120.684166"}
r = requests.get('https://www.google.com/maps/search/', params=payload )
print(r.url)

- `Request.get`大型檔案的方式，，`圖片、影音檔案、二進位制bin檔可用

In [None]:
# 來源: https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests
import requests

def download_file(url):
  """下載檔案，檔名為url.split('/')[-1]"""
  local_filename = url.split('/')[-1]
  with requests.get(url, stream=True) as r:
      r.raise_for_status()
      with open(local_filename, 'wb') as f:
          for chunk in r.iter_content(chunk_size=8192):
              f.write(chunk)
  return local_filename

if __name__=='__main__':
  download_file("https://api.github.com/events")

### `POST`，帶有data的 POST 請求

In [None]:
!curl -X POST -d "key1=value1&key2=value2" "https://httpbin.org/post"

In [None]:
import requests

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("https://httpbin.org/post", data=payload)

print(r.text)

- `payload_tuples`與`payload_dict`用法，以下兩者相同

In [None]:
import requests

payload_tuples = [('key1', 'value1'), ('key1', 'value2')]
requests.post('https://httpbin.org/post', data=payload_tuples).json()

In [None]:
import requests

payload_dict = {'key1': ['value1', 'value2']}
requests.post('https://httpbin.org/post', data=payload_dict).json()

-  `post(url, data=None, json=None, **kwargs)`，以下示範參數放dict轉json或直接json


In [None]:
import json

url = 'https://httpbin.org/post'
payload = {'some': 'data'}

# 以下兩種相同
r = requests.post(url, data=json.dumps(payload))
r = requests.post(url, json=payload)

r.json()

- 加入`Cookies`

In [None]:
url = 'http://www.google.com'
r = requests.get(url)
r.cookies

In [None]:
!curl --cookie "my_cookie=" https://httpbin.org/cookies

In [None]:
url = 'https://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
r.json()

- Timeouts

In [None]:
requests.get('https://github.com/', timeout=0.001)

## `GET` example網頁為例


- 先觀察目標網頁: http://www.example.com/
- 以`requests.get`抓取網頁原始碼，並輸出結果
- 這個階段有抓到網頁就大功告成了!

In [None]:
import requests

url = 'http://www.example.com/'
response = requests.get(url)
print(response.text)

In [None]:
dir(requests)

In [None]:
response.status_code

In [None]:
response.encoding

## `POST` [台灣高鐵訂票](https://www.thsrc.com.tw/ArticleContent/a3b630bb-1066-4352-a1ef-58c7b4e8ef7c)為例

In [None]:
import requests

url= 'https://www.thsrc.com.tw/TimeTable/Search'

data={
    'SearchType': 'S',
    'Lang': 'TW',
    'StartStation': 'NanGang',
    'EndStation': 'ZuoYing',
    'OutWardSearchDate': '2022/11/05',
    'OutWardSearchTime': '16:00',
    'ReturnSearchDate': '2022/11/05',
    'ReturnSearchTime': '16:00',
    'DiscountType': None
}

res = requests.post(url, data=data)

In [None]:
res.json()

In [None]:
res.text

In [None]:
res.headers

In [None]:
r = res.json()
r['data']

# BeautifulSoup

## 以Beautiful Soup讀取並解析HTML


- [文件](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/)
- Beautiful Soup是HTML解析器，將網頁解析為 `bs4.BeautifulSoup` 物件。
- `bs4.BeautifulSoup` 物件是個結構樹(DOM)，依結構與各種方法搜尋目標。
```
!pip3 install beautifulsoup4
```

In [None]:
from bs4 import BeautifulSoup

html_doc="""<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>"""


soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


下表列出了主要的解析器，以及它們的優缺點：

解析器|使用方法|優勢|	劣勢|
-|-|-|-
html.parser|	BeautifulSoup(markup,"html.parser")	|Python的內建標準庫、執行速度適中、文檔容錯能力強|Python 2.7.3及3.2.2之前的版本中文檔容錯能力差
lxml HTML 解析器|	BeautifulSoup(markup, "lxml")	|速度快、文檔容錯能力強(通常用這個)|需要安装C语言库
xml XML 解析器|BeautifulSoup(markup, "xml")|速度快、唯一支持XML的解析器|需要安装C语言库
html5lib	|BeautifulSoup(markup, "html5lib")	|最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔|速度慢、不依賴外部擴展


### 簡易解析文件
- 用`.`的方式存取物件結構，快速但容易出錯
- 用`find()`、`find_all()`、`select()`方法較嚴謹

In [None]:
soup.title

<title>The Dormouse's story</title>

In [None]:
soup.title.name

'title'

In [None]:
soup.title.string

"The Dormouse's story"

In [None]:
soup.title.text

"The Dormouse's story"

In [None]:
soup.title.parent.name

'head'

In [None]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [None]:
soup.p['class']

['title']

In [None]:
soup.p.get("class") #推薦使用`.get()`取得屬性

['title']

In [None]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [None]:
soup.a["href"]

'http://example.com/elsie'

In [None]:
soup.a.get('href')

'http://example.com/elsie'

In [None]:
print(type(soup.title))
print(type(soup.p))
print(type(soup.a))

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>


In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [None]:
[ i.get("href") for i in soup.find_all('a') ]

['http://example.com/elsie',
 'http://example.com/lacie',
 'http://example.com/tillie']

### `soup.find()`



-  回傳第一個被tag包圍的區塊
- 傳入的引數第一個通常是 tag 名稱，第二個引數若未指明屬性就代表 class 名稱，也可以直接使用 id 等屬性去定位區塊。定位到區塊後，可以取出其屬性與包含的字串值

  ```python
  soup.find(name=None,    # 第一個tag name
      attrs={},      # {”屬性名”=“屬性值”}
      recursive=True,  # 迴圈搜尋開啟
      text=None,    # 查找內文
      **kwargs)
  ```

In [None]:
help(soup.find())
#soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)

In [None]:
print(soup.find('p'))
print(soup.find("a"))

#取<a>內容</a>
print(soup.find("a").string)
print(soup.find("a").text)

#取<title>標題</title>，
print(soup.title.string)
print(soup.title.text)

<p class="title"><b>The Dormouse's story</b></p>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
Elsie
The Dormouse's story
The Dormouse's story


### `soup.find().get(屬性)`

- 取出節點屬性的較好方法`.get("屬性")`
  - 使用`get()`如無此屬性，回傳結果為none。
  - 如果不用`get()`也可以擷取屬性，但不存在時會出現錯誤，有礙後續爬蟲執行。
  - 其他詳細用法可參考 [BeautifulSoup的官方文件](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [None]:
# 找不到屬性就出錯! #id, class, href, src
soup.find("p")['style']

KeyError: ignored

In [None]:
# 找不到屬性回傳None
print(soup.find('p').get('style'))

None


### `soup.find_all()`





- 我全都要，回傳結果為`bs4.element.ResultSet`物件
  ```python
  soup.find_all(name=None,     #第一個tag name
         attrs={},      #{”屬性名”=“屬性值”}
         text=None,     #查找內文
         limit=None,     #限制搜尋數量
         **kwargs)
  ```

In [None]:
import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.python.org/')
soup = BeautifulSoup(res.text, "lxml")

In [None]:
p_tags = soup.find_all("p")
p_tags

[<p><strong>Notice:</strong> While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. </p>,
 <p>The core of extensible programming is defining functions. Python allows mandatory and optional arguments, keyword arguments, and even arbitrary argument lists. <a href="//docs.python.org/3/tutorial/controlflow.html#defining-functions">More about defining functions in Python 3</a></p>,
 <p>Lists (known as arrays in other languages) are one of the compound data types that Python understands. Lists can be indexed, sliced and manipulated with other built-in functions. <a href="//docs.python.org/3/tutorial/introduction.html#lists">More about lists in Python 3</a></p>,
 <p>Calculations are simple with Python, and expression syntax is straightforward: the operators <code>+</code>, <code>-</code>, <code>*</code> and <code>/</code> work as expected; parentheses <code>()</code> can be used for grouping. <a

In [None]:
type(p_tags)

bs4.element.ResultSet

In [None]:
# 找出所有內容等於的文字
print(soup.find_all(text="Latest News"))

['Latest News']


- `bs4.element.ResultSet`物件內容以for迴圈取出

In [None]:
for tag in p_tags:
  print(tag)
  print(type(tag)) # 取出一層，內層是`bs4.element.Tag`物件

<p><strong>Notice:</strong> While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. </p>
<class 'bs4.element.Tag'>
<p>The core of extensible programming is defining functions. Python allows mandatory and optional arguments, keyword arguments, and even arbitrary argument lists. <a href="//docs.python.org/3/tutorial/controlflow.html#defining-functions">More about defining functions in Python 3</a></p>
<class 'bs4.element.Tag'>
<p>Lists (known as arrays in other languages) are one of the compound data types that Python understands. Lists can be indexed, sliced and manipulated with other built-in functions. <a href="//docs.python.org/3/tutorial/introduction.html#lists">More about lists in Python 3</a></p>
<class 'bs4.element.Tag'>
<p>Calculations are simple with Python, and expression syntax is straightforward: the operators <code>+</code>, <code>-</code>, <code>*</code> and <code>/</code> wor

In [None]:
for tag in p_tags:
  print(tag.text)
  print(type(tag.text)) # 已解析內文，為文字str

Notice: While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. 
<class 'str'>
The core of extensible programming is defining functions. Python allows mandatory and optional arguments, keyword arguments, and even arbitrary argument lists. More about defining functions in Python 3
<class 'str'>
Lists (known as arrays in other languages) are one of the compound data types that Python understands. Lists can be indexed, sliced and manipulated with other built-in functions. More about lists in Python 3
<class 'str'>
Calculations are simple with Python, and expression syntax is straightforward: the operators +, -, * and / work as expected; parentheses () can be used for grouping. More about simple math functions in Python 3.
<class 'str'>
Python knows the usual control flow statements that other languages speak — if, for, while and range — with some of its own twists, of course. More control flo

In [None]:
# 取出節點屬性

a_tags = soup.find_all("a")
for tag in a_tags:
  print(tag.get('href'))

#content
#python-network
/
/psf-landing/
https://docs.python.org
https://pypi.org/
/jobs/
/community-landing/
#top
/
https://psfmember.org/civicrm/contribute/transact?reset=1&id=2
#site-map
#
javascript:;
javascript:;
javascript:;
#
https://www.facebook.com/pythonlang?fref=ts
https://twitter.com/ThePSF
/community/irc/
/about/
/about/apps/
/about/quotes/
/about/gettingstarted/
/about/help/
http://brochure.getpython.info/
/downloads/
/downloads/
/downloads/source/
/downloads/windows/
/downloads/macos/
/download/other/
https://docs.python.org/3/license.html
/download/alternatives
/doc/
/doc/
/doc/av
https://wiki.python.org/moin/BeginnersGuide
https://devguide.python.org/
https://docs.python.org/faq/
http://wiki.python.org/moin/Languages
http://python.org/dev/peps/
https://wiki.python.org/moin/PythonBooks
/doc/essays/
/community/
/community/diversity/
/community/lists/
/community/irc/
/community/forums/
/psf/annual-report/2021/
/community/workshops/
/community/sigs/
/community/logos/
https

- `soup.find_all()`以list`[]`同時搜尋多種標籤

In [None]:
from pprint import pprint

tags = soup.find_all(["a", "b", "p"]) # 搜尋所有超連結與粗體字
pprint(tags)

[<p><strong>Notice:</strong> While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. </p>,
 <a href="#content" title="Skip to content">Skip to content</a>,
 <a aria-hidden="true" class="jump-link" href="#python-network" id="close-python-network">
<span aria-hidden="true" class="icon-arrow-down"><span>▼</span></span> Close
                </a>,
 <a class="current_item selectedcurrent_branch selected" href="/" title="The Python Programming Language">Python</a>,
 <a href="/psf-landing/" title="The Python Software Foundation">PSF</a>,
 <a href="https://docs.python.org" title="Python Documentation">Docs</a>,
 <a href="https://pypi.org/" title="Python Package Index">PyPI</a>,
 <a href="/jobs/" title="Python Job Board">Jobs</a>,
 <a href="/community-landing/">Community</a>,
 <a aria-hidden="true" class="jump-link" href="#top" id="python-network">
<span aria-hidden="true" class="icon-arrow-up"><sp

In [None]:
tags = soup.find_all(["a", "p"], limit=2) # 限制搜尋結果數量limit
pprint(tags)

[<p><strong>Notice:</strong> While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. </p>,
 <a href="#content" title="Skip to content">Skip to content</a>]


### `soup.select()`


- 用CSS Seletor選擇器，結果回傳為list
- list裡面如果還是標籤形式，這些標籤還是`bs`物件，要解出來才能接著python操作

```python
select(selector, _candidate_generator=None, limit=None)

```

In [None]:
from bs4 import BeautifulSoup
import requests

res = requests.get('http://www.example.com/')
soup = BeautifulSoup(res.text, "lxml")

In [None]:
select_a = soup.select("a")

In [None]:
print(type(select_a))
print(select_a)

<class 'list'>
[<a href="https://www.iana.org/domains/example">More information...</a>]


In [None]:
print(type(select_a[0]))
print(select_a[0])

<class 'bs4.element.Tag'>
<a href="https://www.iana.org/domains/example">More information...</a>


In [None]:
#解析內文
print(type(select_a[0]).text)
print(select_a[0].text)

<property object at 0x7f5bd4446590>
More information...


In [None]:
select_href1 = soup.select('[href]')

In [None]:
print(type(select_href1))
print(select_href1)

<class 'list'>
[<a href="https://www.iana.org/domains/example">More information...</a>]


In [None]:
print(type(select_href1[0]))
print(select_href1[0])

<class 'bs4.element.Tag'>
<a href="https://www.iana.org/domains/example">More information...</a>


In [None]:
#配合`.get(屬性)`來解析屬性
print(type(select_href1[0].get('href')))
print(select_href1[0].get('href'))

<class 'str'>
https://www.iana.org/domains/example


In [None]:
import requests
from bs4 import BeautifulSoup

res = requests.get('http://python.org/')
soup = BeautifulSoup(res.text, "lxml")
a1=soup.select("#touchnav-wrapper > header > div > h1 > a > img")

In [None]:
a1

[<img alt="python™" class="python-logo" src="/static/img/python-logo.png"/>]

In [None]:
a1[0].get("src")

'/static/img/python-logo.png'

## 結合正規表達式regular expression進行搜尋


- 正規表達式對於精準抓取網頁的各種標籤及內文非常有幫助，解決了許多Xpath與CSS selector無法精確擷取的問題，有必要好好理解。
- 擷取的文句段落可以使用[regex101.com](https://regex101.com/)測試。


|符號|意義|範例|符合字串範例
|-|-|-|-
|`*`|`*`之前的字元、表達式或`[]`字元集合，出現為0或1個以上|`a*b*`|aaaa、aaabb、bbbb
|`+`|`+`之前的字元、表達式或`[]`字元集合，出現為1或1個以上|`a+b+`|aaab、aabbb、abbb
|`?`|`+`之前的字元、表達式或`[]`字元集合，出現為0或1次|`a?b?`|ab、b
|`[]`|`[]`內的任一字元挑一個|`[A-Z]*`|ALLPE、CAP、QWER
|`()`|`()`群組，群組運算優先處理|`(a*b)*`|aabaab、abaab、ababab
|`{m,n}`|符合在`{m,n}`前一個字元、表達式或`[]`集合，出現m到n次(包含m與n|`a{2,3}b{2,3}`|aabbb、aaabbb、aabb
|`[^]`|符合任一個不再`[]`的字元|`[^A-Z]*`|apple、banana、cat
|`\|`|符合被`\|`隔開的前後任一字元、字串或表達式|`b(a\|i\|e)d`|bad、bid、bed
|`.`|符合任一字元(含符號、數字、空格等)|`b.d`|bsd、bid、bed
|`^`|`^`之後的第1個字元為開頭的字串|`^a`|apple、afk
|`$`|`$`之前的末1個字元為結尾，否則會`.*`|`[A-Z]*[a-z]*$`|Aab、zzz
|`\d`|所有數字|`\d`|455、5566
|`\w`|所有文字字元|`\w`|123ABC、C8763
|`\s`|所有非無的字元與操作|`\s`|`Tab, Space, Escape, …`



#### Python的re模組
- 可至[regex101](https://regex101.com/)嘗試
- 為了避免與字串中的跳脫字元產生混淆，定義正規表達式樣式建議使用原始字串(raw string)，也就是在字串前加r''

##### 參考寫法
```python
import re

# 找出所有內容等於 python_crawler 的文字
pattern = "我寫好的 regular expression"
string = "我想要找的字串"
re.findall(pattern, string)
```

In [None]:
import re

pattern = "我"
string = "我想要找的字串我我"
re.findall(pattern, string)

['我', '我', '我']

In [None]:
import re

pattern = "^[a-zA-Z0-9\._-]+@[a-zA-Z0-9\._-]+$"
string = "willismax.com@gmail.com"
re.findall(pattern, string)

In [None]:
import requests
import re

res = requests.get('http://python.org/')

pattern = r'h[1-6]' #標題h1-h6
string = res.text
re.findall(pattern, string)

['h1',
 'h1',
 'h1',
 'h1',
 'h1',
 'h1',
 'h1',
 'h1',
 'h1',
 'h1',
 'h1',
 'h1',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2',
 'h2']

In [None]:
import re

res = requests.get('http://python.org/')

pattern = r'"\S*.png"' # .jpg或.png結尾
string = res.text
re.findall(pattern, string)

['"/static/apple-touch-icon-144x144-precomposed.png"',
 '"/static/apple-touch-icon-114x114-precomposed.png"',
 '"/static/apple-touch-icon-72x72-precomposed.png"',
 '"/static/apple-touch-icon-precomposed.png"',
 '"/static/apple-touch-icon-precomposed.png"',
 '"/static/metro-icon-144x144-precomposed.png"',
 '"https://www.python.org/static/opengraph-icon-200x200.png"',
 '"https://www.python.org/static/opengraph-icon-200x200.png"',
 '"/static/img/python-logo.png"']

## 網頁擷取實例


### 以PPT 為例


- 這邊開始要示範使用Chrome開發者工具進行搜尋
- 先觀察目標網頁: https://www.ptt.cc/bbs/StupidClown/index.html
- 使用Chrome瀏覽器，以滑鼠右鍵選擇「檢查」，快捷鍵在windows環境為ctrl+Shift+I或F12

- 另外如果要用別人寫好的，參閱https://dotblogs.com.tw/codinghouse/2018/10/22/pttcrawler

In [None]:
//*[@id="main-container"]/div[2]/div[2]/div[2]/a
#main-container > div.r-list-container.action-bar-margin.bbs-screen > div:nth-child(2) > div.title > a

![](https://i.imgur.com/K55v4SH.png)


- 文章列表可以觀察到推文數、文章標題、作者、日期及文章連結
- 我們先觀察他的樹狀結構，對應的標籤與屬性
- 以COPY XPath紀錄

|名稱|selector|
-|-
標題|`//*[@id="main-container"]/div[2]/div[4]/div[2]/a`
連結|`//*[@id="main-container"]/div[2]/div[4]/div[2]/a`

In [None]:
#目標網址https://www.ptt.cc/bbs/StupidClown/index.html
import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.ptt.cc/bbs/StupidClown/index.html')
soup = BeautifulSoup(res.text ,"lxml")
print(res.text)

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>看板 StupidClown 文章列表 - 批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print">




	</head>
    <body>
		
<div id="topbar-container">
	<div id="topbar" class="bbs-content">
		<a id="logo" href="/bbs/">批踢踢實業坊</a>
		<span>&rsaquo;</span>
		<a class="board" href="/bbs/StupidClown/index.html"><span class="board-label">看板 </span>StupidClown</a>
		<a class="right small" href="/about.html">關於我們</a>
		<a class="right small" hre

In [None]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   看板 StupidClown 文章列表 - 批踢踢實業坊
  </title>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-common.css" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-custom.css" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print" rel="stylesheet" type="text/css"/>
 </head>
 <body>
  <div id="topbar-container">
   <div class="bbs-content" id="topbar">
    <a href="/bbs/" id="logo">
     批踢踢實業坊
    </a>
    <span>
     ›
    </span>
    <a class="board" href="/bbs/StupidClown/index.html">
     <span class="board-label">
      看板
     </span>
     StupidClown
    </a>
    <a class="r

- 有抓到網頁，接下來如果簡單針對連結、標題的話，觀察都在div標籤的class='title'裡

In [None]:
# #main-container > div.r-list-container.action-bar-margin.bbs-screen > div:nth-child(5) > div.title > a
results = soup.select("div.title > a")
print(results)
print(type(results))

[<a href="/bbs/StupidClown/M.1668829121.A.8E6.html">Re: [無言] 下錯月台</a>, <a href="/bbs/StupidClown/M.1668831384.A.31C.html">Re: [無言] 電腦白痴</a>, <a href="/bbs/StupidClown/M.1158735717.A.828.html">[公告] 笨板板規</a>, <a href="/bbs/StupidClown/M.1435710970.A.31E.html">[公告]本板即日起不可PO問卷文</a>, <a href="/bbs/StupidClown/M.1667491247.A.0B7.html">[公告] 11月份置底閒聊文 </a>]
<class 'list'>


In [None]:
article_href = soup.select("div.title a")
article_href

[<a href="/bbs/StupidClown/M.1668829121.A.8E6.html">Re: [無言] 下錯月台</a>,
 <a href="/bbs/StupidClown/M.1668831384.A.31C.html">Re: [無言] 電腦白痴</a>,
 <a href="/bbs/StupidClown/M.1158735717.A.828.html">[公告] 笨板板規</a>,
 <a href="/bbs/StupidClown/M.1435710970.A.31E.html">[公告]本板即日起不可PO問卷文</a>,
 <a href="/bbs/StupidClown/M.1667491247.A.0B7.html">[公告] 11月份置底閒聊文 </a>]

In [None]:
# 逐一取出標題、合併超連結
for a in article_href:
  print(f'{a.text}')
  print(f'href: https://www.ptt.cc{a.get("href")}')

  #打開連結內的網頁並另存
  content_url = f'https://www.ptt.cc{a.get("href")}'
  r = requests.get(content_url)
  with open (f'{a.text}.html', 'w+') as f:
    f.write(r.text)
    print('saved')

Re: [無言] 下錯月台
href: https://www.ptt.cc/bbs/StupidClown/M.1668829121.A.8E6.html
saved
Re: [無言] 電腦白痴
href: https://www.ptt.cc/bbs/StupidClown/M.1668831384.A.31C.html
saved
[公告] 笨板板規
href: https://www.ptt.cc/bbs/StupidClown/M.1158735717.A.828.html
saved
[公告]本板即日起不可PO問卷文
href: https://www.ptt.cc/bbs/StupidClown/M.1435710970.A.31E.html
saved
[公告] 11月份置底閒聊文 
href: https://www.ptt.cc/bbs/StupidClown/M.1667491247.A.0B7.html
saved


In [None]:
%ls

'Re: [無言] 下錯月台.html'  '[公告] 11月份置底閒聊文 .html'
'Re: [無言] 電腦白痴.html'  '[公告]本板即日起不可PO問卷文.html'
 [0m[01;34msample_data[0m/               '[公告] 笨板板規.html'


In [None]:
#需滿18歲要加cookies
import requests

def fetch(url):
    response = requests.get(url)
    response = requests.get(url, cookies={'over18': '1'})  # 一直向 server 回答滿 18 歲了 !
    return response

url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
resp = fetch(url)  # step-1

print(resp.text) # result of setp-1

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>看板 Gossiping 文章列表 - 批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print">




	</head>
    <body>
		
<div id="topbar-container">
	<div id="topbar" class="bbs-content">
		<a id="logo" href="/bbs/">批踢踢實業坊</a>
		<span>&rsaquo;</span>
		<a class="board" href="/bbs/Gossiping/index.html"><span class="board-label">看板 </span>Gossiping</a>
		<a class="right small" href="/about.html">關於我們</a>
		<a class="right small" href="/co

- 更多可參考[爬蟲教學 CrawlerTutorial](https://github.com/leVirve/CrawlerTutorial)

### 以wiki亞洲國家資訊為例

- 參考來源[Web Scraping Wikipedia Tables using BeautifulSoup and Python](https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722)

In [None]:
import requests

url='https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area'
website_url = requests.get(url).text

from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify()[:500])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Asian countries by area - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"c2a


![](https://miro.medium.com/max/740/1*NyaaGqqHnemKSWu8DQqUHQ.png)

In [None]:
My_table = soup.find("table",{"class":"wikitable sortable"})
My_table

<table class="wikitable sortable">
<tbody><tr>
<th rowspan="2">Rank
</th>
<th rowspan="2">Country
</th>
<th colspan="2">Area
</th>
<th class="unsortable" rowspan="2">Notes
</th>
<th rowspan="2">Facts
</th></tr>
<tr>
<th>km²
</th>
<th>sq mi
</th></tr>
<tr>
<td>1
</td>
<td><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></span> <a href="/wiki/Russia" title="Russia">Russia</a>*
</td>
<td data-sort-value="7013131291420000000" style="text-align:right;">13,129,142
</td>
<td data-sort-value="7013131291420000000" style="text-align:right;">5,069,190
</td>
<td>17,

In [None]:
links = My_table.findAll('a')
links

[<a href="/wiki/Russia" title="Russia">Russia</a>,
 <a href="/wiki/European_Russia" title="European Russia">European Russia</a>,
 <a href="#cite_note-russiaTotalAreaByCIA-1">[1]</a>,
 <a href="/wiki/China" title="China">China</a>,
 <a href="/wiki/Taiwan" title="Taiwan">Taiwan</a>,
 <a href="/wiki/Hong_Kong" title="Hong Kong">Hong Kong</a>,
 <a href="/wiki/Macau" title="Macau">Macau</a>,
 <a href="/wiki/India" title="India">India</a>,
 <a href="/wiki/Kazakhstan" title="Kazakhstan">Kazakhstan</a>,
 <a href="/wiki/Saudi_Arabia" title="Saudi Arabia">Saudi Arabia</a>,
 <a href="/wiki/Iran" title="Iran">Iran</a>,
 <a href="/wiki/Mongolia" title="Mongolia">Mongolia</a>,
 <a href="/wiki/Indonesia" title="Indonesia">Indonesia</a>,
 <a href="/wiki/Western_New_Guinea" title="Western New Guinea">Indonesian Papua</a>,
 <a href="/wiki/Oceania" title="Oceania">Oceania</a>,
 <a href="/wiki/Pakistan" title="Pakistan">Pakistan</a>,
 <a href="/wiki/Turkey" title="Turkey">Turkey</a>,
 <a href="/wiki/East_

In [None]:
country = [
        link.get('title')
        for link in links
        if link.get('title') != None
        ]

In [None]:
country

['Russia',
 'European Russia',
 'China',
 'Taiwan',
 'Hong Kong',
 'Macau',
 'India',
 'Kazakhstan',
 'Saudi Arabia',
 'Iran',
 'Mongolia',
 'Indonesia',
 'Western New Guinea',
 'Oceania',
 'Pakistan',
 'Turkey',
 'East Thrace',
 'Myanmar',
 'Afghanistan',
 'Yemen',
 'Thailand',
 'Turkmenistan',
 'Uzbekistan',
 'Iraq',
 'Japan',
 'Vietnam',
 'Malaysia',
 'Oman',
 'Philippines',
 'Laos',
 'Kyrgyzstan',
 'Syria',
 'Golan Heights',
 'Cambodia',
 'Bangladesh',
 'Nepal',
 'Tajikistan',
 'North Korea',
 'South Korea',
 'Jordan',
 'United Arab Emirates',
 'Azerbaijan',
 'Caucasus',
 'Europe',
 'Asia',
 'Georgia (country)',
 'Caucasus',
 'Europe',
 'Asia',
 'Sri Lanka',
 'Egypt',
 'Bhutan',
 'Taiwan',
 'Free area of the Republic of China',
 'Armenia',
 'Armenian highlands',
 'Caucasus',
 'Europe',
 'Asia',
 'Israel',
 'West Bank',
 'Gaza Strip',
 'Golan Heights',
 'Kuwait',
 'East Timor',
 'Qatar',
 'Lebanon',
 'Cyprus',
 'Northern Cyprus',
 'State of Palestine',
 'West Bank',
 'Gaza Strip',
 

In [None]:
country=[]
for link in links:
  if link.get("title") != None:
    country.append(link.get("title"))

country

In [None]:
import pandas as pd

df = pd.DataFrame()
df['Country'] = country
df

Unnamed: 0,Country
0,Russia
1,European Russia
2,China
3,Taiwan
4,Hong Kong
...,...
71,Gaza Strip
72,Brunei
73,Bahrain
74,Singapore


In [None]:
df = df.sort_values(by="Country").reset_index(drop = True)
df

Unnamed: 0,Country
0,Afghanistan
1,Armenia
2,Armenian highlands
3,Asia
4,Asia
...,...
71,Vietnam
72,West Bank
73,West Bank
74,Western New Guinea


# 練習

##  練習1

- 試著看懂並執行、拆解以下程式
- 程式來源https://github.com/jwlin/web-crawler-tutorial/blob/master/ch3/ptt_gossiping.py

In [None]:
import requests
import time
import json
from bs4 import BeautifulSoup


PTT_URL = 'https://www.ptt.cc'


def get_web_page(url):
  resp = requests.get(
    url=url,
    cookies={'over18': '1'}
  )
  if resp.status_code != 200:
    print(f'Invalid url: {resp.url}')
    return None
  else:
    return resp.text


def get_articles(dom, date):
  soup = BeautifulSoup(dom, 'html5lib')

  # 取得上一頁的連結
  paging_div = soup.find('div', 'btn-group btn-group-paging')
  prev_url = paging_div.find_all('a')[1]['href']

  articles = []  # 儲存取得的文章資料
  divs = soup.find_all('div', 'r-ent')
  for d in divs:
    if d.find('div', 'date').text.strip() == date:  # 發文日期正確
      # 取得推文數
      push_count = 0
      push_str = d.find('div', 'nrec').text
      if push_str:
        try:
          push_count = int(push_str)  # 轉換字串為數字
        except ValueError:
          # 若轉換失敗，可能是'爆'或 'X1', 'X2', ...
          # 若不是, 不做任何事，push_count 保持為 0
          if push_str == '爆':
            push_count = 99
          elif push_str.startswith('X'):
            push_count = -10

        # 取得文章連結及標題
        if d.find('a'):  # 有超連結，表示文章存在，未被刪除
          href = d.find('a')['href']
          title = d.find('a').text
          author = ''  # author = d.find('div', 'author').text if d.find('div', 'author') else ''
          articles.append({
            'title': title,
            'href': href,
            'push_count': push_count,
            'author': author
          })
          # [ {'title': __, 'href:__'}, ]
  return articles, prev_url


def get_author_ids(posts, pattern):
  ids = set()
  for post in posts:
    if pattern in post['author']:
      ids.add(post['author'])
  return ids

if __name__ == '__main__':
  current_page = get_web_page(PTT_URL + '/bbs/Gossiping/index.html')
  if current_page:
    articles = []  # 全部的今日文章
    today = time.strftime("%m/%d").lstrip('0')  # 今天日期, 去掉開頭的 '0' 以符合 PTT 網站格式
    current_articles, prev_url = get_articles(current_page, today)  # 目前頁面的今日文章
    while current_articles:  # 若目前頁面有今日文章則加入 articles，並回到上一頁繼續尋找是否有今日文章
      articles += current_articles
      current_page = get_web_page(PTT_URL + prev_url)
      current_articles, prev_url = get_articles(current_page, today)

    # 印出所有不同的 5566 id
    # print(get_author_ids(articles, '5566'))

    # 儲存或處理文章資訊
    print(f'今天有{len(articles)}篇文章')
    threshold = 50
    print(f'熱門文章(>{threshold}推):')
    for a in articles:
      if int(a['push_count']) > threshold:
        print(a)
    with open('gossiping.json', 'w', encoding='utf-8') as f:
      json.dump(articles, f, indent=2, sort_keys=True, ensure_ascii=False)

今天有531篇文章
熱門文章(>50推):
{'title': '[新聞] 北市動物園含淚放手 大貓熊「團團」心跳1', 'href': '/bbs/Gossiping/M.1668838683.A.2B2.html', 'push_count': 98, 'author': ''}
{'title': '[新聞] 快訊／「向團團珍重再見」\u3000柯文哲悼：', 'href': '/bbs/Gossiping/M.1668838392.A.C36.html', 'push_count': 71, 'author': ''}
{'title': '[新聞] 快訊／高雄市區又無預警停電\u3000877戶傳災情...台電緊急搶修中', 'href': '/bbs/Gossiping/M.1668836369.A.C06.html', 'push_count': 79, 'author': ''}
{'title': '[新聞] 快訊／大貓熊團團驚傳逝世！\u3000醫療團隊沉', 'href': '/bbs/Gossiping/M.1668835302.A.A0D.html', 'push_count': 99, 'author': ''}
{'title': '[新聞] 高虹安扯「死亡之握」惹怒藍營 馬英九：', 'href': '/bbs/Gossiping/M.1668835304.A.535.html', 'push_count': 80, 'author': ''}
{'title': '[新聞] 快訊／大貓熊團團驚傳逝世！\u3000醫療團隊沉', 'href': '/bbs/Gossiping/M.1668835423.A.951.html', 'push_count': 99, 'author': ''}
{'title': '[爆卦] 疑似烏克蘭軍人稱陣亡台灣人為狗', 'href': '/bbs/Gossiping/M.1668834804.A.B46.html', 'push_count': 99, 'author': ''}
{'title': '[新聞] 回應高虹安「死亡之握」 馬英九：事情已經解決 她道歉了', 'href': '/bbs/Gossiping/M.1668833633.A.E04.html', 'push_count': 

## 練習2

- 試著看懂並執行、拆解以下程式
- 程式來源https://github.com/jwlin/web-crawler-tutorial/blob/master/ch3/yahoo_movie.py

In [None]:
import requests
import re
import json
from bs4 import BeautifulSoup


Y_MOVIE_URL = 'https://tw.movies.yahoo.com/movie_thisweek.html'

# 以下網址後面加上 "/id=MOVIE_ID" 即為該影片各項資訊
Y_INTRO_URL = 'https://tw.movies.yahoo.com/movieinfo_main.html'  # 詳細資訊
Y_PHOTO_URL = 'https://tw.movies.yahoo.com/movieinfo_photos.html'  # 劇照
Y_TIME_URL = 'https://tw.movies.yahoo.com/movietime_result.html'  # 時刻表


def get_web_page(url):
    resp = requests.get(url)
    if resp.status_code != 200:
        print(f'Invalid url:{resp.url}')
        return None
    else:
        return resp.text


def get_movies(dom):
  soup = BeautifulSoup(dom, 'html5lib')
  movies = []
  rows = soup.find_all('div', 'release_info_text')
  for row in rows:
    movie = dict()
    movie['expectation'] = row.find('div', 'leveltext').span.text.strip()
    movie['ch_name'] = row.find('div', 'release_movie_name').a.text.strip()
    movie['eng_name'] = row.find('div', 'release_movie_name').find('div', 'en').a.text.strip()
    movie['movie_id'] = get_movie_id(row.find('div', 'release_movie_name').a.get('href'))
    movie['poster_url'] = row.parent.find_previous_sibling('div', 'release_foto').a.get('src')
    movie['release_date'] = get_date(row.find('div', 'release_movie_time').text)
    movie['intro'] = row.find('div', 'release_text').text.replace(u'詳全文', '').strip()
    trailer_a = row.find_next_sibling('div', 'release_btn color_btnbox').find_all('a')[1]
    movie['trailer_url'] = trailer_a['href'] if 'href' in trailer_a.attrs.keys() else ''
    movies.append(movie)
  return movies


def get_date(date_str):
  # e.g. "上映日期：2022-11-04" -> match.group(0): "2022-11-04"
  pattern = '\d+-\d+-\d+'
  match = re.search(pattern, date_str)
  if match is None:
    return date_str
  else:
    return match.group(0)


def get_movie_id(url):
  # 20180515: URL 格式有變, e.g., 'https://movies.yahoo.com.tw/movieinfo_main/%E6%AD%BB%E4%BE%8D2-deadpool-2-7820.html
  # e.g., "https://tw.rd.yahoo.com/referurl/movie/thisweek/info/*https://tw.movies.yahoo.com/movieinfo_main.html/id=6707"
  #       -> match.group(0): "/id=6707"
  try:
    movie_id = url.split('.html')[0].split('-')[-1]
  except:
    movie_id = url
  return movie_id


def get_trailer_url(url):
  # e.g., 'https://tw.rd.yahoo.com/referurl/movie/thisweek/trailer/*https://tw.movies.yahoo.com/video/美女與野獸-最終版預告-024340912.html'
  return url.split('*')[1]


def get_complete_intro(movie_id):
  page = get_web_page(Y_INTRO_URL + '/id=' + movie_id)
  if page:
    soup = BeautifulSoup(page, 'html5lib')
    infobox = soup.find('div', 'gray_infobox_inner')
    print(infobox.text.strip())


def main():
  page = get_web_page(Y_MOVIE_URL)
  if page:
    movies = get_movies(page)
    for movie in movies:
      print(movie)
      # get_complete_intro(movie["movie_id"])
    with open('movie.json', 'w', encoding='utf-8') as f:
      json.dump(movies, f, indent=2, sort_keys=True, ensure_ascii=False)


if __name__ == '__main__':
  main()

{'expectation': '100%', 'ch_name': '2022 TEFF歐洲影展', 'eng_name': 'Taiwan European Film Festival', 'movie_id': '14318', 'poster_url': None, 'release_date': '2022-11-17', 'intro': '一、關於影展\t\nTEFF歐洲影展（Taiwan European Film Festival）自2005年起，於全台各地播放歐洲電影並提供觀眾免費入場觀看，希望台灣民眾藉由欣賞歐洲電影的過程中，認識歐洲國家的文化、藝術、和語言的多樣性，目前總觀影人次已超過 180,000 名。 \n\n2022年第18屆台灣歐洲影展由歐洲經貿辦事處主辦，外交部、文化部、台北市文化局合辦，與歐盟駐台各代表處協辦，邀請17個歐洲國家個別推選出一部電影參展，於2022年11月17日至2023年01月31日期間，在全台各地藝文空間、表演空間、戲院、大專院校等超過20個場地隆重巡迴放映。 \n\n今年影展片單，由歐盟駐台代表高哲夫(Filip Grzegorzeski)處長領軍選片，各歐盟國駐台代表，也由該國選一部最能代表該國的影片，總集成17國、17部影片，可謂是年度歐洲電影一時之選。也透過各歐盟成員國的選片觀點，呈現當下歐洲電影現況。\n\n 綜觀今年17部電影中，有9部是移民、難民議題電影，反映現今歐洲面臨的嚴正課題。另外8部，有溫馨的親情、懸疑的苦戀、清新的愛情、懵懂的青春、與成長的記憶等等。透過影片，書寫生活日常的歐洲，也刻劃永恆情愛的人性深層命題。\n\n二、影展主題\n《邊界‧無界》------ 有線的邊界，無界的愛。\n在邊界的內外，認同與原鄉、回歸與新生，一再糾結不斷。\n在愛的本質上，家庭與親情，男女與愛情，永遠無法割裂。\n電影影像與故事文本，如同法國哲學家德勒茲 (Gilles Deleuze) 的空間、時間、影像的辯證關係般。無限循環，永恆回溯，在愛的框內與邊界的框外，一再去框化、再框化……。\n\n三、影展時間、地點\n1、開幕日：2022年11月17日(四) 光點華山電影館\n2、全國巡迴放映：2022年11月18日~2023年1月31日，全國各公、私立大專院校、及藝文影視展演空間。\n\n四、主協辦單位\n

## 練習3


- 擷取並parse「批批踢JOKE版的一篇文章」
- 請依下列步驟練習：
    - 以GET方法將網頁https://www.ptt.cc/bbs/joke/M.1571755669.A.663.html 原始碼讀入
    - 依照上述步驟parse出推文內容及推文者
    - 透過for迴圈，整齊印出