<a href="https://colab.research.google.com/github/willismax/MediaSystem-Python-Course/blob/main/03.Request/%E7%B6%B2%E9%A0%81%E6%93%B7%E5%8F%96_Request.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 爬蟲-網頁資料擷取

- 擷取網頁用 [`requests`](https://docs.python-requests.org/en/latest/) 模組
  - requests.get()
  - requests.post()
- 解析網頁用 [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/) 模組
  - soup.find()
  - soup.find_all()
  - soup.select()

# Requests


#### 網頁請求的回應狀態碼（Status Code）
當我們向網站發送請求時（比如點擊一個連結），網站會回傳一個「狀態碼」來告訴我們請求的結果。這就像是網站和我們之間的秘密語言：
- `200 OK`：一切正常，你要的頁面在這裡！
- `403 Forbidden`：不好意思，你不能進入這裡。
- `404 Not Found`：沒有找到你要的頁面。

#### 網頁內容的編碼（Encoding）
有時網頁使用的文字編碼和我們的不同，這時我們需要調整編碼方式來正確讀取內容。比如：
- `UTF-8`：最常見的編碼方式，支持多種語言。
- `Big5`：繁體中文網站有時會使用的編碼。

#### 請求的回應內容（Response）
- `response.text`：這是網頁的 HTML 內容，也就是網頁的原始碼。
- `response.json()`：如果回應的是 JSON 格式的資料，我們可以這樣將它轉換成 Python 能讀懂的格式（列表或字典）。


- 可從文件學習，搭配[requests官方文件quickstart服用](https://requests.readthedocs.io/en/latest/user/quickstart/)!!

### 檢查連線資訊

In [None]:
import requests  # 引入 requests 模組

url = "https://api.github.com/events"  # 設定要請求的網址
r = requests.get(url)  # 向該網址發送 GET 請求

r.json()  # 將回應的 JSON 內容轉換成 Python 能理解的格式


In [None]:
# 連線狀態
r.status_code

200

In [None]:
# 編碼
r.encoding

'utf-8'

In [None]:
# 內容
r.content[:500]

b'[{"id":"37222617124","type":"PushEvent","actor":{"id":41898282,"login":"github-actions[bot]","display_login":"github-actions","gravatar_id":"","url":"https://api.github.com/users/github-actions[bot]","avatar_url":"https://avatars.githubusercontent.com/u/41898282?"},"repo":{"id":726740836,"name":"43rnb7/auto","url":"https://api.github.com/repos/43rnb7/auto"},"payload":{"repository_id":726740836,"push_id":17875500905,"size":1,"distinct_size":1,"ref":"refs/heads/master","head":"ce4e4f33c2d0d823c2fe'

In [None]:
# 連線的錯誤訊息(正確連線則無)
r.raise_for_status()

In [None]:
# cookies
r.cookies

<RequestsCookieJar[]>

In [None]:
# header (HTTP 標頭名稱不區分大小寫。)
r.headers

{'Server': 'GitHub.com', 'Date': 'Sat, 06 Apr 2024 05:51:18 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept, Accept-Encoding, Accept, X-Requested-With', 'ETag': 'W/"e034336abcddba2032d4ac6387d9c3ebf0628623d096c1dee8c1427319500360"', 'Last-Modified': 'Sat, 06 Apr 2024 05:46:18 GMT', 'X-Poll-Interval': '60', 'X-GitHub-Media-Type': 'github.v3; format=json', 'Link': '<https://api.github.com/events?page=2>; rel="next", <https://api.github.com/events?page=10>; rel="last"', 'x-github-api-version-selected': '2022-11-28', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; 

In [None]:
r.headers['Content-Type']

'application/json; charset=utf-8'

- Request讀取影音圖片檔案(二進位制)的方式

### 快速認識`GET`、`POST`、`PUT`、`DELETE`

In [None]:
!curl -X GET "https://api.github.com/events"

[
  {
    "id": "33414269492",
    "type": "PushEvent",
    "actor": {
      "id": 236117,
      "login": "sidnair",
      "display_login": "sidnair",
      "gravatar_id": "",
      "url": "https://api.github.com/users/sidnair",
      "avatar_url": "https://avatars.githubusercontent.com/u/236117?"
    },
    "repo": {
      "id": 34536227,
      "name": "sidnair/sidnair.github.io",
      "url": "https://api.github.com/repos/sidnair/sidnair.github.io"
    },
    "payload": {
      "repository_id": 34536227,
      "push_id": 15856394215,
      "size": 2,
      "distinct_size": 2,
      "ref": "refs/heads/master",
      "head": "19486018911cf776fa0e431669a5ec3b88d413e3",
      "before": "858ee38a8ed6ed9c98ec9d9de49101c22602abde",
      "commits": [
        {
          "sha": "759a371914b6ffe1534982b2984cc49567d62c69",
          "author": {
            "email": "sidnair09@gmail.com",
            "name": "Sid Nair"
          },
          "message": "gallery css cleanup",
          "distinct

In [None]:
# GET
r = requests.get('https://httpbin.org/get')
r.json()

{'args': {},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.31.0',
  'X-Amzn-Trace-Id': 'Root=1-65569d4d-1c18cc592dbe941931614262'},
 'origin': '34.125.60.156',
 'url': 'https://httpbin.org/get'}

In [None]:
# POST
r = requests.post('https://httpbin.org/post', data={'key': 'value'})
r.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'key': 'value'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '9',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.31.0',
  'X-Amzn-Trace-Id': 'Root=1-65569d55-46f508af7bc9388d1067b1d1'},
 'json': None,
 'origin': '34.125.60.156',
 'url': 'https://httpbin.org/post'}

In [None]:
# PUT
r = requests.put('https://httpbin.org/put', data={'key': 'value2'})
r.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'key': 'value2'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '10',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.31.0',
  'X-Amzn-Trace-Id': 'Root=1-65569d6b-19f44f8f58da821e6684a83d'},
 'json': None,
 'origin': '34.125.60.156',
 'url': 'https://httpbin.org/put'}

In [None]:
# DELETE
r = requests.delete('https://httpbin.org/delete')
r.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '0',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.31.0',
  'X-Amzn-Trace-Id': 'Root=1-65569d7e-1451580b5227830c0fdac947'},
 'json': None,
 'origin': '34.125.60.156',
 'url': 'https://httpbin.org/delete'}

In [None]:
r.text

'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "0", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.31.0", \n    "X-Amzn-Trace-Id": "Root=1-65569d7e-1451580b5227830c0fdac947"\n  }, \n  "json": null, \n  "origin": "34.125.60.156", \n  "url": "https://httpbin.org/delete"\n}\n'

### `GET`，以及增加參數的方式

In [None]:
!curl -X GET "https://httpbin.org/get?k1=v1&k2=v2"

{
  "args": {
    "k1": "v1", 
    "k2": "v2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Host": "httpbin.org", 
    "User-Agent": "curl/7.81.0", 
    "X-Amzn-Trace-Id": "Root=1-65569d91-3604f75e0786577856cbdf28"
  }, 
  "origin": "34.125.60.156", 
  "url": "https://httpbin.org/get?k1=v1&k2=v2"
}


In [None]:
import requests

payload = {'k1': 'v1', 'k2': 'v2'}
r = requests.get('https://httpbin.org/get', params=payload)
print(r.url)

https://httpbin.org/get?k1=v1&k2=v2


In [None]:
# GET的參數會接在URL後面?
import requests

payload  = {"api":"1", "map_action":"map", "zoom":"16", "query":"24.149660,120.684166"}
r = requests.get('https://www.google.com/maps/search/', params=payload )
print(r.url)

https://www.google.com/maps/search/?api=1&map_action=map&zoom=16&query=24.149660%2C120.684166


- `Request.get`大型檔案的方式，，`圖片、影音檔案、二進位制bin檔可用

In [None]:
# 來源: https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests
import requests

def download_file(url):
  """下載檔案，檔名為url.split('/')[-1]"""
  local_filename = url.split('/')[-1]
  with requests.get(url, stream=True) as r:
      r.raise_for_status()
      with open(local_filename, 'wb') as f:
          for chunk in r.iter_content(chunk_size=8192):
              f.write(chunk)
  return local_filename

if __name__=='__main__':
  download_file("https://api.github.com/events")

### `POST`，帶有data的 POST 請求

In [None]:
!curl -X POST -d "key1=value1&key2=value2" "https://httpbin.org/post"

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "curl/7.81.0", 
    "X-Amzn-Trace-Id": "Root=1-65569dd6-01fd49686dff16536c429835"
  }, 
  "json": null, 
  "origin": "34.125.60.156", 
  "url": "https://httpbin.org/post"
}


- `payload_tuples`與`payload_dict`用法，以下兩者相同

In [None]:
import requests

payload_tuples = [('key1', 'value1'), ('key1', 'value2')]
# payload_dict = {'key1': ['value1', 'value2']}
# payload = {'key1': 'value1', 'key2': 'value2'}

requests.post('https://httpbin.org/post', data=payload_tuples).json()

-  `post(url, data=None, json=None, **kwargs)`，以下示範參數放dict轉json或直接json


In [None]:
import json

url = 'https://httpbin.org/post'
payload = {'some': 'data'}

# 以下兩種相同
r = requests.post(url, data=json.dumps(payload))
r = requests.post(url, json=payload)

r.json()

- 加入`Cookies`

In [None]:
!curl --cookie "my_cookie=22222" https://httpbin.org/cookies

{
  "cookies": {
    "my_cookie": "22222"
  }
}


In [None]:
url = 'https://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
r.json()

{'cookies': {'cookies_are': 'working'}}

- Timeouts

In [None]:
requests.get('https://github.com/', timeout=0.001)

## `GET` example網頁為例


- 先觀察目標網頁: http://www.example.com/
- 以`requests.get`抓取網頁原始碼，並輸出結果
- 這個階段有抓到網頁就大功告成了!

In [None]:
import requests  # 引入 requests 模組

url = "https://api.github.com/events"  # 設定要請求的網址
r = requests.get(url)  # 向該網址發送 GET 請求

r.json()  # 將回應的 JSON 內容轉換成 Python 能理解的格式


In [None]:
dir(requests)

['ConnectTimeout',
 'ConnectionError',
 'HTTPError',
 'JSONDecodeError',
 'NullHandler',
 'PreparedRequest',
 'ReadTimeout',
 'Request',
 'RequestException',
 'Response',
 'Session',
 'Timeout',
 'TooManyRedirects',
 'URLRequired',
 '__author__',
 '__author_email__',
 '__build__',
 '__builtins__',
 '__cached__',
 '__cake__',
 '__copyright__',
 '__description__',
 '__doc__',
 '__file__',
 '__license__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__title__',
 '__url__',
 '__version__',
 '_check_cryptography',
 '_internal_utils',
 'adapters',
 'api',
 'auth',
 'certs',
 'chardet_version',
 'charset_normalizer_version',
 'check_compatibility',
 'codes',
 'compat',
 'cookies',
 'delete',
 'exceptions',
 'get',
 'head',
 'hooks',
 'logging',
 'models',
 'options',
 'packages',
 'patch',
 'post',
 'put',
 'request',
 'session',
 'sessions',
 'ssl',
 'status_codes',
 'structures',
 'urllib3',
 'utils',

In [None]:
r.status_code

200

In [None]:
r.encoding

'utf-8'

## `POST` [台灣高鐵訂票](https://www.thsrc.com.tw/ArticleContent/a3b630bb-1066-4352-a1ef-58c7b4e8ef7c)為例

![image.png](https://hackmd.io/_uploads/SyFC06yma.png)


![image](https://hackmd.io/_uploads/SyCjZ7VV6.png)

In [None]:
import requests

url= 'https://www.thsrc.com.tw/TimeTable/Search'

data={
    'SearchType': 'S',
    'Lang': 'TW',
    'StartStation': 'NanGang',
    'EndStation': 'ZuoYing',
    'OutWardSearchDate': '2023/11/18',
    'OutWardSearchTime': '16:00',
    'ReturnSearchDate': '2023/11/18',
    'ReturnSearchTime': '16:00',
    'DiscountType': None
}

res = requests.post(url, data=data)

In [None]:
res.json()

In [None]:
res.text

In [None]:
res.headers

In [None]:
r = res.json()
r['data']

# BeautifulSoup

## 以Beautiful Soup讀取並解析HTML


- [文件](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/)
- Beautiful Soup是HTML解析器，將網頁解析為 `bs4.BeautifulSoup` 物件。
- `bs4.BeautifulSoup` 物件是個結構樹(DOM)，依結構與各種方法搜尋目標。
```
!pip3 install beautifulsoup4
```

In [None]:
from bs4 import BeautifulSoup

html_doc="""<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>"""


soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

下表列出了主要的解析器，以及它們的優缺點：

解析器|使用方法|優勢|	劣勢|
-|-|-|-
html.parser|	BeautifulSoup(markup,"html.parser")	|Python的內建標準庫、執行速度適中、文檔容錯能力強|Python 2.7.3及3.2.2之前的版本中文檔容錯能力差
lxml HTML 解析器|	BeautifulSoup(markup, "lxml")	|速度快、文檔容錯能力強(通常用這個)|需要安装C语言库
xml XML 解析器|BeautifulSoup(markup, "xml")|速度快、唯一支持XML的解析器|需要安装C语言库
html5lib	|BeautifulSoup(markup, "html5lib")	|最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔|速度慢、不依賴外部擴展


### 簡易解析文件
- 用`.`的方式存取物件結構，快速但容易出錯
- 用`find()`、`find_all()`、`select()`方法較嚴謹

In [None]:
soup.title

<title>The Dormouse's story</title>

In [None]:
soup.title.name

'title'

In [None]:
soup.title.string

"The Dormouse's story"

In [None]:
soup.title.text

"The Dormouse's story"

In [None]:
soup.title.parent.name

'head'

In [None]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [None]:
soup.p['class']

['title']

In [None]:
soup.p.get("class") #推薦使用`.get()`取得屬性

['title']

In [None]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [None]:
soup.a["href"]

'http://example.com/elsie'

In [None]:
soup.a.get('href')

'http://example.com/elsie'

In [None]:
print(type(soup.title))
print(type(soup.p))
print(type(soup.a))

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>


In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [None]:
[ i.get("href") for i in soup.find_all('a') ]

['http://example.com/elsie',
 'http://example.com/lacie',
 'http://example.com/tillie']

### `soup.find()`



-  回傳第一個被tag包圍的區塊
- 傳入的引數第一個通常是 tag 名稱，第二個引數若未指明屬性就代表 class 名稱，也可以直接使用 id 等屬性去定位區塊。定位到區塊後，可以取出其屬性與包含的字串值

  ```python
  soup.find(name=None,    # 第一個tag name
      attrs={},      # {”屬性名”=“屬性值”}
      recursive=True,  # 迴圈搜尋開啟
      text=None,    # 查找內文
      **kwargs)
  ```

In [None]:
help(soup.find())
#soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)

Help on Tag in module bs4.element object:

class Tag(PageElement)
 |  Tag(parser=None, builder=None, name=None, namespace=None, prefix=None, attrs=None, parent=None, previous=None, is_xml=None, sourceline=None, sourcepos=None, can_be_empty_element=None, cdata_list_attributes=None, preserve_whitespace_tags=None, interesting_string_types=None, namespaces=None)
 |  
 |  Represents an HTML or XML tag that is part of a parse tree, along
 |  with its attributes and contents.
 |  
 |  When Beautiful Soup parses the markup <b>penguin</b>, it will
 |  create a Tag object representing the <b> tag.
 |  
 |  Method resolution order:
 |      Tag
 |      PageElement
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      A tag is non-None even if it has no contents.
 |  
 |  __call__(self, *args, **kwargs)
 |      Calling a Tag like a function is the same as calling its
 |      find_all() method. Eg. tag('a') returns a list of all the A tags
 |      found within this t

In [None]:
print(soup.find('p'))
print(soup.find("a"))

#取<a>內容</a>
print(soup.find("a").string)
print(soup.find("a").text)

#取<title>標題</title>，
print(soup.title.string)
print(soup.title.text)

<p class="title"><b>The Dormouse's story</b></p>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
Elsie
The Dormouse's story
The Dormouse's story


### `soup.find().get(屬性)`

- 取出節點屬性的較好方法`.get("屬性")`
  - 使用`get()`如無此屬性，回傳結果為none。
  - 如果不用`get()`也可以擷取屬性，但不存在時會出現錯誤，有礙後續爬蟲執行。
  - 其他詳細用法可參考 [BeautifulSoup的官方文件](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [None]:
# 找不到屬性就出錯! #id, class, href, src
soup.find("p")['style']

KeyError: ignored

In [None]:
# 找不到屬性回傳None
print(soup.find('p').get('style'))

None


### `soup.find_all()`





- 我全都要，回傳結果為`bs4.element.ResultSet`物件
  ```python
  soup.find_all(name=None,     #第一個tag name
         attrs={},      #{”屬性名”=“屬性值”}
         text=None,     #查找內文
         limit=None,     #限制搜尋數量
         **kwargs)
  ```

In [None]:
import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.python.org/')
soup = BeautifulSoup(res.text, "lxml")

In [None]:
p_tags = soup.find_all("p")
p_tags

In [None]:
type(p_tags)

In [None]:
# 找出所有內容等於的文字
print(soup.find_all(text="Latest News"))

- `bs4.element.ResultSet`物件內容以for迴圈取出

In [None]:
for tag in p_tags:
  print(tag)
  print(type(tag)) # 取出一層，內層是`bs4.element.Tag`物件

In [None]:
for tag in p_tags:
  print(tag.text)
  print(type(tag.text)) # 已解析內文，為文字str

In [None]:
# 取出節點屬性

a_tags = soup.find_all("a")
for tag in a_tags:
  print(tag.get('href'))

- `soup.find_all()`以list`[]`同時搜尋多種標籤

In [None]:
from pprint import pprint

tags = soup.find_all(["a", "b", "p"]) # 搜尋所有超連結與粗體字
pprint(tags)

In [None]:
tags = soup.find_all(["a", "p"], limit=2) # 限制搜尋結果數量limit
pprint(tags)

### `soup.select()`


- 用CSS Seletor選擇器，結果回傳為list
- list裡面如果還是標籤形式，這些標籤還是`bs`物件，要解出來才能接著python操作

```python
select(selector, _candidate_generator=None, limit=None)

```

In [None]:
from bs4 import BeautifulSoup
import requests

res = requests.get('http://www.example.com/')
soup = BeautifulSoup(res.text, "lxml")

In [None]:
select_a = soup.select("a")

In [None]:
print(type(select_a))
select_a

<class 'bs4.element.ResultSet'>


[<a href="https://www.iana.org/domains/example">More information...</a>]

In [None]:
print(type(select_a[0]))
print(select_a[0])

<class 'bs4.element.Tag'>
<a href="https://www.iana.org/domains/example">More information...</a>


In [None]:
#解析內文
print(type(select_a[0]).text)

select_a[0].text

<property object at 0x79535674f1f0>


'More information...'

In [None]:
select_href1 = soup.select('[href]')

In [None]:
print(type(select_href1))

print(select_href1)

<class 'bs4.element.ResultSet'>
[<a href="https://www.iana.org/domains/example">More information...</a>]


In [None]:
print(type(select_href1[0]))
print(select_href1[0])

<class 'bs4.element.Tag'>
<a href="https://www.iana.org/domains/example">More information...</a>


In [None]:
#配合`.get(屬性)`來解析屬性
print(type(select_href1[0].get('href')))
print(select_href1[0].get('href'))

<class 'str'>
https://www.iana.org/domains/example


In [None]:
import requests
from bs4 import BeautifulSoup

res = requests.get('http://python.org/')
soup = BeautifulSoup(res.text, "lxml")
a1=soup.select("#touchnav-wrapper > header > div > h1 > a > img")

In [None]:
a1

[<img alt="python™" class="python-logo" src="/static/img/python-logo.png"/>]

In [None]:
a1[0].get("src")

'/static/img/python-logo.png'

## 結合正規表達式regular expression進行搜尋


- 正規表達式對於精準抓取網頁的各種標籤及內文非常有幫助，解決了許多Xpath與CSS selector無法精確擷取的問題，有必要好好理解。
- 擷取的文句段落可以使用[regex101.com](https://regex101.com/)測試。


|符號|意義|範例|符合字串範例
|-|-|-|-
|`*`|`*`之前的字元、表達式或`[]`字元集合，出現為0或1個以上|`a*b*`|aaaa、aaabb、bbbb
|`+`|`+`之前的字元、表達式或`[]`字元集合，出現為1或1個以上|`a+b+`|aaab、aabbb、abbb
|`?`|`+`之前的字元、表達式或`[]`字元集合，出現為0或1次|`a?b?`|ab、b
|`[]`|`[]`內的任一字元挑一個|`[A-Z]*`|ALLPE、CAP、QWER
|`()`|`()`群組，群組運算優先處理|`(a*b)*`|aabaab、abaab、ababab
|`{m,n}`|符合在`{m,n}`前一個字元、表達式或`[]`集合，出現m到n次(包含m與n|`a{2,3}b{2,3}`|aabbb、aaabbb、aabb
|`[^]`|符合任一個不再`[]`的字元|`[^A-Z]*`|apple、banana、cat
|`\|`|符合被`\|`隔開的前後任一字元、字串或表達式|`b(a\|i\|e)d`|bad、bid、bed
|`.`|符合任一字元(含符號、數字、空格等)|`b.d`|bsd、bid、bed
|`^`|`^`之後的第1個字元為開頭的字串|`^a`|apple、afk
|`$`|`$`之前的末1個字元為結尾，否則會`.*`|`[A-Z]*[a-z]*$`|Aab、zzz
|`\d`|所有數字|`\d`|455、5566
|`\w`|所有文字字元|`\w`|123ABC、C8763
|`\s`|所有非無的字元與操作|`\s`|`Tab, Space, Escape, …`



#### Python的re模組
- 可至[regex101](https://regex101.com/)嘗試
- 為了避免與字串中的跳脫字元產生混淆，定義正規表達式樣式建議使用原始字串(raw string)，也就是在字串前加r''

##### 參考寫法
```python
import re

# 找出所有內容等於 python_crawler 的文字
pattern = "我寫好的 regular expression"
string = "我想要找的字串"
re.findall(pattern, string)
```

In [None]:
import re

pattern = "我"
string = "我想要找的字串我我"
re.findall(pattern, string)

In [None]:
import re

pattern = "^[a-zA-Z0-9\._-]+@[a-zA-Z0-9\._-]+$"
string = "willismax.com@gmail.com"
re.findall(pattern, string)

In [None]:
import requests
import re

res = requests.get('http://python.org/')

pattern = r'h[1-6]' #標題h1-h6
string = res.text
re.findall(pattern, string)

In [None]:
import re

res = requests.get('http://python.org/')

pattern = r'"\S*.png"' # .jpg或.png結尾
string = res.text
re.findall(pattern, string)

## 網頁擷取實例


### 以PPT 為例


- 這邊開始要示範使用Chrome開發者工具進行搜尋
- 先觀察目標網頁: https://www.ptt.cc/bbs/StupidClown/index.html
- 使用Chrome瀏覽器，以滑鼠右鍵選擇「檢查」，快捷鍵在windows環境為ctrl+Shift+I或F12

- 另外如果要用別人寫好的，參閱https://dotblogs.com.tw/codinghouse/2018/10/22/pttcrawler

```
//*[@id="main-container"]/div[2]/div[2]/div[2]/a
#main-container > div.r-list-container.action-bar-margin.bbs-screen > div:nth-child(2) > div.title > a
```

![](https://i.imgur.com/K55v4SH.png)


- 文章列表可以觀察到推文數、文章標題、作者、日期及文章連結
- 我們先觀察他的樹狀結構，對應的標籤與屬性
- 以COPY XPath紀錄

|名稱|selector|
-|-
標題|`//*[@id="main-container"]/div[2]/div[4]/div[2]/a`
連結|`//*[@id="main-container"]/div[2]/div[4]/div[2]/a`

In [None]:
#目標網址https://www.ptt.cc/bbs/StupidClown/index.html
import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.ptt.cc/bbs/StupidClown/index.html')
soup = BeautifulSoup(res.text ,"html.parser")
print(res.text[:500])

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>看板 StupidClown 文章列表 - 批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//i


In [None]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   看板 StupidClown 文章列表 - 批踢踢實業坊
  </title>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-common.css" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-custom.css" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print" rel="stylesheet" type="text/css"/>
 </head>
 <body>
  <div id="topbar-container">
   <div class="bbs-content" id="topbar">
    <a href="/bbs/" id="logo">
     批踢踢實業坊
    </a>
    <span>
     ›
    </span>
    <a class="board" href="/bbs/StupidClown/index.html">
     <span class="board-label">
      看板
     </span>
     StupidClown
    </a>
    <a class="r

- 有抓到網頁，接下來如果簡單針對連結、標題的話，觀察都在div標籤的class='title'裡

In [None]:
# #main-container > div.r-list-container.action-bar-margin.bbs-screen > div:nth-child(5) > div.title > a

results = soup.select("div.title > a")
print(results)
print(type(results))

[<a href="/bbs/StupidClown/M.1699458156.A.E91.html">[童年] 第一次知道「妊娠」的意思是幾歲</a>, <a href="/bbs/StupidClown/M.1699512149.A.B73.html">[無言] 微軟的陰謀</a>, <a href="/bbs/StupidClown/M.1699526028.A.3F4.html">[恍神] 早起媽媽真的很ㄎㄧㄤ</a>, <a href="/bbs/StupidClown/M.1699611553.A.82F.html">[無言] 中樂透了！？</a>, <a href="/bbs/StupidClown/M.1699719321.A.E4C.html">[恍神] 愛美的少年</a>, <a href="/bbs/StupidClown/M.1699887631.A.E20.html">[眼殘] 老饕最愛的日本料理</a>, <a href="/bbs/StupidClown/M.1699934550.A.5E2.html">[健忘] 荒謬死竟然忘記鎖門</a>, <a href="/bbs/StupidClown/M.1700102542.A.B1D.html">[眼殘] 老鼠屎</a>, <a href="/bbs/StupidClown/M.1700105800.A.4AA.html">[無言] 茶泡飯</a>, <a href="/bbs/StupidClown/M.1700201666.A.184.html">[聽錯] 就不該跟學生聊遊戲</a>, <a href="/bbs/StupidClown/M.1700210033.A.23D.html">[童年] 地瓜三人組</a>, <a href="/bbs/StupidClown/M.1700246364.A.B59.html">[眼殘] 朋友緬懷阿姨的肉粽 但阿姨還活著啊…</a>, <a href="/bbs/StupidClown/M.1700451692.A.041.html">[耍笨] 走路回家路上屎在滾</a>, <a href="/bbs/StupidClown/M.1700541263.A.AE3.html">[眼殘] 想說這個表特正妹怎麼有三隻手</a>, <a href="/b

In [None]:
article_href = soup.select("div.title a")
article_href

[<a href="/bbs/StupidClown/M.1699458156.A.E91.html">[童年] 第一次知道「妊娠」的意思是幾歲</a>,
 <a href="/bbs/StupidClown/M.1699512149.A.B73.html">[無言] 微軟的陰謀</a>,
 <a href="/bbs/StupidClown/M.1699526028.A.3F4.html">[恍神] 早起媽媽真的很ㄎㄧㄤ</a>,
 <a href="/bbs/StupidClown/M.1699611553.A.82F.html">[無言] 中樂透了！？</a>,
 <a href="/bbs/StupidClown/M.1699719321.A.E4C.html">[恍神] 愛美的少年</a>,
 <a href="/bbs/StupidClown/M.1699887631.A.E20.html">[眼殘] 老饕最愛的日本料理</a>,
 <a href="/bbs/StupidClown/M.1699934550.A.5E2.html">[健忘] 荒謬死竟然忘記鎖門</a>,
 <a href="/bbs/StupidClown/M.1700102542.A.B1D.html">[眼殘] 老鼠屎</a>,
 <a href="/bbs/StupidClown/M.1700105800.A.4AA.html">[無言] 茶泡飯</a>,
 <a href="/bbs/StupidClown/M.1700201666.A.184.html">[聽錯] 就不該跟學生聊遊戲</a>,
 <a href="/bbs/StupidClown/M.1700210033.A.23D.html">[童年] 地瓜三人組</a>,
 <a href="/bbs/StupidClown/M.1700246364.A.B59.html">[眼殘] 朋友緬懷阿姨的肉粽 但阿姨還活著啊…</a>,
 <a href="/bbs/StupidClown/M.1700451692.A.041.html">[耍笨] 走路回家路上屎在滾</a>,
 <a href="/bbs/StupidClown/M.1700541263.A.AE3.html">[眼殘] 想說這個表特正妹怎麼有三隻手</a>

In [None]:
# 逐一取出標題、合併超連結
for a in article_href:
  print(f'{a.text}')
  print(f'href: https://www.ptt.cc{a.get("href")}')

  #打開連結內的網頁並另存
  content_url = f'https://www.ptt.cc{a.get("href")}'
  r = requests.get(content_url)
  with open (f'{a.text}.html', 'w+') as f:
    f.write(r.text)
    print('saved')

[童年] 第一次知道「妊娠」的意思是幾歲
href: https://www.ptt.cc/bbs/StupidClown/M.1699458156.A.E91.html


UnicodeEncodeError: 'cp950' codec can't encode character '\u59d9' in position 6332: illegal multibyte sequence

In [None]:
%ls

- 更多可參考[爬蟲教學 CrawlerTutorial](https://github.com/leVirve/CrawlerTutorial)

### 以wiki亞洲國家資訊為例

- 參考來源[Web Scraping Wikipedia Tables using BeautifulSoup and Python](https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722)

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area"  # 設定要請求的網址
res = requests.get(url)  # 向該網址發送 GET 請求
soup = BeautifulSoup(res.text ,"html.parser")

res.text

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>List of Asian countries by area - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-featu

![](https://miro.medium.com/max/740/1*NyaaGqqHnemKSWu8DQqUHQ.png)

In [None]:
table_href = soup.select("table.wikitable.sortable")
table_href

In [None]:
country = [
        link.get('title')
        for link in table_href
        if link.get('title') != None
        ]

In [None]:
country

In [None]:
import pandas as pd

df = pd.DataFrame()
df['Country'] = country
df

In [None]:
df = df.sort_values(by="Country").reset_index(drop = True)
df

# 練習

##  練習1

- 試著看懂並執行、拆解以下程式

In [None]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import json

# 定義 PTT 的 URL
PTT_URL = 'https://www.ptt.cc'

def get_web_page(url):
    """
    透過 URL 獲取網頁內容。
    使用 requests 庫進行 HTTP 請求，並處理可能的異常。
    """
    try:
        resp = requests.get(url, cookies={'over18': '1'})  # 設定 cookies 以通過年齡限制
        resp.raise_for_status()  # 檢查請求是否成功，若不成功則拋出異常
        return resp.text
    except requests.RequestException as e:
        print(f'Error fetching {url}: {e}')
        return None

def parse_articles(dom, date):
    """
    解析 HTML 文檔，提取符合指定日期的文章資訊。
    """
    soup = BeautifulSoup(dom, 'html5lib')
    articles = []
    for d in soup.find_all('div', class_='r-ent'):
        post_date = d.find('div', class_='date').text.strip()
        if post_date == date:
            push_count = get_push_count(d)
            if link := d.find('a'):
                articles.append({
                    'title': link.text,
                    'href': PTT_URL + link['href'],
                    'push_count': push_count
                })
    prev_url = get_prev_page_url(soup)
    return articles, prev_url

def get_push_count(div):
    """
    從文章區塊解析推文數。
    推文數可能是數字、'爆' 表示非常熱門，或以 'X' 開頭表示負面推文。
    """
    push_str = div.find('div', class_='nrec').text
    try:
        return int(push_str) if push_str else 0
    except ValueError:
        return 99 if push_str == '爆' else -10

def get_prev_page_url(soup):
    """
    從導航區塊提取上一頁的 URL。
    """
    paging_div = soup.find('div', 'btn-group btn-group-paging')
    return paging_div.find_all('a')[1]['href']

def fetch_today_articles(url):
    """
    獲取今日的文章列表。
    從起始 URL 開始，遞迴獲取每頁的文章直到找不到符合日期的文章為止。
    """
    articles = []
    date_today = datetime.now().strftime('%m/%d').lstrip('0')  # 獲取今天的日期，並格式化
    while True:
        page = get_web_page(url)
        if not page:
            break
        current_articles, prev_url = parse_articles(page, date_today)
        if not current_articles:
            break
        articles.extend(current_articles)
        url = PTT_URL + prev_url
    return articles

def main():
    """
    主函式：獲取今天在 PTT 八卦版的文章並輸出熱門文章。
    """
    start_url = PTT_URL + '/bbs/Gossiping/index.html'
    articles = fetch_today_articles(start_url)

    print(f'今天有{len(articles)}篇文章')
    threshold = 50  # 設定熱門文章的推文閾值
    print(f'熱門文章(>{threshold}推):')
    for article in filter(lambda a: a['push_count'] > threshold, articles):
        print(article)

    # 將結果存儲為 JSON 檔案
    with open('gossiping.json', 'w', encoding='utf-8') as f:
        json.dump(articles, f, indent=2, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    main()


今天有660篇文章
熱門文章(>50推):
{'title': '[問卦] 這次地震要捐給哪一個帳號阿? (發錢)', 'href': 'https://www.ptt.cc/bbs/Gossiping/M.1712377788.A.AEE.html', 'push_count': 99}
{'title': '[問卦] 新青安加碼？10年寬限1500萬', 'href': 'https://www.ptt.cc/bbs/Gossiping/M.1712377139.A.197.html', 'push_count': 73}
{'title': '[新聞] 花蓮地震／失聯英國補習班師與台籍女找到', 'href': 'https://www.ptt.cc/bbs/Gossiping/M.1712374950.A.8A9.html', 'push_count': 99}
{'title': '[爆卦] 地震', 'href': 'https://www.ptt.cc/bbs/Gossiping/M.1712375621.A.B89.html', 'push_count': 99}
{'title': '[新聞] 不敵少子化！全台14大專院校停辦，新北', 'href': 'https://www.ptt.cc/bbs/Gossiping/M.1712373364.A.280.html', 'push_count': 53}
{'title': '[新聞] 快訊／台南惡火噬民宅！她「燒柴洗澡」出', 'href': 'https://www.ptt.cc/bbs/Gossiping/M.1712373114.A.A27.html', 'push_count': 88}
{'title': '[新聞] 國軍楷模變詐團車手頭！假合約轉匯上億\u3000', 'href': 'https://www.ptt.cc/bbs/Gossiping/M.1712371102.A.B63.html', 'push_count': 99}
{'title': '[問卦] 美國：一中政策+不支持台獨 什麼意思？', 'href': 'https://www.ptt.cc/bbs/Gossiping/M.1712369127.A.1AB.html', 'push_count': 72}
{

## 練習2


- 擷取並parse「批批踢JOKE版的一篇文章」
- 請依下列步驟練習：
    - 以GET方法將網頁https://www.ptt.cc/bbs/joke/M.1571755669.A.663.html 原始碼讀入
    - 依照上述步驟parse出推文內容及推文者
    - 透過for迴圈，整齊印出