# 使用 BeautifulSoup 讀取本地 HTML 並用 CSS Selector 與 Regex 抓取名言

這個 Notebook 示範如何：

- 使用 `requests` 下載 https://quotes.toscrape.com 網頁並儲存為本地檔案
- 讀取本地 HTML 檔案
- 用 BeautifulSoup 搭配 CSS Selector 擷取資料
- 用正規表達式 (regex) 篩選文字


In [1]:
import requests

# 下載網頁並存成本地檔案 quotes.html
url = "https://quotes.toscrape.com/"
response = requests.get(url)
with open("quotes.html", "w", encoding="utf-8") as f:
    f.write(response.text)

print("已下載並儲存 quotes.html")

已下載並儲存 quotes.html


In [None]:
# 確認本地檔案是否有成功下載
!head -20 quotes.html

In [None]:
from bs4 import BeautifulSoup

# 讀取本地 HTML 檔案
with open("quotes.html", "r", encoding="utf-8") as f:
    html = f.read()

# 解析 HTML
soup = BeautifulSoup(html, "html.parser")

# 查詢所有 quote 區塊
quotes = soup.select(".quote")
print(f"共找到 {len(quotes)} 則名言")

In [None]:
# 範例：抓取第一則名言的內容、作者、標籤
first = quotes[0]
text = first.select_one(".text").get_text()
author = first.select_one(".author").get_text()
tags = [tag.get_text() for tag in first.select(".tags .tag")]

print("名言:", text)
print("作者:", author)
print("標籤:", tags)

## 用正規表達式過濾名言內容

找出所有包含「life」字眼的名言：

In [None]:
import re

# 找出所有含有 "life" 的名言
pattern = re.compile("life", re.IGNORECASE)

life_quotes = []
for quote in quotes:
    text = quote.select_one(".text").get_text()
    if pattern.search(text):
        author = quote.select_one(".author").get_text()
        life_quotes.append({"text": text, "author": author})

print(f"共找到 {len(life_quotes)} 則含有 'life' 的名言\n")

for i, q in enumerate(life_quotes[:5], 1):
    print(f"{i}. {q['text']} -- {q['author']}")

## 用正規表達式過濾作者名稱

找出所有作者名稱中含有字母 'a' 的作者及其名言：

In [None]:
pattern_author = re.compile("a", re.IGNORECASE)

author_a_quotes = []
for quote in quotes:
    author = quote.select_one(".author").get_text()
    if pattern_author.search(author):
        text = quote.select_one(".text").get_text()
        author_a_quotes.append({"author": author, "text": text})

print(f"共找到 {len(author_a_quotes)} 位作者名稱含 'a' 的名言\n")

for i, q in enumerate(author_a_quotes[:5], 1):
    print(f"{i}. {q['author']}: {q['text']}")

## 將所有名言組成結構化清單

將每則名言的文字、作者與標籤整理成字典並存入列表中：

In [None]:
all_quotes = []
for quote in quotes:
    text = quote.select_one(".text").get_text()
    author = quote.select_one(".author").get_text()
    tags = [tag.get_text() for tag in quote.select(".tags .tag")]
    all_quotes.append({
        "text": text,
        "author": author,
        "tags": tags
    })

print(f"共整理 {len(all_quotes)} 則名言")
print(all_quotes[:2])  # 顯示前兩筆資料