---
title: "自建宋詞練字本"
author: "Simon Chiu"
execute: 
  warning: false
---

# 1. 簡介

近來不論是學習或是娛樂都離不開3C產品，因此覺得眼睛過於疲累。剛好過年期間看了甄還傳，體會了宋詞之美，因此想說不如利用爬蟲技術，自動化建立一個練字本，讓自己在休憩時，能脫離屏幕，一邊練字，一邊欣賞古代文字之美。

# 2. 程式demo

1. Install and import required packages

In [8]:
# pip install python-docx

#import reuqired packages
import requests
from bs4 import BeautifulSoup
from docx import Document

2. 用BeautifulSoup處理爬下來的網頁

In [10]:
# get soup
# URL of the webpage containing the poem
url = "https://zh.wikisource.org/zh-hant/%E5%AE%8B%E8%A9%9E%E4%B8%89%E7%99%BE%E9%A6%96"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

3. 觀察wiki網頁原始碼，可以發現詞主要存在div class="poem"，所以用find_all找出所有詞

<img src="image/picture1.png" width="800" height="300">

<img src="image/picture2.png" width="500" height="100">

In [11]:
# Find the div with class 'poem'
poem_div = soup.find_all('div', class_='poem')

4. 以下用頭3闕詞來demo如何preprocessing

In [26]:
#select first 3 poem for demo
test_poem_div = poem_div[:3]
print(test_poem_div)

[<div class="poem">
<p>裁翦冰綃，輕疊數重，淡著燕脂勻注。新樣靚妝，艷溢香融，羞殺蕊珠宮女。易得凋零，更多少、無情風雨。愁苦。問院落淒涼，幾番春暮？　　憑寄離恨重重，者雙燕何曾，會人言語？天遙地遠，萬水千山，知他故宮何處？怎不思量？除夢裏、有時曾去。無據。和夢也、新來不做。
</p>
</div>, <div class="poem">
<p>城上風光鶯語亂。城下煙波春拍岸。綠楊芳草幾時休？淚眼愁腸先已斷。　　情懷漸覺成衰晚。鸞鏡朱顏驚暗換。昔年多病厭芳尊，今日芳尊惟恐淺。
</p>
</div>, <div class="poem">
<p>碧雲天，黃葉地，秋色連波，波上寒煙翠。<br/>
山映斜陽天接水，芳草無情，更在斜陽外。<br/>
<br/>
黯鄉魂，追旅思，夜夜除非，好夢留人睡。<br/>
明月樓高休獨倚，酒入愁腸，化作相思淚。
</p>
</div>]


首先，用find('p')提取詞的主體，接著用replace除去任何可能的換行或空白

In [30]:
# create test_list to store processed poems
test_list = []
# loop through all poem to extract p and then remove extra white space
for poem_div in test_poem_div:
    text = ''.join(poem_div.find('p').strings)
    text = text.strip().replace('\n', '').replace('\t', '').replace(' ', '').replace('\u3000', '')  # Remove leading/trailing whitespace and other possible whitespace
    test_list.append(text)
# show result
test_list

['裁翦冰綃，輕疊數重，淡著燕脂勻注。新樣靚妝，艷溢香融，羞殺蕊珠宮女。易得凋零，更多少、無情風雨。愁苦。問院落淒涼，幾番春暮？憑寄離恨重重，者雙燕何曾，會人言語？天遙地遠，萬水千山，知他故宮何處？怎不思量？除夢裏、有時曾去。無據。和夢也、新來不做。',
 '城上風光鶯語亂。城下煙波春拍岸。綠楊芳草幾時休？淚眼愁腸先已斷。情懷漸覺成衰晚。鸞鏡朱顏驚暗換。昔年多病厭芳尊，今日芳尊惟恐淺。',
 '碧雲天，黃葉地，秋色連波，波上寒煙翠。山映斜陽天接水，芳草無情，更在斜陽外。黯鄉魂，追旅思，夜夜除非，好夢留人睡。明月樓高休獨倚，酒入愁腸，化作相思淚。']

最後便可以把處理完的詞存進word

In [32]:
# Write the text into a Word document
doc = Document()
for poem in test_list:
    doc.add_paragraph(poem)
# Save the Word document
doc.save('test_poems.docx')

# 3. Word處理

最後只需將字調成自己喜歡的字體、換一個較淡的顏色、調整字體大小，便可以印出來當作練字本了。

<img src="image/picture3.png" width="700" height="500">

<img src="image/picture4.png" width="800" height="500">

<img src="image/picture5.png" width="700" height="500">

# 4. 完整code

In [None]:
# get soup
# URL of the webpage containing the poem
url = "https://zh.wikisource.org/zh-hant/%E5%AE%8B%E8%A9%9E%E4%B8%89%E7%99%BE%E9%A6%96"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the div with class 'poem'
poem_div = soup.find_all('div', class_='poem')

# create result_list to store processed poems
result_list = []
# loop through all poem to extract p and then remove extra white space
for poem in poem_div:
    text = ''.join(poem.find('p').strings)
    text = text.strip().replace('\n', '').replace('\t', '').replace(' ', '').replace('\u3000', '')  # Remove leading/trailing whitespace and other possible whitespace
    result_list.append(text)

# Write the text into a Word document
doc = Document()
for poem in result_list:
    doc.add_paragraph(poem)
# Save the Word document
doc.save('all_poems.docx')