#  BeautifulSoup 
## 常用執行步驟與函數


In [1]:
# Example html code
html_doc = """
<html><head><title>The NCDR history</title></head>
<body>
<p class="title"><b>The NCDR history1</b><b>The NCDR history2</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

## 萬年起首式
![](https://i.imgur.com/cqoMN6s.png)

In [2]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,"lxml")
print(soup.prettify())

<html>
 <head>
  <title>
   The NCDR history
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The NCDR history1
   </b>
   <b>
    The NCDR history2
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
1
soup.title # <title>The NCDR history</title>
2
soup.title.name # u'title' .name => 標籤名稱
3
soup.title.string # u'The NCDR history' .string => 標籤內容
4
soup.title.parent.name # u'head'
5
soup.p # <p class="title"><b>The NCDR history</b></p>
6
soup.p['class'] # u'title'
7
soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
8
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
9
soup.find(id="link3")  # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
10
print(soup.get_text()) #印出所有文字

The NCDR history

The NCDR history1The NCDR history2
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



In [5]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

### 練習一:
使用上方這組parse過的資料，獲取文件中的超連結。 <br>
提示: <br>
.find_all() <br>
XXX[] => 取出搜尋到的標籤的屬性值    or .get('href')

In [13]:
for link in soup.find_all('a'):
    print(link['href'])
#[link['href'] for link in soup.find_all('a')]

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


## BeautifulSoup 架構介紹
Beautiful Soup將復雜的HTML文檔轉換成一個複雜的樹形結構，每個節點都是Python對象，所有對象可以歸納為4種：
* Tag
* NavigableString
* BeautifulSoup 
* Comment

### Tag => 請以樹狀結構去想像
基本操作
1. 取標籤
2. 取標籤值
3. 取屬性
4. 取屬性值

In [16]:
soup1 = BeautifulSoup('<b class="boldest">Extremely bold</b>', "lxml")
1
tag = soup1.b
type(tag)
tag.name #可以改變 tag 名稱，但會影響之後的呼叫
2
tag.attrs #獲取屬性
3
tag['class']#獲取屬性值

1

bs4.element.Tag

'b'

2

{'class': ['boldest']}

3

['boldest']

樹狀結構進階，僅介紹 "子"屬性， parent, sibling，element 有需求再去看官方文件。

單層
* **.contents**: Tag的屬性，可以將標籤的子節點以list的方式輸出。
* **.children**: 用此屬性可以建立 fot loop 針對該標籤的子節點做動作。 

多層

* **.descendants**: 對所有標籤的子孫節點進行遞歸循環

In [17]:
# .contents
1
head_tag = soup.p
head_tag 
2
head_tag.contents
3
head_tag.contents[0] 
4
title_tag = soup.title
title_tag.contents 
5 # soup 本身也有子節點
soup.contents[0].name

1

<p class="title"><b>The NCDR history1</b><b>The NCDR history2</b></p>

2

[<b>The NCDR history1</b>, <b>The NCDR history2</b>]

3

<b>The NCDR history1</b>

4

['The NCDR history']

5

'html'

In [18]:
# .children 
type(title_tag.children)
for child in title_tag.children:
    print(child)
print("length of .children = ",len(list(soup.children)))

# .descendants
type(title_tag.descendants)
for child in head_tag.descendants:
    print(child)
print("length of .decendents = ",len(list(soup.descendants)))

list_iterator

The NCDR history
length of .children =  1


generator

<b>The NCDR history1</b>
The NCDR history1
<b>The NCDR history2</b>
The NCDR history2
length of .decendents =  28


### NavigableString
tag中的字符串，且字符串與Python的中的的Unicode字符串相同，並且**還包含在遍歷文檔和搜尋文檔樹中的一些特性**。
如果想在Beautiful Soup之外使用NavigableString對象，需要調用unicode（）方法，將該對象轉換成普通的Unicode字符串，否則就算Beautiful Soup已經執行結束，該對象的輸出也會帶有對象的引用地址。這樣會浪費內存。

In [19]:
tag
1 # NavigableString
tag.string # u'Extremely bold'
type(tag.string) # <class 'bs4.element.NavigableString'>
2 # unicode
unicode_string = str(tag.string)  # unicode() for Python2
unicode_string # u'Extremely bold'
type(unicode_string) # <type 'unicode'>

<b class="boldest">Extremely bold</b>

1

'Extremely bold'

bs4.element.NavigableString

2

'Extremely bold'

str

In [10]:
1 # .stripped_strings 可以去除多余空白内容
for string in soup.stripped_strings:
    print(repr(string))  #repr 轉成字串 for python
2 # .strings 循環取得tag下多個字串
for string in soup.strings:
    print(repr(string))

1

'The NCDR history'
'The NCDR history'
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


2

'The NCDR history'
'\n'
'\n'
'The NCDR history'
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'


### BeautifulSoup
指一個文檔的全部內容，但不完全等於Tag。

In [20]:
soup.name

'[document]'

### Comment
不再上述三類所定義的其他種類~ <br>
<!--XXXXX--> 為html中的注釋

In [21]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup2 = BeautifulSoup(markup,"lxml")
comment = soup2.b.string
type(comment)
comment

bs4.element.Comment

'Hey, buddy. Want to buy a used parser?'

## BeautifulSoup 搜尋
* find()
* findall()
* 自定義搜尋: Ture & False

### 方法
* str
* list of str
* regular expression


In [25]:
# 可搜尋形式
import re
1
soup.find('a')
2
soup.find_all('a')
3
soup.find_all(['a','p'])
4  #找出 tag 名稱中包含 t 的
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

1

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

2

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

3

[<p class="title"><b>The NCDR history1</b><b>The NCDR history2</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
 <p class="story">...</p>]

4

html
title


### find_all( name , attrs , recursive , text , **kwargs )
* name: Tag
* attrs: 屬性
* text: 內文* 
* recursive = False 限制只搜尋子節點
* limit = 2 限制搜尋比數為2

In [14]:
1 # p tag 中，有屬性 = title
soup.find_all("p", "title") 
2
soup.find_all(id="link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
3 # class 為 python 中內定字，多加底線
soup.find_all(class_="sister") 

4 #特例: data-* 属性
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>', "lxml")
#data_soup.find_all(data-foo="value")
data_soup.find_all(attrs={"data-foo": "value"})

5 # text 參數

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
soup.find_all(text=re.compile("NCDR"))

6 # recursive
soup.html.find_all("title") # [<title>The NCDR history</title>]
soup.html.find_all("title", recursive=False) # []

1

[<p class="title"><b>The NCDR history</b></p>]

2

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

3

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4

[<div data-foo="value">foo!</div>]

5

['Elsie', 'Lacie', 'Tillie']

['The NCDR history', 'The NCDR history']

6

[<title>The NCDR history</title>]

[]

In [15]:
# 可以用此方式搜尋特定路徑 (下面有更好的特定路徑搜尋法)
soup.head.title
soup.find("head").find("title")

<title>The NCDR history</title>

<title>The NCDR history</title>

### CSS選擇器  select()
括號中的內容可以使用 chrome inspect(檢查) 中 copy selector，詳細其他用法可參考官方文件。 <br>
下面我們就用 NCDR 的例子來試玩一次吧!<br>
看看如何直接抓取服務時間!

In [34]:
import requests
ncdr = requests.get("https://www.ncdr.nat.gov.tw/")
soup_ncdr = BeautifulSoup(ncdr.text,"lxml")
# nth-child => nth-of-type
soup_ncdr.select("#Service_box > table > tbody > tr:nth-of-type(1) > td:nth-of-type(2)")

[<td>星期一至五早上8:30至下午5:30</td>]

# 練習
用上述教的select方式，(1)過濾出NCDR首頁上最新發布的電子報標題，並(2)印出純文字

In [35]:
for string in soup_ncdr.select("#ctl08_lbtnNewsletter")[0].stripped_strings:
    print(repr(string))
print(soup_ncdr.select("#ctl08_lbtnNewsletter")[0].string)

'第151期 災防告警細胞廣播服務民眾滿意度調查'
　　　　　　　　　第151期 災防告警細胞廣播服務民眾滿意度調查


### 解決tag內有其他tag，無法用 .string 取出文字部分
* 移除部分搜尋數 .extract
* get_text(strip = True)

In [36]:
# NCDR 取最新消息公告內容
with_span = soup_ncdr.select("#ctl08_Repeater_Type_LinkButton_Link_0")[0]
1 #extract 前
with_span
print("印出的內容: ",with_span.string)
span = with_span.span.extract()
span = with_span.img.extract()
2 #extract 後
with_span
#這樣就可以順利地取出文字內容了!
print("印出的內容: ",with_span.string)

1

<a href="javascript:__doPostBack('ctl08$Repeater_Type$ctl01$LinkButton_Link','')" id="ctl08_Repeater_Type_LinkButton_Link_0"><span id="ctl08_Repeater_Type_Label_ModuleName_0">[中心公告]</span>即時示警資訊分享~歡迎加入帳號，國家災害防救科技中心將會提供您各...
                  <img id="ctl08_Repeater_Type_Image_News_0" src="App_Themes/new2.gif"/></a>

印出的內容:  None


2

<a href="javascript:__doPostBack('ctl08$Repeater_Type$ctl01$LinkButton_Link','')" id="ctl08_Repeater_Type_LinkButton_Link_0">即時示警資訊分享~歡迎加入帳號，國家災害防救科技中心將會提供您各...
                  </a>

印出的內容:  即時示警資訊分享~歡迎加入帳號，國家災害防救科技中心將會提供您各...
                  


In [31]:
with_span = soup_ncdr.select("#ctl08_Repeater_Type_LinkButton_Link_0")[0]
with_span

<a href="javascript:__doPostBack('ctl08$Repeater_Type$ctl01$LinkButton_Link','')" id="ctl08_Repeater_Type_LinkButton_Link_0">即時示警資訊分享~歡迎加入帳號，國家災害防救科技中心將會提供您各...
                  <img id="ctl08_Repeater_Type_Image_News_0" src="App_Themes/new2.gif"/></a>

In [19]:
soup_ncdr.select("#ctl08_Repeater_Type_LinkButton_Link_2")[0].get_text()
soup_ncdr.select("#ctl08_Repeater_Type_LinkButton_Link_2")[0].get_text(strip=True)

'[活動訊息]災害管理科普演講系列活動(03)_防災社區: 『防汛深根 在地傳...\r\n                \xa0\xa0'

'[活動訊息]災害管理科普演講系列活動(03)_防災社區: 『防汛深根 在地傳...'