# BeautifulSoup

- `pip install bs4`
- 利用 BeautifulSoup() 將回傳值轉為 BeautifulSoup 物件
- 可將 HTML 或 XML 轉為 BeautifulSoup 物件
- 再利用 find 或 select 的方法定位標籤
    - 先從「開發人員工具」找出標籤位子
    - 再用 find 或 select 篩選特定標籤
    - 可利用標籤中的屬性協助定位
    - 盡可能先用 id 定位，因為 id 是唯一的

In [1]:
import os
from dotenv import load_dotenv
from urllib import request
from bs4 import BeautifulSoup

In [18]:
url ='https://www.ptt.cc/bbs/joke/index.html'
# url =  'http://httpbin.org/get'

In [None]:
load_dotenv()
useragent = os.getenv("USER_AGENT")
headers = {'User-Agent' : useragent}
req = request.Request(url=url, headers=headers)
res = request.urlopen(req)
res

<http.client.HTTPResponse at 0x299c114b430>

In [21]:
# 不為字串，為 BeautifulSoup 物件
soup = BeautifulSoup(res, 'html.parser')
# soup = BeautifulSoup(res)       # 會有警告，因 parser 不明確
print(soup)

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>看板 joke 文章列表 - 批踢踢實業坊</title>
<link href="//images.ptt.cc/bbs/v2.27/bbs-common.css" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/bbs-custom.css" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print" rel="stylesheet" type="text/css"/>
</head>
<body>
<div id="topbar-container">
<div class="bbs-content" id="topbar">
<a href="/bbs/" id="logo">批踢踢實業坊</a>
<span>›</span>
<a class="board" href="/bbs/joke/index.html"><span class="board-label">看板 </span>joke</a>
<a class="right small" href="/about.html">關於我們</a>
<a class="right small" href="/contact.html">聯絡資訊</a>
</div>
</div>
<div id="ma

## find(), findAll()
- syntax：`findAll(tag, attribute)`
- find：只會找出一個
    - `soup.findAll('div,' {'id' : 'action-bar-container'})`
    - `soup.findAll('div', id = 'action-bar-container')`
- findAll：找出所有符合條件的標籤，
    - return type → list，list裡的每個物件都是 BeautifulSoup 物件，所以可以繼續對裡面 find
    - `soup.findAll('div', {'id' : 'action-bar-container'})`
    - `soup.findAll('div', id = 'action-bar-container')`

In [56]:
# html 中，<div id="action-bar-container"> </div> 裡的東西
action_bar_find = soup.find('div', {'id' : 'action-bar-container'})
print(f"type: {type(action_bar_find)}")
action_bar_find     # 不是 list

type: <class 'bs4.element.Tag'>


<div id="action-bar-container">
<div class="action-bar">
<div class="btn-group btn-group-dir">
<a class="btn selected" href="/bbs/joke/index.html">看板</a>
<a class="btn" href="/man/joke/index.html">精華區</a>
</div>
<div class="btn-group btn-group-paging">
<a class="btn wide" href="/bbs/joke/index1.html">最舊</a>
<a class="btn wide" href="/bbs/joke/index8611.html">‹ 上頁</a>
<a class="btn wide disabled">下頁 ›</a>
<a class="btn wide" href="/bbs/joke/index.html">最新</a>
</div>
</div>
</div>

In [53]:
action_bar = soup.findAll('div', {'id' : 'action-bar-container'})
print(f"type: {type(action_bar)}")
action_bar  # lsit
# type(action_bar)

type: <class 'bs4.element.ResultSet'>


[<div id="action-bar-container">
 <div class="action-bar">
 <div class="btn-group btn-group-dir">
 <a class="btn selected" href="/bbs/joke/index.html">看板</a>
 <a class="btn" href="/man/joke/index.html">精華區</a>
 </div>
 <div class="btn-group btn-group-paging">
 <a class="btn wide" href="/bbs/joke/index1.html">最舊</a>
 <a class="btn wide" href="/bbs/joke/index8611.html">‹ 上頁</a>
 <a class="btn wide disabled">下頁 ›</a>
 <a class="btn wide" href="/bbs/joke/index.html">最新</a>
 </div>
 </div>
 </div>]

In [52]:
# 在 action_bar 中找第一個 <div>
tmp_div = action_bar[0].find('div')
print(f"type: {type(tmp_div)}")
tmp_div


type: <class 'bs4.element.Tag'>


<div class="action-bar">
<div class="btn-group btn-group-dir">
<a class="btn selected" href="/bbs/joke/index.html">看板</a>
<a class="btn" href="/man/joke/index.html">精華區</a>
</div>
<div class="btn-group btn-group-paging">
<a class="btn wide" href="/bbs/joke/index1.html">最舊</a>
<a class="btn wide" href="/bbs/joke/index8611.html">‹ 上頁</a>
<a class="btn wide disabled">下頁 ›</a>
<a class="btn wide" href="/bbs/joke/index.html">最新</a>
</div>
</div>

In [31]:
tmp_div_div = action_bar[0].div.div
tmp_div_div

<div class="btn-group btn-group-dir">
<a class="btn selected" href="/bbs/joke/index.html">看板</a>
<a class="btn" href="/man/joke/index.html">精華區</a>
</div>

In [57]:
# 在 action_bar 中找所有 <a>
tmp_a_all = action_bar[0].findAll('a')
print(f"type: {type(tmp_a_all)}")
tmp_a_all

type: <class 'bs4.element.ResultSet'>


[<a class="btn selected" href="/bbs/joke/index.html">看板</a>,
 <a class="btn" href="/man/joke/index.html">精華區</a>,
 <a class="btn wide" href="/bbs/joke/index1.html">最舊</a>,
 <a class="btn wide" href="/bbs/joke/index8611.html">‹ 上頁</a>,
 <a class="btn wide disabled">下頁 ›</a>,
 <a class="btn wide" href="/bbs/joke/index.html">最新</a>]

In [58]:
# 在 action_bar 中找第一個 <a>
tmp_a = action_bar[0].find('a')
print(f"type: {type(tmp_a)}")
tmp_a

type: <class 'bs4.element.Tag'>


<a class="btn selected" href="/bbs/joke/index.html">看板</a>

In [59]:
tmp_text_in_a = tmp_a.text
print(f"type: {type(tmp_text_in_a)}")
tmp_text_in_a

type: <class 'str'>


'看板'

In [None]:
tmp_string_in_a = tmp_a.string
print(f"type: {type(tmp_string_in_a)}")
tmp_string_in_a

type: <class 'bs4.element.NavigableString'>


'看板'

In [46]:
# 取出標籤裡的東西，用法像字典，但其實不是字典
tmp_url = tmp_a['href']
tmp_url
"https://www.ptt.cc" + tmp_url

'https://www.ptt.cc/bbs/joke/index.html'

# select_one(), select()
- syntax：`select('tag[attribute]')`
- select_one()：只會找一個
- Select 會列出所有符合條件的標籤
    - return type → list，list裡的每個物件都是 BeautifulSoup 物件，所以可以繼續對裡面 find
    - `soup.select('div[id="action-bar-container"]')`
    - `soup.select('div#action-bar-container')`

In [61]:
action_bar_sel = soup.select('div[id="action-bar-container"]')
# action_bar_sel = soup.select('div#action-bar-container')
print(f"type: {type(action_bar_sel)}")
action_bar_sel

type: <class 'bs4.element.ResultSet'>


[<div id="action-bar-container">
 <div class="action-bar">
 <div class="btn-group btn-group-dir">
 <a class="btn selected" href="/bbs/joke/index.html">看板</a>
 <a class="btn" href="/man/joke/index.html">精華區</a>
 </div>
 <div class="btn-group btn-group-paging">
 <a class="btn wide" href="/bbs/joke/index1.html">最舊</a>
 <a class="btn wide" href="/bbs/joke/index8611.html">‹ 上頁</a>
 <a class="btn wide disabled">下頁 ›</a>
 <a class="btn wide" href="/bbs/joke/index.html">最新</a>
 </div>
 </div>
 </div>]

In [65]:
action_bar_sel[0].div

<div class="action-bar">
<div class="btn-group btn-group-dir">
<a class="btn selected" href="/bbs/joke/index.html">看板</a>
<a class="btn" href="/man/joke/index.html">精華區</a>
</div>
<div class="btn-group btn-group-paging">
<a class="btn wide" href="/bbs/joke/index1.html">最舊</a>
<a class="btn wide" href="/bbs/joke/index8611.html">‹ 上頁</a>
<a class="btn wide disabled">下頁 ›</a>
<a class="btn wide" href="/bbs/joke/index.html">最新</a>
</div>
</div>

In [75]:
action_bar_sel[0].div.div.next_sibling.next_sibling

<div class="btn-group btn-group-paging">
<a class="btn wide" href="/bbs/joke/index1.html">最舊</a>
<a class="btn wide" href="/bbs/joke/index8611.html">‹ 上頁</a>
<a class="btn wide disabled">下頁 ›</a>
<a class="btn wide" href="/bbs/joke/index.html">最新</a>
</div>

In [62]:
tmp_next_sibling = action_bar_sel[0].div.div.next_sibling.next_sibling
print(f"type: {type(tmp_next_sibling)}")
tmp_next_sibling

type: <class 'bs4.element.Tag'>


<div class="btn-group btn-group-paging">
<a class="btn wide" href="/bbs/joke/index1.html">最舊</a>
<a class="btn wide" href="/bbs/joke/index8611.html">‹ 上頁</a>
<a class="btn wide disabled">下頁 ›</a>
<a class="btn wide" href="/bbs/joke/index.html">最新</a>
</div>

In [63]:
tmp_siblings = tmp_next_sibling.a.next_siblings
for i in tmp_siblings:
    print(i.text)



‹ 上頁


下頁 ›


最新


