## 網頁三要素

- ### HTML: 建構網頁主架構
- ### CSS: 決定網頁設計樣式
- ### Javascript: 控制網頁動態行為

## API

- ### Application Programming Interface
- ### 資料擁有者使用網路傳輸協定來提供資料的方式

## 網路爬蟲

- ### 使用者透過程式自動化抓取網頁內容
- ### 對網頁HTML進行解析並取得資料
- ### 使用HTML tag & CSS selector來定位

## CSS selector

- ### 網頁定位工具
- ### 可用於程式語法及檢視瀏覽器原始碼

## Requests

- ### 模擬網頁請求來獲得網頁內容

- ### 安裝Requests

In [None]:
conda install -c anaconda requests

- ### 載入套件

In [32]:
import requests

- ### 抓取網頁HTML

In [36]:
# 抓取google首頁
google = requests.get('https://www.google.com/?hl=zh_tw')

In [37]:
google.text

'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="zh-TW"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="0685EbYttpJPY3onKpgGBQ==">(function(){window.google={kEI:\'zLqZX_KZKeKFr7wPkdawiAs\',kEXPI:\'0,202162,1151585,5662,730,224,5104,207,2414,790,10,1226,364,1499,611,92,114,383,246,5,1354,648,3451,315,3,66,768,216,284,867,114,99,352,733,19,2571,265,7,131,1116857,1197747,536,328984,13677,4855,32692,16114,17444,11240,9188,8384,4858,1362,9291,3026,7391,8383,1808,4998,7626,306,5296,2054,920,873,4192,6430,14528,4517,2777,919,2277,8,87,2709,885,708,1279,2212,530,149,1103,840,518,1465,56,4258,109,203,1137,2,2669,2023,1777,520,1704,243,2229,93,328,1284,16,2927,2247,1812,1787,3227,2845,7,6068,6286,4455,641,7539,338,4928,108,3407,908,2,941,2614,2397,7468,3277,3,576,970,865,4625,148,3501,2489,7985,4,1252,196,80,2304,1

## BeautifulSoup


- ### 用來解析HTML的套件庫

- ### 安裝BeuatifulSoup

In [None]:
conda install -c anaconda beautifulsoup4

- ### 載入套件

In [1]:
from bs4 import BeautifulSoup

- ### 模擬HTML

In [2]:
html = '''<html>
<head></head>
<body>
<h1>This is a title</h1>
<p class="subtitle">Lorem ipsum dolor sit amet. Consectetur edipiscim elit.</p>
<p>Here's another p without a class</p>
<ul>
    <li>Rolf</li>
    <li>Charlie</li>
    <li>Jen</li>
    <li>Jose</li>
</ul>
</body>
</html>'''

- ### 解析網頁

In [3]:
soup = BeautifulSoup(html, 'html.parser')

- ### 取得單一元素

In [5]:
ele = soup.find('h1')

In [6]:
ele

<h1>This is a title</h1>

- ### 取得文字內容

In [22]:
ele.text

'This is a title'

In [23]:
ele.string

'This is a title'

- ### 取得多個元素

In [7]:
eles = soup.find_all('li')

In [8]:
eles

[<li>Rolf</li>, <li>Charlie</li>, <li>Jen</li>, <li>Jose</li>]

- ### 取得屬性

In [9]:
parag = soup.find('p')

In [10]:
parag.attrs

{'class': ['subtitle']}

- ### 利用屬性取得想要的元素

In [11]:
# 取得class屬性是'subtitle'的標籤
soup.find('p', {'class':'subtitle'})

<p class="subtitle">Lorem ipsum dolor sit amet. Consectetur edipiscim elit.</p>

- ### 指定選取路徑

In [12]:
html = '''<html><head></head><body>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
    <article class="product_pod">
            <div class="image_container">
                    <a href="catalogue/a-light-in-the-attic_1000/index.html"><img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail"></a>
            </div>
                <p class="star-rating Three">
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                </p>
            <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
            <div class="product_price">
        <p class="price_color">£51.77</p>
<p class="instock availability">
    <i class="icon-ok"></i>
        In stock
</p>
    <form>
        <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>
    </form>
            </div>
    </article>
</li>
</body></html>
'''

In [20]:
soup = BeautifulSoup(html, 'html.parser')

In [24]:
# 選取article-> h3 -> a
soup.select('article h3 a')

[<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

- ### 加入屬性篩選

In [25]:
soup.select('article.product_pod h3 a')

[<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

In [None]:
# select_one() 可以只選一個元素

[--QUIZ--]

選取article-> h3 -> a 的href屬性內容

[--QUIZ--]

嘗試選取£51.77的數值
以float表示

[--QUIZ--]

選取article標籤下且class為'star-rating'的p標籤，印出第二個class屬性