# Python 爬蟲筆記
資料來源: 由郭耀仁老師授課之Python與資料科學應用

## 網頁資料擷取

- 如何獲得資料：[`requests`](<https://2.python-requests.org//en/master/>)

- 如何解析（parse）資料：`lxml`, `beautifulsoup`, `pyquery`
    - `beautifulsoup`, `pyquery` 或其他解析器都是 `lxml` 的包裝器（wrapper）

### Web API 格式

- 可以透過 [JSONView](https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc) 瀏覽資料外觀
   - 提供使用者易於閱讀的介面
   - 大多資料型態都是list of dictionary


### 範例: AQI

In [0]:
import requests

aqi_url = "https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json"
r = requests.get(aqi_url, verify = False) #verify為False代表不要讓網站跑安全憑證



In [0]:
print(type(r))
print(r.status_code)
#status_code = 200 代表 ok 
#status_code = 404 代表錯誤
#status_code = 403 代表沒有權限

<class 'requests.models.Response'>
200


In [0]:
print(type(r.text))
print(len(r.text))

<class 'str'>
29050


In [0]:
print(type(r.json()))
print(len(r.json())) #這個list長度是81，代表裡面有81的dictionary

<class 'list'>
81


In [0]:
list_of_dicts = r.json()
len(list_of_dicts[0]) #每個dictionary裡面的長度是23，代表裡面有23組資料

23

In [0]:
for k, v in list_of_dicts[0].items(): # k = key, v = value   #  dict.items()函数以列表返回可遍歷的(键, 值) 元组数组。
    print(k, v, type(v)) 

SiteName 屏東(琉球) <class 'str'>
County 屏東縣 <class 'str'>
AQI 53 <class 'str'>
Pollutant 細懸浮微粒 <class 'str'>
Status 普通 <class 'str'>
SO2 3.9 <class 'str'>
CO 0.48 <class 'str'>
CO_8hr 0.3 <class 'str'>
O3 18 <class 'str'>
O3_8hr 22 <class 'str'>
PM10 34 <class 'str'>
PM2.5 21 <class 'str'>
NO2 18 <class 'str'>
NOx 21 <class 'str'>
NO 2.9 <class 'str'>
WindSpeed 2 <class 'str'>
WindDirec 315 <class 'str'>
PublishTime 2019-04-21 14:00 <class 'str'>
PM2.5_AVG 16 <class 'str'>
PM10_AVG 31 <class 'str'>
SO2_AVG 2 <class 'str'>
Longitude 120.377222 <class 'str'>
Latitude 22.352222 <class 'str'>


In [0]:
for i in list_of_dicts:
  print(i)

{'SiteName': '屏東(琉球)', 'County': '屏東縣', 'AQI': '53', 'Pollutant': '細懸浮微粒', 'Status': '普通', 'SO2': '3.9', 'CO': '0.48', 'CO_8hr': '0.3', 'O3': '18', 'O3_8hr': '22', 'PM10': '34', 'PM2.5': '21', 'NO2': '18', 'NOx': '21', 'NO': '2.9', 'WindSpeed': '2', 'WindDirec': '315', 'PublishTime': '2019-04-21 14:00', 'PM2.5_AVG': '16', 'PM10_AVG': '31', 'SO2_AVG': '2', 'Longitude': '120.377222', 'Latitude': '22.352222'}
{'SiteName': '苗栗(後龍)', 'County': '苗栗縣', 'AQI': '40', 'Pollutant': '', 'Status': '良好', 'SO2': '1', 'CO': '0.17', 'CO_8hr': '0.2', 'O3': '36', 'O3_8hr': '32', 'PM10': '16', 'PM2.5': '13', 'NO2': '5.1', 'NOx': '6.2', 'NO': '1.1', 'WindSpeed': '3.1', 'WindDirec': '199', 'PublishTime': '2019-04-21 14:00', 'PM2.5_AVG': '12', 'PM10_AVG': '13', 'SO2_AVG': '2', 'Longitude': '120.786028', 'Latitude': '24.616369'}
{'SiteName': '彰化(大城)', 'County': '彰化縣', 'AQI': '38', 'Pollutant': '', 'Status': '良好', 'SO2': '1.6', 'CO': '0.19', 'CO_8hr': '0.2', 'O3': '39', 'O3_8hr': '36', 'PM10': '16', 'PM2.5': '

In [0]:
for k in list_of_dicts[0].keys():
    print(k)

SiteName
County
AQI
Pollutant
Status
SO2
CO
CO_8hr
O3
O3_8hr
PM10
PM2.5
NO2
NOx
NO
WindSpeed
WindDirec
PublishTime
PM2.5_AVG
PM10_AVG
SO2_AVG
Longitude
Latitude


In [0]:
for v in list_of_dicts[0].values():
    print(v)

屏東(琉球)
屏東縣
53
細懸浮微粒
普通
3.9
0.48
0.3
18
22
34
21
18
21
2.9
2
315
2019-04-21 14:00
16
31
2
120.377222
22.352222


In [0]:
list(list_of_dicts[0].items())  #印出來的是一組組的tuples

#for k, v in list_of_dicts[0].items():
#    print(k)
#    print(v)
#    print("===")

[('SiteName', '屏東(琉球)'),
 ('County', '屏東縣'),
 ('AQI', '53'),
 ('Pollutant', '細懸浮微粒'),
 ('Status', '普通'),
 ('SO2', '3.9'),
 ('CO', '0.48'),
 ('CO_8hr', '0.3'),
 ('O3', '18'),
 ('O3_8hr', '22'),
 ('PM10', '34'),
 ('PM2.5', '21'),
 ('NO2', '18'),
 ('NOx', '21'),
 ('NO', '2.9'),
 ('WindSpeed', '2'),
 ('WindDirec', '315'),
 ('PublishTime', '2019-04-21 14:00'),
 ('PM2.5_AVG', '16'),
 ('PM10_AVG', '31'),
 ('SO2_AVG', '2'),
 ('Longitude', '120.377222'),
 ('Latitude', '22.352222')]

#### 補充: enumerate 與 zip package
- enumerate(): 把原本的值編號(預設從0開始)，並以tuple的方式呈現
- zip(): 把兩個 list 合在一起，並成為tuple

[Built-in Functions](https://docs.python.org/3/library/functions.html)

In [0]:
cities = ['Taipei', 'New York', 'London', 'Tokyo']
for c in cities:
    print(c)

Taipei
New York
London
Tokyo


In [0]:
list(enumerate(cities))

[(0, 'Taipei'), (1, 'New York'), (2, 'London'), (3, 'Tokyo')]

In [0]:
countries = ['Taiwan', 'United States', 'United Kingdom', 'Japan']
list(zip(cities, countries))

[('Taipei', 'Taiwan'),
 ('New York', 'United States'),
 ('London', 'United Kingdom'),
 ('Tokyo', 'Japan')]

In [0]:
capitals = {
    "Taiwan": "Taipei",
    "United States": "Washington DC",
    "United Kingdom": "London"
}
list(capitals.items())

[('Taiwan', 'Taipei'),
 ('United States', 'Washington DC'),
 ('United Kingdom', 'London')]

In [0]:
continents = ["Asia", "North America", "Europe", "Asia"]
list(zip(cities, countries, continents))

[('Taipei', 'Taiwan', 'Asia'),
 ('New York', 'United States', 'North America'),
 ('London', 'United Kingdom', 'Europe'),
 ('Tokyo', 'Japan', 'Asia')]

### 範例: PCHome 購物中心

判斷 HTML 中資料是以 key 或是以 value 存取：關掉網站的 JavaScript 功能

[Quick JavaScript Switcher](https://chrome.google.com/webstore/detail/quick-javascript-switcher/geddoclleiomckbhadiaipdggiiccfje)

- 看得到 Web API 資料: `requests`
- 看不到 Web API 資料
    - 資料是用參照的方式（關掉 JavaScript 功能後頁面會看不見資料）：使用 Chrome 開發人員工具尋找 Web API
      - 到開發人員工具>Network，重新整理網站後，Network會跑出東西，通常資料會在XHR或是Doc裡面，在一個個找，看資料藏在哪個裡面即可
    - 資料是用寫入的方式（關掉 JavaScript 功能後頁面資料還是存在）：使用解析器


In [0]:
import requests

pchome_search_api = "https://ecshweb.pchome.com.tw/search/v3.3/all/results?q=macbook&page=1&sort=sale/dc"
r = requests.get(pchome_search_api)
print(r.status_code)

200


In [0]:
prod_dict = r.json()
prod_dict.keys()

dict_keys(['QTime', 'totalRows', 'totalPage', 'range', 'cateName', 'q', 'subq', 'token', 'prods'])

In [0]:
for i in prod_dict["prods"]:
    print(i)

{'Id': 'DYAJBG-19009S5Q7', 'cateId': 'DYAJCW', 'picS': '/items/DYAJBG19009S5Q7/000002_1553827164.jpg', 'picB': '/items/DYAJBG19009S5Q7/000001_1554886523.jpg', 'name': 'MacBook Air 13-inch: 1.8GHz dual-core Intel Core i5, 128GB (MQD32TA/A)', 'describe': '降2千★再搭原廠滑鼠MacBook Air 13第五代 i5 / 8GB / 128GB / 1.8GHz dual\\r\\n降2千★再搭原廠滑鼠活動日期：2019 04 12(五) 15:00 -2019 06 21(五) 23:59\r\nmacbook air  128gb (市價$31900) + magic mouse 2 (市價$2290) \r\n數量有限，售完為止\r\n網路價$34190．驚喜優惠價↘$２９９００\r\n\r\n● intel core  i5 處理器\r\n● intel hd graphics 6000\r\n● ssd 儲存裝置\r\n● 長達 12 小時電池續航力\r\n● 802.11 ac wi-fi\r\n● multi - touch 觸控式軌跡板\r\n● 最長可達 30 天待機時間\r\n● 節能低耗又兼具高效能的設計\r\n\r\n注意事項', 'price': 29900, 'originPrice': 29900, 'author': '', 'brand': '', 'publishDate': '', 'sellerId': '', 'isPChome': 1, 'isNC17': 0, 'couponActid': [], 'BU': 'ec'}
{'Id': 'DYAJBD-A9008Z17R', 'cateId': 'DYAJBD', 'picS': '/items/DYAJBDA9008Z17R/000002_1524644532.png', 'picB': '/items/DYAJBDA9008Z17R/000001_1554889556.jpg', 'name': 'MacBook Pro 

In [0]:
prod_dict["prods"][0].keys()

dict_keys(['Id', 'cateId', 'picS', 'picB', 'name', 'describe', 'price', 'originPrice', 'author', 'brand', 'publishDate', 'sellerId', 'isPChome', 'isNC17', 'couponActid', 'BU'])

### Non Web API 格式

HTML 網頁：
- `<head></head>`
- `<body></body>`
- `<style></style>`
- `<script></script>`

定位的方式:
- tag 名稱 
- 資料的 id
- 資料的 class
- **資料的 CSS 位址**

協助 CSS 位址定位的 Chrome 外掛：
<https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=zh-TW>

In [0]:
import requests

r = requests.get('https://www.imdb.com/title/tt4154796')
print(r.status_code)
print(type(r))

200
<class 'requests.models.Response'>


In [0]:
html_doc = r.text
print(type(html_doc))
print(len(html_doc))
print(html_doc)

<class 'str'>
247351








<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///title/tt4154796?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
        <title>Avengers: Endgame (2019) - IMDb</title>
  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 

#### 解析器 Parser

- Beautiful Soup <https://www.crummy.com/software/BeautifulSoup/bs4/doc/#>
- PyQuery <https://pythonhosted.org/pyquery/>

```bash
pip install beautifulsoup4 
```

##### BeautifulSoup

In [0]:
from bs4 import BeautifulSoup

rating_css = "strong span"
soup = BeautifulSoup(html_doc)
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [0]:
soup.select(rating_css)

[<span itemprop="ratingValue">9.2</span>]

In [0]:
type(soup.select(rating_css)[0])

bs4.element.Tag

In [0]:
elem = soup.select(rating_css)[0]

In [0]:
rating = float(elem.text)
print(rating)

9.2


In [0]:
soup.select(".ratingValue span")

[<span itemprop="ratingValue">9.2</span>,
 <span class="grey">/</span>,
 <span class="grey" itemprop="bestRating">10</span>]

In [0]:
rating_values = [i.text for i in soup.select(".ratingValue span")]
print(rating_values)

['9.2', '/', '10']


In [0]:
# 仿照上述的程式擷取：
# 片長
# 上映日期
movie_time = [i.text.strip() for i in soup.select("time")]
movie_genre = [i.text for i in soup.select(".subtext a")]
movie_release_date = movie_genre.pop()
movie_release_date = movie_release_date.strip()
#movie_cast = [i.text for i in soup.select(".primary_photo+ td")]
print(movie_time)
print(movie_genre)
print(movie_release_date)
#print(movie_cast)

[<a href="/search/title?genres=action&amp;explore=title_type,genres">Action</a>,
 <a href="/search/title?genres=adventure&amp;explore=title_type,genres">Adventure</a>,
 <a href="/search/title?genres=fantasy&amp;explore=title_type,genres">Fantasy</a>,
 <a href="/title/tt4154796/releaseinfo" title="See more release dates">26 April 2019 (USA)
 </a>,
 <time datetime="PT181M">
                         3h 1min
                     </time>,
 <time datetime="PT181M">181 min</time>]

In [0]:
# 擷取電影海報
elem = soup.select(".poster img")[0]  # img tag 不是成雙成對的
print(type(elem))
print(elem.get("alt"))
print(elem.get("src"))

<class 'bs4.element.Tag'>
Avengers: Endgame Poster
https://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg


##### pyquery

```bash
!pip install pyquery
```

In [0]:
from pyquery import PyQuery as pq

d = pq(html_doc)
print(type(d))

<class 'pyquery.pyquery.PyQuery'>


In [0]:
print(d(".primary_photo+ td a"))

<a href="/name/nm0000375/"> Robert Downey Jr.
</a>          <a href="/name/nm0262635/"> Chris Evans
</a>          <a href="/name/nm0749263/"> Mark Ruffalo
</a>          <a href="/name/nm1165110/"> Chris Hemsworth
</a>          <a href="/name/nm0424060/"> Scarlett Johansson
</a>          <a href="/name/nm0719637/"> Jeremy Renner
</a>          <a href="/name/nm0000332/"> Don Cheadle
</a>          <a href="/name/nm0748620/"> Paul Rudd
</a>          <a href="/name/nm1212722/"> Benedict Cumberbatch
</a>          <a href="/name/nm1569276/"> Chadwick Boseman
</a>          <a href="/name/nm0488953/"> Brie Larson
</a>          <a href="/name/nm4043618/"> Tom Holland
</a>          <a href="/name/nm2394794/"> Karen Gillan
</a>          <a href="/name/nm0757855/"> Zoe Saldana
</a>          <a href="/name/nm1431940/"> Evangeline Lilly
</a>          


In [0]:
for i in d(".primary_photo+ td a").items():
  print(i.text())

Robert Downey Jr.
Chris Evans
Mark Ruffalo
Chris Hemsworth
Scarlett Johansson
Jeremy Renner
Don Cheadle
Paul Rudd
Benedict Cumberbatch
Chadwick Boseman
Brie Larson
Tom Holland
Karen Gillan
Zoe Saldana
Evangeline Lilly


In [0]:
movie_cast = [i.text() for i in d(".primary_photo+ td a").items()]
print(movie_cast)

['Robert Downey Jr.', 'Chris Evans', 'Mark Ruffalo', 'Chris Hemsworth', 'Scarlett Johansson', 'Jeremy Renner', 'Don Cheadle', 'Paul Rudd', 'Benedict Cumberbatch', 'Chadwick Boseman', 'Brie Larson', 'Tom Holland', 'Karen Gillan', 'Zoe Saldana', 'Evangeline Lilly']


In [0]:
# 擷取連結到演員介紹的網站
for i in d(".primary_photo+ td a").items():
  print("http://www.imdb.com"+i.attr("href"))

http://www.imdb.com/name/nm0000375/
http://www.imdb.com/name/nm0262635/
http://www.imdb.com/name/nm0749263/
http://www.imdb.com/name/nm1165110/
http://www.imdb.com/name/nm0424060/
http://www.imdb.com/name/nm0719637/
http://www.imdb.com/name/nm0000332/
http://www.imdb.com/name/nm0748620/
http://www.imdb.com/name/nm1212722/
http://www.imdb.com/name/nm1569276/
http://www.imdb.com/name/nm0488953/
http://www.imdb.com/name/nm4043618/
http://www.imdb.com/name/nm2394794/
http://www.imdb.com/name/nm0757855/
http://www.imdb.com/name/nm1431940/


In [0]:
# 寫個函數 get_movie_spec(movie_url)
# movie_rating
# movie_length
# movie_poster_link
# movie_genre # 劇情
# movie_release_date
import requests
from bs4 import BeautifulSoup

def get_movie_spec(movie_url):
    r = requests.get(movie_url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc)
    movie_rating = [float(i.text) for i in soup.select("strong span")][0]
    movie_time = [i.text.strip() for i in soup.select("time")]
    movie_genre = [i.text for i in soup.select(".subtext a")]
    movie_release_date = movie_genre.pop()
    movie_release_date = movie_release_date.strip()
    movie_poster_link = ["https://www.imdb.com" + i.get("src") for i in soup.select(".poster img")][0]
    movie_spec = {
        "movieRating": movie_rating,
        "movieTime": movie_time,
        "movieGenre": movie_genre,
        "movieReleaseDate": movie_release_date,
        "moviePosterLink": movie_poster_link
    }
    return movie_spec

In [0]:
get_movie_spec("https://www.imdb.com/title/tt4154796")

{'movieGenre': ['Action', 'Adventure', 'Fantasy'],
 'moviePosterLink': 'https://www.imdb.comhttps://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg',
 'movieRating': 9.1,
 'movieReleaseDate': '26 April 2019 (USA)',
 'movieTime': ['3h 1min', '181 min']}

In [0]:
'''
用前述的方法爬取www.imdb.com演員名單時
發現不適用，因此使用 BeautifulSoup 最原始的爬法取資料
終極寫法為下:
cast = [row.find_all('td')[1].find('a').text.strip() for row in soup.select(".cast_list")[0].find_all('tr')[1:]] 
下方是正常一階階爬取時的步驟
'''
cast_list_table = soup.select(".cast_list")[0]
#print(cast_list_table)
cast_list_obs = cast_list_table.find_all("tr")[1:]
#print(cast_list_obs)
cast_list_dim = [i.find_all("td")[1] for i in cast_list_obs]
#print(cast_list_dim)
cast_list = [i.find("a").text.strip() for i in cast_list_dim]
print(cast_list)

In [0]:
# urllib package:
# https://docs.python.org/3/library/urllib.html
from urllib.parse import quote_plus

movie_title = "Captain Marvel"
query_str = quote_plus(movie_title)
query_url = "https://www.imdb.com/find?q={}&s=tt&ttype=ft&ref_=fn_ft".format(query_str)
print(query_url)

https://www.imdb.com/find?q=Captain+Marvel&s=tt&ttype=ft&ref_=fn_ft


In [0]:
from urllib.parse import quote_plus

def get_movie_spec_from_titles(movie_titles):
    movie_info = {}
    for movie_title in movie_titles:
        query_str = quote_plus(movie_title)
        query_url = "https://www.imdb.com/find?q={}&s=tt&ttype=ft&ref_=fn_ft".format(query_str)
        r = requests.get(query_url)
        soup = BeautifulSoup(r.text)
        movie_links = ["https://www.imdb.com" + i.get("href") for i in soup.select(".result_text a")]
        movie_url = movie_links[0]
        movie_spec = get_movie_spec(movie_url)
        movie_info[movie_title] = movie_spec
    return movie_info

In [0]:
x = ["Captain Marvel", "Avengers: Endgame"]
get_movie_spec_from_titles(x)

{'Avengers: Endgame': {'movieGenre': ['Action', 'Adventure', 'Fantasy'],
  'moviePosterLink': 'https://www.imdb.comhttps://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg',
  'movieRating': 9.1,
  'movieReleaseDate': '26 April 2019 (USA)',
  'movieTime': ['3h 1min', '181 min']},
 'Captain Marvel': {'movieGenre': ['Action', 'Adventure', 'Sci-Fi'],
  'moviePosterLink': 'https://www.imdb.comhttps://m.media-amazon.com/images/M/MV5BMTE0YWFmOTMtYTU2ZS00ZTIxLWE3OTEtYTNiYzBkZjViZThiXkEyXkFqcGdeQXVyODMzMzQ4OTI@._V1_UX182_CR0,0,182,268_AL_.jpg',
  'movieRating': 7.1,
  'movieReleaseDate': '8 March 2019 (USA)',
  'movieTime': ['2h 3min', '123 min']}}

In [0]:
import time
from random import randint

def get_movie_spec_from_titles(*args):
    movie_info = {}
    for movie_title in args:
        query_str = quote_plus(movie_title)
        query_url = "https://www.imdb.com/find?q={}&s=tt&ttype=ft&ref_=fn_ft".format(query_str)
        r = requests.get(query_url)
        soup = BeautifulSoup(r.text)
        movie_links = ["https://www.imdb.com" + i.get("href") for i in soup.select(".result_text a")]
        movie_url = movie_links[0]
        movie_spec = get_movie_spec(movie_url)
        movie_info[movie_title] = movie_spec
        print("Now parsing {}...".format(movie_title))
        sleep_sec = randint(2, 5)
        print("Now sleeping for {} secs...".format(sleep_sec))
        time.sleep(sleep_sec)
    return movie_info

In [0]:
get_movie_spec_from_titles("Captain Marvel", "Avengers: Endgame", "Iron Man 3")