# Simplest scraping
**本範例將開啟簡單的網頁(https://taipeicity.github.io/traffic_realtime/)，接著列印其中所有的內容。**

In [22]:
import urllib.request
with urllib.request.urlopen('https://taipeicity.github.io/traffic_realtime/') as html:
     print(html.read(500))

b'<!doctype html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="chrome=1">\n    <title>\xe8\x87\xba\xe5\x8c\x97\xe5\xb8\x82\xe6\x94\xbf\xe5\xba\x9c \xe4\xba\xa4\xe9\x80\x9a\xe5\x8d\xb3\xe6\x99\x82\xe8\xb3\x87\xe6\x96\x99 \xe9\x96\x8b\xe6\x94\xbe\xe8\xb3\x87\xe6\x96\x99\xe5\xb0\x88\xe5\x8d\x80 by taipeicity</title>\n\n    <link rel="stylesheet" href="stylesheets/styles.css">\n    <link rel="stylesheet" href="stylesheets/github-light.css">\n    <script src="javascripts/scale.fix.js"></script>\n\t<script type="text/javascript"  src="javascripts/normal.js"></script>\n    <meta name="viewport" content="width=device-'


**若遇到網站採用utf-8 encoding編碼，則需要加上decode('utf-8')**

In [24]:
with urllib.request.urlopen('https://taipeicity.github.io/traffic_realtime/') as html:
     print(html.read(300).decode('utf-8'))

<!doctype html>
<html>
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <title>臺北市政府 交通即時資料 開放資料專區 by taipeicity</title>

    <link rel="stylesheet" href="stylesheets/styles.css">
    <link rel="stylesheet" href="sty


若要處理可能的例外情況，例如找不到網頁或網站目前暫時關閉，需要利用urllib.error模組所回傳的資訊來處理。

In [25]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import sys


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("https://taipeicity.github.io/traffic_realtime/")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1 class="header">臺北市政府 交通即時資料 開放資料專區</h1>


**接著利用re正則表示式對網頁標籤內容進行辨識**

In [26]:
import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])


Page title is:  臺北市政府 交通即時資料 開放資料專區 by taipeicity


**另一個從網頁抓取內容選取段落的例子**

In [27]:
res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])


Page paragraph is:  若您有任何問題，歡迎來信 <a href="mailto:services@mail.taipei.gov.tw">services@mail.taipei.gov.tw</a> 或來電(02)2720-8889#2858（李先生），感謝您！


**也可以利用regex找出LINK**

In [28]:
res = re.findall(r'href="(.*?)"', html)
print("\nAll links: ", res)


All links:  ['stylesheets/styles.css', 'stylesheets/github-light.css', 'https://github.com/taipeicity/traffic_realtime/zipball/master', 'https://github.com/taipeicity/traffic_realtime/tarball/master', 'https://github.com/taipeicity/traffic_realtime', 'https://github.com/taipeicity', '#%E8%B3%87%E6%96%99%E9%9B%86%E5%88%97%E8%A1%A8-%E8%B3%87%E6%96%99%E7%82%BA-json-%E6%A0%BC%E5%BC%8F', 'mailto:services@mail.taipei.gov.tw', '#%E5%85%AC%E8%BB%8A%E5%8D%B3%E6%99%82%E8%B3%87%E6%96%99-%E8%AA%AA%E6%98%8E%E6%96%87%E4%BB%B6', 'https://drive.google.com/file/d/0BzL9ldn5Fg6dcVZ3eUgybkdiTXc/view?usp=sharing', 'https://tcgbusfs.blob.core.windows.net/blobbus/GetPathDetail.gz', 'http://data.taipei/opendata/datalist/datasetMeta?oid=174d780f-6e87-45d8-b779-c608c6f01432', 'https://tcgbusfs.blob.core.windows.net/ntpcbus/GetPathDetail.gz', 'http://data.taipei/opendata/datalist/datasetMeta?oid=2a8f5f42-942c-4974-8366-b44e90ad9701', 'https://tcgbusfs.blob.core.windows.net/blobbus/GetCarInfo.gz', 'http://data.t