寫爬蟲之前, 可以先看看你想要爬的目標是否已經有人做出工具了, 若有的話就不用費力寫爬蟲了.
當然, 若很不幸都沒人做, 你就自己來吧...
一定要自己寫爬蟲的時候, 可以按照下面的順序來考慮開發爬蟲的方向:
目標網站/服務是否有提供API? (FB, Twitter, Google, etc...)
URL/Link有沒有規則可循? (Code, Date, Num, etc...)
Response是可解析的Json
網頁太複雜的話可以按"列印此網頁"或是看看行動版網頁(m.xxx.xxx.com)
總而言之, 馬上就開始爬整張網頁一定是最不得已的選項.

### 範例一：

In [1]:
import requests
from bs4 import BeautifulSoup


def main():
    url = 'http://blog.castman.net/web-crawler-tutorial/ch2/blog/blog.html'
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')

    # The following two lines are the same.
    # print(soup.find('h4'))
    print('Content of the first h4:')
    print(soup.h4)

    # To find the first text content of anchor of h4
    print('\nText content of the first h4:')
    print(soup.h4.a.text)

    print('\nTo find all the h4 text content:')
    h4_tags = soup.find_all('h4')
    for h4 in h4_tags:
        print(h4.a.text)

    print('\nTo find all the h4 text content with class named \'card-title\' :')
    # The following three ways are the same.
    # h4_tags = soup.find_all('h4', {'class': 'card-title'})
    # h4_tags = soup.find_all('h4', 'card-title')
    h4_tags = soup.find_all('h4', class_='card-title')
    for h4 in h4_tags:
        print(h4.a.text)

    print('\nTo find elements with id attribute: ')
    print(soup.find(id='mac-p').text.strip())
    # If the attribute key contains special character, it will occur SyntaxError:
    # print(soup.find(data-foo='mac-p').text.strip())
    # To prevent this, you can do as the following line:
    print(soup.find_all('', {'data-foo': 'mac-foo'}))

    print('\nTo retrieve all the blog post\'s information:')
    divs = soup.find_all('div', 'content')
    for div in divs:
        # If we only use print(div.text) to retrieve the content, it's not easy to handle the information,
        # to make the retrieved data clearly, you can craw the blog page like this:
        print(div.h6.text.strip(), div.h4.a.text.strip(), div.p.text.strip())

    # There is also another good way the retrieve the blog info, by stripped_strings() function,
    # it will return all the text content that are under the parent tag, even wrap by other sub tags.
    # However, the return object of stripped_strings is an iterator object, so it's not human-readable.
    # To solve this, take a look at following code block:
    print('\nTo find all blog contents via stripped_strings function:')
    for div in divs:
        # If you feel it's hard to understand, google "[s for s in subsets(S)]"
        print([s for s in div.stripped_strings])


if __name__ == '__main__':
    main()

Content of the first h4:
<h4 class="card-title">
<a href="http://www.pycone.com/blogs#pablo">Mac使用者</a>
</h4>

Text content of the first h4:
Mac使用者

To find all the h4 text content:
Mac使用者
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析

To find all the h4 text content with class named 'card-title' :
Mac使用者
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析
給初學者的 Python 網頁爬蟲與資料分析

To find elements with id attribute: 
在Mac環境下安裝Python與Sublime Text3 Read More
[<a data-foo="mac-foo" href="http://www.pycone.com/blogs/mac-python-environment"> <br/>Read More </a>]

To retrieve all the blog post's information:
開發環境設定 Mac使用者 在Mac環境下安裝Python與Sublime Text3 Read More
資料科學 給初學者的 Python 網頁爬蟲與資料分析 (1) 前言 Read More
資料科學 給初學者的 Python 網頁爬蟲與資料分析 (2) 套件安裝與啟動網頁爬蟲 Read More
資料科學 給初學者的 Python 網頁爬蟲與資料分析 (3) 解構並擷取網頁資料 Read More
資料科學 給初學者的 Python 網頁爬蟲與資料分析 (4) 擷取資料及下載圖片 Read More
資料科學 給初學者的 Python 網頁爬蟲與資料

### 範例二：
跟前一個範例比起來, 在這種類型的網頁中, find()跟find_all()不見得就是最好用的, 在這種走訪網頁結構的過程中, parent, children, next/previous siblings也可以有很好的效果.

In [2]:
import requests
from bs4 import BeautifulSoup

# Structure of the example html page:
#  body
#   - div
#     - h2
#     - p
#     - table.table
#       - thead
#         - tr
#           - th
#           - th
#           - th
#           - th
#       - tbody
#         - tr
#           - td
#           - td
#           - td
#           - td
#             - a
#               - img
#         - tr
#         - ...


def main():
    url = 'http://blog.castman.net/web-crawler-tutorial/ch2/table/table.html'
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')

    count_course_number(soup)
    calculate_course_average_price1(soup)
    calculate_course_average_price2(soup)
    retrieve_all_tr_contents(soup)


def count_course_number(soup):
    print('Total course count: ' + str(len(soup.find('table', 'table').tbody.find_all('tr'))) + '\n')


def calculate_course_average_price1(soup):
    # To calculate the average course price
    # Retrieve the record with index:
    prices = []
    rows = soup.find('table', 'table').tbody.find_all('tr')
    for row in rows:
        price = row.find_all('td')[2].text
        print(price)
        prices.append(int(price))
    print('Average course price: ' + str(sum(prices) / len(prices)) + '\n')


def calculate_course_average_price2(soup):
    # Retrieve the record via siblings:
    prices = []
    links = soup.find_all('a')
    for link in links:
        price = link.parent.previous_sibling.text
        prices.append(int(price))
    print('Average course price: ' + str(sum(prices) / len(prices)) + '\n')


def retrieve_all_tr_contents(soup):
    # Retrieve all tr record:
    rows = soup.find('table', 'table').tbody.find_all('tr')
    for row in rows:
        # Except all_tds = row.find_all('td'), you can also retrieve all td record with the following line code:
        all_tds = [td for td in row.children]
        if 'href' in all_tds[3].a.attrs:
            href = all_tds[3].a['href']
        else:
            href = None
        print(all_tds[0].text, all_tds[1].text, all_tds[2].text, href, all_tds[3].a.img['src'])


if __name__ == '__main__':
    main()

Total course count: 6

1490
1890
1890
1890
1890
1890
Average course price: 1823.3333333333333

Average course price: 1823.3333333333333

初心者 - Python入門 初學者 1490 http://www.pycone.com img/python-logo.png
Python 網頁爬蟲入門實戰 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 機器學習入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 資料科學入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 資料視覺化入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 網站架設入門實戰 (預計) 有程式基礎的初學者 1890 None img/python-logo.png
