爬了一堆東西, 總是需要找個地方存起來以利後續處理, 有時候你可能會需要把抓到的東西下載下來(檔案), 抑或是考量到儲存空間的問題而只儲存URL, 當然, 你也可以把爬到的資訊儲存到資料庫裡, 這部分會分享幾種資料儲存的方式, 看看各種場合下的爬蟲要怎麼去保存資料

### 儲存成CSV檔案
這隻爬蟲會去ezprice上根據指定的商品字眼搜集商品資訊, 並且將爬到的資訊儲存至csv檔案裡.

In [1]:
import requests
import urllib.parse
import csv
import os
from bs4 import BeautifulSoup


EZPRICE_URL = 'https://ezprice.com.tw'
CSV_FILE_NAME = 'ezprice.csv'


def get_web_content(url):
    resp = requests.get(url)
    if resp.status_code != 200:
        print('Invalid url: ' + resp.url)
        return None
    else:
        return resp.text


def get_price_info(query, page):
    encoded_query = urllib.parse.quote(query)
    doms = list()
    for page in range(1, page + 1):
        url = EZPRICE_URL + '/s/%s/price/?q=%s&p=%s' % (encoded_query, encoded_query, str(page))
        result_page = get_web_content(url)
        doms.append(BeautifulSoup(result_page, 'html5lib'))
    return doms


def extract_results(dom):
    items = list()
    for div in dom.find_all('div', 'search-rst clearfix'):
        item = list()
        item.append(div.h4.a['title'])
        item.append(div.find(itemprop='price')['content'])
        if div.find('span', 'platform-name'):
            item.append(div.find('span', 'platform-name').text.strip())
        else:
            item.append('N/A')
        items.append(item)
    return items, len(items)


def show_results(items):
    for item in items:
        print(item)


def write_to_csv_file(is_first_page, items):
    with open(CSV_FILE_NAME, 'a', encoding='UTF-8', newline='') as file:
        writer = csv.writer(file)
        if is_first_page:
            writer.writerow(('Item', 'Price', 'Store'))
        for item in items:
            writer.writerow((column for column in item))


def read_from_csv_file():
    print('\nRead from csv file: ' + CSV_FILE_NAME)
    with open(CSV_FILE_NAME, 'r', encoding='UTF-8') as file:
        reader = csv.DictReader(file)
        for row in reader:
            print(row['Item'], row['Price'], row['Store'])


def main():
    query = '吉胖喵'
    page = 5
    doms = get_price_info(query, page)
    is_first_page = True
    total_item_count = 0
    for dom in doms:
        items, count = extract_results(dom)
        total_item_count += count
        show_results(items)
        write_to_csv_file(is_first_page, items)
        is_first_page = False
    print('There are %s items in %d page(s).' % (total_item_count, page))
    read_from_csv_file()
    # Uncomment this if you don't want to keep the data in csv file.
    # os.remove(CSV_FILE_NAME)


if __name__ == '__main__':
    main()

AttributeError: 'NoneType' object has no attribute 'a'

### 儲存至SQLite
這裡會延續前一小節的ezprice爬蟲的內容. 有時候, 你可能不喜歡CSV檔案, 而比較喜歡把資料存到DB裡面, 這時候就可以考慮採用以下這隻爬蟲的做法. 這隻爬蟲會把前一小節產生的csv檔案當作輸入, 並把當中的資料讀出來並且儲存到資料庫裡面. 這邊使用SQLite作為範例資料庫.

In [None]:
from ch5.domain.item import Item
import sqlite3
import csv


DB_NAME = 'db.sqlite'
DROP_TABLE_COMMAND = 'DROP TABLE %s'
CHECK_TABLE_COMMAND = 'SELECT name FROM sqlite_master WHERE type=\'table\' AND name=\'%s\';'
FETCH_ALL_RECORD_COMMAND = 'SELECT * FROM %s;'


def connect_db(db_file):
    return sqlite3.connect(db_file)


def execute_command(connection, sql_cmd):
    cursor = connection.cursor()
    cursor.execute(sql_cmd)
    connection.commit()


def table_exists(connection, table_name):
    cursor = connection.cursor()
    cursor.execute(CHECK_TABLE_COMMAND % table_name)
    result = cursor.fetchone()
    if result is None:
        return False
    else:
        return True


def create_table(connection, table_name):
    create_table_cmd = 'CREATE TABLE %s (id INTEGER PRIMARY KEY AUTOINCREMENT, item TEXT, price INTEGER, shop TEXT)' % table_name
    if not table_exists(connection, table_name):
        print('Table \'%s\' does not exist, creating...' % table_name)
        execute_command(connection, create_table_cmd)
        print('Table \'%s\' created.' % table_name)
    else:
        execute_command(connection, DROP_TABLE_COMMAND % table_name)
        print('Table \'%s\' already exists, initializing...' % table_name)
        execute_command(connection, create_table_cmd)
        print('Table \'%s\' created.' % table_name)


def insert_data(connection, table_name, item):
    insert_record_cmd = 'INSERT INTO %s (item, price, shop) VALUES ("%s", %d, "%s")' % (table_name, item.name, item.price, item.shop)
    execute_command(connection, insert_record_cmd)


def update_data(connection, table_name):
    update_record_cmd = 'UPDATE %s SET shop = "udn買東西2" where shop="udn買東西"' % table_name
    execute_command(connection, update_record_cmd)


def insert_bulk_record(connection, table_name, input_file):
    with open(input_file, 'r', encoding='UTF-8') as file:
        reader = csv.DictReader(file)
        for row in reader:
            insert_record_cmd = 'INSERT INTO %s (item, price, shop) VALUES ("%s", "%s", "%s")' % (table_name, row['Item'], row['Price'], row['Store'])
            execute_command(connection, insert_record_cmd)


def fetch_all_record_from_db(connection, sql_cmd):
    cursor = connection.cursor()
    cursor.execute(sql_cmd)
    rows = cursor.fetchall()
    for row in rows:
        print(row)


def main():
    connection = connect_db(DB_NAME)
    table_name = 'record'
    input_file = 'ezprice.csv'
    item = Item('嚕嚕抱枕', 999, '嚕嚕小朋友')
    try:
        create_table(connection, table_name)
        insert_data(connection, table_name, item)
        insert_bulk_record(connection, table_name, input_file)
        update_data(connection, table_name)
        fetch_all_record_from_db(connection, FETCH_ALL_RECORD_COMMAND % table_name)
        connection.close()
    except Exception as exception:
        print('Encounter some exceptions while executing DB tasks, close the connection...')
        print('Exception message: ' + exception.__str__())
        connection.close()

if __name__ == '__main__':
    main()