# Quotes Survey | 名人名言小调研

使用Scrapy从爬虫练习网站 http://quotes.toscrape.com 上爬取全部名言信息、作者信息、标签信息

- 将所有名言按照作者年龄排序
- 统计每个作者使用的标签，按从多到少排序

## 准备Pipeline

In [1]:
import json


class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open("quotes.jl", "w")
        pass

    def close_spider(self, spider):
        self.file.close()
        pass

    def process_item(self, item, spider):
        # save json lines
        line = json.dumps(item)
        self.file.write(f"{line}\n")


## 使用任一种方案，将quotes信息和author爬取到

In [3]:
import scrapy


class QuoteAuthorSpider(scrapy.Spider):
    name = "quote-author"
    start_urls = ["http://quotes.toscrape.com/"]
    custom_settings = {"ITEM_PIPELINES": {JsonWriterPipeline: 400}}

    def parse(self, response):
        for quote in response.css("div.quote"):
            text = quote.css("span.text::text").get()
            author = quote.css("small.author::text").get()
            tags = quote.css("div.tags a.tag::text").getall()
            item = dict(text=text, author=author, tags=tags)

            author_page_link = quote.css(".author + a::attr(href)").get()
            yield response.follow(
                author_page_link,
                callback=self.parse_author,
                cb_kwargs=dict(quote=item),
                dont_filter=True,
            )

        pagination_links = response.css("li.next a")
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response, quote):
        def extract_with_css(query):
            return response.css(query).get(default="").strip()

        quote.update(
            {
                "author": extract_with_css("h3.author-title::text"),
                "birthdate": extract_with_css(".author-born-date::text"),
                "bio": extract_with_css(".author-description::text"),
            }
        )
        yield quote


## 启动爬虫

In [5]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()

process.crawl(QuoteAuthorSpider)
process.start()

2024-10-22 14:28:58 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: scrapybot)
2024-10-22 14:28:58 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.2, Platform Windows-10-10.0.22631-SP0
2024-10-22 14:28:58 [scrapy.crawler] INFO: Overridden settings:
{}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-10-22 14:28:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-10-22 14:28:58 [scrapy.extensions.telnet] INFO: Telnet Password: 23070148bbfddcbd
2024-10-22 14:28:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 '

## 载入数据，按照作者年龄排序

In [11]:
import pandas as pd
from datetime import datetime

# hint: datetime.strptime(date_time_str, "%B %d, %Y") -> (datetime object) 

# YOUR CODE HERE
quotes = pd.read_json("quotes.jl", lines=True)
quotes["birthdate"] = pd.to_datetime(quotes["birthdate"], format="%B %d, %Y")
quotes = quotes.sort_values("birthdate")
quotes


Unnamed: 0,text,author,tags,birthdate,bio
99,“A lady's imagination is very rapid; it jumps ...,Jane Austen,"[humor, love, romantic, women]",1775-12-16,Jane Austen was an English novelist whose work...
94,“I declare after all there is no enjoyment lik...,Jane Austen,"[books, library, reading]",1775-12-16,Jane Austen was an English novelist whose work...
53,“There is nothing I would not do for those who...,Jane Austen,"[friendship, love]",1775-12-16,Jane Austen was an English novelist whose work...
86,"“There are few people whom I really love, and ...",Jane Austen,"[elizabeth-bennet, jane-austen]",1775-12-16,Jane Austen was an English novelist whose work...
5,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]",1775-12-16,Jane Austen was an English novelist whose work...
...,...,...,...,...,...
95,"“Remember, if the time should come when you ha...",J.K. Rowling,[integrity],1965-07-31,See also: Robert GalbraithAlthough she writes ...
32,"“To the well-organized mind, death is but the ...",J.K. Rowling,"[death, inspirational]",1965-07-31,See also: Robert GalbraithAlthough she writes ...
25,“It takes a great deal of bravery to stand up ...,J.K. Rowling,"[courage, friends]",1965-07-31,See also: Robert GalbraithAlthough she writes ...
7,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]",1965-07-31,See also: Robert GalbraithAlthough she writes ...


结果示例：

![image.png](attachment:49a7cc37-d80d-4fed-a5e8-025ee79cb22b.png)

## 统计每个作者使用的标签，按从多到少排序

In [21]:
import pandas as pd
from collections import Counter

# hint: Counter(a_list).most_common() -> [(word, count), ...] sorted by count

# YOUR CODE HERE

def word_count(tags):
    counter = Counter(tags)
    sorted_tags = sorted(counter.items(), key = lambda x: x[1], reverse=True)
    return sorted_tags
new_df = quotes[["author", "tags"]]
new_df = new_df.groupby("author").aggregate({"tags": "sum"})
new_df["tags"] = new_df["tags"].apply(word_count)
new_df

Unnamed: 0_level_0,tags
author,Unnamed: 1_level_1
Albert Einstein,"[(life, 2), (inspirational, 1), (live, 1), (mi..."
Alexandre Dumas-fils,"[(misattributed-to-einstein, 1)]"
Alfred Tennyson,"[(friendship, 1), (love, 1)]"
Allen Saunders,"[(fate, 1), (life, 1), (misattributed-john-len..."
André Gide,"[(life, 1), (love, 1)]"
Ayn Rand,[]
Bob Marley,"[(friendship, 1), (love, 1), (music, 1)]"
C.S. Lewis,"[(christianity, 1), (faith, 1), (religion, 1),..."
Charles Bukowski,"[(alcohol, 1), (humor, 1)]"
Charles M. Schulz,"[(chocolate, 1), (food, 1), (humor, 1)]"


结果示例：


![image.png](attachment:8aaaa26d-9815-4df7-a8d9-bdcbd7b6ecbf.png)