## Scrapy数据收集

https://scrapy.org/

![Alt text](image.png)

In [None]:
%pip install scrapy "cryptography<3.4" "pyopenssl<22"

注意：在notebook中运行scrapy，遇到`ReactorNotRestartable`，需重启kernel

例：
http://books.toscrape.com/

目标：爬取书籍的信息

In [3]:
import scrapy


class BookSpider(scrapy.Spider):
    name = "quotes"

    start_urls = [
        'http://books.toscrape.com/index.html',
    ]

    def parse(self, response):
        print(response.url)
        print(response.status)
        print(response.headers)
        print(response.body)


In [5]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(BookSpider)
process.start()

2023-10-09 10:57:30 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
2023-10-09 10:57:30 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:38:29) [Clang 13.0.1 ], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform macOS-14.0-arm64-arm-64bit
2023-10-09 10:57:30 [scrapy.crawler] INFO: Overridden settings:
{}
2023-10-09 10:57:30 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-10-09 10:57:30 [scrapy.extensions.telnet] INFO: Telnet Password: 0ae6736fa79e4375
2023-10-09 10:57:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-10-09 10:57:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.

http://books.toscrape.com/index.html
200
{b'Content-Length': [b'51294'], b'Date': [b'Mon, 09 Oct 2023 02:57:31 GMT'], b'Content-Type': [b'text/html'], b'Last-Modified': [b'Wed, 08 Feb 2023 21:02:32 GMT'], b'Etag': [b'"63e40de8-c85e"'], b'Accept-Ranges': [b'bytes']}


```html
<article class="product_pod">
  <div class="image_container">
    <a href="a-light-in-the-attic_1000/index.html"
      ><img
        src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"
        alt="A Light in the Attic"
        class="thumbnail"
    /></a>
  </div>

  <p class="star-rating Three">
    <i class="icon-star"></i>
    <i class="icon-star"></i>
    <i class="icon-star"></i>
    <i class="icon-star"></i>
    <i class="icon-star"></i>
  </p>

  <h3>
    <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic"
      >A Light in the ...</a
    >
  </h3>

  <div class="product_price">
    <p class="price_color">£51.77</p>

    <p class="instock availability">
      <i class="icon-ok"></i>

      In stock
    </p>

    <form>
      <button
        type="submit"
        class="btn btn-primary btn-block"
        data-loading-text="Adding..."
      >
        Add to basket
      </button>
    </form>
  </div>
</article>
```

In [1]:
import scrapy


class BookSpider(scrapy.Spider):
    name = "quotes"

    start_urls = [
        'http://books.toscrape.com/index.html',
    ]

    def parse(self, response):
        # css selector
        print(response.css('h3 a'))
        print(response.css('h3 a').getall())
        print(response.css('h3 a::text').getall())
        print(response.css('h3 a::text').get())  # returns the first element

        # regex
        print(response.css('h3 a::text').re(r'[^\.]+'))

        # xpath
        print(response.xpath('//h3/a/text()').get())

        # final version
        # print(response.css('h3 a::attr(title)').getall())

In [2]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(BookSpider)
process.start()

2023-10-09 11:30:47 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
2023-10-09 11:30:47 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:38:29) [Clang 13.0.1 ], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform macOS-14.0-arm64-arm-64bit
2023-10-09 11:30:47 [scrapy.crawler] INFO: Overridden settings:
{}
2023-10-09 11:30:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-10-09 11:30:47 [scrapy.extensions.telnet] INFO: Telnet Password: 91aae72c26491261
2023-10-09 11:30:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-10-09 11:30:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.

[<Selector xpath='descendant-or-self::h3/descendant-or-self::*/a' data='<a href="catalogue/a-light-in-the-att...'>, <Selector xpath='descendant-or-self::h3/descendant-or-self::*/a' data='<a href="catalogue/tipping-the-velvet...'>, <Selector xpath='descendant-or-self::h3/descendant-or-self::*/a' data='<a href="catalogue/soumission_998/ind...'>, <Selector xpath='descendant-or-self::h3/descendant-or-self::*/a' data='<a href="catalogue/sharp-objects_997/...'>, <Selector xpath='descendant-or-self::h3/descendant-or-self::*/a' data='<a href="catalogue/sapiens-a-brief-hi...'>, <Selector xpath='descendant-or-self::h3/descendant-or-self::*/a' data='<a href="catalogue/the-requiem-red_99...'>, <Selector xpath='descendant-or-self::h3/descendant-or-self::*/a' data='<a href="catalogue/the-dirty-little-s...'>, <Selector xpath='descendant-or-self::h3/descendant-or-self::*/a' data='<a href="catalogue/the-coming-woman-a...'>, <Selector xpath='descendant-or-self::h3/descendant-or-self::*/a' data='<a href=

更多页面
- http://books.toscrape.com/catalogue/page-1.html
- http://books.toscrape.com/catalogue/page-2.html
- ……

In [1]:
import scrapy


class BookSpider(scrapy.Spider):
    name = "quotes"

    start_urls = [
        'http://books.toscrape.com/catalogue/page-1.html',
        'http://books.toscrape.com/catalogue/page-2.html',
        # ...
    ]

    # def start_requests(self):
    #     url_template = 'http://books.toscrape.com/catalogue/page-{}.html'
    #     for i in range(1, 6):
    #         url = url_template.format(i)
    #         yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        items = response.css("article.product_pod")
        for item in items:
            title = item.css('h3 a::attr(title)').get()
            price = item.css('div.product_price p.price_color::text').get()
            print(title, price)


In [2]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(BookSpider)
process.start()

2023-10-09 11:44:01 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
2023-10-09 11:44:01 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:38:29) [Clang 13.0.1 ], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform macOS-14.0-arm64-arm-64bit
2023-10-09 11:44:01 [scrapy.crawler] INFO: Overridden settings:
{}
2023-10-09 11:44:01 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-10-09 11:44:01 [scrapy.extensions.telnet] INFO: Telnet Password: 8d5f90336c765d74
2023-10-09 11:44:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']


2023-10-09 11:44:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-10-09 11:44:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.ref

A Light in the Attic £51.77
Tipping the Velvet £53.74
Soumission £50.10
Sharp Objects £47.82
Sapiens: A Brief History of Humankind £54.23
The Requiem Red £22.65
The Dirty Little Secrets of Getting Your Dream Job £33.34
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull £17.93
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics £22.60
The Black Maria £52.15
Starving Hearts (Triangular Trade Trilogy, #1) £13.99
Shakespeare's Sonnets £20.66
Set Me Free £17.46
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1) £52.29
Rip it Up and Start Again £35.02
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991 £57.25
Olio £23.88
Mesaerion: The Best Science Fiction Stories 1800-1849 £37.59
Libertarianism for Beginners £51.33
It's Only the Himalayas £45.17
The Nameless City (The Nameless City #1) £38.16
The Murder That Never Was (Forensic Instincts #5) £54.11
The Most Perfect Thing: Insi

2023-10-09 11:44:28 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2023-10-09 11:44:28 [scrapy.core.engine] INFO: Closing spider (shutdown)
2023-10-09 11:44:54 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
2023-10-09 11:44:54 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://books.toscrape.com/catalogue/page-3.html> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>, <twisted.python.failure.Failure twisted.web.http._DataLoss: >]
2023-10-09 11:44:54 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://books.toscrape.com/catalogue/page-2.html> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>, <twisted.python.failure.Failure twisted.web.http._DataLoss: >]
2023-10

用Pipeline保存数据

In [1]:
import json


class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open("items.jl", "w")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(item)
        self.file.write(f"{line}\n")
        return item


In [2]:
import scrapy



class BookSpider(scrapy.Spider):
    name = "book"

    def start_requests(self):
        url_template = 'http://books.toscrape.com/catalogue/page-{}.html'
        for i in range(1, 6):
            url = url_template.format(i)
            yield scrapy.Request(url, callback=self.parse)

    custom_settings = {
        'ITEM_PIPELINES': {
            JsonWriterPipeline: 400
        }
    }

    def parse(self, response):
        for item in response.css("article.product_pod"):
            title = item.css('h3 a::attr(title)').get()
            price = item.css('div.product_price p.price_color::text').get()
            yield dict(title=title, price=price)


In [3]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(BookSpider)
process.start()

2023-10-09 11:50:40 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
2023-10-09 11:50:40 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:38:29) [Clang 13.0.1 ], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform macOS-14.0-arm64-arm-64bit
2023-10-09 11:50:40 [scrapy.crawler] INFO: Overridden settings:
{}
2023-10-09 11:50:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor


2023-10-09 11:50:40 [scrapy.extensions.telnet] INFO: Telnet Password: 13eba7e8fef641dc
2023-10-09 11:50:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-10-09 11:50:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downlo