In [1]:
from scrapy import Spider, Request
from scrapy.crawler import CrawlerProcess

#### Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).

##### For spiders, the scraping cycle goes through something like this:

###### 1- You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

###### 2- The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

###### 3- In the callback function, you parse the response (web page) and return item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

###### 4- In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

###### 5- Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled into Scrapy for different purposes. We will talk about those types here.

#### name
##### A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it’s required.

##### If the spider scrapes a single domain, a common practice is to name the spider after the domain, with or without the TLD. So, for example, a spider that crawls mywebsite.com would often be called mywebsite.

In [2]:
class WikipediaSpider(Spider):
    name = 'wiki_spider'
    def start_requests(self):
        urls = ['https://en.wikipedia.org/wiki/Web_scraping']
        for url in urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response, **kwargs):
        html_file = 'DC_Courses.html'
        with open(html_file, 'wb') as fout:
            fout.write(response.body)

In [3]:
process = CrawlerProcess()

2022-02-16 19:02:58 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-02-16 19:02:58 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.10.0 (default, Dec 21 2021, 13:36:04) [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.0, Platform Linux-5.4.0-96-generic-x86_64-with-glibc2.27
2022-02-16 19:02:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor


In [4]:
process.crawl(WikipediaSpider)

2022-02-16 19:03:01 [scrapy.crawler] INFO: Overridden settings:
{}
2022-02-16 19:03:01 [scrapy.extensions.telnet] INFO: Telnet Password: 15993518ff3dcd14
2022-02-16 19:03:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-02-16 19:03:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrap

<Deferred at 0x7f1bd696fc10>

In [5]:
process.start()

2022-02-16 19:03:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Web_scraping> (referer: None)
2022-02-16 19:03:05 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-16 19:03:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 237,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 29040,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 3.699937,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 2, 16, 17, 3, 5, 401315),
 'httpcompression/response_bytes': 110438,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'memusage/max': 88592384,
 'memusage/startup': 88592384,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 

In [38]:
process.stop()

2022-02-15 22:25:05 [scrapy.core.engine] INFO: Closing spider (shutdown)
2022-02-15 22:25:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 11.318816,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2022, 2, 15, 20, 25, 5, 65914),
 'log_count/INFO': 10,
 'memusage/max': 136728576,
 'memusage/startup': 136728576,
 'start_time': datetime.datetime(2022, 2, 15, 20, 24, 53, 747098)}
2022-02-15 22:25:05 [scrapy.core.engine] INFO: Spider closed (shutdown)


<DeferredList at 0x7f851ebfead0 current result: [(True, None)]>