# Scrapy spider trial

#### Date: Feb 23 2021, Tuesday

This is the notebook to run some trials using Scrapy module in Python 3.0 environment. Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. The data collection for the Enigma search engine project is going to be done with a simple scraper using scrapy-based web spider. To learn more about scrapy, please visit:<br />
<a href="https://scrapy.org/">Scrapy official website</a>
<br />
<a href="https://docs.scrapy.org/en/latest/">Documentation for Scrapy 2.4.1</a>

First, we will need to import all the necessary libraries we are going to use to make a simple scaper which is going to scrape a single page <a href="https://quotes.toscrape.com/">QuotesToScrape.com</a>.<br />
Wee will import InteractiveShell from IPython.core.interactiveshell to execute a command to run crawler into a terminal, platform to generate python version for the terminal. And most importantly, we must import scrapy to build the crawler and json module to store the crawled items in a JSON file and later print it on the output terminal with indents to make them readable.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import platform
platform.python_version()

import json
import scrapy

Let's first build a spider class which contains our crawler and the crawler accesses all the necessary settings in order to make a request to be sent to the server and obtain the corresponding response which will be parsed within the parse method of this class for now. Later, when we make the actual spider, the parsed response is going to make its way through several other classes in a pipeline to finally store the processed data into a database.

In [2]:
class EnigmaSpider(scrapy.Spider):
    name = "demoenigma"
    start_urls = [
        'https://quotes.toscrape.com/'
    ]

    def parse(self, response):
        yield {
            'url': response.url,
            'response_headers': {
                'content_type': str(response.headers['content-type']),
                #'last_modified': str(response.headers['last-modified']),
            },
            'response_body': {
                'head': response.xpath('string(//head)').getall(),
                'body': response.xpath('string(//body)').getall(),
                'links': response.css('a::attr(href)').extract(),
                'images': response.css('img::attr(src)').extract(),
            },
        }

The spider class parses necessary fields from the response headers and the response body using several simple css and xpath selectors. It also parses the lists of all images and links to other pages available within the HTML document of the response residing into the response body. All the parsed data are then yielded (a similar operation as return which works for python generators). Find more about generators here:<br />
<a href="https://docs.python.org/3/glossary.html#term-generator">Python generators</a>

In [3]:
!scrapy crawl demoenigma -o storage.json

2021-02-23 15:47:35 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: enigmaScraper)
2021-02-23 15:47:35 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1j  16 Feb 2021), cryptography 3.4.6, Platform Windows-10-10.0.18362-SP0
2021-02-23 15:47:35 [scrapy.crawler] INFO: Overridden settings:
{'AJAXCRAWL_ENABLED': True,
 'AUTOTHROTTLE_ENABLED': True,
 'BOT_NAME': 'enigmaScraper',
 'CONCURRENT_REQUESTS': 100,
 'COOKIES_ENABLED': False,
 'DEPTH_PRIORITY': 1,
 'DOWNLOAD_TIMEOUT': 15,
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'enigmaScraper.spiders',
 'REACTOR_THREADPOOL_MAXSIZE': 20,
 'REDIRECT_ENABLED': False,
 'RETRY_ENABLED': False,
 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
 'SCHEDULER_PRIORITY_QUEUE': 'scrapy.pq

After the crawler has run in the virtual environment returned the response which was later processed by the parse method inside our spider class. The parsed results were stored in a JSON file. Now we are going to view the results bu dumping the JSON data with indents to make it more readable. To find more about json module, visit:<br />
<a href="https://docs.python.org/3/library/json.html">JSON module</a>

In [4]:
with open('storage.json', 'r') as handle:
    data = json.load(handle)
print(json.dumps(data, indent=4))

[
    {
        "url": "https://en.wikipedia.org/wiki/Apple_Inc.",
        "response_headers": {
            "content_type": "b'text/html; charset=UTF-8'",
            "last_modified": "b'Tue, 23 Feb 2021 09:17:23 GMT'"
        },
        "response_body": {
            "head": [
                "\n\nApple Inc. - Wikipedia\ndocument.documentElement.className=\"client-js\";RLCONF={\"wgBreakFrames\":!1,\"wgSeparatorTransformTable\":[\"\",\"\"],\"wgDigitTransformTable\":[\"\",\"\"],\"wgDefaultDateFormat\":\"dmy\",\"wgMonthNames\":[\"\",\"January\",\"February\",\"March\",\"April\",\"May\",\"June\",\"July\",\"August\",\"September\",\"October\",\"November\",\"December\"],\"wgRequestId\":\"YDTINJFVeZyKXre69wy4XAAAAA8\",\"wgCSPNonce\":!1,\"wgCanonicalNamespace\":\"\",\"wgCanonicalSpecialPageName\":!1,\"wgNamespaceNumber\":0,\"wgPageName\":\"Apple_Inc.\",\"wgTitle\":\"Apple Inc.\",\"wgCurRevisionId\":1007083200,\"wgRevisionId\":1007083200,\"wgArticleId\":856,\"wgIsArticle\":!0,\"wgIsRedirect\":!


 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-02-23 15:47:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-23 15:47:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-02-23 15:47:35 [scrapy.core.engine] INFO: Spider opened
2021-02-23 15:47:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-02-23 15:47:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-23 15:47:37 [scrapy.core.engine] INFO: Closing spider (finished)
2021-02-23 15:47:37 [scrapy.

This ends the making of a simple spider using scrapy which crawls a single webpage. Now, we are going to implement this knowledge into making the actual spider which will be used to accumulate large amount of data. To make a crawler for search engine's data, we are going to make a different kind of spider technology called 'Broad Crawl' which is much more robust and faster than usual spiders.

<br /><br />
<div>
    <p>Rakib Md Abdur<br />
        <span style="align-text: right; color: darkgray;">
            Student ID: 178801037<br />
            Yangzhou University
        </span>
    </p>
</div>