## Introduction

This tutorial will be an introduction to the popular python library Scrapy, which is used primarly for web scraping. Scrapy gives freedom to the programmer to choose to use APIs for gathering data or creating their own general purpose web crawler.
Scrapy also gives tools for requesting source HTML and parsing through data which will all be covered in the tutorial.

What is a web crawler? Web crawlers collect information of a website (i.e. URL, web page content, meta tag information), add available links to the queue of links to go to next, and recursively goes to the next website. They are key components of Web search engines, Web archiving, and data mining on various statistics.

![Web Crawler Pipeline](https://photos-6.dropbox.com/t/2/AACswTUCLUi0yyL30XTVeKUS14ivT_VjVnfcAjfL4ZzxeQ/12/136140040/png/32x32/1/_/1/2/AAEAAQAAAAAAAAf4AAAAJGQ1YWYzOWQ1LWJjNDItNDU0Ni05ZWQwLThmMTNiNDI4ODYwNg.png/EP2prGgYzzkgAigC/y4lK_d-PcmUrHehYzQGGtb7e7XAZjy6WHhOo6xmUUcY?preserve_transparency=1&size=1024x768&size_mode=3)

Not included in the diagram is how to decide what scheduling is used to pick the next URL to request from.

## Installing the libraries

To install Scrapy using `pip`, run:

    $ pip install Scrapy
    
Alternatively, you can install through `conda` by running:

    $ conda install -c conda-forge scrapy
    
After installing, make sure the following commands work for you:

In [None]:
import scrapy

from scrapy.selector import Selector
from scrapy.http import Request, Response

## Request and Response

In previous homeworks we used the third party library requests for requesting source HTML from websites. Shifting to the functionality provided by Scrapy allows us to parse through the data faster and easier. 

These requests and responses will __only be used in context of a Spider__ and cannot be used as stand alone as the Spider internally executes the request and gives us the response like a black box. Therefore any code in this section will have no actual output. To fetch HTML source outside of this context we must use the Scrapy shell.

Let's get started with the basic request:

In [49]:
request = Request(url='http://www.example.com')

Scrapy has the ability to pass a callback function on requests like so:

In [53]:
def parse(self, response):
    return Request(url='http://www.example.com',
                   callback=self.parse_logger)

def parse_logger(self, response):
    self.logger.info('Visited %s', response.url)

This can also be applied for the errback function argument for error handling. There are also additional parameters which can be found in documentation that are extremely useful.

`Request` has an extremely helpful subclass, `FormRequest`, with functionality for dealing with HTML forms. Form data can be automatically filled in and used for login purposes.
It has a new argument for its constructor:
* __formdata__ (dict or iterable of tuples) - Contains HTML Form data to be assigned to the body of the request.

In [54]:
def parse(self, response):
    return FormRequest(url='https://s3.andrew.cmu.edu/sio/',
                       formdata={'AndrewID': 'yourandrewid', 'Password': 'yourpassword'},
                       callback=self.login_attempt)

def login_attempt(self, response):
    if 'Authentication Failed' in response.body :
        self.logger.info('Failed login.')
    else :
        self.logger.info('Successful login.')
        # continue scrapping otherwise...
        

After the Spider makes the request, the Downloader executes and generates a `Response`. The most important fields are: 

* __url__ (String): url of the response.
* __headers__ (dict): headers of the response; dict values can be strings or lists depending on how many values a header has.
* __body__ (str): text body of the response.

Generally you would want to cast response from `Response` to one of its subclasses, as the base `Response` class is meant to be only used for binary data. Most often `TextResponse` is used to allow encoding capabilities. Otherwise you can make your own subclass!

## Relearning Parsing

Scrapy also has its own methods from extracting data from HTML source. Previously in the course we have used BeautifulSoup, but this has the downside of being slow. Web scraping is a very time-intensive so we will relearn parsing through Scrapy for better performance.

Scrapy extracts data via so called "selectors" because they select certain parts of the HTML document specified either by XPath or CSS expressions. XPath is a language for selected nodes in XML and CSS is a language for applying styles to HTML documents.


Constructing selector from text:

In [119]:
body = '<html><body><span>Data Science is not Statistics</span></body></html>'
selector = Selector(text=body)

Constructing Selector from a response:

In [120]:
response = HtmlResponse(url='http://example.com', body=body, encoding='utf-8')
selector = Selector(response=response)

In [121]:
# More complicated body for parsing
body = '''<html>
           <head>
            <base href='http://scrapytutorial.com/' />
            <title>Parsing is fun!</title>
           </head>
           <body>
            <div id='images'>
             <a href='image1.html'>Name: Spiderup <br /><img src='spider_thumb.jpg' /></a>
             <a href='image2.html'>Name: Pipeline <br /><img src='pipeline_thumb.jpg' /></a>
             <a href='image3.html'>Name: Crawling <br /><img src='crawlin_thumb.jpg' /></a>
             <a href='image4.html'>Name: Crawlers <br /><img src='crawler_thumb.jpg' /></a>
             <a href='image5.html'>Name: Crawfish <br /><img src='crawfish_thumb.jpg' /></a>
             <a href='extra.html'> Name: notimage <br /><gif src='gotem.gif /></a>
            </div>
           </body>
          </html>'''
response = HtmlResponse(url='http://scrapytutorial.com', body=body, encoding='utf-8')

Querying responses using XPath and CSS has two convenience shortcuts: response.xpath() and response.css(). The arguments for both are their respective path expressions. 

For a quick review of XPath, below are the most common and important expressions (taken from https://www.w3schools.com/xml/xpath_syntax.asp):

| __Expression__ | Description                                  |
|--------------|--------------------------------------------|
| __nodename__ | Selects all nodes with the name "nodename" |
| __/__        | Selects from the root node                 |
| __//__       | Selects nodes in the document from the current node that match the selection no matter where they are                              |
| __.__        | Selects the current node                   |
| __..__       | Selects the parent of the current node     |
| __@__        | Selects attributes                         |

Another useful tool are predicates, which selects nodes that contain a specific value. These are enclosed in square brackets and are powerful tools (i.e. finding last of an element or elements with specific tags). One such use which is seen later is contains(), which is used to find entries with a specific substring.

For a quick review of CSS, below are the most common and important expressions (taken from https://www.w3schools.com/cssref/css_selectors.asp):

| __Selector__ | Description                                  |
|--------------|--------------------------------------------|
| __.class__   | Selects all elements with specified class  |
| __#id__      | Selects the element with specified id      |
| __*__        | Selects all elements                       |
| __element__  | Selects all `<`element`>` elements             |
| __element1 element2__  | Selects all `<`element1`>` elements inside `<`element2`>` elements |
| __[attribute=value]__  | Selects all elements with attribute="value" |

In [122]:
response.xpath('//title')

[<Selector xpath='//title' data='<title>Parsing is fun!</title>'>]

.xpath() and .css() both output a `SelectorList` which can in turn be recursively upon each other:

In [123]:
response.css('img').xpath('@src')

[<Selector xpath='@src' data='spider_thumb.jpg'>,
 <Selector xpath='@src' data='pipeline_thumb.jpg'>,
 <Selector xpath='@src' data='crawlin_thumb.jpg'>,
 <Selector xpath='@src' data='crawler_thumb.jpg'>,
 <Selector xpath='@src' data='crawfish_thumb.jpg'>]

.extract() extracts the text data for each `Selector` in `SelectorList`. Some example code follows to get more familiar with the functions:

In [124]:
response.xpath('//a[contains(@href, "image")]/@href').extract()

['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [125]:
response.xpath('//a[contains(@href, "image")]/img/@src').extract()

['spider_thumb.jpg',
 'pipeline_thumb.jpg',
 'crawlin_thumb.jpg',
 'crawler_thumb.jpg',
 'crawfish_thumb.jpg']

Additionally, `Selectors` have a re() method that takes in regular expressions. However, it cannot be used recursively further.

In [126]:
craw_stuff = response.xpath('//a[contains(@href, "image")]/img/@src').re(r'(craw\s*.*)')
print(craw_stuff)

['crawlin_thumb.jpg', 'crawler_thumb.jpg', 'crawfish_thumb.jpg']


## Writing your first spider

Now that we have the knowledge to make `Requests` and parse through the `Responses`, we can now write a basic spider. 

The Spider subclass spider.Spider requires the following attributes and/or functions:
* __name__: Unique string id for the spider within the current directory
* __start_urls__: An iterable of class String consisting of URLs to start from.
* __start_requests()__: Returns an iterable of the class scrapy.Request (e.g. list or a generator function) from which the Spider crawls from.
* __parse()__: Handles the response downloaded for each request made. The response parameter is an instance of scrapy.Response which holds the page content. We can simply print information of the URL or save it elsewhere

Either start_requests() or start_urls can be provided.

In [130]:
class BasicSpider(scrapy.Spider):
    name = "basic"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        #Getting the page number
        page = response.url.split("/")[-2]
        filename = 'quotes-' + str(page) + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file '+ filename)

This spider does the bare minimum by only parsing the starting URLs and saving the HTML source of each one. Our next step should be making the Spider go to the next page and store some specific information.

In the following example we want to store information about each author so we find the author link and follow the corresponding link. After requesting, the name, birthdate, and bio fields are easy to find.

In [None]:
class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

## How do I run the Spider?

To actually run the `Spiders` that we've made, we have to use the Scrapy shell. We run it through the command line via the command:

    scrapy crawl <name>

in the same folder as the spider we wish to run, where `<name>` is the unique id of the spider.

## More with Scrapy shell

With the Scrapy shell we can also allow command line arguments when crawlying by using the  `-a` option :

    scrapy crawl <name> -a field=attribute
    
If we run `scrapy crawl quotes -a tag=love`, QuotesSpider will now have a tag field with it initialized to "love".

In [None]:
class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

## Next Steps

This tutorial only covered the basics of Scrapy, but should be a good starting foundation to build complex Spiders. A very helpful resource when learning Scrapy is just interacting with the Scrapy shell, which is the only way to use `Request` and `Response` outside of the context of a `Spider` (https://doc.scrapy.org/en/latest/topics/shell.html).

On the side of parsing, we all know that parsing can be very difficult and tedious even with the Scrapy wrapper functions. It is very common to rely on outside libraries inside parse() for specific uses. For example, to extract microdata items we can use the library Extruct (https://github.com/scrapinghub/extruct). 

Additional pieces of data can be stored as the values of some parameters in URLs listed in a page such as its users. Instead of using regex, we can rely on the w3lib library (https://github.com/scrapinghub/w3lib). You should always be on the lookout for libraries that make parsing easier and not rely solely on the functions given by Scrapy.