# Scrapy Tutorial
##### by Malvin De Nunez Estevez - mdenunez@andrew.cmu.edu

Scrapy is primarily a web scraping library that allows the extraction of structured data from websites for different applicatons. It can also be used to collect data using APIs and as web crawler.

Technically one could do all scraping using simply libraries such as Requests and Beatiful Soup. But Scrapy provides lots of built-in features that would be tedious to implement and a great deal of modularity for extensive scraping tasks. You can read some arguments of why one would use Scrapy over an HTTP library plus parsing library in this post: 
https://www.quora.com/Why-would-some-use-scrapy-instead-of-just-crawling-with-requests-or-urllib2

This tutorial will go over the basics of how to set up Scrapy and then will illustrate its usage with two examples.

##### Disclaimer

The examples below include links of different websites that may change over time. Also, the code that parses the websites may become useless if the HTML layouts were to change. The examples are current as of 11/02/2016. Please update as necessary to prevent errors or just have as reference. Make sure logging is enabled to use it to debug any errors.

## Installation

Scrapy can be easily installed through a pip install with the command below. It can be also be donwloaded at: https://scrapy.org/download/

pip install scrapy

In [1]:
import scrapy

## Setting up the environment
The traditional way of using Scrapy would be through a "Scrapy Project", which contains different directories and files for every step of the scraping pipeline. The modularity and neatness of this approach requires to create and modify different files and run things using the command prompt. Instead of this, this tutorial will use the Scrapy API methodology trying to keep everything within the notebook. 

The official documentation of the API can be found here:
https://doc.scrapy.org/en/latest/topics/api.html

### Scrapy Spider
A Scrapy spider is a class that will dictate how we are going to crawl through the websites and scrape the data in them. All spider objects will subclass scrapy.Spider. Below is the general structure of a spider.

In [None]:
class OurSpider(scrapy.Spider):
    name = ""
    start_urls = [
        '',
        '',
    ]
    def start_requests(self):
        pass
    def parse(self, response):
        pass

#### Structure Overview
* name: identifier for the Spider. 
* start_urls: list of urls to be scraped. 
* start_requests(): specifies how to perform HTTP requests. When the request is done here one specifies how to handle the response (e.g., parse() method for successful requests).
* parse(): method that will handle the parsing of the response and will also define how to crawl the websites, if applicable. 

### Scrapy Item
A Scrapy item is a class where the data collected from websites can be stored. This class effectively recreates dictionaries. The biggest difference is that an Item by default does not allow you to assign a value to a key that was not declared as a field. It follows the following format:

In [None]:
class First_scrapyItem(scrapy.Item):
    name = scrapy.Field() #attribute "name" to be extracted 
    address = scrapy.Field() #attribute "address" to be extracted
    #...

The advantage of defining an item class is that it allows to easily modify or validate the collected items (as shown in the second application) and to export them to files such as JSON and CSV.

### Application: Yelp restaurant reviews

To start off let's create a spider that will crawl through all pages of a specific Yelp restaurant and parse the review comments. We are interested in the author, date, rating, and content (text) of the reviews. For this example, we will choose the restaurant Gaucho Parrila Argentina: https://www.yelp.com/biz/gaucho-parrilla-argentina-pittsburgh

Before building the spider we need to know the HTML structure of the website, and especially of the comments to be parsed. The individual reviews in the Yelp websites are within 'div' tags that have attribute itemprop="review". Inside these tags the features of interest can be found by searching for the right "itemprop" attribute, as done in the code below.

Although Scrapy has its own built in methodology to browse through HTML trees with the scrapy.selector class, the library Beatiful Soup will be used in this tutorial instead. Beautiful Soup is fairly user-friendly and intuitive.

#### Review_Item (scrapy.item object)
The Review_Item class will indicate the fields that we are interested in collecting.

In [2]:
class Review_Item(scrapy.Item):
    author = scrapy.Field() #author of review
    date = scrapy.Field() #date published
    rating = scrapy.Field() #rating (1-5)
    text = scrapy.Field() #content of review

#### YelpSpider (scrapy.spider object)

The YelpSpider class will follow the structure above for the most part. Here are some changes and pointers: 
* errback_httpbin method: method that will report in the log requests that fail and the reason
* In the HTTP requests the 'dont_obey_robotstxt' attribute is set to true. This will prevent the scrapy bot from being blocked. Only information publicly displayed is being extracted so this should be fine. 
* The parse method will scrape and parse the reviews as well as collect the link to the next page. After getting the following link, if any, the parse method will call itself. Once there are no more pages left in that restaurant, the following link in the start_urls list would be scraped (but here there is only one).




In [2]:
from bs4 import BeautifulSoup

#needed for errback_httpbin method
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

In [4]:
class YelpSpider(scrapy.Spider):
    name = "Yelp_Spider"
    start_urls = ['https://www.yelp.com/biz/gaucho-parrilla-argentina-pittsburgh']
  
    #get HTTP request for every link in start_urls
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, errback=self.errback_httpbin, meta={'dont_obey_robotstxt':True})
    
    #parse HTTP response
    def parse(self, response):
        #create soup and find all reviews
        soup = BeautifulSoup(response.text, 'lxml')
        raw_reviews = soup.findAll(itemprop="review")
        #initiliaze review item and find desired information
        review = Review_Item()
        for each_review in raw_reviews:
            review['author'] = each_review.find(itemprop='author')['content']
            review['date'] = each_review.find(itemprop='datePublished')['content']
            review['rating'] = each_review.find(itemprop='ratingValue')['content']
            review['text'] = each_review.find(itemprop='description').text
            yield review

        # find link to next page. If exists, call parse method again. 
        # Otherwise, make request for next link in start_urls, if any.
        next_page = soup.find("a", class_="u-decoration-none next pagination-links_anchor")
        if next_page is not None:
            next_page = response.urljoin(next_page["href"])
            yield scrapy.Request(next_page, callback=self.parse,meta={'dont_obey_robotstxt':True})

    #handle HTTP request failures
    def errback_httpbin(self, failure):
        # logs failures
        self.logger.error(repr(failure))
        if failure.check(HttpError):
            response = failure.value.response
            self.logger.error("HttpError occurred on %s", response.url)
        elif failure.check(DNSLookupError):
            request = failure.request
            self.logger.error("DNSLookupError occurred on %s", request.url)    
        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error("TimeoutError occurred on %s", request.url)

Now that the spider is created, let's set up the API to perform the crawl. A few pointers:

* We are initilizing a CrawlerProcess object that will manage the crawls. Although CrawlerProcess could directly take a Scrapy Spider, we are first creating a Crawler object from our YelpSpider that will help us to save the items without externally reading them. Read more about the API here: https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler
* Settings: settings of the crawl. In this case the output file will be written into a .json file "result.json", the log is disabled (otherwise would have shown in the output), and a download delay is introduced to put less pressure on websites. Here one can see a list of all the settings to play around with: https://doc.scrapy.org/en/latest/topics/settings.html#topics-settings-ref


In [3]:
#needed to run Scrapy API
from scrapy.settings import Settings
from scrapy.crawler import CrawlerProcess, Crawler
from scrapy import signals

In [None]:
#store reviews in this list
crawled_reviews = []

#add item to items list when signaled
def add_item(item):
    crawled_reviews.append(item)

#intialize settings
settings = Settings()
settings.set("FEED_FORMAT",'json')
settings.set("FEED_URI",'result.json')
settings.set("LOG_ENABLED",False)
settings.set("DOWNLOAD_DELAY",1)
#settings.set("LOG_FILE",'logfile.txt')

#initiliaze process
process = CrawlerProcess(settings)
#create crawler. Needed to collect items in code
crawler = Crawler(YelpSpider, settings)
#when item is collected, add it to list
crawler.signals.connect(add_item, signals.item_scraped)
#add crawler to process
process.crawl(crawler)
#start process
process.start()

Just for illustration the reviews were exported into a .json file. But let's take a look at the first review that was stored in the crawled_reviews list:

In [7]:
print crawled_reviews[0]

{'author': 'Ana G.',
 'date': '2016-09-01',
 'rating': '5.0',
 'text': u"So after waiting so long to come to this wonderful establishment I am glad to say I finally made it!  I ended up paying this place a visit over the weekend at around 3 pm when the lunch crowds had dissipated and the dinner crowd was yet to show up.   Came here with hubby and the in-laws when they were visiting from GA. Anyway I was initially overwhelmed by their menu but we started with the empanadas and fungos (mushrooms). For mains, my husband had the seafood soup and I had the Argentinian salad with filet. \n\nThe empanadas were delicious, and were highly complimented by the various chimichurris they had. Chicken went really great with ajo (garlic sauce)) and the beef went great with the regular chimichurri. \n\nMy salad was excellent, it had greens, tomatos, mushrooms, onion, bell peppers, and some other stuff I dont remember along with a perfectly cooked filet. It was beautiful, it was tasty, and it was a gre

### Important Note

Scrapy runs inside the Twisted asynchronous networking library, which is a bit tricky to deal with especially when running Scrapy from a script. Even following the official documentation and different posts on Stackoverflow, stopping the reactor for later crawls was not accomplished. It's worth saying that if one were to start the crawls simultaneously this would not be a problem. 

To keep things simple, please **restart** the kernel and output to shut off the Twisted reactor and avoid errors in the following part. Also, make sure to **run the cells that do imports**.

## Application: Online Shopping - Nike Shoes

So far Scrapy really hasn't brought much to the table. Let's see how we can use Scrapy for more demanding tasks. 

Let's say one is looking for shoes in the Nike.com website. It is painful having  to click at every link to see if one may like the shoe and at the end they may not even have the right size. This example will create a spider that will search through every shoe in the Men Basketball Shoes section and export the results to a .csv file. Using Beatiful Soup as before, the following information will be collected from every shoe:
* Name
* Size (the desired size is specified by user inside spider)
* Price
* Rating
* Number of reviews
* Reviews
* Colors available
* Link

Just for fun, the shoes in the Men Running Shoes section will be scraped as well to make some illustrations.

### Shoe Item

In [4]:
class Shoe_Item(scrapy.Item):
    name = scrapy.Field() 
    size = scrapy.Field() 
    price = scrapy.Field()
    colors = scrapy.Field()
    rating = scrapy.Field() #rating (1-5)
    reviews_count = scrapy.Field() #number of reviews
    reviews = scrapy.Field() #individual reviews
    link = scrapy.Field()

### ShoeSpider

In [5]:
class ShoeSpider(scrapy.Spider):
    name = "Shoe_Spider"
    shoe_size = 10  #int or floating point ending in .5
    start_urls = [
    'http://store.nike.com/us/en_us/pw/mens-basketball-shoes/7puZ8r1Zoi3',
    'http://store.nike.com/us/en_us/pw/mens-running-shoes/7puZ8yzZoi3?ipp=120'
]
     
    #get HTTP request for every link in start_urls
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, errback=self.errback_httpbin, meta={'dont_obey_robotstxt':True})

    #obtain link from all shoes and pass them one at a time for parsing
    def parse(self, response):
        #create soup and get all links
        main_soup = BeautifulSoup(response.text, 'lxml')        
        all_shoes = main_soup.find_all('div',class_="grid-item-box")
        for element in all_shoes:
            shoe_link = element.a['href']
            yield scrapy.Request(shoe_link, callback=self.parse_shoe, errback=self.errback_httpbin,meta={'dont_obey_robotstxt':True}, dont_filter=True)
    
    #parse individual shoes
    def parse_shoe(self, response):
        #create soup of individual shoe
        soup = BeautifulSoup(response.text, 'lxml') 
        #create shoe item
        shoe = Shoe_Item()
        
        #see if shoe size is unavailable. If unavailable, don't check anything else
        not_avail = soup.find_all("option", class_="exp-pdp-size-not-in-stock selectBox-disabled")
        if(float(self.shoe_size) in (float(element.text) for element in not_avail)):
            shoe['size'] = None
        else: 
            #see if shoe size exist for the shoe
            if(type(self.shoe_size) is float and soup.find(attrs={"data-label": "(%f)"%self.shoe_size}) is not None):
                shoe['size'] = self.shoe_size
            elif(type(self.shoe_size) is int and soup.find(attrs={"data-label": "(%d)"%self.shoe_size}) is not None):
                shoe['size'] = self.shoe_size
            #get name
            try:
                shoe['name'] = soup.h1.text
            except:
                shoe['name'] = None

            #see if colors are available
            try:
                shoe['colors'] = [element.get('alt') for element in soup.find("div",class_="color-chips").find_all("img")]
            except:
                shoe['colors'] = None

            #get price
            try:
                shoe['price'] = soup.find(itemprop='price').text 
            except:
                shoe['price'] = None

            #check if shoe has yet been rated. Collect rating and reviews, 0 otherwise
            try:
                shoe['rating'] = float(soup.find(itemprop='ratingValue')['content'][:4])
                shoe['reviews_count'] = int(soup.find(itemprop='reviewCount')['content'])
                shoe['reviews'] = [review.text for review in soup.find_all("div", class_="reviewText")]
            except:
                shoe['rating'] = 0
                shoe['reviews_count'] = 0
                shoe['reviews'] = 0
            #get shoe link
            shoe['link'] = response.url
        yield shoe

    def errback_httpbin(self, failure):
        # logs failures
        self.logger.error(repr(failure))
        if failure.check(HttpError):
            response = failure.value.response
            self.logger.error("HttpError occurred on %s", response.url)
        elif failure.check(DNSLookupError):
            request = failure.request
            self.logger.error("DNSLookupError occurred on %s", request.url)
        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error("TimeoutError occurred on %s", request.url)

##### Spider Item Pipeline
For this task we will introduce the Scrapy Item Pipeline. This functionality will help to process the collected items. For example, in the pipeline one may decide to keep only the keywords of the reviews. In this example, we will use the pipeline to simply drop shoes without enough reviews (min of 10) or rated too low (below 4/5), duplicates, and shoes that don't have the desired size available.

Typically you would add the pipeline to a file in the project directory. Here, instead, the settings in the crawler will be adjusted accordingly.

In [6]:
#supports dropping items
from scrapy.exceptions import DropItem

class ShoePipeline(object):
    #set to keep track of unique elements
    def __init__(self):
        self.ids_seen = set()
        self.reviews_count_min = 10
        self.rating_min = 4

    def process_item(self, item, spider):
        #removes element if already seen
        if item['name'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        self.ids_seen.add(item['name']) #add to set if not seen
        #drops item if conditions not met
        if(item['size'] is None):
            raise DropItem("Shoe %s does not have your size" % item)
        elif(item['rating'] < self.rating_min):
            raise DropItem("Shoe %s is rated below 4" % item)
        elif(item['reviews_count'] < self.reviews_count_min):
            raise DropItem("Shoe %s does not enough reviews" % item)
        else:
            return item

This time around, the settings will be adjusted to create a text file "logfile.txt" for logging and to export the results in the file "shoes.csv".

In [7]:
#intialize settings
settings = Settings()
#output to csv file
settings.set("FEED_FORMAT",'csv')
settings.set("FEED_URI",'shoes.csv')
settings.set("DOWNLOAD_DELAY",1)
#log in text file
settings.set("LOG_FILE",'logfile.txt')

#add ShoePipeline. The number (100 in this case) just indicates order but there
#are no more pipelines in this case. Find more info here: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
settings.set('ITEM_PIPELINES', {
    '__main__.ShoePipeline': 100
})

#initiliaze process and crawler
process = CrawlerProcess(settings)
crawler = Crawler(ShoeSpider, settings)
#add crawler to process
process.crawl(crawler)
#start process
process.start()

Now, let's take a look at the summary of statistics of the crawl:

In [8]:
for k, v in crawler.stats.get_stats().iteritems():
    print k, v

log_count/INFO 14
downloader/response_count 224
downloader/response_bytes 7514424
finish_reason finished
item_dropped_count 101
item_dropped_reasons_count/DropItem 101
log_count/ERROR 54
spider_exceptions/ValueError 23
log_count/DEBUG 250
scheduler/dequeued 224
request_depth_max 1
start_time 2016-11-02 23:19:28.326000
downloader/request_method_count/GET 224
downloader/request_bytes 104462
downloader/response_status_count/200 180
response_received_count 180
scheduler/enqueued/memory 224
finish_time 2016-11-02 23:23:59.615000
item_scraped_count 25
scheduler/dequeued/memory 224
scheduler/enqueued 224
downloader/request_count 224
downloader/response_status_count/301 44


Here is why there are some errors (found by looking through the log file): 
* There are some shoes linked that are "customizeable" and have different HTML layout. These shoes come in defined colors so they are accounted for ultimately.
* The Men Running Shoes that we added to scrape has shoes that are "unisex", which again vary in HTML layout. These shoes have list sizes for both Men and Women in a different format. 

It took less than 5 minutes to go through 182 websites for a final count of 25 shoes. The results now are conveniently stored in a CSV file that could simply be opened for inspection. If the results would have been more and of more complexity, one could have imported them for further analysis through the Pandas library, for example.

Below is a screenshot of the results:

<img src="output.png">

## Conclusion

Srapy provides cool features that are great if one has to repeatedly scrape a large number of websites, such as a summary of stats and error handling capabilities. It would be a great tool for someone who does business off of resaling Amazon or Ebay articles, for example. However, for someone who has does not scrape often, setting up the whole Scrapy environment is probably not worth the hassle. Instead, this person should consider simply using the Requests and Beatiful Soup libraries.

### Sources:

The Scrapy Documentation: https://doc.scrapy.org/en/latest/index.html

Stack Overflow: http://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script 

Scrapy Tutorial - Tutorials Point: https://www.tutorialspoint.com/scrapy/index.htm