[Scrapy](https://scrapy.org/) is an application framework for crawling web sites and extracting structured data. It is a powerful tool in the fields of Data Collection, Data Processing and Data Mining.

## Installing Scrapy

Installing Scrapy requires dependencies of Python 2.7 or above. There are three common options to install Scrapy. If you don't have Scrapy installed on your machine, it is advised to use the first option to install.  
1) You can use [Anaconda](https://www.continuum.io/) to install (recommended option):

In [1]:
!conda install -c scrapinghub scrapy

Fetching package metadata .........
Solving package specifications: ..........

# All requested packages already installed.
# packages in environment at /Users/apple/anaconda2:
#
scrapy                    1.2.0                    py27_0    scrapinghub


2) You can also install Scrapy and its dependencies with [pip](https://pypi.python.org/pypi/pip):  
`pip install Scrapy`  

3) You can also use [virtualenv](https://virtualenv.pypa.io/en/stable/installation/) to install Scrapy so that you may avoid possible conflicts with your system packages.  

## A skim through Scrapy
After the installation of Scrapy, let's get started to work with Scrapy!

### Create a project
The first step is to creat a new Scrapy project, there are a couple of things to note when you configure the Scrapy project. 

In [2]:
!rm -rf demo || true
!scrapy startproject demo

New Scrapy project 'demo', using template directory '/Users/apple/anaconda2/lib/python2.7/site-packages/scrapy/templates/project', created in:
    /Users/apple/Desktop/PDS/tutorial/demo

You can start your first spider with:
    cd demo
    scrapy genspider example example.com


The command above will get you a directory named firstscrapyproject with following structures:  

```
demo
├── demo
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg
```
### Create an infant spider
The "spider" is the class that we use to define the actions that how to parse the extracted pages after making the requests. It must extend the scrapy.Spider class and stored under the `firstscrapyproject/spiders` directory.

Below is a simple example showing how to create a spider, for example if we want take a look at recent updates regarding Google products, we could just sent request to crawl the Google products update on Google blog, and we would keep all the cnntents of that page to `Google-products.html` for simplicity. This is just an simple example to introduce spider.

In [3]:
import scrapy

class SimpleSpider(scrapy.Spider):
    name = "simplespider" # name must be unique within project.
    
    def start_requests(self):
        urls = [
            'https://blog.google/products/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        category = response.url.split('/')[-2]
        file_name = 'Google-%s.html' % category
        with open(file_name, 'wb') as f:
            f.write(response.body) # simply record all the contents of page
        self.log('%s is saved.' % file_name)

For reader's convenience, we have already put code above in a python file `SimpleSpider.py`, so simply run the shell commands below to move it to where it should be put in the project folder.

In [4]:
%cp SimpleSpider.py demo/demo/spiders

### Run a spider
After creating the spider, we will need it to work using following command, be sure to be aware that you need to get into the project folder instead of staying in current directory, after we run the spider, a good strategy is that we get back to the previous directory so that we will not get lost.

In [5]:
%cd demo
!scrapy crawl simplespider
!cat Google-products.html | head -n 100 # show top 100 lines of extracted html page
%cd -

/Users/apple/Desktop/PDS/tutorial/demo
2016-11-02 02:18:59 [scrapy] INFO: Scrapy 1.2.0 started (bot: demo)
2016-11-02 02:18:59 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'demo.spiders', 'SPIDER_MODULES': ['demo.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'demo'}
2016-11-02 02:18:59 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-02 02:18:59 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloader

### Simplify a spider class
Actually, there's a way that can spare our efforts to implement `start_requests()` method, instead we need to define a `start_urls` class attribute which is a list of URLs. We can do this because `parse()` is Scrapy's default callback method.

In [6]:
import scrapy

class SimplifiedSimpleSpider(scrapy.Spider):
    name = "simplifiedsimplespider" # name must be unique within project.
    
    # remove start_requests(), use start_urls instead
    start_urls = [
        'https://blog.google/topics/',
    ]
            
    def parse(self, response):
        category = response.url.split('/')[-2]
        file_name = 'Google-simplified-%s.html' % category
        with open(file_name, 'wb') as f:
            f.write(response.body)
        self.log('%s is saved.' % file_name)

In [7]:
%cp SimplifiedSimpleSpider.py demo/demo/spiders
%cd demo
!scrapy crawl simplifiedsimplespider
!cat Google-simplified-topics.html | head -n 100 # show top 100 lines of extracted html page
%cd -

/Users/apple/Desktop/PDS/tutorial/demo
2016-11-02 02:19:01 [scrapy] INFO: Scrapy 1.2.0 started (bot: demo)
2016-11-02 02:19:01 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'demo.spiders', 'SPIDER_MODULES': ['demo.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'demo'}
2016-11-02 02:19:01 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-02 02:19:01 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloader

### Use spider to do something more complicated
Until now, we just define our "spider" to fetch the whole page for us, however we have not done anything specific to the pages we just retrieved. So you may want to add some interesting actions on the spider, let's get started from modifying the `parse(response)` method.  
In the example below, we will try to crawl the website of National Park Service where we can get comprehensive information about all the national parks in United States. Now we want all the names of national parks located in Alaska(AK) and California(CA). We first sent request to the pages containing information about national parks in Alaska and California, after that we will analyze the structure of those two pages and use simple css selector to extract the information we want.

In [8]:
import scrapy

class FindParkSpider(scrapy.Spider):
    name = "findparkspider"
    
    states = ['ca']
    start_urls = [
        'https://www.nps.gov/state/ak/index.htm',
        'https://www.nps.gov/state/ca/index.htm',
    ]

    def parse(self, response):
        for r in response.css('div#parkListResultsArea'):
            yield {
                'park_name': r.css('h3 a::text').extract(),
                'park_description': r.css('div.list_left p::text').extract(),
            }

After extracting the information that we are interested in, we need to store them by using following command in json format:  

In [9]:
%cp FindParkSpider.py demo/demo/spiders
%cd demo
!scrapy crawl findparkspider -o parks-ak-ca.json
%cat parks-ak-ca.json | head -n 100 # show top 100 lines of extracted json file
%cd -

/Users/apple/Desktop/PDS/tutorial/demo
2016-11-02 02:19:03 [scrapy] INFO: Scrapy 1.2.0 started (bot: demo)
2016-11-02 02:19:03 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'demo.spiders', 'FEED_URI': 'parks-ak-ca.json', 'SPIDER_MODULES': ['demo.spiders'], 'BOT_NAME': 'demo', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'}
2016-11-02 02:19:03 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-02 02:19:03 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.

By appending -o parameter after the command, we will get an output file `output.json` which contains all scraped data in JSON format.  

### Extend the feet of spider to "next" page
In most cases, you will not be satisfied if you just extract one or two pages, so if there is a link to the next page, you may find it to be very helpful, the spider will love it!  
In the following example, we are going to send our spider to Yelp to find which restaurants are popular in Pittsburgh. We simply ask the spider to work on the page which holds the information of those popular restaurants with rank from high to low. We will extract the names, average ratings and review counts of listed restaurants. Obviously, one page is not enough to hold so many great popular restaurants in Pittsbugh, what can we do to find them all? No worry, we can take advantage of the "next" anchor in the end of each page. Let's see how spider find there next destination through example below. **It is worth noting that when you crawl some websites, you may encounter a problem related to robot settings, to solve this, simply make some slight modification in project directory (in this case `demo/demo/settings.py`) by setting ROBOTSTXT_OBEY = False from default True. Please also note these process could take at most 3 minutes since there are so many next pages.**

In [17]:
import scrapy

class YelpSpider(scrapy.Spider):
    name = "yelpspider"
    
    start_urls = [
        'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Pittsburgh,+PA&start=0',
    ]

    def parse(self, response):
        for r in response.css('ul.ylist.ylist-bordered.search-results'):
            yield {
                'restaurant_name': r.css('span.indexed-biz-name a.biz-name.js-analytics-click span::text').extract(),
                'rating': [float(x.split(' ')[0]) for x in r.css('div.rating-large i.star-img::attr(title)').extract()],
                'review_count': [int(x.strip().split(' ')[0]) for x in r.css('span.review-count::text').extract()],
            }
        next_page = response.css('a.u-decoration-none.next.pagination-links_anchor::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

In [25]:
%cp YelpSpider.py demo/demo/spiders
%cd demo
!scrapy crawl yelpspider -o yelp-pittsburgh-restaurant.json
!cat yelp-pittsburgh-restaurant.json | head -n 100 # show top 100 lines of extracted json file
%cd -

/Users/apple/Desktop/PDS/tutorial/demo
2016-11-02 12:45:46 [scrapy] INFO: Scrapy 1.2.0 started (bot: demo)
2016-11-02 12:45:46 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'demo.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['demo.spiders'], 'FEED_URI': 'yelp-pittsburgh-restaurant.json', 'BOT_NAME': 'demo'}
2016-11-02 12:45:46 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-02 12:45:46 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMid

The `urljoin()` method above will build a full absolute URL and there will be a new request to next page yielded.  
Here let's walk through the mechanism of the Scrapy to get following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback to be executed when that request finishes.

### Run Scrapy using a python script
Besides the typical way to run Scrapy via `scrapy crawl`, we can also run Scrapy using a script which is essential to show our work inside this notebook.  
Because the Scrapy is built upon the Twisted asynchronous networking library, so we need to run it inside the Twisted reactor.
Let's get started from a single spider.  
**Please note that since Reactor is not restartable, if you want to run following script the second time, you need to relaunch jupyter notebook. You are also welcome to find out some other ways to tackle this. **

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

class ScriptRunSpider(scrapy.Spider):
    name = "scriptrunspider"
    
    def start_requests(self):
        urls = [
            'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Pittsburgh,+PA&start=0',
            'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Pittsburgh,+PA&start=10',
            'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Pittsburgh,+PA&start=20',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        page_num = response.url.split('/')[-1].split('=')[-1]
        file_name = 'script-yelp-pittsburgh-restaurants-%d.txt' % (int(page_num) / 10)
        with open(file_name, 'wb') as f:
            f.write(str(response.css('ul.ylist.ylist-bordered.search-results span.indexed-biz-name a.biz-name.js-analytics-click span::text').extract()))
        self.log('%s is saved.' % file_name)

process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})

process.crawl(ScriptRunSpider)
process.start(stop_after_crawl=True)

2016-11-02 13:28:06 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot)
2016-11-02 13:28:06 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2016-11-02 13:28:06 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-02 13:28:06 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewar

<DeferredList at 0x10568a050 current result: []>

In [2]:
!cat script-yelp-pittsburgh-restaurants-0.txt
!cat script-yelp-pittsburgh-restaurants-1.txt
!cat script-yelp-pittsburgh-restaurants-2.txt

[u'Gaucho Parrilla Argentina', u't\xe4k\u014d', u'Las Palmas', u'Butcher and the Rye', u'Altius', u'Umbrella Cafe', u'The Twisted Frenchman', u'Legume', u'Smallman Galley', u'The Foundry Table & Tap'][u'Las Palmas Carniceria', u'Smallman Galley', u'Cop Out Pierogies', u'Eleven', u'B52 Cafe', u'Meat & Potatoes', u'Noodlehead', u'The Foundry Table & Tap', u'Cure', u'Penn Ave Fish Company'][u'Point Brugge Caf\xe9', u'Bakersfield Penn Ave', u'Alla Famiglia', u'Cafe\u2019 33', u'Apteka', u'Edgar Tacos Stand', u'Amazing Cafe', u'Proper Brick Oven & Tap Room', u'Cafe Du Jour', u'Peppi\u2019s']

### Run multiple spiders in the same process
Above we just tried to run a single spider per proces, now we will try to run multiple spiders in the same process.  

See below an example to run multiple spiders at the same time. **Also, please note if you want to execute following script after the execution of above code, you also need to relaunch the jupyter notebook since Reactor is not restartable.**

In [1]:
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class Spider1(scrapy.Spider):
    name = "spider1" # name must be unique within project.
    
    def start_requests(self):
        urls = [
            'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Pittsburgh,+PA&start=0',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        page_num = response.url.split('/')[-1].split('=')[-1]
        file_name = 'script-parallel-spider1.txt'
        with open(file_name, 'wb') as f:
            f.write(str(response.css('ul.ylist.ylist-bordered.search-results span.indexed-biz-name a.biz-name.js-analytics-click span::text').extract()))
        self.log('%s is saved.' % file_name)
        
class Spider2(scrapy.Spider):
    name = "spider2" # name must be unique within project.
    
    def start_requests(self):
        urls = [
            'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Pittsburgh,+PA&start=10',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        page_num = response.url.split('/')[-1].split('=')[-1]
        file_name = 'script-parallel-spider2.txt'
        with open(file_name, 'wb') as f:
            f.write(str(response.css('ul.ylist.ylist-bordered.search-results span.indexed-biz-name a.biz-name.js-analytics-click span::text').extract()))
        self.log('%s is saved.' % file_name)

configure_logging()
runner = CrawlerRunner()
runner.crawl(Spider1)
runner.crawl(Spider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()

2016-11-02 14:59:11 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-02 14:59:11 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-11-02 14:59:11 [scrapy] INFO: Enabled sp

In [2]:
!cat script-parallel-spider1.txt
!cat script-parallel-spider2.txt

[u'Gaucho Parrilla Argentina', u't\xe4k\u014d', u'Las Palmas', u'Butcher and the Rye', u'Altius', u'Umbrella Cafe', u'The Twisted Frenchman', u'Legume', u'Smallman Galley', u'The Foundry Table & Tap'][u'Doce Taqueria', u'Meat & Potatoes', u'Eleven', u'Noodlehead', u'Cop Out Pierogies', u'Penn Ave Fish Company', u'Cure', u'Bakersfield Penn Ave', u'Las Palmas Carniceria', u'Ting\u2019s Kitchen']

### Resources

In case you are interested in knowing more about Scrapy, you can refer to following site. Some of the contents covered in this material is also inspired by following document.

1. Official document of Scrapy: https://doc.scrapy.org/en/latest/index.html