### Introduction ###

###### Some tips:

(*This tutorial need to shift between several files to make everything works. So if you want to try it by your self, you can create all files from begining following this tutorial. But if you do not need to run it, you can just open the file directory of 'scrapy_tutorial' that I created already to get more details.  Good luck! *)

This tutorial will introduce you **scrapy** which is a free and open-source web framework to help you crawling. 

Web crawling is the first step of data analysis. Before, we can use some packages to do this job like urllib2 and BeatifulSoup.But actually, BeatifulSoup is a parsing library which let you fetching contents from URL and parse certain parts of them. It cannot crawl unless you use a loop.

Compared with BeatifulSoup, Scrapy is a complete framework that you can give it a root URL and then specify on how much URLs you want to crawl.

Scrapy is written in python. You can use it to scrap the web or use it extract data by APIS. It is becoming much and much popular these days. Let's start!

### Tutorial content ###
In this tutorial, we will show a brief step-by-step example of using Scrapy to scrap our school website http://www.cmu.edu/.


We will cover the following topics in this tutorial:
* [Installation](#1)
* [Start a New Scrapy Project](#2)
* [Create the first Spider](#3)
* [Define Item](#4)
* [Extract Data using Xpath](#5)
   * [Extract html using scrapy shell](#5.1)
   * [Extract html by function](#5.2)
* [Improve the xpath search](#6)
   * [Using Rules](#6.1)
   * [Define attribute in the selector](#6.2)
* [Store data in JSON form](#7)
* [Item pipelines](#8)
   * [Pipeline of drop and change data](#8.1)
   * [Pipeline of storing data](#8.2)
   * [Pipeline of droping duplicate data](#8.3)
   * [Active the pipelines](#8.4)






<h3 id="1">1 Installation</h3>

Since you have Python 2.7 or above installed. You can install Scrapy and its dependencies from Pypi with:

            

In [None]:
 $ pip install Scrapy

In [12]:
### run the above commands line
import commands
(status, output) = commands.getstatusoutput('pip install Scrapy')
print output


You are using pip version 8.1.2, however version 9.0.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.


<h3 id="2">2 Start a New Scrapy Project</h3>

Before you start scraping,you may need to start a new Scrapy Project. Enter a directory where you would like to store your code and run:
            

In [None]:
 $ scrapy startproject scrapy_tutorial

In [27]:
import commands
### run the above commands line
(status, output) = commands.getstatusoutput('scrapy startproject scrapy_tutorial')
print output

New Scrapy project 'scrapy_tutoria', using template directory '//anaconda/lib/python2.7/site-packages/scrapy/templates/project', created in:
    /Users/star/Desktop/CMU/Practical data science/tutorial/scrapy_tutorial/scrapy_tutoria

You can start your first spider with:
    cd scrapy_tutoria
    scrapy genspider example example.com


The script creates a directory like this:
    

In [None]:
scrapy_tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

<h3 id="3">3 Create the first Spider</h3>

Spiders are some classes that need you to define. They subclass scrapy.Spider and define how you get links and how to parse content to extract data.

The spider Subclass defindes some attributes and methods, one of them is parse() method.
It is used to parses the response, extract data and find new URL to create new requests form them.

Under the ** scrapy_tutorial/spiders ** directory in your project, create a new file. In this example, we create a new file named **“cmuspider.py”** to save our first Spider. Also, you can put this file(this notebook) under scrapy_tutorial/ directory for running.

In this example, we want to start from school page of interdisciplinary programs to crawl all the program titles and links as an experienct to teach you how to use scrapy.

In cmuspider.py file, we can write our spider like this:



In [3]:
import scrapy

class cmuSpider(scrapy.Spider):
    name = "cmu"
    allowed_domains = ["cmu.edu"] 
    
    def start_requests(self):
        # we choose cmu interdisciplinary-programs page to get programs information
        start_url= ['http://www.cmu.edu/academics/interdisciplinary-programs.html' ]  
        #if there are several start_urls, we can use loop
        for url in start_url:
            yield scrapy.Request(url=url, callback=self.parse) 
    def parse(self, response):
        page = response.url.split("/")[-1]
        with open(page, 'wb') as f: # save the result of scrawler in a file
            f.write(response.body)


Then, after fill the above code into the spider file, we can get html data! 

Go to scrapy_tutorial root and run:

In [None]:
    $ scrapy crawl cmu

In [3]:
### run the above commands line
(status, output) = commands.getstatusoutput('scrapy crawl cmu')
print output

2016-11-04 15:20:20 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapy_tutorial)
2016-11-04 15:20:20 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_tutorial.spiders', 'SPIDER_MODULES': ['scrapy_tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'scrapy_tutorial'}
2016-11-04 15:20:20 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-04 15:20:20 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downl

After this, you get the whole html from the url you start. It is saved as "interdisciplinary-programs.html" as you hope. 

Part of the html file should like this:

![](http://i.imgur.com/6jNaagt.png)

<h3 id="3">4 Define Item</h3>

We just get the whole html above. But most of time, in scraping we need to extract form typically unstructured sources. 

Scrapy can use Item class to return data like python dictionaries. But not like dicts that easy to typo in field name and get inconsistent data, Scrapy Item is better for big projects with many spiders.

In this example, we need edit the **/item.py** file under directory of /scrapy_tutorial/ to define fields of these attributes like program_title, program_url from cmu programs site.

In [None]:
# paste the following class to the file "scrapy_tutorial/item.py"
class CmuItem(scrapy.Item):
    program_title = scrapy.Field()
    program_url = scrapy.Field()

<h3 id="5">5 Extract data by Xpath</h3>


After defining the items, what we need to consider is how to extract the data we want to fill the item fields.
In BeautifulSoup, we must deal with HTML tree. But in Scrapy, we uses **XPath**, which uses path expressions to select nodes or node-sets in an XML document. 


<h4 id="5.1">5.1 Extract html by scrapy shell</h4>

Instead of using python file, we can use **Scrapy shell** to try first. Open your terminal and type in the url we will parse.

In [None]:
$ scrapy shell "http://www.cmu.edu/academics/interdisciplinary-programs.html"

In [4]:
### run the above commands line
(status, output) = commands.getstatusoutput('scrapy shell "http://www.cmu.edu/academics/interdisciplinary-programs.html"')
print output

2016-11-04 15:36:20 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapy_tutorial)
2016-11-04 15:36:20 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['scrapy_tutorial.spiders'], 'BOT_NAME': 'scrapy_tutorial', 'LOGSTATS_INTERVAL': 0}
2016-11-04 15:36:20 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-04 15:36:20 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redire



Once we get into the **scrapy shell**, we can use **xpath** to prase the html. Like we can get all **li** element using : **$ response.xpath('//li')**


The output would like this:

![](http://i.imgur.com/TzW0ZBW.png)


Get the content of it by using: 

In [None]:
$ response.xpath('//li').extract()

The output would like this:

![](http://i.imgur.com/hPtme8X.png)

<h4 id="5.2">5.2 Extract html by function</h4>

Now we can change the parse function in spider before to get program titles and program links using **Xpath** method.

Change the parse function in /cmuspider.py file under /spiders directory as the following:

In [7]:
## change the parse function in cmuspider.py to
def parse(self, response):
    for sel in response.xpath('//ul/li'):
        title = sel.xpath('a/text()').extract()
        link = sel.xpath('a/@href').extract()
    print title, link

In [None]:
Run it again:
    $ scrapy crawl cmu

In [9]:
### run the above commands line
(status, output) = commands.getstatusoutput('scrapy crawl cmu')
print output

2016-11-04 16:18:56 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapy_tutorial)
2016-11-04 16:18:56 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_tutorial.spiders', 'SPIDER_MODULES': ['scrapy_tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'scrapy_tutorial'}
2016-11-04 16:18:56 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-04 16:18:56 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downl

From the output, we can get the title and links, but we need to return these extracted data to item. Item class that we defined before is easy to make a typo in a field name. They are simple containers used to collect the data.

Next step is to change the spider to return the data to Item field:

In [13]:
# change the cmuspider.py file again to store item field by the following code:
import scrapy

from scrapy_tutorial.items import CmuItem  # don't forget to import CmuItem we defined before

class CmuSpider(scrapy.Spider):
    name = "cmu"
    allowed_domains = ["cmu.edu"]
    start_urls = [ "http://www.cmu.edu/academics/interdisciplinary-programs.html"]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = CmuItem()
            item['program_title'] = sel.xpath('a/text()').extract()
            item['program_link'] = sel.xpath('a/@href').extract()
            print "program_title", item['program_title'] # just print to make it clear, you can delete it too.
            print "program_link",item['program_link']
            yield item

In [None]:
Run it again:
    $ scrapy crawl cmu

In [14]:
### run the above commands line
(status, output) = commands.getstatusoutput('scrapy crawl cmu')
print output


2016-11-04 17:16:42 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapy_tutorial)
2016-11-04 17:16:42 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_tutorial.spiders', 'SPIDER_MODULES': ['scrapy_tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'scrapy_tutorial'}
2016-11-04 17:16:42 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-04 17:16:42 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downl

<h3 id="6">6 Improve the xpath search</h3>

From the output, we can find out that there is still something we need to do of the extract data: 

    a. The program we get contain the nevigation in the website such as "Centers & Institutes\n  "
    b. The program we get contain the footer text in the website such as "CMU facebook".
    
Since we only need the pure program list, we need to try to improve our **XPath** expression.
Normally, we have several ways to do this:
<h4 id="6.1">6.1 Using [Rules](https://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules)</h4>



  

In [None]:
# example of using rules:
# Since our example is crawling interdisciplinary-programs, we do not need to do this,
# you can try this code by yourself if interested.

class CmuSpider(CrawlSpider):
    name = 'cmu'
    allowed_domains = ['cmu.edu']
    start_urls = ['http://www.cmu.edu/academics/']
    rules = (
        # Extract inks matching 'category.php' (but not matching 'subsection.php')
        Rule(LinkExtractor(allow=('interdisciplinary-programs.html', ), deny=('learning-for-a-lifetime.html')))
    )


<h4 id="6.2">6.2 Define attribute in the selector</h4>

  We can find out that in the website, all 'real' program titles and links are in class "". So we only need to change the selector line in cumspider.py file in /spiders directory:

In [None]:
# change the cmuspider.py file again to user better selector by the following code:
for sel in response.xpath('//div[@class=""]/ul/li'):

In [None]:
Run it again:
    $ scrapy crawl cmu

In [15]:
### run the above commands line
(status, output) = commands.getstatusoutput('scrapy crawl cmu')
print output

2016-11-04 17:55:50 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapy_tutorial)
2016-11-04 17:55:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_tutorial.spiders', 'SPIDER_MODULES': ['scrapy_tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'scrapy_tutorial'}
2016-11-04 17:55:50 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-04 17:55:50 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downl

<h3 id="7">7 Store data in JSON form</h3>


So,after we get the data, how can we store the items? 

We can simply store the data as json. Enter the command:






In [None]:
$ scrapy crawl cmu -o items.json

In [18]:
### run the above commands line
(status, output) = commands.getstatusoutput('scrapy crawl cmu -o programs.json')
print output

2016-11-04 18:26:08 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapy_tutorial)
2016-11-04 18:26:08 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_tutorial.spiders', 'FEED_URI': 'programs.json', 'SPIDER_MODULES': ['scrapy_tutorial.spiders'], 'BOT_NAME': 'scrapy_tutorial', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'}
2016-11-04 18:26:08 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-04 18:26:08 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retr

Then, we get a json file named **programs.json** and beautiful content with all programs titles and links that we need!


![](http://i.imgur.com/KejvrvP.png)


<h3 id='8'>8 Item pipelines</h3>

Actually, we used the way of ** Item exporters** It is for serializing the scraped data. It is easy, and we can also store data by the form of XML, CSV and JSON lines.

Besides **Feed Exports**, scrapy also provide **Pipelines** for use to store data. Each item pipeline is a Python class that implements a method. Besides we can use pipelines to drop the duplicate items or change the data we get:

<h4 id='8.1'>8.1 Pipeline of drop and change data</h4>
Here is an example to change the data by adding "CMU_program:" to every program title, let's find out how pipeline can do this task!


In [None]:
# Change your pipelines.py file to:

import json
from scrapy.exceptions import DropItem
class AddPipeline(object):
    def process_item(self, item, spider):
        vat_factor = "CMU_program: " # the factor we need to add before each program title
        if item['program_title']:
            if item['program_link']:
                item['program_title'][0]= vat_factor+ item['program_title'][0]
            return item
        else:
            raise DropItem("Missing project title in %s" % item)


<h4 id='8.2'>8.2 Pipeline of storing data</h4>
By pipelines, we can also store items in different ways:
* into Json File
* into database
* take screenshot of item

Here is the example of store item into a JSON file named 'cmu.jl':


In [None]:
# add these to your pipelines.py file:
class JsonWriterPipeline(object):
    def __init__(self):
        self.file = open('cmu.jl', 'wb')
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

<h4 id='8.3'>8.3 Pipeline of droping duplicate data</h4>

By pipelines, you can also drop the duplicate items, use pipeline as a filter.Everytime pipe get a duplicate item, it will drop it.
Here we can add a new pipeline:

In [None]:
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):
    def __init__(self):
        self.ids_seen = set()
    def process_item(self, item, spider):
        if item['program_title'][0] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item['program_title'])
        else:
            self.ids_seen.add(item['program_title'][0])
            return item


<h4 id='8.4'>8.4 Active the pipelines </h4>

Then you need to change the ** setting.py file ** to make our pipeline works.

By define the integer values, you can determine the order these pipes run in.The order is from number low to high.
In the example below, we run AddPipeline first, than jsonwriter pipeline.

Add the following line to your setting.py file under /scrapy_tutorial directory:

In [21]:
# add this line to your setting.py file 
ITEM_PIPELINES = {'scrapy_tutorial.pipelines.AddPipeline':200,
'scrapy_tutorial.pipelines.JsonWriterPipeline': 400,
                  'scrapy_tutorial.pipelines.DuplicatesPipeline': 300
}

In [None]:
We can test if these pipes work by running it again:
    $ scrapy crawl cmu

In [26]:
### run the above commands line
(status, output) = commands.getstatusoutput('scrapy crawl cmu')
print output

2016-11-04 19:41:17 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapy_tutorial)
2016-11-04 19:41:17 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_tutorial.spiders', 'SPIDER_MODULES': ['scrapy_tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'scrapy_tutorial'}
2016-11-04 19:41:17 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-04 19:41:17 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downl

After add three pipelines here, we get different data compared with before's. 

We drop the duplicate ones(shown in log):
    
  ![](http://i.imgur.com/kwrvJoZ.png)
 
 Also, we add "CMU_program: " before every title as we hoped:


  ![](http://i.imgur.com/sJ66zrH.png)

The above pipeline store data into JSON form, but if you are interested, you can also try to store data to MongoDB or MySQL.

### Summary ###

After this little example of using scrapy library to crawl data from cmu website, we had a basic knowledge about how to get data from complex html, how improve our result of extracting, how to store data to simple JSON as well as how to drop or change data by adding pipelines. These basic knowledge can support you to try a project of crawling by yourself. But if you still need to learn more about scrapy, it would be helpful to read scrapy official tutorial in detials. 

Try to use scrapy and have fun!

### References¶###

Scrapy Tutorial https://doc.scrapy.org/en/latest/intro/tutorial.html

A quick introduction to web crawling using Scrapy https://amaral.northwestern.edu/blog/quick-introduction-web-crawling-using-scrapy-part-