## Introduction

This tutorial will introduce you to the basics of Scrapy, a web crawling framework for a developer to write code to and create a spider, which define how a certain site (or a group of sites) will be scraped. The biggest feature is that it is built on Twisted, an asynchronous networking library, which increasees Scrapy's performance in regards to other scraping frameworks. Before we begin, let's answer some questions that some of you may have initially.

![alt text](https://www.22nds.com/wp-content/uploads/2017/07/scrapy-e1501276846765.png "Scrappy")

What is **aysnchronous**? 

To put simply, if you have a phone book and needed to call everyone, calling everyone one by one is **synchronous**. If you have a phone book, calling multiple phone numbers at the same time, **asynchronous**.


What is the difference between **Scrapy** and **Beautiful Soup**?

As mentioned before, Scrapy is a web scraper framework. In Scrapy, we input a root URL to start crawling, then specify how many URLs you want to crawl and parse. 

**On the other hand...**

BeautifulSoup is a tool to quickly extract valid data from web pages. It is also very friendly for beginners as learning it is very easy. However, in most cases, BeautifulSoup alone can not get the job done, you need use another package such as urlib2 or requests to help you download the web page and then you can use BeautifulSoup to parse the HTML source code. It also only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.

In conclusion, the difference lies in that Beautiful Soup is a library while Scrapy is a complete framework-- we can use Beautiful Soup to create something similar to Scrapy.


### Tutorial content

In this tutorial, we will show how to do some basic web scraping with the Scrapy frame work. 

While we have learned how to scrape data from the web using Beautiful Soup in class, this tutorial will show you an alternative, way to scrape data that is much quicker. We will first be scraping the /r/funny, https://www.reddit.com/r/funny/  page on reddit and converting the up/down votes and content into a .csv file. 

We will cover the following topics in this tutorial:
Overview of Scrapy
Write your first Web Scraping code with Scrapy
Set up your system
Scraping Reddit: Fast Experimenting with Scrapy Shell
Writing Custom Scrapy Spiders
Case Studies using Scrapy
Scraping an E-Commerce site
Scraping Techcrunch: Create your own RSS Feed Reader

- [Installing Scrapy](#Installing-Scrapy)
- [Scraping  Reddit: Intro to Scrapy using Scrapy Shell](#Scrapy-Shell)
- [Writing Custom Spiders](#Writing-Custom-Spiders)


## Installing Scrapy

Before we can use Scrapy, we must first install Scrapy onto our computer. Scrapy supports both versions of Python 2 and 3. If you’re using Anaconda, you can install the package from the conda-forge channel, which has up-to-date packages for Linux, Windows and OS X.

To install using `conda`:

    conda install -c conda-forge scrapy

For Linux and OS x users, you can also use `pip`.

To install using `pip`:

    pip install scrapy
    
**NOTE**: I ran into some erros after installing scrapy, specifically:

    ImportError: libiconv.so.2: cannot open shared object file: No such file or directory
    
If this is the case, make sure you run:

    conda update --all

and errors such as these should be fixed.



## Scrapy Shell

In this part of the tutorial, we will primarily be using terminal command line to get use to the syntax of Scrapy. To get use to this syntax, we will be scraping the /r/datascience page on reddit. Create a new folder on your computer and start scrapy shell in that folder. To start the scrapy shell in your command line type:

    scrapy shell
    
If Scrapy just wrote a bunch of stuff, don't worry, you are on the right track. In order to get information from /r/datascience on Reddit you will have to first run a crawler on it. A crawler is a program that browses web sites and downloads content. Sometimes crawlers are also referred as spiders.

To run the scrapy crawler, type the following into the shell:

    fetch("https://www.reddit.com/r/datascience/")
    
It should return something like this:![alt text](https://image.ibb.co/fKkEX7/1.png "Scrappy")

When you crawl something with scrapy it returns a “response” object that contains the downloaded information. If done correctly, you should see that page of /r/datascience scraped onto your computer locally. Compare it with the actual /r/datascience page.

If you type print(response.text) into your terminal you should see the elements that make up the webpage.

For now, we simply just want to collect 
* Title of each post
* Number of votes it has
* Number of comments
* Time of post creation

### Extracting Title of each post
Similar to Beautiful Soup, Scrapy provides ways to extract information from HTML based on css selectors like class, id etc. Let’s find the css selector for title, right click on any post’s title and select “Inspect” or “Inspect Element”:

![alt text](https://image.ibb.co/fA3Uzn/2.png "Scrappy")

You should see that the p tag has a class title. The title can then be extracted by typing the following:

    response.css(".title::text").extract()
    
Here response.css(..) is a function which helps pull content based on the css class/selector inserted. The ‘.’ before title is used because .title is the css class. We use ::text in order to only extract the text content of the matching elements. Otherwise, the html tags would be scraped as well. The following two images show an example:

   ![alt text](https://image.ibb.co/c5Rk5S/3.png "Scrappy")

This similar idea can be applied to the extraction of the number of votes, number of comments, time of post creation.

### Extracting Number of votes

For extracting number of votes, we also simply Inspect Element of the page. From inspecting elements, we see that .score.unvoted corresponds directly to the vote count we see on the page. After find this class, we then type:

    response.css(".score.unvoted::text").extract()
    
### Extracting time of post creation

For the extraction of time of post creation, we cannot simply extract the text, as the text displays how long ago the post was created and not the time. As a result we must use another method.

Take a look at the documentation first in regards to the attr method and see if you can apply this to extracting time of post creation. If not follow below:

   ![alt text](https://image.ibb.co/jTa1en/4.png "Scrappy")

Since we need to extract the attribute contents in the title, we must use the attr method as follows:

    response.css("time::attr(title)").extract()

The .attr(attributename) is used to get the value of the specified attribute of the matching element.

### Extracting Number of comments

Extracting the number of comments should be relatively easy based on what we learn so far and will be left as an excercise to the readers. 

As a recap, so far we have used:

* response – An object that the scrapy crawler returns. This object contains all the information about the downloaded content.
* response.css(..) – Matches the element with the given CSS selectors.
* extract_first(..) – Extracts the “first” element that matches the given criteria.
* extract(..) – Extracts “all” the elements that match the given criteria.


# Writing Custom Spiders

As we have stated, a spider/scraper is a program that pulls content from web from a given URL. Normally, you also would need to write code to convert the extracted data to a structured format and store it in a reusable format like CSV, JSON,etc as we have seen in Beautiful soup. However, that is very tedious and is a lot of code. Fortunately for us, Scrapy does most of the work for us, as it comes with most of the functionalities to scrape and convert to a usable format.

Before we can start using Custom Spiders, we must initialize spider to work in Jupyter Notebook. Because Scrapy is a framework and is usually not used in Jupyter Notebook, there are some preset configurations we must apply before we write our Custom Spiders. First, we need to allow Jupyter to input into terminal. This is done by imporitng InteractiveSHell

In [10]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Now, we must import scrapy and its web crawler from scrapy to scrape websites

In [11]:
from scrapy.crawler import CrawlerProcess
import scrapy

In the next step, we import json and create a simple pipeline that writes all found items to a JSON file, where each line contains one JSON element.

In [12]:
import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('quoteresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

In [13]:
import logging

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://www.reddit.com/r/datascience/',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'result.json'                        # Used for pipeline 2
    }
    
    def parse(self, response):
        titles = response.css('.title.may-blank::text').extract()
        votes = response.css('.score.unvoted::text').extract()
        times = response.css('time::attr(title)').extract()
        comments = response.css('.comments::text').extract()
       
        #Give the extracted content row wise
        for item in zip(titles,votes,times,comments):
            #create a dictionary to store the scraped info
            scraped_info = {
                'title' : item[0],
                'vote' : item[1],
                'created_at' : item[2],
                'comments' : item[3],
            }

            #yield or give the scraped info to scrapy
            yield scraped_info

In [14]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start()


2018-03-31 23:57:47 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-31 23:57:47 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) - [GCC 7.2.0], pyOpenSSL 17.5.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.2.1, Platform Linux-4.13.0-37-generic-x86_64-with-debian-stretch-sid
2018-03-31 23:57:47 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'result.json', 'LOG_LEVEL': 30, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


<Deferred at 0x7fca206816d8>

ReactorNotRestartable: 

In [15]:
ll quoteresult.*

-rw-rw-r-- 1 andy 0 Mar 31 23:57 quoteresult.jl


In [18]:
!tail -n 2 quoteresult.jl

In [19]:
import pandas as pd
dfjson = pd.read_json('result.json')
dfjson

Unnamed: 0,comments,created_at,title,vote
0,77 comments,Sat Aug 6 11:43:42 2011 UTC,Weekly 'Entering & Transitioning' Thread. Ques...,13
1,21 comments,Sun Mar 25 17:09:02 2018 UTC,How much math is really needed for DS?,39
2,1 comment,Sat Mar 31 17:58:33 2018 UTC,"Autodidacts, how do you attack a textbook on y...",4
3,3 comments,Sun Apr 1 01:41:02 2018 UTC,Validate significance of classification of unb...,17
4,1 comment,Sat Mar 31 14:58:40 2018 UTC,Data Science Intermediate Projects,4
5,2 comments,last edited 11 hours ago,Create internal blog from Notebooks,2
6,comment,Sat Mar 31 20:33:52 2018 UTC,Best courses to take if I hope to be a Data Sc...,•
7,comment,Sat Mar 31 23:38:08 2018 UTC,Word Embeddings : Word2Vec and Latent Semantic...,3
8,comment,Sun Apr 1 03:20:42 2018 UTC,Before Sunrise Text Classification: Who Said It?,3
9,50 comments,Sat Mar 31 20:56:01 2018 UTC,What is the easiest way for one to store a dat...,61
