![Test](https://4qr7k2a2xza2vctux33bisalkw-wpengine.netdna-ssl.com/wp-content/uploads/2013/09/Scrapy_logo.jpg)

   # Web Scraping
    
The saying, "Information is Power" is especially true for a data scientist and a data scientist's work is worthless without the right data. The entire process of data science and analytics starts with obtaining the right data and involves interaction with it every step of the process.

Data collection is one of  the most important aspects of Data Science, period. Statistical evaluation, Machine learning models, they all come after one has successfully collected and cleaned up the required data. Often times, this data hasn't already been collected and cleaned up and necessitates work on the part of the data scientist to collect the right data and clean it up. 

In today's digital age, this data is usually scattered all over the web, and unless someone has taken the pains to create a usable API for data collection, this data collection process requires web scraping. This is why, every data scientist should be familiar with at least the basics of the web scraping process allowing one to easily collect data from websites, a situation which inevitably arises quite often. 

Data collection from the internet usually involves three steps - 
1. Crawling - this refers to browsing the internet in a systematic and automated way.        An example of the power of large-scale web crawling is presented in Google, whose web crawlers have essentially indexed the  world wide web making it accessible at everyone's fingertips. 
2. Scraping - this technically refers to scraping i.e, extracting some part of the data from a particular domain or a website however colloquially,  it refers to the entire process of data collection (all three steps as a whole)
3. Parsing - this refers to interpreting the data extracted from webpages or domains and converting it into usable data
    
This entire process leaves one with data which one can then analyze.

# Scrapy 

Now that we can better appreciate the importance of web scraping, we can better appreciate the need to understand a technology like Scrapy. 

An open source framework for web scraping, scrapy makes it painless to obtain data from all kinds of websites. A robust and well-documented back-end combined with scrapy's user friendliness and relatively easy learning curve make it one of the most useful web scraping frameworks for anyone looking to get into web scraping. 

Scrapy has been intended to serve as a complete data collection toolkit and allows one to complete all three steps of the data collection which is why it is the framework of choice for this tutorial. 

# Tutorial Structure

Now that we have hopefully motivated the need to learn and understand the basics of a web scraping framework, we will begin our tutorial. This tutorial will guide the reader through the basics of Scrapy while introducing core concepts in Scrapy that translate to any web scraper in general. It will end with a reddit meme-scraper that scrapes and stores in a `.csv` file, the names and URLs of a user-inputted no. of pages of Reddit. The tutorial be structured as follows - 

1. [Installing Scrapy](#one)
2. [Spiders](#two)
    * Basic Scraping
    * Following Links
3. [Selectors (a Deeper Discussion)](#three)
4. [Items](#four)
5. [Basics of Reddit Scraper](#five)
6. [Feed Exports](#six)
7. [Spider Arguments](#seven)


<a id="one"> </a> 
## 1. Installing Scrapy
Scrapy works with Python 2.7 and Python 3.4 and above. 

For Anaconda users, 

In [None]:
conda install -c conda-forge scrapy

Alternatively, one can use `pip` to install scrapy as well, 

In [1]:
pip install Scrapy


The following command must be run outside of the IPython shell:

    $ pip install Scrapy

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more informations on how to install packages:

    https://docs.python.org/3/installing/


According to Scrapy's official documentation, it is strongly recommended to install Scrapy in its own `virtualenv` due to the dependency issues it might create. This would require running a Jupypter notebook on a particular virtual environment but that is unnecessary for the purposes of this tutorial and will not be covered here. 

In [2]:
import scrapy
import scrapy.cmdline
import sys

<a id="two"> </a> 
## 2. Spiders
Spiders are custom classes which define how a particular group of sites are to be scraped (think complex web parser) including instructions on how the websites should be crawled and how data should be parsed.

They must subclass `scrapy.Spider` and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

In basic terms, spiders are like web crawlers which, when set loose on a set of start_urls, keep scraping data and following new  links according to a pre-defined set of rules. They essentially crawl the internet, hence the name.

Here's the code for a very bare-bones spider,

In [4]:
class MySpider(scrapy.Spider):
    name = 'your_url.com'
    allowed_domains = ['my_url.com']
    start_urls = [
        'http://www.my_url.com/1.html',
        'http://www.my_url.com/2.html',
        'http://www.my_url.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('Response recieved!')

There are a few observations one should make here. 
`allowed_domains` and `start_urls` are self-explanatory and serve as a starting point for our spider. 
Note that the `parse` method, by default, is the registered callback for the start_urls and is called with the response for each of the urls in `start_urls`. Note that our parse function currently just returns a log message but parse functions can in general `yield` either dicts (data extracted from the webpage) or a `Request` object (making the spider follow the this `Request` object)

Now, observe that this isn't very different from what you can do with just the standard `requests` package in python or any other language for that matter. (Send a few GET requests to a few webpages and store their responses)

The real magic of a web scraper like scrapy comes into play when you actually make you spider crawl, i.e, you return multiple  items from your parse function (a combination of Requests and data), making your spider follow links according to your rules. 

For instance, if you want your spider to follow every link in the webpages in `star_urls` and return all `h1` headings in the page, you could modify your spider as follows, 

In [6]:
class MySpider(scrapy.Spider):
    name = 'your_url.com'
    allowed_domains = ['my_url.com']
    start_urls = [
        'http://www.my_url.com/1.html',
        'http://www.my_url.com/2.html',
        'http://www.my_url.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.css('h3').extract():
            yield {'h3::text'}

        for url in response.css('a::attr(href)').extract():
            yield scrapy.Request(url, callback=self.parse)

As can be seen, scrapy supports CSS selectors using the syntax `response.css()` and the function `.extract()` is required to convert the output of response.css (a `SelectorList`) to a list (scrapy has a custom `Selector` class that provides the mechanism required to extract data from returned webpages).  A deeper discussion of Scrapy's Selectors is done below covering the specifics.

Also, note that scrapy also provides the convenience function `.follow()` for the `response` object which can be used as follows, 

In [52]:
class MySpider(scrapy.Spider):
    name = 'your_url.com'
    allowed_domains = ['my_url.com']
    start_urls = [
        'http://www.my_url.com/1.html',
        'http://www.my_url.com/2.html',
        'http://www.my_url.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.css('h3').extract():
            yield {'h3::text'}

        for url in response.css('a::attr(href)'):
            yield response.follow(url, callback = self.parse)


The advantage of this convenience function is that it handles relative urls automatically and also works with selectors in addition to string arguments eliminating the need to convert one's selector object into a string.

<a id="three"> </a> 
## 3. Selectors

Let us now briefly discuss selectors before moving ahead since selectors form an integral part of the entire scraping process. 

As mentioned earlier, selectors are essentially the mechanism for extracting data from a received response. Scrapy supports "selecting" different parts of the HTML response using either XPath or CSS expressions. XPath is a language for selecting nodes in XML documents and outside the scope of this tutorial. We will instead focus on using widely popular CSS to achieve this.

As we saw earlier, one can query a response using the `.css()` function which returns a `SelectorList` instance (a list of new selectors). Scrapy provides the `.extract()` adn `.extract_first()` functions to convert this `SelectorList` instance to a normal list of matches or the first match in the focument respectively. note that one added advantage benefit of `.extract_first()` compared to manually getting the first element of the `SelectorList` instance using `.extract_first()` compared to manually indexing the first element of the `SelectorList` is that `.extract_first()` returns  `None`if no match is found. 

Also, CSS selectors can select text or attribute nodes through CSS3 pseudo-elements (`.css('h3::text')` or `.css('a::attr(href)')`)

Note that while collecting data, especially with a web crawler, it is important to make one's code resillient to the the wasteland of badly organized CSS and HTML that the internet is in order to ensure to maximum data collection.

Now, while Selectors by themselves are quite a vast topic, we now have enough knowledge about them to scrape most well-created websites and since we are more focused on learning Scrapy as a whole, we encourgae the interested reader to look through Srapy's own (quite extensive) documentation on Selectors. 

<a id="four"> </a> 
# 4. Items

Now that we know the basics of traversing webpages, and selecting and parsing data, we will now talk about returning this data in a structured form. We can definitely use Python dictionaries for doing this right now but for bigger projects with multiple spiders (following a complicated set of rules), this can very easily lead to type inconsistencies or typos leading to various issues. To resolve this, scrapy provides a common data output format in the form of an `Item` class. They are simple data containers to provide a unified method of structuring data and give the user a dictionary-like API with convenient syntax to declare fields. 


In [7]:
class Meme(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()

While we have no need of Items in our small-scale example, we will still use a `Meme` Item to illustrate it's utility which will help the reader during independent larger projects. 

Note: The `Field` objects are used to specify metadata for each field of the item. Since we will not be needing this for our purposes, we invite the interested reader to learn more about these through our follow-up reading. 

As mentioned earlier, scrapy provides a dictionary-like interface to use the `Item` class. 

In [37]:
meme = Meme(name="the ultimate", link="www.xkcd.com")
print(meme)
print(meme.get('name'))

meme["name"] = "the ultimate 2"
print(meme.get('name'))

print(meme.keys())
print(meme.items())

{'link': 'www.xkcd.com', 'name': 'the ultimate'}
the ultimate
the ultimate 2
dict_keys(['name', 'link'])
ItemsView({'link': 'www.xkcd.com', 'name': 'the ultimate 2'})


Also note that it is also easy to convert an `Item` into a normal python dictionary, 

In [38]:
print(dict(meme))

{'name': 'the ultimate 2', 'link': 'www.xkcd.com'}


<a id="five"> </a> 
## 5. Basics of Reddit Scraper

Believe it or not, we now have enough tools to create a pretty robust example, a Reddit meme scraper! The beauty behind Scrapy, in my opinion, is that it abstracts away different amounts of funtionality for different kinds of users and one can very easily peel away at its layers to go deeper and obtain more and more advanced features. 

While this tutorial is meant as an introduction to web scraping and Scrapy in general, it will leave the reader with enough tools to confidently dive into Scrapy's documentation and explore its several intricacies. 

We will now combine the concepts we have talked about earlier to create a basic Reddit meme scraper. 

Let us first start off with a simple Spider that simply visits https://www.reddit.com/r/Jokes/

In [99]:
%%writefile spider.py
import scrapy
class RedditScraper(scrapy.Spider):
    name = "reddit_scraper"
    start_urls = ['https://www.reddit.com/r/memes/']
    
    def parse(self, response):
        pass

Overwriting spider.py


Scrapy officially recommends using its CLI tool to run the spiders one writes so that is the route we'll be following. (Note that there is a way to run your spiders programmatically in a python script as well) 

Note that achieving this in a Jupyter notebook this will require us using the magic commands `%%bash` (runs cell as bash commands) and `%%writefile` (to  write out python code to a file which can then be run by our command-line level bash commands). On an actual development machine, as one would imagine, the workflow is more streamlined.  Also, the `2>&1` at the end of our bash commands can be safely ignored. (All it does, essentially, is make the output of our bash commands look decent)  

Running a spider is as simple as

In [100]:
%%bash 
venv/Scripts/scrapy.exe runspider spider.py  2>&1 

2018-03-30 22:19:17 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-30 22:19:17 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.3 (v3.6.3:2c5fed8, Oct  3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.2.1, Platform Windows-10-10.0.16299-SP0
2018-03-30 22:19:17 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-03-30 22:19:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-03-30 22:19:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 

Amidst this jungle of INFO messages, one can see `Crawled (200) <GET https://www.reddit.com/r/memes/> (referer: None)` which signifies that we successfully visited the reddit meme page! 

We can now modify our `parse` function to actually extract the titles of the posts and their  links as well as follow the links to next pages. (Note that we have also added our `Meme` class from earlier to be written to `spider.py`)

In [56]:
%%writefile spider.py

import scrapy

class Meme(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()

class RedditScraper(scrapy.Spider):
    no_pages = 2
    name = "reddit_scraper"
    start_urls = ['https://www.reddit.com/r/memes/']
    
    def parse(self, response):
        next_link = response.css('span.next-button').css('a::attr(href)')[0]
        if next_link and self.no_pages:
            self.no_pages -= 1
            yield response.follow(next_link, callback=self.parse)
        
        links = response.css('div.thing')
        for link in links:
            meme = Meme()
            meme["link"] = link.css('::attr(data-url)').extract_first()
            meme["name"] = link.css('a.title::text').extract_first()
            yield meme
        

Overwriting spider.py


In [42]:
%%bash 
venv/Scripts/scrapy.exe runspider spider.py  2>&1 

2018-03-30 19:48:40 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-30 19:48:40 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.3 (v3.6.3:2c5fed8, Oct  3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.2.1, Platform Windows-10-10.0.16299-SP0
2018-03-30 19:48:40 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-03-30 19:48:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-03-30 19:48:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 


Note: The `no_pages` parameter decides the number of pages to scrape. (It is simply a counter that decides how many times a `Request` with the next page's url is yielded) 

Our scraper works! Amidst the INFO messages, one can see that our program has crawled, scraped and outputted after parsing, in a structured format, the names and links of all the memes on the first two pages of the memes subreddit. 




<a id="six"> </a> 
# 6. Feed Exports

Now that our scraper is working, we actually need to store the parsed data in a usable format like .JSON, .CSV and such. Scrapy, once again, makes this extremely simple to do through Feed Exports. Feed Exports allow one to create a feed of the  scraped data, using a wide array of different serialization formats and storage options. Scrapy supports JSON, JSON Lines, CSV and XML out  of  the box but can be easily extended to support any other serialization format.

Scrapy feed exports also support a variety of storage options in addition to one's local filesystem like FTP, S3, and Standard Output. For our purposes, we will be creating a feed for .JSON on our local filesystem but the reader should  note that it is extremely easy for one to switch these up and use whichever option one's project requires. 

All one needs to do to create a feed export on the local system is add a `-o` switch to one's CLI command and supply the name of the file one want to feed into. (note that scrapy feed exports append to a pre-existing file and does not overwrite)


In [None]:
%%bash 
venv/Scripts/scrapy.exe runspider spider.py -o results.json 2>&1 

In [64]:
file = open("results.json", encoding="utf8")
print(file.read())

[
{"link": "https://i.redd.it/srm1pqxkrxo01.jpg", "name": "Could've been worse."},
{"link": "https://i.imgur.com/vxkEdC7.jpg", "name": "Mmmhm..."},
{"link": "https://i.redd.it/t8g1spx9mxo01.jpg", "name": "Hide your girlfriend"},
{"link": "https://i.imgur.com/gdxbQRN.jpg", "name": "Watch it happen"},
{"link": "https://i.redd.it/9m2i6rzsxwo01.jpg", "name": "I only kinda like Reddit"},
{"link": "https://i.redd.it/zkz8wqit7zo01.jpg", "name": "Rip Gru figurine"},
{"link": "https://i.redd.it/yvjd3n2szxo01.jpg", "name": "My favorite"},
{"link": "https://i.redd.it/v27u2zy30xo01.jpg", "name": "considering their past history, maybe they shouldn't have invited her over"},
{"link": "https://i.redd.it/y8eu4bk8eyo01.jpg", "name": "Happens all the time"},
{"link": "https://i.redd.it/kihfuavylyo01.jpg", "name": "I excuse you, not the bell"},
{"link": "https://i.redd.it/45oykye1xyo01.jpg", "name": "Who said that *look right*"},
{"link": "https://i.redd.it/ymng5u49kxo01.jpg", "name": "When someone asks 

And just like that, we have a cleanly formatted .JSON file with the links and names of all the memes on the first page of reddit. 


<a id="seven"> </a> 
## 7. Spider Arguments

Wouldn't it be useful if we could actually specify the number of pages we want to scrape while running our spider? This issue of user-controlled variables in one's code arises frequently during the data scraping process since one usually needs to experiment with different sized data-sets and such. 

Scrapy makes this simple through spider arguments i.e, by using the `-a` tag while running our CLI command with a name=value pair for each argument. We will now, implement this in our existing meme scraper to make it possible to change the number of pages scraped with each CLI command to run our spider. 

In [80]:
%%writefile spider.py

import scrapy

class Meme(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()

class RedditScraper(scrapy.Spider):
 
    name = "reddit_scraper"
    start_urls = ['https://www.reddit.com/r/memes/']
    
    def parse(self, response):
        next_link = response.css('span.next-button').css('a::attr(href)')[0]
        if next_link and self.no_pages:
            self.no_pages = int(self.no_pages) - 1
            yield response.follow(next_link, callback=self.parse)
        
        links = response.css('div.thing')
        for link in links:
            meme = Meme()
            meme["link"] = link.css('::attr(data-url)').extract_first()
            meme["name"] = link.css('a.title::text').extract_first()
            yield meme
        

Overwriting spider.py


Note that the `start_requests` method is simply the default starting function that scrapy provides. The arguments passed to our spider are accessible in our code through `self.name`. 

In [81]:
%%bash 
venv/Scripts/scrapy.exe runspider spider.py -o results2.json -a no_pages=5 2>&1 

2018-03-30 21:07:34 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-30 21:07:34 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.3 (v3.6.3:2c5fed8, Oct  3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.2.1, Platform Windows-10-10.0.16299-SP0
2018-03-30 21:07:34 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'results2.json', 'SPIDER_LOADER_WARN_ONLY': True}
2018-03-30 21:07:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2018-03-30 21:07:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.Downl

In [97]:
file = open("results2.json", encoding="utf8")
print(file.read())

[
{"link": "https://i.redd.it/srm1pqxkrxo01.jpg", "name": "Could've been worse."},
{"link": "https://i.imgur.com/vxkEdC7.jpg", "name": "Mmmhm..."},
{"link": "https://i.redd.it/zkz8wqit7zo01.jpg", "name": "Rip Gru figurine"},
{"link": "https://i.redd.it/t8g1spx9mxo01.jpg", "name": "Hide your girlfriend"},
{"link": "https://i.redd.it/45oykye1xyo01.jpg", "name": "Who said that *look right*"},
{"link": "https://i.imgur.com/gdxbQRN.jpg", "name": "Watch it happen"},
{"link": "https://i.redd.it/kihfuavylyo01.jpg", "name": "I excuse you, not the bell"},
{"link": "https://i.redd.it/9m2i6rzsxwo01.jpg", "name": "I only kinda like Reddit"},
{"link": "https://i.redd.it/yvjd3n2szxo01.jpg", "name": "My favorite"},
{"link": "https://i.redd.it/y8eu4bk8eyo01.jpg", "name": "Happens all the time"},
{"link": "https://i.redd.it/v27u2zy30xo01.jpg", "name": "considering their past history, maybe they shouldn't have invited her over"},
{"link": "https://i.redd.it/ymng5u49kxo01.jpg", "name": "When someone asks 

As one can see, we now have a working reddit meme scraper that cleanly outputs to a JSON file and allows us to scrape a variable number pages. It has all the basic features one might need for such a task and leaves the user with data that is ready to be analyzed in any sort of way. It is worth noting that all this functionality has been achieved with less than 30 lines of code in our final version of `spider.py` which just goes to show how easy it is for a data scientist to quickly scrape some interesting data from a blog/website/news-aggregator or anything of the sort and analyse it. 

# Summary and References

As is clear, this tutorial just highlighted and guided the reader through the basic concepts and functionality of scrapy and web scraping in general.  We even referred the reader to look in one of our suggested readings to gain further insight on some topics. 

Here are some readings we suggest to understand this extremely powerful framework better - 
   * Offical Documentation for Scrapy (MOST IMPORTANT) https://doc.scrapy.org/en/latest/index.html#section-basics
   * An extremely useful beginner tutorial on scrapy (similar to this one) https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
   * A github repo with various useful examples of scrapy usage https://github.com/feiskyer/scrapy-examples
   * An advanced tutorial for Scrapy(make sure you're very familiar with the basics before reading this) http://sangaline.com/post/advanced-web-scraping-tutorial/
   * A page with several nifty tricks and common practices https://doc.scrapy.org/en/latest/topics/practices.html