# Web Scraping with Scrapy
## Introduction
This tutorial provides an introduction to data collection through web scraping using the Python library Scrapy. 
Data science is dependent on having data to analyze, and with the abundance of data available on the internet, web scraping&mdash;fetching and extracting data from webpages&mdash;has become a popular and effective way to collect that data. Scrapy provides an convenient API for programmatically extracting data from connected web pages in relatively few lines of code.

![Scrapy](https://scrapy.org/img/scrapylogo.png)

Scrapy is a Python library that deals with web pages, so readers who are new to Python 3 or are unfamiliar with the structure of web pages may want to first go through [the official Python Tutorial](https://docs.python.org/3/tutorial/) or brush up on [HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML) and [CSS](https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS).

### Table of Contents
1. [Installation](#Installation)
2. [Getting Started with Spiders](#Getting-Started-with-Spiders)
3. [Running Spiders](#Running-Spiders)
4. [Following Links](#Following-Links)
5. [Making Requests and Using APIs](#Making-Requests-and-Using-APIs)
6. [Summary and Additional Resources](#Summary-and-Additional-Resources)

---

## Installation
The recommended way to install Scrapy for users who use Anaconda is to open an Anaconda prompt and run the command 
    
<code>$ conda install -c conda-forge scrapy</code>

You can also install using pip by running

<code>$ pip install Scrapy</code>

If you have any issues while installing, you can refer to the [platform-specific instructions](https://doc.scrapy.org/en/latest/intro/install.html#platform-specific-installation-notes) in Scrapy's documentation.

After installation, the below imports should work.

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

---

## Getting Started with Spiders
The main component of a Scrapy web scraper is the **Spider**. Spiders are user defined classes that extend [`scrapy.Spider`](https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy-spider) and have methods specifying which web requests to make, which pages and links to follow, and what data on each page to extract. Here's a simple example spider that scrapes and prints a few tweets from a page on [Twitter](https://twitter.com/BarackObama).

In [2]:
import scrapy
from scrapy.crawler import CrawlerProcess

class SimpleExampleSpider(scrapy.Spider):
    name = "simple example"
    start_urls = ['https://twitter.com/BarackObama']

    def parse(self, response):
        tweets = dict()
        for tweet in response.css('div.content'):
            time = tweet.css('span._timestamp::text').extract_first()
            text = tweet.css('p.tweet-text::text').extract()
            tweets[time] = ''.join(text)
            if self.settings['FEED_URI'] == None: print(time + ': ' + ''.join(text) + '\n')
        yield tweets

Our `SimpleExampleSpider` defines some required attributes and methods. The `name` attribute is a unique identifier for our spider, `start_urls` is a list of urls to start scraping from, and `parse()` is the method that parses the page and extracts the data we want.

When the spider is run, HTTP requests are made to each of the urls in `start_urls`, and `parse()` is used as a callback method: it's called after each request with the HTTP response as an argument to extract data from the response. 

A convenient way to do this is with [CSS selectors](https://www.w3.org/TR/selectors/) and the `css()` method. In the above example, `response.css('div.content')` returns all &lt;div&gt; tags in the page that have class `content`. The next lines then select the &lt;span&gt; and &lt;p&gt; tags with the `_timestamp` and `tweet-text` classes, respectively, and select the text inside the tags with the `::text` selector. 

The `extract_first()` and `extract()` methods get the text from the [`Selector`](https://doc.scrapy.org/en/latest/topics/selectors.html#selector-objects) object returned by the `css()` method, with `extract_first()` returning only the first piece of text found inside the tag and `extract()` returning a list of all the text found inside the tag.

CSS selectors offer many other ways to select elements in a page. Here are some of the more commonly used selectors.

For an HTML element of type `E`:

|||
|:---|:---|
|`E.myclass`|Specifies an element of class `myclass`|
|`E#myid`|Specifies an element with id `myid`|
|`E[attr]`|Specifies an element with attribute `attr`|
|`E[attr=val]`|Specifies an element with attribute `attr` having value `val`|
|`E C`|Specifies an element of type `C` that is a descendant of an element of type `E`|

---

## Running Spiders
Scrapy offers two main ways to run crawlers: a command line interface and a Python API. In this tutorial, we'll be using the API to run crawlers from Python scripts, but you can visit [this page](https://doc.scrapy.org/en/latest/topics/commands.html) for more information about the command line tools. 

Spiders can be run in Python by creating a [`CrawlerProcess`](https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess), which launches a Twisted [`reactor`](https://twistedmatrix.com/documents/current/core/howto/reactor-basics.html) to handle network communication.

In [3]:
crawler = CrawlerProcess({
    'LOG_ENABLED': False,
})

crawler.crawl(SimpleExampleSpider)
crawler.start() # blocks until done

Mar 25: Incredible to have a Chicago team in the Final Four. I’ll take that over an intact bracket any day! Congratulations to everybody  - let’s keep it going!

Mar 24: Michelle and I are so inspired by all the young people who made today’s marches happen. Keep at it. You’re leading us forward. Nothing can stand in the way of millions of voices calling for change.

Mar 19: Our most important task as a nation is to make sure all our young people can achieve their dreams. We’ve started this work with , but there’s so much more all of us have to do—government, private sector, academia & community leaders—to change the odds for our kids.

Mar 19: In Singapore with young people who are advocating for education, empowering young women, and getting involved all over Southeast Asia with a profoundly optimistic commitment to building the world they want to see.

Mar 15: 41: I like the competition. And the loyalty to the home team. - 44

Mar 15: Congrats to  and Sister Jean for a last-second up

In this example, we instantiate a `CrawlerProcess` with settings for the crawler, call `crawl()` with our spider, and `start()` the crawler. 

The constructor for `CrawlerProcess` takes in a `dict` of settings. A full list of the various settings can be found [here](https://doc.scrapy.org/en/latest/topics/settings.html). Some of the more important settings include `FEED_URI` and `FEED_FORMAT`, which can be used to specify an output file and format for your spider. Calling the constructor like this:

In [3]:
crawler = CrawlerProcess({
    'LOG_ENABLED': False,
    'FEED_URI': 'out.json',
    'FEED_FORMAT': 'json',
})

crawler.crawl(SimpleExampleSpider)
crawler.start()

would put the spider output in json format in a file called out.json. Settings for the export feed (output of the crawler) can be found on [this page](https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-exports).

One thing to note is that a crawler cannot be run more than once in the lifetime of a process. This is because the Twisted `reactor` that `CrawlerProcess` uses is started when `crawler.start()` is called and is stopped after the crawler finishes, and a `reactor`, once stopped, cannot be restarted. Although this isn't an issue when running a Python script for a crawler,  Jupyter Notebooks like this one will not be able to run consecutive cells with crawlers without restarting the kernel. You will need to restart the kernel and run cells containing `CrawlerProcess`es one at a time if you want to run the example code here.

---

## Following Links
The example spider we've used so far has shown us how to extract data from a single page, but in order to extract data from multiple pages on a site, we'll need to follow links. Following links allows our spiders to deal with pagination and enables data collection from any page connected to one of the start urls. With Scrapy, following links is as simple as calling `response.follow()`. The next example scrapes data from IMDb about the [top 250 movies](http://www.imdb.com/search/title?groups=top_250&sort=user_rating).

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

class IMDbSpider(scrapy.Spider):
    name = 'IMDb'
    start_urls = ['http://www.imdb.com/search/title?groups=top_250&sort=user_rating']
    
    def parse(self, response):
        for movie_link in response.css('h3.lister-item-header a::attr(href)').extract():
            yield response.follow(movie_link, callback=self.parse_movie)
        
        next_link = response.css('div.desc a.lister-page-next::attr(href)').extract_first()
        yield response.follow(next_link, callback=self.parse)
    
    def parse_movie(self, response):
        name = response.css('h1[itemprop=name]::text').extract_first()
        date = response.css('a meta[itemprop=datePublished]::attr(content)').extract_first()
        rating = response.css('span[itemprop=ratingValue]::text').extract_first()
        rating_count = response.css('span.small[itemprop=ratingCount]::text').extract_first()
        genres = response.css('span.itemprop[itemprop=genre]::text').extract()
        content_rating = response.css('div.subtext meta[itemprop=contentRating]::attr(content)').extract_first()
        length = response.css('time[itemprop=duration]::text').extract_first()
        storyline = ''.join(response.css('div[itemprop=description] p::text').extract())
        keywords = response.css('span[itemprop=keywords]::text').extract()
        yield {
            'name': name.strip(),
            'date': date.strip(),
            'rating': rating.strip(),
            'rating_count': rating_count.strip(),
            'genres': genres,
            'content_rating': content_rating.strip(),
            'length': length.strip(),
            'storyline': storyline.strip(),
            'keywords': keywords,
        }
        
crawler = CrawlerProcess({
    'LOG_ENABLED': False,
    'FEED_URI': 'imdb_top_250.json',
    'FEED_FORMAT': 'json',
})

crawler.crawl(IMDbSpider)
crawler.start()

Here instead of directly parsing the response passed to `parse()`, we iterate through the links to movies found in the page and call `response.follow()` on each link, which we extract using the `a::attr(href)` selector. The `follow()` method then makes an HTTP request in the same way requests are made to start the spider, with the `callback` keyword argument specifying a method to parse the response. After following all the links to movies on the page, we follow the link to the next page of the listing, and the callback `parse()` is called again to recursively parse the rest of the pages.

The output of the above example, a file named 'imdb_top_250.json', should look something like this:
```
[
    {
        "name": "The Matrix",
        "date": "1999-03-31",
        "rating": "8.7",
        "rating_count": "1,388,432",
        "genres": [
            "Action",
            "Sci-Fi"
        ],
        "content_rating": "R",
        "length": "2h 16min",
        "storyline": "Thomas A. Anderson is a man living two lives. By day he is an average computer programmer and by night a hacker known as Neo. Neo has always questioned his reality, but the truth is far beyond his imagination. Neo finds himself targeted by the police when he is contacted by Morpheus, a legendary computer hacker branded a terrorist by the government. Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix. As a rebel against the machines, Neo must return to the Matrix and confront the agents: super-powerful computer programs devoted to snuffing out Neo and the entire human rebellion.",
        "keywords": [
            "artificial reality",
            "simulated reality",
            "post apocalypse",
            "questioning reality",
            "war with machines"
        ]
    },
    {
        "name": "One Flew Over the Cuckoo's Nest",
        "date": "1975-11-19",
        "rating": "8.7",
        "rating_count": "771,037",
        "genres": [
            "Drama"
        ],
        "content_rating": "R",
        "length": "2h 13min",
        "storyline": "McMurphy has a criminal past and has once again gotten himself into trouble and is sentenced by the court. To escape labor duties in prison, McMurphy pleads insanity and is sent to a ward for the mentally unstable. Once here, McMurphy both endures and stands witness to the abuse and degradation of the oppressive Nurse Ratched, who gains superiority and power through the flaws of the other inmates. McMurphy and the other inmates band together to make a rebellious stance against the atrocious Nurse.",
        "keywords": [
            "mental institution",
            "escape",
            "evil woman",
            "psychiatric examination",
            "mental illness"
        ]
    },
    ....
]
```

---

## Making Requests and Using APIs
So far, we've seen how requests are made to the targets in `start_urls` and to links passed as arguments to `response.follow()`, but Scrapy also allows us to make custom requests with more options. The [`scrapy.http.Request`](https://doc.scrapy.org/en/latest/topics/request-response.html#request-objects) object has attributes and methods that can be used to attach data to each request. This data can be in the form of request headers, cookies, or a Python `dict` and is stored in the `headers`, `cookies`, and `meta` attributes, respectively, attached to the `Request` object.

To use requests with data attached, we'll need to make requests a little differently from before. Instead of using `start_urls` and `response.follow()` to automatically make requests, we'll need to create our own request objects. To demonstrate the differences, we'll modify the `IMDbSpider` from the previous section to manually create `Request` objects:

In [5]:
 class ModifiedIMDbSpider(IMDbSpider): # extends IMDbSpider to inherit parse_movie method
    name = 'modifiedIMDb'
    
    # start_urls = ['http://www.imdb.com/search/title?groups=top_250&sort=user_rating']
    # becomes this method
    def start_requests(self):
        urls = ['http://www.imdb.com/search/title?groups=top_250&sort=user_rating']
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)
    
    def parse(self, response):
        for movie_link in response.css('h3.lister-item-header a::attr(href)').extract():
            # yield response.follow(movie_link, callback=self.parse_movie)
            # becomes
            link = response.urljoin(movie_link)
            yield scrapy.Request(link, callback=self.parse_movie)
        
        next_link = response.css('div.desc a.lister-page-next::attr(href)').extract_first()
        # yield response.follow(next_link, callback=self.parse)
        # becomes
        next_link = response.urljoin(next_link)
        yield scrapy.Request(next_link, callback=self.parse)

A key difference between calling `response.follow()` and manually creating `Request`s is that the url passed to the Constructor should be the result of a call to `response.urljoin()`. The `urljoin()` method converts relative urls to absolute urls because `Request`s are not able to resolve relative urls.

Running the `ModifiedIMDbSpider` with these modifications should produce the exact same output as above.

---

Now that we've seen how to create `Requests`, we can add headers to our requests and supplement our web scraping with calls to a web API. The next example uses the Yelp API to search for businesses and scrapes each business' page for reviews. It includes an API key in the HTTP request headers for requests made to API endpoints and uses the `meta` attribute of `Request` and `Response` objects to pass information between the two parse methods.

In [3]:
import scrapy
from scrapy.crawler import CrawlerProcess
import json

class YelpSpider(scrapy.Spider):
    name = 'yelp'
    url = 'https://api.yelp.com/v3/businesses/search?categories=restaurants&location='
    offset = 0

    def start_requests(self):
        self.url += self.location
        self.headers = {'Authorization': 'Bearer %s' % self.api_key,}
        yield scrapy.Request(self.url, headers=self.headers, callback=self.parse)

    def parse(self, response):
        response_json = json.loads(response.text) # Yelp API returns a json response
        total = response_json['total']
        businesses = response_json['businesses']
        self.offset += len(businesses)
        for business in businesses:
            business_request = scrapy.Request(business['url'], callback=self.parse_business)
            business_request.meta['info'] = business # store the API response in the request
            business_request.meta['reviews'] = []
            yield business_request
        if self.offset < self.max_results:
            yield scrapy.Request(self.url + '&offset=' + str(self.offset), 
                             headers=self.headers, callback=self.parse)

    def parse_business(self, response):
        reviews = []
        for review in response.css('div.review-content'):
            rating = review.css('div.i-stars.rating-large::attr(title)').extract_first()[:3]
            date = review.css('span.rating-qualifier::text').extract_first()
            text = ''.join(review.css('p::text').extract())
            reviews.append({
                'rating': rating,
                'date': date,
                'text': text
            })
            
        business_info = response.meta['info'] # extract the API response for the business
        yield {'info': business_info, 'reviews': reviews,} # output both the API response and the scraped reviews

When we run the spider below, we pass in the location, API key, and a limit on the number of results as arguments. These arguments are converted to attributes of our spider by the default `__init__()` method of `scrapy.Spider`. More information on passing arguments to spiders can be found [here](https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments).

In [4]:
def get_key(path):
    with open(path, 'r') as file:
        return file.read().replace('\n','')

api_key = get_key('api_key.txt')

crawler = CrawlerProcess({
    'LOG_ENABLED': False,
    'FEED_URI': 'yelp.json',
    'FEED_FORMAT': 'json',
})

crawler.crawl(YelpSpider, location='Polish Hill, Pittsburgh', api_key=api_key, max_results=60)
crawler.start()

The output of the above code, located in 'yelp.json', should be something like this:
```
    [
        {
        "info": {
            "id": "mount-everest-sushi-pittsburgh",
            "name": "Mount Everest Sushi",
            "image_url": "https://s3-media4.fl.yelpcdn.com/bphoto/Vx_hhEFamDFpnTF-lnG_aQ/o.jpg",
            "is_closed": false,
            "url": "https://www.yelp.com/biz/mount-everest-sushi-pittsburgh?adjust_creative=XWzn1hOw6xktvH04lFxiXA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=XWzn1hOw6xktvH04lFxiXA",
            "review_count": 125,
            "categories": [
                {
                    "alias": "asianfusion",
                    "title": "Asian Fusion"
                },
    ...
        },
        "reviews": [
            {
                "rating": "5.0",
                "date": "\n        3/2/2018\n            ",
                "text": "I called to order and as busy as they were (everyone was scrambling around trying to get orders done), they had my order done in the ten minutes from when I called to the time it took me to walk over there! It's pretty impressive considering their phones were ringing off the hook order after order.I ordered 2 house special poké bowls and an order of sushi tacos (which came with 2 tacos) and miso soup. They've raised their prices a bit since I've been there last and compared to the menu photos here on yelp so it's a little more pricey (I think my total came to around $41) but still, you get a ton of sashimi and it tastes super fresh. They were all packed with sashimi, especially the sushi tacos, it was amazing. Delicious too!"
            },
            {
                "rating": "4.0",
                "date": "\n        3/27/2018\n    ",
                "text": "Don't be fooled by the small/sketchy appearance when you first walk in. They actually have a pretty decent seating area upstairs. You order your food at the counter in the entrance and can then head upstairs to enjoyI had the salmon poke bowl, and I thought it was delicious!"
            },
    ...
```

## Summary and Additional Resources
This tutorial covered the basics of web scraping with Python scripts using Scrapy. You can learn more about Scrapy, or topics related to web scraping in general, from these links:
* [Scrapy's official documentation](https://doc.scrapy.org/en/latest/index.html) 
* [CSS selectors](https://www.w3.org/TR/selectors/)
* [XPath expressions](https://www.w3.org/TR/xpath/), a powerful alternative to CSS selectors in Scrapy
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), another library for parsing web pages
