# Advanced Web Scraping - Scrapy 

## What you'll learn in this course 

As you learned how to parse HTML pages, it is now time to go to the next level and scrape websites automatically. The best way to do so is by using spiders from Scrapy. In this course, we'll learn: 

* How to create basic crawlers 
* Target specific tags and attributes in a webpage 
* Follow links 
* Simulate user log-in
* Create Item Pipelines 
* Deploy Scrapy to scrapinghub

## Install Scrapy üößüöß

Scrapy is not usually pre-installed in your environment. We strongly advise you to install it in a virtual environment. To do so, make sure you have `virtualenv` install. Otherwise: 

```shell
pip install virtualenv
```

then create a directory: 

```shell
mkdir scrapy
```
```shell
cd scrapy
```

then create a virtual environment within this directory: 

```shell
virtualenv env
```
```shell
source env/bin/activate
```

(NB: You should see a `(env)` in your terminal next to your username if everything worked well)

You can now run:

```shell
pip install Scrapy
```

This install `scrapy` within your virtual environment. Once you will be done playing around with the tool, you can do: 

```shell 
deactivate
```

This will deactivate your environment. 

## Explore with Scrapy shell üîç

Let's take the website that scrapy uses in its documentation: http://quotes.toscrape.com/
To help you select the right element, you can use scrapy shell:   
```shell
scrapy shell "https://quotes.toscrape.com"
```
You can start by typing `view(response)` to visualize the website you'll be scraping. It should open an HTML page. 

#### Select elements from a webpage 

If you want to select elements from a webpage, you can do it in two ways: 

1. With CSS selectors 
2. With XPaths 

CSS selectors are easier to use eventhough scrapy recommends you to use XPath (We've provided a crash course on XPath in the appendix of the course.)

Let's select all the quotes from our page, we can simply do it by typing: 

```shell 
response.css('.text').getall()
```

As you can see, we selected the class `text` that corresponds to the HTML tag where the quotes are. Then we called the `.getall()` method to get all the elements that contains the class `text` in the webpage. 

The output should look like this: 

```shell 
['<span class="text" itemprop="text">‚ÄúThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.‚Äù</span>', '<span class="text" itemprop="text">‚ÄúIt is our choices, Harry, that show what we truly are, far more than our abilities.‚Äù</span>', '<span class="text" itemprop="text">‚ÄúThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.‚Äù</span>', '<span class="text" itemprop="text">‚ÄúThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.‚Äù</span>', '<span class="text" itemprop="text">‚ÄúImperfection is beauty, madness is genius and it\'s better to be absolutely ridiculous than absolutely boring.‚Äù</span>', '<span class="text" itemprop="text">‚ÄúTry not to become a man of success. Rather become a man of value.‚Äù</span>', '<span class="text" itemprop="text">‚ÄúIt is better to be hated for what you are than to be loved for what you are not.‚Äù</span>', '<span class="text" itemprop="text">‚ÄúI have not failed. I\'ve just found 10,000 ways that won\'t work.‚Äù</span>', '<span class="text" itemprop="text">‚ÄúA woman is like a tea bag; you never know how strong it is until it\'s in hot water.‚Äù</span>', '<span class="text" itemprop="text">‚ÄúA day without sunshine is like, you know, night.‚Äù</span>']
```

If you don't want to get the HTML tag, but simply the text, you can specify it in your `.css()` method. 

```shell
response.css('.text::text').getall()
```

The output should look like this: 

```shell 
['‚ÄúThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.‚Äù', '‚ÄúIt is our choices, Harry, that show what we truly are, far more than our abilities.‚Äù', '‚ÄúThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.‚Äù', '‚ÄúThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.‚Äù', "‚ÄúImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.‚Äù", '‚ÄúTry not to become a man of success. Rather become a man of value.‚Äù', '‚ÄúIt is better to be hated for what you are than to be loved for what you are not.‚Äù', "‚ÄúI have not failed. I've just found 10,000 ways that won't work.‚Äù", "‚ÄúA woman is like a tea bag; you never know how strong it is until it's in hot water.‚Äù", '‚ÄúA day without sunshine is like, you know, night.‚Äù']
```

An equivalent in XPath would be: 

```shell 
response.xpath('//span/text()').getall()
```

The XPath is basically looking for all the spans in the HTML page and get the text of it. 

#### Select an attribute of an HTML tag 

Sometimes, you don't want to have the text contained in an HTML tag but rather its attributes. For example, you might want to get the url of a link. You can do it by using the `.attrib` attribute. 



```shell 
response.css("small+a").attrib['href']
```

You should get the following output:

```
'/author/Albert-Einstein'
```

If you want you to get the full url, you can do: 

```shell 
response.urljoin(response.css("small+a").attrib["href"])
```

You would get the following output 

```shell 
'http://quotes.toscrape.com/author/Albert-Einstein'
```




## Create your first project üíºüíº

To create a project with Scrapy, go to your terminal and simply run: 

```shell 
scrapy startproject my_first_project
```

You should see the following output: 

```shell 
New Scrapy project 'my_first_project', using template directory '/Users/antoinekrajnc/Desktop/test_scrapy/virt/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /Users/antoinekrajnc/Desktop/test_scrapy/my_first_project

You can start your first spider with:
    cd my_first_project
    scrapy genspider example example.com
```

`cd` into your project and you should see the following tree: 

```shell 
.
‚îú‚îÄ‚îÄ __init__.py
‚îú‚îÄ‚îÄ __pycache__
‚îú‚îÄ‚îÄ items.py
‚îú‚îÄ‚îÄ middlewares.py
‚îú‚îÄ‚îÄ pipelines.py
‚îú‚îÄ‚îÄ settings.py
‚îî‚îÄ‚îÄ spiders
    ‚îú‚îÄ‚îÄ __init__.py
    ‚îî‚îÄ‚îÄ __pycache__

```

We will learn the purpose of the main files as we progress in this course. For the moment, we'll simply be working inside the `spiders` folder. 

## Working with Spiders üï∑Ô∏èüï∑Ô∏è

Spiders are the crawlers that will actually scrape data from a website.

### Basic Example 

Run 
```shell
scrapy genspider my_first_spider "https://quotes.toscrape.com"
```
It creates a new file `my_first_spider.py` within the `spiders` folder: 

```shell
‚îú‚îÄ‚îÄ __init__.py
‚îú‚îÄ‚îÄ items.py
‚îú‚îÄ‚îÄ middlewares.py
‚îú‚îÄ‚îÄ pipelines.py
‚îú‚îÄ‚îÄ settings.py
‚îî‚îÄ‚îÄ spiders
    ‚îú‚îÄ‚îÄ __init__.py
    ‚îî‚îÄ‚îÄ my_first_spider.py
```

You can modify `my_first_spider.py` with the code below:

In [None]:
import scrapy


class QuotesSpider(scrapy.Spider):

    # Name of your spider
    name = "quotes"

    # Url to start your spider from 
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    # Callback that gets text, author and tags of the webpage
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # Select the NEXT button and store it in next_page
        next_page = response.css('li.next a').attrib["href"]

        # Check if next_page exists
        if next_page is not None:
            # Follow the next page and use the callback parse
            yield response.follow(next_page, callback=self.parse)

#### Run Spider 

Now if you want to run your spider, the simplest way is to run: 

```shell 
scrapy crawl quotes -o quotes.json
```

After the command ended, you should see a file called: `quotes.json` that contains: 

```json 
[
{"text": "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", "author": "Marilyn Monroe", "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"]},
{"text": "\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d", "author": "J.K. Rowling", "tags": ["courage", "friends"]},
{"text": "\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d", "author": "Albert Einstein", "tags": ["simplicity", "understand"]},
{"text": "\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d", "author": "Bob Marley", "tags": ["love"]},
{"text": "\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d", "author": "Dr. Seuss", "tags": ["fantasy"]},
{"text": "\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d", "author": "Douglas Adams", "tags": ["life", "navigation"]},
{"text": "\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d", "author": "Elie Wiesel", "tags": ["activism", "apathy", "hate", "indifference", "inspirational", "love", "opposite", "philosophy"]},
{"text": "\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d", "author": "Friedrich Nietzsche", "tags": ["friendship", "lack-of-friendship", "lack-of-love", "love", "marriage", "unhappy-marriage"]},
{"text": "\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d", "author": "Mark Twain", "tags": ["books", "contentment", "friends", "friendship", "life"]},
{"text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders", "tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]},
{"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe", "tags": ["be-yourself", "inspirational"]},
{"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]},
{"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide", "tags": ["life", "love"]},
{"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]},
{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]}
]
```

Now here, pay attention to the different elements in this command: 

* `scrapy crawl` tells scrapy to start running a spider 

* `quotes` is the name of the spider we specified when building our class 

* `-o quotes.json` stands for output in a file called `quotes.json`


And that's it!ü§© You have scraped data through the web. Now let's check out more complex features. 

Let's break down this code a little bit:

```python
import scrapy


class QuotesSpider(scrapy.Spider):

    # Name of your spider
    name = "quotes"

    # Url to start your spider from 
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    # Callback that gets text, author and tags of the webpage
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

```

There are now several things to understand in this code: 

* `QuotesSpider(scrapy.Spider)`: We create a class that inherites from `scrapy.Spider`. This allow us to use all the methods and attributes from it and therefore be able to create our spider. 

* `name = quotes`: This is the name of your spider 

* `start_urls`: Must be a list of urls. Indeed `scrapy` needs to receive an iterable here. 

* `parse`: function is what we call a [`callback`](https://stackoverflow.com/questions/1319074/parallel-python-what-is-a-callback) this a function that will actually do the job of getting the data. 

* `yield`: it's a keyword well used when one needs to iterate over very long loops. Please check out our appendix if you want to learn more. 

In this code, we will be creating dictionnaries that will store: 

* The text of a quote 
* Its author 
* The tags 

## Following links 

Very often, you want all the pages of a website. In this case, you won't be able to hard code each page you want your crawler to run to. You will need to follow links. Here is how you'd do it. 

A lot of code should be familiar by now. The only thing new is the following: 

```python 
        # Select the NEXT button and store it in next_page
        next_page = response.css('li.next a').attrib["href"]

        # Check if next_page exists
        if next_page is not None:
            # Follow the next page and use the callback parse
            yield response.follow(next_page, callback=self.parse)
```


* `response.css('li.next a').attrib["href"]` gets the link to the next page of the website 

* `if next_page is not None:` means that we are going to execute the code only if there is another page that exists

* `yield response.follow(next_page, callback=self.parse)` means that we are following the link that is inside `next_page` and once the spider is on that page, it's going to execute the function that is inside the `callback` argument, i.e `parse`. It will trigger a recursive function that will be executed as many times as there are webpages. 

We can now even shorten that code by using: `.follow_all` combined with `yield from`

In [None]:
import scrapy


class QuotesSpider(scrapy.Spider):

    # Name of your spider
    name = "quotes"

    # Url to start your spider from 
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    # Callback that gets text, author and tags of the webpage
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        
        yield from response.follow_all(css="li.next a", callback=self.parse)

## Simulate user login 

Very often, you will need to log into a given website before being able to scrape it. The best way to do so is to use `FormRequest.from_response` which will take care of login. 

In [None]:
import scrapy 

class Login(scrapy.Spider):
    # Name of your spider
    name = "login"

    # Starting URL
    start_urls = ['http://quotes.toscrape.com/login']

    # Parse function for login
    def parse(self, response):
        # FormRequest used to login
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    # Callback used after login
    def after_login(self, response):

        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        
        yield from response.follow_all(css="li.next a", callback=self.after_login)

## AutoThrottle extension

As you are becoming a pro in webscraping. It might happen that you have a lot of requests to make. If websites are well protected, they might ban you from making any more requests. 

One way to avoid that is to delay the number of requests automatically by using the AutoThrottle extension. 

As the documentation states, AutoThrottle extension is designed to: 

* *be nicer to sites instead of using default download delay of zero*

* *automatically adjust Scrapy to the optimum crawling speed, so the user doesn‚Äôt have to tune the download delays to find the optimum one. The user only needs to specify the maximum concurrent requests it allows, and the extension does the rest.*

If you want to be able to use the AutoThrottle Extension, you will need to edit your `settings.py` within your project. 

```shell 
‚îú‚îÄ‚îÄ __init__.py
‚îú‚îÄ‚îÄ items.py
‚îú‚îÄ‚îÄ middlewares.py
‚îú‚îÄ‚îÄ pipelines.py
‚îú‚îÄ‚îÄ quotes.json
‚îú‚îÄ‚îÄ settings.py
‚îî‚îÄ‚îÄ spiders
    ‚îú‚îÄ‚îÄ __init__.py
    ‚îú‚îÄ‚îÄ login.py
    ‚îî‚îÄ‚îÄ my_first_spider.py
```

Especially, you will need to edit: 

* `AUTOTHROTTLE_ENABLED`

* `AUTOTHROTTLE_START_DELAY`

* `AUTOTHROTTLE_MAX_DELAY`

* `AUTOTHROTTLE_TARGET_CONCURRENCY`

* `AUTOTHROTTLE_DEBUG`

* `CONCURRENT_REQUESTS_PER_DOMAIN`

* `CONCURRENT_REQUESTS_PER_IP`

* `DOWNLOAD_DELAY`


In your `settings.py`, simply uncomment the above variables and it autoThrottle extension will be enabled. 

## Feed Exports 

One very important feature that is and extremely useful is to have control on how and where your output data is going to be stored. To enable this, you will handle what scrapy calls `FEEDS`

In `settings.py`, add another paramater called `FEEDS`. It should follow the following structure: 

```python 
FEEDS = {
    "URI_or_FILE": {
        "format": "format_of_file",
        "encoding":"encoding_format",
        "fields": "fields_you_want_to_export",
        "indent": "indentation_space_number",
        "store_empty": "boolean_to_export_empty_feeds"
        
    }
}
```

For example, you can have: 

```python 
FEEDS = {
    "items.json": {
        
        'format': 'json',
        'encoding': 'utf8',
        'store_empty': False,
        'fields': None,
        'indent': 4,
    }
}
```

Once enabled, each time you run a spider without specifying an output, it will export it there. 


## Deploy your Spiders 

Finally, to deploy your spiders, there is nothing simpler. You will need to : 

1. Create an account on https://app.scrapinghub.com/

2. Create a project in your account 

2. install shub --> `pip install shub`

3. run `shub login` and provide the api key that has been given when you created a repository in your account. 

4. run `shub deploy project_name` 

5. That's it!




You can now run spiders directly from your account or simply by api call. Check out the documentation for more information. 

https://doc.scrapinghub.com/scrapy-cloud.html

## Appendix 1 - What is Yield Keyword for?

You might have noticed that we used `yield` keyword in Scrapy which could be quite new and confusing. 

In a nutshell, `yield` is a very useful keyword to return a series of data without taking up too much machine's memory. 

Let's check out with an example. Let's take two functions: 

In [1]:
# Simple function with return keyword
def return_list(a_list):
    for i in range(len(a_list)):
        a_list[i] *= 2
    return a_list

# Function with yield keyword
def return_with_yield(a_list):
    for i in range(len(a_list)):
        yield a_list[i] * 2


Now let's apply these two functions to a `range_list`

In [6]:
range_list = [x for x in range(10)]

In [7]:
# Returns a list
return_list(range_list)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [8]:
# Function with yield
return_with_yield(range_list)

<generator object return_with_yield at 0x7fd5dc411a98>

As you can see, the full list is returned in the first example whereas the second example returned a `generator`. Generators are very cool because we haven't actually executed the loop. Therefore, we haven't spend too much computer memory. 

So let's say instead of a list of 10 items, you'd have 1000000 items, it would make a huge difference in terms of computing memory. 

Now if you need to get the actual values of your generator, you can simply create a for loop.

In [10]:
[i for i in return_with_yield(range_list)]

[0, 4, 8, 12, 16, 20, 24, 28, 32, 36]

If you simply need to yield from a list without doing any manipulation, you can use `yield from` instead of creating a loop. 

In [11]:
# yield from
def return_with_yield(a_list):
    yield from a_list

In [13]:
# Iterate through the generator 
[i * 2 for i in return_with_yield(range_list)]

[0, 4, 8, 12, 16, 20, 24, 28, 32, 36]

## Appendix 2 - Crash course on XPath 

Eventhough we can use CSS selectors, it could be good for you learn XPaths 

The best way to learn XPath is to follow this great tutorial from https://Zvon.org :

* Start here : http://zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPaths

## Resources 

* [FormRequest.from_response()](https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin)
* [Downloading and processing files](https://docs.scrapy.org/en/latest/topics/media-pipeline.html)
* [Avoid Getting Banned](https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned)
* [What's YIELD in Python](https://www.youtube.com/watch?v=akqjaqUzdnA)
* [When to use yields in python](https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/#:~:text=When%20to%20use%20yield%20instead%20of%20return%20in%20Python%3F,where%20it%20is%20left%20off.&text=If%20the%20body%20of%20a,automatically%20becomes%20a%20generator%20function.)
* [In practice, what are the main uses for the new ‚Äúyield from‚Äù syntax in Python 3.3?
](https://stackoverflow.com/questions/9708902/in-practice-what-are-the-main-uses-for-the-new-yield-from-syntax-in-python-3)
* [AutoThrottle Extension](https://docs.scrapy.org/en/latest/topics/autothrottle.html)
* [Feed Exports](https://docs.scrapy.org/en/latest/topics/feed-exports.html)
* [Scrapinghub Reference Documentation](https://doc.scrapinghub.com/scrapy-cloud.html?_ga=2.217147093.2103382170.1591108132-2059253340.1591108132)