# Advanced scraping with Scrapy

## What you will learn in this course 🧐🧐

As you learned how to parse HTML pages, it is now time to go to the next level and scrape websites automatically. The best way to do so is by using spiders from Scrapy. In this course, we'll learn:

* How to create basic crawlers 
* Target specific tags and attributes in a webpage 
* Follow links to scrap multiple pages
* Simulate user log-in
* Run multiple crawlers at the same time
* Avoid being banned from websites

If Scrapy isn't installed yet in your environment, just execute the cell below:

In [1]:
# Add '!' only if you are running this command on a notebook 
## It tells Jupyter that the command should be interpreted as bash command
!pip install Scrapy

Collecting Scrapy
  Using cached Scrapy-2.4.1-py2.py3-none-any.whl (239 kB)
Collecting service-identity>=16.0.0
  Using cached service_identity-18.1.0-py2.py3-none-any.whl (11 kB)
Collecting w3lib>=1.17.0
  Using cached w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting queuelib>=1.4.2
  Using cached queuelib-1.5.0-py2.py3-none-any.whl (13 kB)
Collecting parsel>=1.5.0
  Using cached parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Collecting zope.interface>=4.1.3
  Using cached zope.interface-5.2.0-cp38-cp38-manylinux2010_x86_64.whl (244 kB)
Collecting PyDispatcher>=2.0.5
  Downloading PyDispatcher-2.0.5.tar.gz (34 kB)
Processing /home/jovyan/.cache/pip/wheels/91/64/36/bd0d11306cb22a78c7f53d603c7eb74ebb6c211703bc40b686/Protego-0.1.16-py3-none-any.whl
Collecting itemloaders>=1.0.1
  Using cached itemloaders-1.0.4-py3-none-any.whl (11 kB)
Processing /home/jovyan/.cache/pip/wheels/f2/36/1b/99fe6d339e1559e421556c69ad7bc8c869145e86a756c403f4/Twisted-20.3.0-cp38-cp38-linux_x86_64.whl
Collecting c

## Create your first spider 🕷️🕷️

Basically, Scrapy works with *Spiders* that describe the successive steps necessary to get the data you're interested in at a given url. To make a scraping engine, you will need to:

- declare your own class that inherits from `Scrapy.Spider`,
- declare two attributes: the `name` of your crawler and the `url` at which you will start crawling,
- declare a `parse` method with an argument called `response` (which represents the variable containing the HTML response at the `url` you just defined). This method will describe all the steps required to extract the desired data from the HTML elements, by using CSS selectors.

Let's begin with a simple example:

In [2]:
# Import os => Library used to easily manipulate operating systems
## More info => https://docs.python.org/3/library/os.html
import os 

# Import logging => Library used for logs manipulation 
## More info => https://docs.python.org/3/library/logging.html
import logging

# Import scrapy and scrapy.crawler 
import scrapy
from scrapy.crawler import CrawlerProcess

In [3]:
class RandomQuoteSpider(scrapy.Spider):
    # Name of your spider
    name = "randomquote"

    # Url to start your spider from 
    start_urls = [
        'http://quotes.toscrape.com/random',
    ]

    # Callback function that will be called when starting your spider
    # It will get text, author and tags of the first <div> with class="quote"
    def parse(self, response):
        quote = response.css('div.quote')
        return {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('span small.author::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

Then, you have to declare a `CrawlerProcess` that will run the spider and save the results in a `json` file (called a "FEED"):

In [5]:
# Name of the file where the results will be saved
filename = "1_randomquote.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir('src/'):
        os.remove('src/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'src/' + filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(RandomQuoteSpider)
process.start()

2020-12-12 14:04:48 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2020-12-12 14:04:48 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 | packaged by conda-forge | (default, Oct  7 2020, 19:08:05) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Linux-4.19.112+-x86_64-with-glibc2.10
2020-12-12 14:04:48 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-12-12 14:04:48 [scrapy.extensions.telnet] INFO: Telnet Password: afe766bdc57fd3ef
2020-12-12 14:04:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-12-12 14:04:48 [scrapy.middleware] INFO: Ena

**WARNING**: Scrapy is not made to run multiple independant crawlers in one script. For this reason, please restart your notebook's kernel before declaring a new `CrawlerProcess` (otherwise an error will be raised and the crawling won't run).


## Scraping multiple items per page 🛍️🛍️

Let's see an example where we parse multiple elements with a `for` loop and python's `yield` instruction (see appendix 1 of this lecture for details):

In [2]:
class QuotesSpider(scrapy.Spider):

    # Name of your spider
    name = "quotes"

    # Url to start your spider from 
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    # Callback function that will be called when starting your spider
    # It will get text, author and tags of all the <div> with class="quote"
    def parse(self, response):
        quotes = response.css('div.quote')
        for quote in quotes:
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

In [3]:
# Name of the file where the results will be saved
filename = "2_quotes.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir('src/'):
        os.remove('src/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'src/' + filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(QuotesSpider)
process.start()

2020-11-13 10:24:35 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-11-13 10:24:35 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.14.6-x86_64-i386-64bit
2020-11-13 10:24:35 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-11-13 10:24:35 [scrapy.extensions.telnet] INFO: Telnet Password: 5b91904e46afd900
2020-11-13 10:24:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-11-13 10:24:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
[

## Following pagination links 📄📄

The example below shows how to use links to iterate over multiple pages:

In [2]:
class QuotesMultipleSpider(scrapy.Spider):

    # Name of your spider
    name = "quotesmultiplepages"

    # Url to start your spider from 
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    # Callback function that will be called when starting your spider
    # It will get text, author and tags of the <div> with class="quote"
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        try:
            # Select the NEXT button and store it in next_page
            next_page = response.css('li.next a').attrib["href"]
        except KeyError:
            # In the last page, there won't be any "href" and a KeyError will be raised
            logging.info('No next page. Terminating crawling process.')
        else:
            # If a next page is found, execute the parse method once again
            yield response.follow(next_page, callback=self.parse)

In [3]:
# Name of the file where the results will be saved
filename = "3_quotesmultiplepages.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir('src/'):
        os.remove('src/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'src/' + filename: {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(QuotesMultipleSpider)
process.start()

2020-11-13 10:25:11 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-11-13 10:25:11 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.14.6-x86_64-i386-64bit
2020-11-13 10:25:11 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-11-13 10:25:11 [scrapy.extensions.telnet] INFO: Telnet Password: b090b7f78c6cb76e
2020-11-13 10:25:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-11-13 10:25:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
[

## Authentication on a website 🔐🔐

A very useful feature of Scrapy: you can simulate automatic authentication!

This can be done by using `scrapy.FormRequest.from_response()` to send a post request with some your login/password to the website:

In [2]:
class QuotesLogin(scrapy.Spider):
    # Name of your spider
    name = "login"

    # Starting URL
    start_urls = ['http://quotes.toscrape.com/login']

    # Parse function for login
    def parse(self, response):
        # FormRequest used to login
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},

            # Function to be called once logged in
            callback=self.after_login
        )

    # Callback used after login
    def after_login(self, response):

        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        
        # Select the NEXT button and store it in next_page
        try:
            next_page = response.css('li.next a').attrib["href"]
        except KeyError:
            # In the last page, there won't be any "href" and a KeyError will be raised
            logging.info('No next page. Terminating crawling process.')
        else:
            # If a next page is found, execute the parse method once again
            yield response.follow(next_page, callback=self.after_login)

In [3]:
# Name of the file where the results will be saved
filename = "4_quotesauthentication.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir('src/'):
        os.remove('src/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'src/' + filename: {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(QuotesLogin)
process.start()

2020-11-13 10:26:29 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-11-13 10:26:29 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.14.6-x86_64-i386-64bit
2020-11-13 10:26:29 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-11-13 10:26:29 [scrapy.extensions.telnet] INFO: Telnet Password: 60f7861552914475
2020-11-13 10:26:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-11-13 10:26:29 [scrapy.middleware] INFO: Enabled downloader middlewares:
[

## Running multiple spiders simultaneously 🕸️ 🕷️

As stated before, it's not possible to run multiple crawlers in the same python script. But if you'd like to crawl different pages in parallel, this can be done by declaring multiple spiders!

Then you just have to call `process.crawl()` as many times as you need, by passing your different spiders, as we illustrate below. The results will all be stored as a list of JSON data in the same file:

In [2]:
class QuotesSpiderPage1(scrapy.Spider):

    # Name of your spider
    name = "quotes"

    # Url to start your spider from 
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    # Callback function that will be called when starting your spider
    # It will get text, author and tags of all <div> with class="quote"
    def parse(self, response):
        quotes = response.css('div.quote')
        for quote in quotes:
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
            

class QuotesSpiderPage2(scrapy.Spider):

    # Name of your spider
    name = "quotes"

    # Url to start your spider from 
    start_urls = [
        'http://quotes.toscrape.com/page/2/',
    ]

    # Callback function that will be called when starting your spider
    # It will get text, author and tags of the <div> with class="quote"
    def parse(self, response):
        quotes = response.css('div.quote')
        for quote in quotes:
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

In [3]:
# Name of the file where the results will be saved
filename = "5_quotesmultiplespiders.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir('src/'):
        os.remove('src/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'src/' + filename: {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(QuotesSpiderPage1)
process.crawl(QuotesSpiderPage2)
process.start()

2020-11-13 10:27:54 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-11-13 10:27:54 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.14.6-x86_64-i386-64bit
2020-11-13 10:27:54 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-11-13 10:27:54 [scrapy.extensions.telnet] INFO: Telnet Password: 226a569a2ba7e62a
2020-11-13 10:27:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-11-13 10:27:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
[

## Avoid being banned: autothrottle 🚫🚫

The more scraping you're doing the more requests you make. If websites are well protected, they might ban you because you exceeded requests limitations. 

You may avoid that by delaying the number of requests automatically thanks to the `AutoThrottle` extension. 

As stated in the documentation, `AutoThrottle` extension is designed to: 

- *Be nicer to sites instead of using default download delay of zero.*
- *Automatically adjust Scrapy to the optimum crawling speed, so the user doesn’t have to tune the download delays to find the optimum one. The user only needs to specify the maximum concurrent requests it allows, and the extension does the rest.*

To use autothrottle, it's as simple as adding `"AUTOTHROTTLE_ENABLED": True` to your crawler's settings:

In [2]:
class QuotesMultipleSpider(scrapy.Spider):

    # Name of your spider
    name = "quotesmultiplepages"

    # Url to start your spider from 
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    # Callback that gets text, author and tags of the webpage
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # Select the NEXT button and store it in next_page
        try:
            next_page = response.css('li.next a').attrib["href"]
        except KeyError:
            # In the last page, there won't be any "href" and a KeyError will be raised
            logging.info('No next page. Terminating crawling process.')
        else:
            # If a next page is found, execute the parse method once again
            yield response.follow(next_page, callback=self.parse)

In [3]:
# Name of the file where the results will be saved
filename = "6_quotesautothrottle.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir('src/'):
        os.remove('src/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'src/' + filename: {"format": "json"},
    },
    "AUTOTHROTTLE_ENABLED": True  # AutoThrottle Here!
})

# Start the crawling using the spider you defined above
process.crawl(QuotesMultipleSpider)
process.start()

2020-11-13 10:31:19 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-11-13 10:31:19 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.14.6-x86_64-i386-64bit
2020-11-13 10:31:19 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-11-13 10:31:19 [scrapy.extensions.telnet] INFO: Telnet Password: 86b452237fed67cf
2020-11-13 10:31:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2020

## Appendix 1 - What is Yield keyword for? 💐

You might have noticed that we used the `yield` keyword in Scrapy which could be quite new and confusing. Technically speaking it is called a *generator*.

In a nutshell, `yield` is a very useful keyword to return a data collection without taking up too much machine's memory. 

Let's check out with an example. Let's take two functions: 

In [1]:
# Simple function with return keyword
def return_list(a_list):
    for i in range(len(a_list)):
        a_list[i] *= 2
    return a_list

# Function with yield keyword
def return_with_yield(a_list):
    for i in range(len(a_list)):
        yield a_list[i] * 2

Now let's apply these two functions to our `random_list`

In [2]:
# Create a list of numbers from 0 to 9
random_list = [x for x in range(10)]
# Returns a list
return_list(random_list)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [3]:
# Create a list of numbers from 0 to 9
random_list = [x for x in range(10)]
# Function with yield
return_with_yield(random_list)

<generator object return_with_yield at 0x10a14a190>

In the first example, `return_list` returned directly the full list. Whereas, in the second example, `return_with_yield` returned a `generator`. Generators are very cool because we haven't actually executed the loop. Therefore, we haven't spend too much computer memory. 

So let's say instead of a list of 10 items, you'd have one of 1000000 items, it would make a huge difference in terms of computing time. 

Now if you need to get the actual values of your generator, you can simply create a for loop or a comprehension list like:

In [4]:
# Using a for loop will just print the output:
for number in return_with_yield(random_list):
    print("output", number)

# Using a comprehension list will create a list:
[i for i in return_with_yield(random_list)]

output 0
output 2
output 4
output 6
output 8
output 10
output 12
output 14
output 16
output 18


[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

If you simply need to yield from a list without doing any manipulation, you can use `yield from` instead of creating a loop. 

## Appendix 2 - Crash course on XPath ⚔️

In this lecture, you've learned how to use CSS selectors with Scrapy. Another way of scraping websites with Scrapy is by using XPaths.

The best way to learn XPath is to follow this great tutorial from <a href="http://zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPaths" target="_blank">http://Zvon.org</a>.

## Resources 📚📚

* <a href="https://docs.scrapy.org/en/latest/index.html" target="_blank"> Scrapy Documentation </a>
* <a href="https://docs.python.org/3/library/logging.html" target="_blank"> Logging</a>
* <a href="https://docs.scrapy.org/en/latest/topics/logging.html#topics-logging" target="_blank">Logging in a scrapy</a>
