# Advanced Web Scraping Lab

In this lab you will first learn the following code snippet which is a simple web spider class that allows you to scrape paginated webpages. Read the code, run it, and make sure you understand how it work. In the challenges of this lab, we will guide you in building up this class so that eventually you will have a more robus web spider that you can further work on in the Web Scraping Project.

In [186]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    return content

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

## Challenge 1 - Custom Parser Function

In this challenge, complete the custom `quotes_parser()` function so that the returned result contains the quote string instead of the whole html page content.

In the cell below, write your updated `quotes_parser()` function and kickstart the spider. Make sure the results being printed contain a list of quote strings extracted from the html content.

In [187]:
def quotes_parser(content):
    soup=BeautifulSoup(content)
    quotes=[x.text for x in soup.select('.text')]
    authors=[x.text for x in soup.select('.author')]
    data=data=list(zip(authors,quotes))
    df=pd.DataFrame(data,columns=['Author','Quote'])
    return df

In [188]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    soup=BeautifulSoup(content)
    quotes=[x.text for x in soup.select('.text')]
    authors=[x.text for x in soup.select('.author')]
    data=data=list(zip(authors,quotes))
    df=pd.DataFrame(data,columns=['Author','Quote'])
    return df

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

              Author                                              Quote
0    Albert Einstein  “The world as we have created it is a process ...
1       J.K. Rowling  “It is our choices, Harry, that show what we t...
2    Albert Einstein  “There are only two ways to live your life. On...
3        Jane Austen  “The person, be it gentleman or lady, who has ...
4     Marilyn Monroe  “Imperfection is beauty, madness is genius and...
5    Albert Einstein  “Try not to become a man of success. Rather be...
6         André Gide  “It is better to be hated for what you are tha...
7   Thomas A. Edison  “I have not failed. I've just found 10,000 way...
8  Eleanor Roosevelt  “A woman is like a tea bag; you never know how...
9       Steve Martin  “A day without sunshine is like, you know, nig...


## Challenge 2 - Error Handling

In `IronhackSpider.scrape_url()`, catch any error that might occur when you make requests to scrape the webpage. This includes checking the response status code and catching http request errors such as timeout, SSL, and too many redirects.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [190]:
def scrape_url(self,url):
    try:
        response=requests.get(url)
        if response.status_code<300:                    
            result = self.content_parser(response.content)
            self.output_results(result)
        elif response.status_code in range(400,500):
            print('Error {} in "{}", request failed because the resource either does not exist or is forbidden.'
                  .format(response.status_code,url))
        else:
            print('Error {}, in "{}", request failed because the response server encountered an error.'
                  .format(response.status_code,url))
    except requests.exceptions.Timeout:
        print('Error= Timeout, in {}.'.format(url))
    except requests.exceptions.TooManyRedirects:
        print('Error= TooManyRedirects, in {}.'.format(url))
    except requests.exceptions.SSLError:
        print('Error= SSLError, in {}.'.format(url))
    except requests.exceptions.RequestException as e:
        print('Error= {}, in {}.'.format(e,url))

In [191]:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self,url):
        try:
            response=requests.get(url)
            if response.status_code<300:                    
                result = self.content_parser(response.content)
                self.output_results(result)
            elif response.status_code in range(400,500):
                print('Error {} in "{}", request failed because the resource either does not exist or is forbidden.'
                      .format(response.status_code,url))
            else:
                print('Error {}, in "{}", request failed because the response server encountered an error.'
                      .format(response.status_code,url))
        except requests.exceptions.Timeout:
            print('Error= Timeout.'.format(url))
        except requests.exceptions.TooManyRedirects:
            print('Error= TooManyRedirects, in {}.'.format(url))
        except requests.exceptions.SSLError:
            print('Error= SSLError, in {}.'.format(url))
        except requests.exceptions.RequestException as e:
            print('Error= {}, in {}.'.format(e,url))
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    soup=BeautifulSoup(content)
    quotes=[x.text for x in soup.select('.text')]
    authors=[x.text for x in soup.select('.author')]
    data=data=list(zip(authors,quotes))
    df=pd.DataFrame(data,columns=['Author','Quote'])
    return df

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

              Author                                              Quote
0    Albert Einstein  “The world as we have created it is a process ...
1       J.K. Rowling  “It is our choices, Harry, that show what we t...
2    Albert Einstein  “There are only two ways to live your life. On...
3        Jane Austen  “The person, be it gentleman or lady, who has ...
4     Marilyn Monroe  “Imperfection is beauty, madness is genius and...
5    Albert Einstein  “Try not to become a man of success. Rather be...
6         André Gide  “It is better to be hated for what you are tha...
7   Thomas A. Edison  “I have not failed. I've just found 10,000 way...
8  Eleanor Roosevelt  “A woman is like a tea bag; you never know how...
9       Steve Martin  “A day without sunshine is like, you know, nig...


# Challenge 3 - Sleep Interval

In `IronhackSpider.kickstart()`, implement `sleep_interval`. You will check if `self.sleep_interval` is larger than 0. If so, tell the FOR loop to sleep the given amount of time before making the next request.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [194]:
def kickstart(self):
    interval=0
    if self.sleep_interval>0:
        interval=self.sleep_interval
    for i in range(1, self.pages_to_scrape+1):
        self.scrape_url(self.url_pattern % i)
        time.sleep(interval)

In [195]:
import requests
import re
import time
import pandas as pd
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self,url):
        try:
            response=requests.get(url)
            if response.status_code<300:                    
                result = self.content_parser(response.content)
                self.output_results(result)
            elif response.status_code in range(400,500):
                print('Error {} in "{}", request failed because the resource either does not exist or is forbidden.'
                      .format(response.status_code,url))
            else:
                print('Error {}, in "{}", request failed because the response server encountered an error.'
                      .format(response.status_code,url))
        except requests.exceptions.Timeout:
            print('Error= Timeout.'.format(url))
        except requests.exceptions.TooManyRedirects:
            print('Error= TooManyRedirects, in {}.'.format(url))
        except requests.exceptions.SSLError:
            print('Error= SSLError, in {}.'.format(url))
        except requests.exceptions.RequestException as e:
            print('Error= {}, in {}.'.format(e,url))
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        interval=0
        if self.sleep_interval>0:
            interval=self.sleep_interval
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            time.sleep(interval)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    soup=BeautifulSoup(content)
    quotes=[x.text for x in soup.select('.text')]
    authors=[x.text for x in soup.select('.author')]
    data=data=list(zip(authors,quotes))
    df=pd.DataFrame(data,columns=['Author','Quote'])
    return df

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

              Author                                              Quote
0    Albert Einstein  “The world as we have created it is a process ...
1       J.K. Rowling  “It is our choices, Harry, that show what we t...
2    Albert Einstein  “There are only two ways to live your life. On...
3        Jane Austen  “The person, be it gentleman or lady, who has ...
4     Marilyn Monroe  “Imperfection is beauty, madness is genius and...
5    Albert Einstein  “Try not to become a man of success. Rather be...
6         André Gide  “It is better to be hated for what you are tha...
7   Thomas A. Edison  “I have not failed. I've just found 10,000 way...
8  Eleanor Roosevelt  “A woman is like a tea bag; you never know how...
9       Steve Martin  “A day without sunshine is like, you know, nig...


# Challenge 4 - Test Batch Scraping

Change the `PAGES_TO_SCRAPE` value from `1` to `10`. Try if your code still works as intended to scrape 10 webpages. If there are errors in your code, fix them.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [196]:
import requests
import re
import time
import pandas as pd
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self,url):
        try:
            response=requests.get(url)
            if response.status_code<300:                    
                result = self.content_parser(response.content)
                self.output_results(result)
            elif response.status_code in range(400,500):
                print('Error {} in "{}", request failed because the resource either does not exist or is forbidden.'
                      .format(response.status_code,url))
            else:
                print('Error {}, in "{}", request failed because the response server encountered an error.'
                      .format(response.status_code,url))
        except requests.exceptions.Timeout:
            print('Error= Timeout.'.format(url))
        except requests.exceptions.TooManyRedirects:
            print('Error= TooManyRedirects, in {}.'.format(url))
        except requests.exceptions.SSLError:
            print('Error= SSLError, in {}.'.format(url))
        except requests.exceptions.RequestException as e:
            error=re.findall('Caused by ([^(]*)',str(e))[0]
            print('Error= {}, in {}.'.format(error,url))
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        interval=0
        if self.sleep_interval>0:
            interval=self.sleep_interval
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            time.sleep(interval)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    soup=BeautifulSoup(content)
    quotes=[x.text for x in soup.select('.text')]
    authors=[x.text for x in soup.select('.author')]
    data=data=list(zip(authors,quotes))
    df=pd.DataFrame(data,columns=['Author','Quote'])
    return df

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

              Author                                              Quote
0    Albert Einstein  “The world as we have created it is a process ...
1       J.K. Rowling  “It is our choices, Harry, that show what we t...
2    Albert Einstein  “There are only two ways to live your life. On...
3        Jane Austen  “The person, be it gentleman or lady, who has ...
4     Marilyn Monroe  “Imperfection is beauty, madness is genius and...
5    Albert Einstein  “Try not to become a man of success. Rather be...
6         André Gide  “It is better to be hated for what you are tha...
7   Thomas A. Edison  “I have not failed. I've just found 10,000 way...
8  Eleanor Roosevelt  “A woman is like a tea bag; you never know how...
9       Steve Martin  “A day without sunshine is like, you know, nig...
                Author                                              Quote
0       Marilyn Monroe  “This life is what you make it. No matter what...
1         J.K. Rowling  “It takes a great deal of bravery to

# Challenge 5 - Scrape a Different Website

Update the parameters passed to the `IronhackSpider` constructor so that you coder can crawl [books.toscrape.com](http://books.toscrape.com/). You will need to use a different `URL_PATTERN` (figure out the new url pattern by yourself) and write another parser function to be passed to `IronhackSpider`. 

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [3]:
import requests
import re
import time
import pandas as pd
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self,url):
        try:
            response=requests.get(url)
            if response.status_code<300:                    
                result = self.content_parser(response.content)
                self.output_results(result)
            elif response.status_code in range(400,500):
                print('Error {} in "{}", request failed because the resource either does not exist or is forbidden.'
                      .format(response.status_code,url))
            else:
                print('Error {}, in "{}", request failed because the response server encountered an error.'
                      .format(response.status_code,url))
        except requests.exceptions.Timeout:
            print('Error= Timeout.'.format(url))
        except requests.exceptions.TooManyRedirects:
            print('Error= TooManyRedirects, in {}.'.format(url))
        except requests.exceptions.SSLError:
            print('Error= SSLError, in {}.'.format(url))
        except requests.exceptions.RequestException as e:
            error=re.findall('Caused by ([^(]*)',str(e))[0]
            print('Error= {}, in {}.'.format(error,url))
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        interval=0
        if self.sleep_interval>0:
            interval=self.sleep_interval
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            time.sleep(interval)


URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def books_parser(content):
    soup=BeautifulSoup(content)
    titles=[x.a['title'] for x in soup.find_all('h3')]
    prices=[x.text for x in soup.select('.price_color')]
    in_stock=[re.sub('\n','',x.text).strip() for x in soup.select('.instock')]
    ratings=[x['class'][1] for x in soup.select('.star-rating')]
    df=pd.DataFrame(zip(titles,prices,in_stock,ratings),columns=['Title','Price','Availability','Stars'])
    return df

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=books_parser)

# Start scraping jobs
my_spider.kickstart()

                                                Title   Price Availability  \
0                                A Light in the Attic  £51.77     In stock   
1                                  Tipping the Velvet  £53.74     In stock   
2                                          Soumission  £50.10     In stock   
3                                       Sharp Objects  £47.82     In stock   
4               Sapiens: A Brief History of Humankind  £54.23     In stock   
5                                     The Requiem Red  £22.65     In stock   
6   The Dirty Little Secrets of Getting Your Dream...  £33.34     In stock   
7   The Coming Woman: A Novel Based on the Life of...  £17.93     In stock   
8   The Boys in the Boat: Nine Americans and Their...  £22.60     In stock   
9                                     The Black Maria  £52.15     In stock   
10     Starving Hearts (Triangular Trade Trilogy, #1)  £13.99     In stock   
11                              Shakespeare's Sonnets  £20.66   

# Bonus Challenge 1 - Making Your Spider Unblockable

Use techniques such as randomizing user agents and referers in your requests to reduce the likelihood that your spider is blocked by websites. [Here](http://blog.adnansiddiqi.me/5-strategies-to-write-unblock-able-web-scrapers-in-python/) is a great article to learn these techniques.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [128]:
import requests
import re
import time
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None, get_agent=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
        self.get_agent=get_agent
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self,url):
        agent=self.get_agent()
        headers = {
            'user-agent': agent,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'en-US,en;q=0.9,es;q=0.8,fr;q=0.7,gl;q=0.6',
            'Pragma': 'no-cache',
        } 
        try:
            response=requests.get(url,headers=headers)
            if response.status_code<300:                    
                result = self.content_parser(response.content)
                self.output_results(result)
            elif response.status_code in range(400,500):
                print('Error {} in "{}", request failed because the resource either does not exist or is forbidden.'
                      .format(response.status_code,url))
            else:
                print('Error {}, in "{}", request failed because the response server encountered an error.'
                      .format(response.status_code,url))
        except requests.exceptions.Timeout:
            print('Error= Timeout.'.format(url))
        except requests.exceptions.TooManyRedirects:
            print('Error= TooManyRedirects, in {}.'.format(url))
        except requests.exceptions.SSLError:
            print('Error= SSLError, in {}.'.format(url))
        except requests.exceptions.RequestException as e:
            error=re.findall('Caused by ([^(]*)',str(e))[0]
            print('Error= {}, in {}.'.format(error,url))
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        interval=0
        if self.sleep_interval>0:
            interval=self.sleep_interval
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            time.sleep(interval)
            
URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    soup=BeautifulSoup(content)
    quotes=[x.text for x in soup.select('.text')]
    authors=[x.text for x in soup.select('.author')]
    data=data=list(zip(authors,quotes))
    df=pd.DataFrame(data,columns=['Author','Quote'])
    return df

def random_agent():
    with open('agents.txt',encoding="utf8") as f:
        lines = f.readlines()
        perm = np.random.RandomState()
        index = perm.permutation(len(lines) - 1)[0]
        agent = lines[int(index)].strip()
        return agent
# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser, get_agent=random_agent,sleep_interval=1)

# Start scraping jobs
my_spider.kickstart()

              Author                                              Quote
0    Albert Einstein  “The world as we have created it is a process ...
1       J.K. Rowling  “It is our choices, Harry, that show what we t...
2    Albert Einstein  “There are only two ways to live your life. On...
3        Jane Austen  “The person, be it gentleman or lady, who has ...
4     Marilyn Monroe  “Imperfection is beauty, madness is genius and...
5    Albert Einstein  “Try not to become a man of success. Rather be...
6         André Gide  “It is better to be hated for what you are tha...
7   Thomas A. Edison  “I have not failed. I've just found 10,000 way...
8  Eleanor Roosevelt  “A woman is like a tea bag; you never know how...
9       Steve Martin  “A day without sunshine is like, you know, nig...
                Author                                              Quote
0       Marilyn Monroe  “This life is what you make it. No matter what...
1         J.K. Rowling  “It takes a great deal of bravery to

# Bonus Challenge 2 - Making Asynchronous Calls

Implement asynchronous calls to `IronhackSpider`. You will make requests in parallel to complete your tasks faster.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [182]:
import asyncio
import re
import time
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import aiohttp

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None, get_agent=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
        self.get_agent=get_agent
        
    async def fetch_headers(self,url,headers):
        async with aiohttp.ClientSession() as s, s.get(url,headers=headers) as res:
            ret=await res.read()
            status=res.status
            return status,ret
    
    """
    Scrape the content of a single url.
    """
    async def scrape_url(self,url):
        agent=self.get_agent()
        headers = {
            'user-agent': agent,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'en-US,en;q=0.9,es;q=0.8,fr;q=0.7,gl;q=0.6',
            'Pragma': 'no-cache',
        } 
        try:
            status,content= await self.fetch_headers(url,headers)
            if status<300:                    
                result = self.content_parser(content)
                self.output_results(result)
            elif status in range(400,500):
                print('Error {} in "{}", request failed because the resource either does not exist or is forbidden.'
                      .format(status,url))
            else:
                print('Error {}, in "{}", request failed because the response server encountered an error.'
                      .format(status,url))
        except aiohttp.ServerTimeoutError:
            print('Error= Timeout.'.format(url))
        except aiohttp.TooManyRedirects:
            print('Error= TooManyRedirects, in {}.'.format(url))
        except aiohttp.ClientSSLError:
            print('Error= SSLError, in {}.'.format(url))
        except aiohttp.ClientError as e:
            error=re.findall('Caused by ([^(]*)',str(e))[0]
            print('Error= {}, in {}.'.format(error,url))
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    async def kickstart(self):
        for j in range(int(np.sqrt(self.pages_to_scrape))):
            await asyncio.wait([self.scrape_url(self.url_pattern % i) 
                                for i in range(1, self.pages_to_scrape+1) if i%(int(np.sqrt(self.pages_to_scrape)))==j])
            if self.sleep_interval>0:
                await asyncio.sleep(self.sleep_interval)
            
    def async_kickstart(self):
        try:
            loop=asyncio.get_event_loop()
            loop.run_until_complete(self.kickstart())
        except:
            pass
        
URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrapge


"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    soup=BeautifulSoup(content)
    quotes=[x.text for x in soup.select('.text')]
    authors=[x.text for x in soup.select('.author')]
    data=data=list(zip(authors,quotes))
    df=pd.DataFrame(data,columns=['Author','Quote'])
    return df

def random_agent():
    with open('agents.txt',encoding="utf8") as f:
        lines = f.readlines()
        perm = np.random.RandomState()
        index = perm.permutation(len(lines) - 1)[0]
        agent = lines[int(index)].strip()
        return agent
# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser, get_agent=random_agent,sleep_interval=2)

# Start scraping jobs
my_spider.async_kickstart()

            Author                                              Quote
0  Albert Einstein  “Anyone who has never made a mistake has never...
1      Jane Austen  “A lady's imagination is very rapid; it jumps ...
2     J.K. Rowling  “Remember, if the time should come when you ha...
3      Jane Austen  “I declare after all there is no enjoyment lik...
4      Jane Austen  “There are few people whom I really love, and ...
5       C.S. Lewis  “Some day you will be old enough to start read...
6       C.S. Lewis  “We are not necessarily doubting that God will...
7       Mark Twain  “The fear of death follows from the fear of li...
8       Mark Twain  “A lie can travel half way around the world wh...
9       C.S. Lewis  “I believe in Christianity as I believe that t...
                 Author                                              Quote
0           Jane Austen  “There is nothing I would not do for those who...
1     Eleanor Roosevelt          “Do one thing every day that scares you.”
2    