# Advanced Web Scraping Lab

In this lab you will first learn the following code snippet which is a simple web spider class that allows you to scrape paginated webpages. Read the code, run it, and make sure you understand how it works. In the challenges of this lab, we will guide you in building up this class so that eventually you will have a more robust web spider that you can further work on in the Web Scraping Project.

In [1]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scrape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i) 


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    return content

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

## Challenge 1 - Custom Parser Function

In this challenge, complete the custom `quotes_parser()` function so that the returned result contains the quote string instead of the whole html page content.

In the cell below, write your updated `quotes_parser()` function and kickstart the spider. Make sure the results being printed contain a list of quote strings extracted from the html content.

In [7]:
def quotes_parser(content):
    
    soup = BeautifulSoup(content,'html.parser')
    
    quotes = soup.find_all('div',class_='quote')
    
    quote_list = []
    
    for quote in quotes:
        
        quote_list.append(quote.span.get_text().strip())
        
    return quote_list

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']


## Challenge 2 - Error Handling

In `IronhackSpider.scrape_url()`, catch any error that might occur when you make requests to scrape the webpage. This includes checking the response status code and catching http request errors such as timeout, SSL, and too many redirects.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [9]:
import requests
from bs4 import BeautifulSoup

class IronhackSpdier:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scrape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url)
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout as err:
            print('\n Time out error. The processing time needed was too long \n')
            print(err.message)
        except HTTPError as err:
            print('\n HTTP error. The HTTP request returned an unsuccessful status code \n')
            print(err.message)
        except TooManyRedirects as err:
            print('\n TooManyRedirects error. The request exceeded the configured number of maximum redirections \n')
            print(err.message)
        except ConnectionError as err:
            print('\n Connection error. There was a problem with your connection \n')
            print(err.message)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i) 


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    
    soup = BeautifulSoup(content,'html.parser')
    
    quotes = soup.find_all('div',class_='quote')
    
    quote_list = []
    
    for quote in quotes:
        
        quote_list.append(quote.span.get_text().strip())
        
    return quote_list

# Instantiate the IronhackSpider class
my_spider = IronhackSpdier(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']


# Challenge 3 - Sleep Interval

In `IronhackSpider.kickstart()`, implement `sleep_interval`. You will check if `self.sleep_interval` is larger than 0. If so, tell the FOR loop to sleep the given amount of time before making the next request.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [10]:
import time
import requests
from bs4 import BeautifulSoup

class IronhackSpdier:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scrape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url)
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout as err:
            print('\n Time out error. The processing time needed was too long \n')
            print(err.message)
        except HTTPError as err:
            print('\n HTTP error. The HTTP request returned an unsuccessful status code \n')
            print(err.message)
        except TooManyRedirects as err:
            print('\n TooManyRedirects error. The request exceeded the configured number of maximum redirections \n')
            print(err.message)
        except ConnectionError as err:
            print('\n Connection error. There was a problem with your connection \n')
            print(err.message)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval>0:
                time.sleep(self.sleep_interval)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    
    soup = BeautifulSoup(content,'html.parser')
    
    quotes = soup.find_all('div',class_='quote')
    
    quote_list = []
    
    for quote in quotes:
        
        quote_list.append(quote.span.get_text().strip())
        
    return quote_list

# Instantiate the IronhackSpider class
my_spider = IronhackSpdier(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']


# Challenge 4 - Test Batch Scraping

Change the `PAGES_TO_SCRAPE` value from `1` to `10`. Try if your code still works as intended to scrape 10 webpages. If there are errors in your code, fix them.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [11]:
import time
import requests
from bs4 import BeautifulSoup

class IronhackSpdier:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scrape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url)
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout as err:
            print('\n Time out error. The processing time needed was too long \n')
            print(err.message)
        except HTTPError as err:
            print('\n HTTP error. The HTTP request returned an unsuccessful status code \n')
            print(err.message)
        except TooManyRedirects as err:
            print('\n TooManyRedirects error. The request exceeded the configured number of maximum redirections \n')
            print(err.message)
        except ConnectionError as err:
            print('\n Connection error. There was a problem with your connection \n')
            print(err.message)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval>0:
                time.sleep(self.sleep_interval)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    
    soup = BeautifulSoup(content,'html.parser')
    
    quotes = soup.find_all('div',class_='quote')
    
    quote_list = []
    
    for quote in quotes:
        
        quote_list.append(quote.span.get_text().strip())
        
    return quote_list

# Instantiate the IronhackSpider class
my_spider = IronhackSpdier(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']
["“This life is what you make it. No matter what, you're going t

['“If I had a flower for every time I thought of you...I could walk through my garden forever.”', '“Some people never go crazy. What truly horrible lives they must lead.”', '“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”', '“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”', "“What really knocks me out is a book that, when you're all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn't happen much, though.”", '“The reason I talk to myself is because I’m the only one whose answers I accept.”', "“You may say I'm a dreamer, but I'm not the only one. I hope someday you'll join us. And the world will live as one.”", '“I am free of all prejudice. I hate everyone equally. ”', "“The question isn't who is going to let me; it's who is going to stop me.”",

# Challenge 5 - Scrape a Different Website

Update the parameters passed to the `IronhackSpider` constructor so that you coder can crawl [books.toscrape.com](http://books.toscrape.com/). You will need to use a different `URL_PATTERN` (figure out the new url pattern by yourself) and write another parser function to be passed to `IronhackSpider`. 

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [14]:
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd

class IronhackSpdier:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scrape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url)
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout as err:
            print('\n Time out error. The processing time needed was too long \n')
            print(err.message)
        except HTTPError as err:
            print('\n HTTP error. The HTTP request returned an unsuccessful status code \n')
            print(err.message)
        except TooManyRedirects as err:
            print('\n TooManyRedirects error. The request exceeded the configured number of maximum redirections \n')
            print(err.message)
        except ConnectionError as err:
            print('\n Connection error. There was a problem with your connection \n')
            print(err.message)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval>0:
                time.sleep(self.sleep_interval)


URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 50 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    
    soup = BeautifulSoup(content,'html.parser')
    
    books = soup.find_all('article',class_='product_pod')
    
    book_dict = {'Name':[],'Price':[],'Availability':[]}

    for book in books:

        book_dict['Name'].append(book.h3.a.get_text())

        book_dict['Price'].append(book.select('div')[1].p.get_text())

        book_dict['Availability'].append(book.select('div')[1].select('p')[1].get_text().strip())

    bookspd = pd.DataFrame(book_dict)
        
    return bookspd

# Instantiate the IronhackSpider class
my_spider = IronhackSpdier(URL_PATTERN, PAGES_TO_SCRAPE,sleep_interval=0.1, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

                                     Name   Price Availability
0                      A Light in the ...  £51.77     In stock
1                      Tipping the Velvet  £53.74     In stock
2                              Soumission  £50.10     In stock
3                           Sharp Objects  £47.82     In stock
4            Sapiens: A Brief History ...  £54.23     In stock
5                         The Requiem Red  £22.65     In stock
6            The Dirty Little Secrets ...  £33.34     In stock
7                 The Coming Woman: A ...  £17.93     In stock
8                     The Boys in the ...  £22.60     In stock
9                         The Black Maria  £52.15     In stock
10  Starving Hearts (Triangular Trade ...  £13.99     In stock
11                  Shakespeare's Sonnets  £20.66     In stock
12                            Set Me Free  £17.46     In stock
13    Scott Pilgrim's Precious Little ...  £52.29     In stock
14                      Rip it Up and ...  £35.02     I

                                      Name   Price Availability
0                 The Shadow Hero (The ...  £33.14     In stock
1               The Secret (The Secret ...  £27.37     In stock
2               The Regional Office Is ...  £51.36     In stock
3               The Psychopath Test: A ...  £36.00     In stock
4                              The Project  £10.65     In stock
5                    The Power of Now: ...  £43.54     In stock
6            The Omnivore's Dilemma: A ...  £38.21     In stock
7          The Nerdy Nummies Cookbook: ...  £37.34     In stock
8                  The Murder of Roger ...  £44.10     In stock
9              The Mistake (Off-Campus #2)  £43.29     In stock
10  The Matchmaker's Playbook (Wingmen ...  £55.85     In stock
11                 The Love and Lemons ...  £37.60     In stock
12                  The Long Shadow of ...  £10.97     In stock
13                         The Kite Runner  £41.82     In stock
14                    The House by the .

                                  Name   Price Availability
0                  My Name Is Lucy ...  £41.56     In stock
1                        My Mrs. Brown  £24.48     In stock
2                     My Kind of Crazy  £40.36     In stock
3        Mr. Mercedes (Bill Hodges ...  £28.90     In stock
4         More Than Music (Chasing ...  £37.61     In stock
5               Made to Stick: Why ...  £38.85     In stock
6                Luis Paints the World  £53.95     In stock
7                  Luckiest Girl Alive  £49.83     In stock
8          Lowriders to the Center ...  £51.51     In stock
9                    Love Is a Mix ...  £18.03     In stock
10  Looking for Lovely: Collecting ...  £29.14     In stock
11   Living Leadership by Insight: ...  £46.91     In stock
12                   Let It Out: A ...  £26.79     In stock
13         Lady Midnight (The Dark ...  £16.28     In stock
14         It's All Easy: Healthy, ...  £19.55     In stock
15    Island of Dragons (Unwanteds ...  

                                   Name   Price Availability
0                         Without Shame  £48.27     In stock
1                              Watchmen  £58.05     In stock
2               Unlimited Intuition Now  £58.87     In stock
3                      Underlying Notes  £11.82     In stock
4                             The Shack  £28.03     In stock
5                The New Brand You: ...  £44.05     In stock
6   The Moosewood Cookbook: Recipes ...  £12.34     In stock
7                      The Flowers Lied  £16.68     In stock
8                 The Fabric of the ...  £55.91     In stock
9                    The Book of Mormon  £24.57     In stock
10              The Art and Science ...  £52.98     In stock
11                       The Alien Club  £54.40     In stock
12   Suzie Snowflake: One beautiful ...  £54.81     In stock
13                            Nap-a-Roo  £25.08     In stock
14           NaNo What Now? Finding ...  £10.41     In stock
15                    Mo

                                 Name   Price Availability
0      The Barefoot Contessa Cookbook  £59.92     In stock
1             Tell the Wolves I'm ...  £50.96     In stock
2      Ship Leaves Harbor: Essays ...  £30.60     In stock
3                 Pride and Prejudice  £19.27     In stock
4    Musicophilia: Tales of Music ...  £46.58     In stock
5                   Mere Christianity  £48.51     In stock
6               Me Before You (Me ...  £19.02     In stock
7            In the Woods (Dublin ...  £38.38     In stock
8                       In Cold Blood  £49.98     In stock
9            How to Stop Worrying ...  £46.49     In stock
10                       Give It Back  £18.32     In stock
11                  Girl, Interrupted  £42.14     In stock
12             Fun Home: A Family ...  £56.59     In stock
13          Fruits Basket, Vol. 6 ...  £20.96     In stock
14                    Deception Point  £40.32     In stock
15            Death Note, Vol. 6: ...  £36.39     In sto

                                      Name   Price Availability
0               Walk the Edge (Thunder ...  £32.36     In stock
1                   Voyager (Outlander #3)  £21.07     In stock
2                 Very Good Lives: The ...  £50.66     In stock
3    Vegan Vegetarian Omnivore: Dinner ...  £13.66     In stock
4   Unstuffed: Decluttering Your Home, ...  £58.09     In stock
5                  Under the Banner of ...  £30.00     In stock
6                         Two Boys Kissing  £32.74     In stock
7                   Twilight (Twilight #1)  £41.93     In stock
8                            Twenties Girl  £42.80     In stock
9      Trespassing Across America: One ...  £53.51     In stock
10                     Three-Martini Lunch  £23.21     In stock
11                 Thinking, Fast and Slow  £21.14     In stock
12                          The Wild Robot  £56.07     In stock
13                    The Wicked + The ...  £14.41     In stock
14                  The Undomestic Godde

                                 Name   Price Availability
0                Girl in the Blue ...  £46.83     In stock
1           Fruits Basket, Vol. 3 ...  £45.17     In stock
2          Friday Night Lights: A ...  £51.22     In stock
3   Fire Bound (Sea Haven/Sisters ...  £21.28     In stock
4       Fifty Shades Freed (Fifty ...  £15.36     In stock
5                            Fellside  £38.62     In stock
6   Extreme Prey (Lucas Davenport ...  £25.40     In stock
7   Eragon (The Inheritance Cycle ...  £43.87     In stock
8               Eclipse (Twilight #3)  £18.74     In stock
9                      Dune (Dune #1)  £54.86     In stock
10                            Dracula  £52.62     In stock
11           Do Androids Dream of ...  £51.48     In stock
12  Disrupted: My Misadventure in ...  £15.28     In stock
13            Dead Wake: The Last ...  £39.24     In stock
14  David and Goliath: Underdogs, ...  £17.81     In stock
15               Darkfever (Fever #1)  £56.02     In sto

                                    Name   Price Availability
0                           Frankenstein  £38.00     In stock
1        Forever Rockers (The Rocker ...  £28.80     In stock
2            Fighting Fate (Fighting #6)  £39.24     In stock
3                                   Emma  £32.93     In stock
4                        Eat, Pray, Love  £51.32     In stock
5        Deep Under (Walker Security ...  £47.09     In stock
6         Choosing Our Religion: The ...  £28.42     In stock
7          Charlie and the Chocolate ...  £22.85     In stock
8     Charity's Cross (Charles Towne ...  £41.24     In stock
9                           Bright Lines  £39.07     In stock
10    Bridget Jones's Diary (Bridget ...  £29.82     In stock
11         Bounty (Colorado Mountain #7)  £37.26     In stock
12  Blood Defense (Samantha Brinkman ...  £20.30     In stock
13        Bleach, Vol. 1: Strawberry ...  £34.65     In stock
14                  Beyond Good and Evil  £43.38     In stock
15      

# Bonus Challenge 1 - Making Your Spider Unblockable

Use techniques such as randomizing user agents and referers in your requests to reduce the likelihood that your spider is blocked by websites. [Here](http://blog.adnansiddiqi.me/5-strategies-to-write-unblock-able-web-scrapers-in-python/) is a great article to learn these techniques.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [18]:
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

class IronhackSpdier:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scrape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=False, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        try:
    
            headers = {
                'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)Version/14.0.3 Safari/605.1.15',
                'referer':'http://books.toscrape.com/index.html',
                'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Upgrade-Insecure-Requests':'1',
                'Host':'books.toscrape.com',
                'Accept-Language': 'en-gb',
                'Accept-Encoding': 'gzip, deflate',
                'Connection': 'keep-alive',
            }


            response = requests.get(url,headers=headers)
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout as err:
            print('\n Time out error. The processing time needed was too long \n')
            print(err.message)
        except HTTPError as err:
            print('\n HTTP error. The HTTP request returned an unsuccessful status code \n')
            print(err.message)
        except TooManyRedirects as err:
            print('\n TooManyRedirects error. The request exceeded the configured number of maximum redirections \n')
            print(err.message)
        except ConnectionError as err:
            print('\n Connection error. There was a problem with your connection \n')
            print(err.message)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval==True:
                nap = np.random.random_sample()
                time.sleep(nap)


URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 50 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    
    soup = BeautifulSoup(content,'html.parser')
    
    books = soup.find_all('article',class_='product_pod')
    
    book_dict = {'Name':[],'Price':[],'Availability':[]}

    for book in books:

        book_dict['Name'].append(book.h3.a.get_text())

        book_dict['Price'].append(book.select('div')[1].p.get_text())

        book_dict['Availability'].append(book.select('div')[1].select('p')[1].get_text().strip())

    bookspd = pd.DataFrame(book_dict)
        
    return bookspd

# Instantiate the IronhackSpider class
my_spider = IronhackSpdier(URL_PATTERN, PAGES_TO_SCRAPE,sleep_interval=True, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

                                     Name   Price Availability
0                      A Light in the ...  £51.77     In stock
1                      Tipping the Velvet  £53.74     In stock
2                              Soumission  £50.10     In stock
3                           Sharp Objects  £47.82     In stock
4            Sapiens: A Brief History ...  £54.23     In stock
5                         The Requiem Red  £22.65     In stock
6            The Dirty Little Secrets ...  £33.34     In stock
7                 The Coming Woman: A ...  £17.93     In stock
8                     The Boys in the ...  £22.60     In stock
9                         The Black Maria  £52.15     In stock
10  Starving Hearts (Triangular Trade ...  £13.99     In stock
11                  Shakespeare's Sonnets  £20.66     In stock
12                            Set Me Free  £17.46     In stock
13    Scott Pilgrim's Precious Little ...  £52.29     In stock
14                      Rip it Up and ...  £35.02     I

                                      Name   Price Availability
0                 The Shadow Hero (The ...  £33.14     In stock
1               The Secret (The Secret ...  £27.37     In stock
2               The Regional Office Is ...  £51.36     In stock
3               The Psychopath Test: A ...  £36.00     In stock
4                              The Project  £10.65     In stock
5                    The Power of Now: ...  £43.54     In stock
6            The Omnivore's Dilemma: A ...  £38.21     In stock
7          The Nerdy Nummies Cookbook: ...  £37.34     In stock
8                  The Murder of Roger ...  £44.10     In stock
9              The Mistake (Off-Campus #2)  £43.29     In stock
10  The Matchmaker's Playbook (Wingmen ...  £55.85     In stock
11                 The Love and Lemons ...  £37.60     In stock
12                  The Long Shadow of ...  £10.97     In stock
13                         The Kite Runner  £41.82     In stock
14                    The House by the .

                                  Name   Price Availability
0                  My Name Is Lucy ...  £41.56     In stock
1                        My Mrs. Brown  £24.48     In stock
2                     My Kind of Crazy  £40.36     In stock
3        Mr. Mercedes (Bill Hodges ...  £28.90     In stock
4         More Than Music (Chasing ...  £37.61     In stock
5               Made to Stick: Why ...  £38.85     In stock
6                Luis Paints the World  £53.95     In stock
7                  Luckiest Girl Alive  £49.83     In stock
8          Lowriders to the Center ...  £51.51     In stock
9                    Love Is a Mix ...  £18.03     In stock
10  Looking for Lovely: Collecting ...  £29.14     In stock
11   Living Leadership by Insight: ...  £46.91     In stock
12                   Let It Out: A ...  £26.79     In stock
13         Lady Midnight (The Dark ...  £16.28     In stock
14         It's All Easy: Healthy, ...  £19.55     In stock
15    Island of Dragons (Unwanteds ...  

                                   Name   Price Availability
0                         Without Shame  £48.27     In stock
1                              Watchmen  £58.05     In stock
2               Unlimited Intuition Now  £58.87     In stock
3                      Underlying Notes  £11.82     In stock
4                             The Shack  £28.03     In stock
5                The New Brand You: ...  £44.05     In stock
6   The Moosewood Cookbook: Recipes ...  £12.34     In stock
7                      The Flowers Lied  £16.68     In stock
8                 The Fabric of the ...  £55.91     In stock
9                    The Book of Mormon  £24.57     In stock
10              The Art and Science ...  £52.98     In stock
11                       The Alien Club  £54.40     In stock
12   Suzie Snowflake: One beautiful ...  £54.81     In stock
13                            Nap-a-Roo  £25.08     In stock
14           NaNo What Now? Finding ...  £10.41     In stock
15                    Mo

                                 Name   Price Availability
0      The Barefoot Contessa Cookbook  £59.92     In stock
1             Tell the Wolves I'm ...  £50.96     In stock
2      Ship Leaves Harbor: Essays ...  £30.60     In stock
3                 Pride and Prejudice  £19.27     In stock
4    Musicophilia: Tales of Music ...  £46.58     In stock
5                   Mere Christianity  £48.51     In stock
6               Me Before You (Me ...  £19.02     In stock
7            In the Woods (Dublin ...  £38.38     In stock
8                       In Cold Blood  £49.98     In stock
9            How to Stop Worrying ...  £46.49     In stock
10                       Give It Back  £18.32     In stock
11                  Girl, Interrupted  £42.14     In stock
12             Fun Home: A Family ...  £56.59     In stock
13          Fruits Basket, Vol. 6 ...  £20.96     In stock
14                    Deception Point  £40.32     In stock
15            Death Note, Vol. 6: ...  £36.39     In sto

                                      Name   Price Availability
0               Walk the Edge (Thunder ...  £32.36     In stock
1                   Voyager (Outlander #3)  £21.07     In stock
2                 Very Good Lives: The ...  £50.66     In stock
3    Vegan Vegetarian Omnivore: Dinner ...  £13.66     In stock
4   Unstuffed: Decluttering Your Home, ...  £58.09     In stock
5                  Under the Banner of ...  £30.00     In stock
6                         Two Boys Kissing  £32.74     In stock
7                   Twilight (Twilight #1)  £41.93     In stock
8                            Twenties Girl  £42.80     In stock
9      Trespassing Across America: One ...  £53.51     In stock
10                     Three-Martini Lunch  £23.21     In stock
11                 Thinking, Fast and Slow  £21.14     In stock
12                          The Wild Robot  £56.07     In stock
13                    The Wicked + The ...  £14.41     In stock
14                  The Undomestic Godde

                                 Name   Price Availability
0                Girl in the Blue ...  £46.83     In stock
1           Fruits Basket, Vol. 3 ...  £45.17     In stock
2          Friday Night Lights: A ...  £51.22     In stock
3   Fire Bound (Sea Haven/Sisters ...  £21.28     In stock
4       Fifty Shades Freed (Fifty ...  £15.36     In stock
5                            Fellside  £38.62     In stock
6   Extreme Prey (Lucas Davenport ...  £25.40     In stock
7   Eragon (The Inheritance Cycle ...  £43.87     In stock
8               Eclipse (Twilight #3)  £18.74     In stock
9                      Dune (Dune #1)  £54.86     In stock
10                            Dracula  £52.62     In stock
11           Do Androids Dream of ...  £51.48     In stock
12  Disrupted: My Misadventure in ...  £15.28     In stock
13            Dead Wake: The Last ...  £39.24     In stock
14  David and Goliath: Underdogs, ...  £17.81     In stock
15               Darkfever (Fever #1)  £56.02     In sto

                                    Name   Price Availability
0                           Frankenstein  £38.00     In stock
1        Forever Rockers (The Rocker ...  £28.80     In stock
2            Fighting Fate (Fighting #6)  £39.24     In stock
3                                   Emma  £32.93     In stock
4                        Eat, Pray, Love  £51.32     In stock
5        Deep Under (Walker Security ...  £47.09     In stock
6         Choosing Our Religion: The ...  £28.42     In stock
7          Charlie and the Chocolate ...  £22.85     In stock
8     Charity's Cross (Charles Towne ...  £41.24     In stock
9                           Bright Lines  £39.07     In stock
10    Bridget Jones's Diary (Bridget ...  £29.82     In stock
11         Bounty (Colorado Mountain #7)  £37.26     In stock
12  Blood Defense (Samantha Brinkman ...  £20.30     In stock
13        Bleach, Vol. 1: Strawberry ...  £34.65     In stock
14                  Beyond Good and Evil  £43.38     In stock
15      