# Advanced Web Scraping Lab

In this lab you will first learn the following code snippet which is a simple web spider class that allows you to scrape paginated webpages. Read the code, run it, and make sure you understand how it work. In the challenges of this lab, we will guide you in building up this class so that eventually you will have a more robus web spider that you can further work on in the Web Scraping Project.

In [4]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    
    """
     Esta es la clase de constructor a la que puede pasar un montón de parámetros.
     Estos parámetros se almacenan en las variables de instancia de clase para que el
     las funciones de clase pueden acceder a ellas más tarde.
    
     url_pattern: el patrón de expresiones regulares de las URL web para escapar
     pages_to_scrape: cuántas páginas raspar
     sleep_interval: el intervalo de tiempo en segundos para retrasar entre solicitudes. Si <0, las solicitudes no se retrasarán.
     content_parser: una referencia de función que extraerá la información deseada del contenido raspado.
    """
    
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    
    """
     Scrape el contenido de una sola url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    
    """
     Exporte el contenido raspado. En este momento simplemente imprime los resultados.
     Pero en el futuro puede exportar los resultados a un archivo de texto o base de datos.
     """
    
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    """
     Después de instanciar la clase, llame a esta función para iniciar los trabajos de raspado.
     Esta función usa un bucle FOR para llamar a `scrape_url()` para cada url a raspar.
     """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""

"""
Esta es una función de analizador personalizado que completará en el desafío.
En este momento, simplemente devuelve la cadena que se le pasó. Pero en este laboratorio
completarás esta función para que extraiga las comillas.
Esta función se pasará a la clase IronhackSpider.
"""

def quotes_parser(content):
    return content

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

## Challenge 1 - Custom Parser Function

In this challenge, complete the custom `quotes_parser()` function so that the returned result contains the quote string instead of the whole html page content.

In the cell below, write your updated `quotes_parser()` function and kickstart the spider. Make sure the results being printed contain a list of quote strings extracted from the html content.

In [3]:
# your code here
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    
    """
     Esta es la clase de constructor a la que puede pasar un montón de parámetros.
     Estos parámetros se almacenan en las variables de instancia de clase para que el
     las funciones de clase pueden acceder a ellas más tarde.
    
     url_pattern: el patrón de expresiones regulares de las URL web para escapar
     pages_to_scrape: cuántas páginas raspar
     sleep_interval: el intervalo de tiempo en segundos para retrasar entre solicitudes. Si <0, las solicitudes no se retrasarán.
     content_parser: una referencia de función que extraerá la información deseada del contenido raspado.
    """
    
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    
    """
     Scrape el contenido de una sola url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    
    """
     Exporte el contenido raspado. En este momento simplemente imprime los resultados.
     Pero en el futuro puede exportar los resultados a un archivo de texto o base de datos.
     """
    
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    """
     Después de instanciar la clase, llame a esta función para iniciar los trabajos de raspado.
     Esta función usa un bucle FOR para llamar a `scrape_url()` para cada url a raspar.
     """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""

"""
Esta es una función de analizador personalizado que completará en el desafío.
En este momento, simplemente devuelve la cadena que se le pasó. Pero en este laboratorio
completarás esta función para que extraiga las comillas.
Esta función se pasará a la clase IronhackSpider.
"""

def quotes_parser(content):
    import requests
    from bs4 import BeautifulSoup
    sopa_git = BeautifulSoup(content)
    bloques = sopa_git.select('div.quote')
    respuesta=[]
    for item in bloques:
        respuesta.append(item.text.replace('\n',' '))
    return respuesta

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

[' “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein (about)               Tags:              change deep-thoughts thinking world  ', ' “It is our choices, Harry, that show what we truly are, far more than our abilities.” by J.K. Rowling (about)               Tags:              abilities choices  ', ' “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” by Albert Einstein (about)               Tags:              inspirational life live miracle miracles  ', ' “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” by Jane Austen (about)               Tags:              aliteracy books classic humor  ', " “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” by Marilyn Monroe (about)               Tags:              be-yo

## Challenge 2 - Error Handling

In `IronhackSpider.scrape_url()`, catch any error that might occur when you make requests to scrape the webpage. This includes checking the response status code and catching http request errors such as timeout, SSL, and too many redirects.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [16]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    
    """
     Esta es la clase de constructor a la que puede pasar un montón de parámetros.
     Estos parámetros se almacenan en las variables de instancia de clase para que el
     las funciones de clase pueden acceder a ellas más tarde.
    
     url_pattern: el patrón de expresiones regulares de las URL web para escapar
     pages_to_scrape: cuántas páginas raspar
     sleep_interval: el intervalo de tiempo en segundos para retrasar entre solicitudes. Si <0, las solicitudes no se retrasarán.
     content_parser: una referencia de función que extraerá la información deseada del contenido raspado.
    """
    
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    
    """
     Scrape el contenido de una sola url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url, timeout=10)
            
            
            if response.status_code < 300:
                print('request was successful \n')
            elif response.status_code >= 400 and response.status_code < 500:
                print('request failed because the resource either does not exist or is forbidden \n')
            else:
                print('request failed because the response server encountered an error \n')
                
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except requests.exceptions.Timeout:
            print ('TIMEOUT')
            # timeout error... do something
        except requests.exceptions.TooManyRedirects:
            print ('MANY REDIRECTS')
            # redirect error... do something
        except requests.exceptions.SSLError:
            print ('SSL ERROR')
            # ssl error... do something
        except requests.exceptions.RequestException as e:
            print ('REQUESTS EXCEPTION')
            # other unknown errors... do something
        
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    
    """
     Exporte el contenido raspado. En este momento simplemente imprime los resultados.
     Pero en el futuro puede exportar los resultados a un archivo de texto o base de datos.
     """
    
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    """
     Después de instanciar la clase, llame a esta función para iniciar los trabajos de raspado.
     Esta función usa un bucle FOR para llamar a `scrape_url()` para cada url a raspar.
     """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)




"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""

"""
Esta es una función de analizador personalizado que completará en el desafío.
En este momento, simplemente devuelve la cadena que se le pasó. Pero en este laboratorio
completarás esta función para que extraiga las comillas.
Esta función se pasará a la clase IronhackSpider.
"""

def quotes_parser(content):
    import requests
    from bs4 import BeautifulSoup
    sopa_git = BeautifulSoup(content)
    bloques = sopa_git.select('div.quote')
    respuesta=[]
    for item in bloques:
        respuesta.append(item.text.replace('\n',' '))
    return respuesta


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.scrape_url(URL_PATTERN)


request failed because the response server encountered an error 

[]


# Challenge 3 - Sleep Interval

In `IronhackSpider.kickstart()`, implement `sleep_interval`. You will check if `self.sleep_interval` is larger than 0. If so, tell the FOR loop to sleep the given amount of time before making the next request.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [12]:
# your code here
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    
    """
     Esta es la clase de constructor a la que puede pasar un montón de parámetros.
     Estos parámetros se almacenan en las variables de instancia de clase para que el
     las funciones de clase pueden acceder a ellas más tarde.
    
     url_pattern: el patrón de expresiones regulares de las URL web para escapar
     pages_to_scrape: cuántas páginas raspar
     sleep_interval: el intervalo de tiempo en segundos para retrasar entre solicitudes. Si <0, las solicitudes no se retrasarán.
     content_parser: una referencia de función que extraerá la información deseada del contenido raspado.
    """
    
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    
    """
     Scrape el contenido de una sola url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url, timeout=10)
            
            if response.status_code < 300:
                print('request was successful \n')
            elif response.status_code >= 400 and r.status_code < 500:
                print('request failed because the resource either does not exist or is forbidden \n')
            else:
                print('request failed because the response server encountered an error \n')
                
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except requests.exceptions.Timeout:
            print ('TIMEOUT')
            # timeout error... do something
        except requests.exceptions.TooManyRedirects:
            print ('MANY REDIRECTS')
            # redirect error... do something
        except requests.exceptions.SSLError:
            print ('SSL ERROR')
            # ssl error... do something
        except requests.exceptions.RequestException as e:
            print ('REQUESTS EXCEPTION')
            # other unknown errors... do something
        
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    
    """
     Exporte el contenido raspado. En este momento simplemente imprime los resultados.
     Pero en el futuro puede exportar los resultados a un archivo de texto o base de datos.
     """
    
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    """
     Después de instanciar la clase, llame a esta función para iniciar los trabajos de raspado.
     Esta función usa un bucle FOR para llamar a `scrape_url()` para cada url a raspar.
     """
    def kickstart(self):
        import time
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval > 0:
                print('\nSLEEP INTERVAL \n')
                time.sleep(1)
            


    

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""

"""
Esta es una función de analizador personalizado que completará en el desafío.
En este momento, simplemente devuelve la cadena que se le pasó. Pero en este laboratorio
completarás esta función para que extraiga las comillas.
Esta función se pasará a la clase IronhackSpider.
"""

def quotes_parser(content):
    import requests
    from bs4 import BeautifulSoup
    sopa_git = BeautifulSoup(content)
    bloques = sopa_git.select('div.quote')
    respuesta=[]
    for item in bloques:
        respuesta.append(item.text.replace('\n',' '))
    return respuesta


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 2 # how many webpages to scrapge

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE,  sleep_interval=1, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

request was successful 

[' “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein (about)               Tags:              change deep-thoughts thinking world  ', ' “It is our choices, Harry, that show what we truly are, far more than our abilities.” by J.K. Rowling (about)               Tags:              abilities choices  ', ' “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” by Albert Einstein (about)               Tags:              inspirational life live miracle miracles  ', ' “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” by Jane Austen (about)               Tags:              aliteracy books classic humor  ', " “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” by Marilyn Monroe (about)              

# Challenge 4 - Test Batch Scraping

Change the `PAGES_TO_SCRAPE` value from `1` to `10`. Try if your code still works as intended to scrape 10 webpages. If there are errors in your code, fix them.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [20]:
# your code here
# your code here
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    
    """
     Esta es la clase de constructor a la que puede pasar un montón de parámetros.
     Estos parámetros se almacenan en las variables de instancia de clase para que el
     las funciones de clase pueden acceder a ellas más tarde.
    
     url_pattern: el patrón de expresiones regulares de las URL web para escapar
     pages_to_scrape: cuántas páginas raspar
     sleep_interval: el intervalo de tiempo en segundos para retrasar entre solicitudes. Si <0, las solicitudes no se retrasarán.
     content_parser: una referencia de función que extraerá la información deseada del contenido raspado.
    """
    
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    
    """
     Scrape el contenido de una sola url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url, timeout=10)
            
            if response.status_code < 300:
                print('request was successful \n')
            elif response.status_code >= 400 and r.status_code < 500:
                print('request failed because the resource either does not exist or is forbidden \n')
            else:
                print('request failed because the response server encountered an error \n')
                
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except requests.exceptions.Timeout:
            print ('TIMEOUT')
            # timeout error... do something
        except requests.exceptions.TooManyRedirects:
            print ('MANY REDIRECTS')
            # redirect error... do something
        except requests.exceptions.SSLError:
            print ('SSL ERROR')
            # ssl error... do something
        except requests.exceptions.RequestException as e:
            print ('REQUESTS EXCEPTION')
            # other unknown errors... do something
        
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    
    """
     Exporte el contenido raspado. En este momento simplemente imprime los resultados.
     Pero en el futuro puede exportar los resultados a un archivo de texto o base de datos.
     """
    
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    """
     Después de instanciar la clase, llame a esta función para iniciar los trabajos de raspado.
     Esta función usa un bucle FOR para llamar a `scrape_url()` para cada url a raspar.
     """
    def kickstart(self):
        import time
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval > 0:
                print(f'\nSLEEP INTERVAL pagina {i}\n')
                time.sleep(1)
            


    

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""

"""
Esta es una función de analizador personalizado que completará en el desafío.
En este momento, simplemente devuelve la cadena que se le pasó. Pero en este laboratorio
completarás esta función para que extraiga las comillas.
Esta función se pasará a la clase IronhackSpider.
"""

def quotes_parser(content):
    import requests
    from bs4 import BeautifulSoup
    sopa_git = BeautifulSoup(content)
    bloques = sopa_git.select('div.quote')
    respuesta=[]
    for item in bloques:
        respuesta.append(item.text.replace('\n',' '))
    return respuesta


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrapge

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE,  sleep_interval=1, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

request was successful 

[' “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein (about)               Tags:              change deep-thoughts thinking world  ', ' “It is our choices, Harry, that show what we truly are, far more than our abilities.” by J.K. Rowling (about)               Tags:              abilities choices  ', ' “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” by Albert Einstein (about)               Tags:              inspirational life live miracle miracles  ', ' “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” by Jane Austen (about)               Tags:              aliteracy books classic humor  ', " “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” by Marilyn Monroe (about)              

# Challenge 5 - Scrape a Different Website

Update the parameters passed to the `IronhackSpider` constructor so that you coder can crawl [books.toscrape.com](http://books.toscrape.com/). You will need to use a different `URL_PATTERN` (figure out the new url pattern by yourself) and write another parser function to be passed to `IronhackSpider`. 

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [21]:
# your code here

In [22]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    
    """
     Esta es la clase de constructor a la que puede pasar un montón de parámetros.
     Estos parámetros se almacenan en las variables de instancia de clase para que el
     las funciones de clase pueden acceder a ellas más tarde.
    
     url_pattern: el patrón de expresiones regulares de las URL web para escapar
     pages_to_scrape: cuántas páginas raspar
     sleep_interval: el intervalo de tiempo en segundos para retrasar entre solicitudes. Si <0, las solicitudes no se retrasarán.
     content_parser: una referencia de función que extraerá la información deseada del contenido raspado.
    """
    
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    
    """
     Scrape el contenido de una sola url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url, timeout=10)
            
            if response.status_code < 300:
                print('request was successful \n')
            elif response.status_code >= 400 and r.status_code < 500:
                print('request failed because the resource either does not exist or is forbidden \n')
            else:
                print('request failed because the response server encountered an error \n')
                
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except requests.exceptions.Timeout:
            print ('TIMEOUT')
            # timeout error... do something
        except requests.exceptions.TooManyRedirects:
            print ('MANY REDIRECTS')
            # redirect error... do something
        except requests.exceptions.SSLError:
            print ('SSL ERROR')
            # ssl error... do something
        except requests.exceptions.RequestException as e:
            print ('REQUESTS EXCEPTION')
            # other unknown errors... do something
        
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    
    """
     Exporte el contenido raspado. En este momento simplemente imprime los resultados.
     Pero en el futuro puede exportar los resultados a un archivo de texto o base de datos.
     """
    
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    """
     Después de instanciar la clase, llame a esta función para iniciar los trabajos de raspado.
     Esta función usa un bucle FOR para llamar a `scrape_url()` para cada url a raspar.
     """
    def kickstart(self):
        import time
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval > 0:
                print('\nSLEEP INTERVAL \n')
                time.sleep(1)
            


    

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""

"""
Esta es una función de analizador personalizado que completará en el desafío.
En este momento, simplemente devuelve la cadena que se le pasó. Pero en este laboratorio
completarás esta función para que extraiga las comillas.
Esta función se pasará a la clase IronhackSpider.
"""

def quotes_parser(content):
    import requests
    from bs4 import BeautifulSoup
    sopa_git = BeautifulSoup(content)
    bloques = sopa_git.select('li.col-xs-6.col-sm-4.col-md-3.col-lg-3 h3 a')
    libros=[]
    for item in bloques:
        libros.append(item.text)
    return libros


In [23]:
URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 3 # how many webpages to scrapge

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE,  sleep_interval=1, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

request was successful 

['A Light in the ...', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History ...', 'The Requiem Red', 'The Dirty Little Secrets ...', 'The Coming Woman: A ...', 'The Boys in the ...', 'The Black Maria', 'Starving Hearts (Triangular Trade ...', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little ...", 'Rip it Up and ...', 'Our Band Could Be ...', 'Olio', 'Mesaerion: The Best Science ...', 'Libertarianism for Beginners', "It's Only the Himalayas"]

SLEEP INTERVAL 

request was successful 

['In Her Wake', 'How Music Works', 'Foolproof Preserving: A Guide ...', 'Chase Me (Paris Nights ...', 'Black Dust', 'Birdsong: A Story in ...', "America's Cradle of Quarterbacks: ...", 'Aladdin and His Wonderful ...', 'Worlds Elsewhere: Journeys Around ...', 'Wall and Piece', 'The Four Agreements: A ...', 'The Five Love Languages: ...', 'The Elephant Tree', 'The Bear and the ...', "Sophie's World", 'Penny Maybe', 'Maude (1883-1993):

# Bonus Challenge 1 - Making Your Spider Unblockable

Use techniques such as randomizing user agents and referers in your requests to reduce the likelihood that your spider is blocked by websites. [Here](http://blog.adnansiddiqi.me/5-strategies-to-write-unblock-able-web-scrapers-in-python/) is a great article to learn these techniques.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [24]:
# your code here

import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    
    """
     Esta es la clase de constructor a la que puede pasar un montón de parámetros.
     Estos parámetros se almacenan en las variables de instancia de clase para que el
     las funciones de clase pueden acceder a ellas más tarde.
    
     url_pattern: el patrón de expresiones regulares de las URL web para escapar
     pages_to_scrape: cuántas páginas raspar
     sleep_interval: el intervalo de tiempo en segundos para retrasar entre solicitudes. Si <0, las solicitudes no se retrasarán.
     content_parser: una referencia de función que extraerá la información deseada del contenido raspado.
    """
    
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    
    """
     Scrape el contenido de una sola url.
    """
    def scrape_url(self, url):
        try:
            user_agent = get_random_ua()
            user_agent = user_agent.strip('\n')
            headers = {'user-agent': user_agent}
            print(user_agent)
            #r = requests.get('http://books.toscrape.com/catalogue/page-1.html',headers=headers)
            response = requests.get(url, headers=headers , timeout=10)
            
            if response.status_code < 300:
                print('request was successful \n')
            elif response.status_code >= 400 and r.status_code < 500:
                print('request failed because the resource either does not exist or is forbidden \n')
            else:
                print('request failed because the response server encountered an error \n')
                
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except requests.exceptions.Timeout:
            print ('TIMEOUT')
            # timeout error... do something
        except requests.exceptions.TooManyRedirects:
            print ('MANY REDIRECTS')
            # redirect error... do something
        except requests.exceptions.SSLError:
            print ('SSL ERROR')
            # ssl error... do something
        except requests.exceptions.RequestException as e:
            print ('REQUESTS EXCEPTION')
            # other unknown errors... do something
        
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    
    """
     Exporte el contenido raspado. En este momento simplemente imprime los resultados.
     Pero en el futuro puede exportar los resultados a un archivo de texto o base de datos.
     """
    
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    """
     Después de instanciar la clase, llame a esta función para iniciar los trabajos de raspado.
     Esta función usa un bucle FOR para llamar a `scrape_url()` para cada url a raspar.
     """
    def kickstart(self):
        import time
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval > 0:
                print('\nSLEEP INTERVAL \n')
                time.sleep(1)
            


    

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""

"""
Esta es una función de analizador personalizado que completará en el desafío.
En este momento, simplemente devuelve la cadena que se le pasó. Pero en este laboratorio
completarás esta función para que extraiga las comillas.
Esta función se pasará a la clase IronhackSpider.
"""

def quotes_parser(content):
    import requests
    from bs4 import BeautifulSoup
    sopa_git = BeautifulSoup(content)
    bloques = sopa_git.select('li.col-xs-6.col-sm-4.col-md-3.col-lg-3 h3 a')
    libros=[]
    for item in bloques:
        libros.append(item.text)
    return libros

import numpy as np

def get_random_ua():
    random_ua = ''
    ua_file = 'agents.txt'
    try:
        with open(ua_file) as f:
            lines = f.readlines()
        if len(lines) > 0:
            prng = np.random.RandomState()
            index = prng.permutation(len(lines) - 1)
            idx = np.asarray(index, dtype=np.integer)[0]
            random_ua = lines[int(idx)]
    except Exception as ex:
        print('Exception in random_ua')
        print(str(ex))
    finally:
        return random_ua
    

In [25]:
URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE,  sleep_interval=1, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

  idx = np.asarray(index, dtype=np.integer)[0]


Mozilla/5.0 (LG-T375 AppleWebkit/531 Browser/Phantom/V2.0 Widget/LGMW/3.0 MMS/LG-MMS-V1.0/1.2 Java/ASVM/1.1 Profile/MIDP-2.1 Configuration/CLDC-1.1)
request was successful 

['A Light in the ...', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History ...', 'The Requiem Red', 'The Dirty Little Secrets ...', 'The Coming Woman: A ...', 'The Boys in the ...', 'The Black Maria', 'Starving Hearts (Triangular Trade ...', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little ...", 'Rip it Up and ...', 'Our Band Could Be ...', 'Olio', 'Mesaerion: The Best Science ...', 'Libertarianism for Beginners', "It's Only the Himalayas"]

SLEEP INTERVAL 



In [26]:
URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE,  sleep_interval=1, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

  idx = np.asarray(index, dtype=np.integer)[0]


Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
request was successful 

['A Light in the ...', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History ...', 'The Requiem Red', 'The Dirty Little Secrets ...', 'The Coming Woman: A ...', 'The Boys in the ...', 'The Black Maria', 'Starving Hearts (Triangular Trade ...', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little ...", 'Rip it Up and ...', 'Our Band Could Be ...', 'Olio', 'Mesaerion: The Best Science ...', 'Libertarianism for Beginners', "It's Only the Himalayas"]

SLEEP INTERVAL 



In [27]:
URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE,  sleep_interval=1, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

  idx = np.asarray(index, dtype=np.integer)[0]


Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
request was successful 

['A Light in the ...', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History ...', 'The Requiem Red', 'The Dirty Little Secrets ...', 'The Coming Woman: A ...', 'The Boys in the ...', 'The Black Maria', 'Starving Hearts (Triangular Trade ...', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little ...", 'Rip it Up and ...', 'Our Band Could Be ...', 'Olio', 'Mesaerion: The Best Science ...', 'Libertarianism for Beginners', "It's Only the Himalayas"]

SLEEP INTERVAL 



# Bonus Challenge 2 - Making Asynchronous Calls

Implement asynchronous calls to `IronhackSpider`. You will make requests in parallel to complete your tasks faster.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
# your code here