# Web Crawler

Nosso Web Crawler irá navegar pelas páginas do website **http://quotes.toscrape.com**

Esta aplicação foi desenvolvida especificamente para praticarmos nossos conhecimentos sobre **Web Scraping** e nos servirá de grande auxílio.

Para a construção de nosso Crawler vamos utilizar as bibliotecas **[Requests](https://requests.kennethreitz.org/en/master/)** e **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**

Iniciaremos importando as bibliotecas necessárias

In [1]:
from bs4 import BeautifulSoup
import requests

Vamos definir uma função chamada **spider()** ao qual:

- Navegará pelo número de páginas máximo especificado por nós via argumento
- Para cada página, vamos extrair o código HTML
- Através do nosso objeto soup buscaremos elementos:
    - Representando o autor do quote
    - Representando o texto do quote
- Por fim incrementamos nossa variável page até alcançarmos o limite máximo de páginas

In [2]:
def spider(max_pages):
    page = 1
    while page < (max_pages + 1):
        url = f'http://quotes.toscrape.com/page/{str(page)}/'
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'lxml')
        for autor in soup.find_all('small', class_='author'):
            print(autor.text)
        for quote in soup.find_all('span', class_='text'):
            print(quote.text)
        page += 1

Executamos nossa função passando como argumento o valor **2** 

- O spider irá navegar pelas páginas **http://quotes.toscrape.com/page/1/** e **http://quotes.toscrape.com/page/2/**
- Serão extraídos todos os quotes e seus respectivos autores das páginas que navegamos

In [3]:
spider(2)

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”

### Aperfeiçoando nosso Web Crawler

Acredito que se transformarmos nossa função **spider()** em uma função geradora, podemos guardar nossos dados em um dicionário onde o **nome do autor** representará a **chave** e o **quote** representará o **valor**.

Para isso, vamos criar duas listas, uma para guardarmos os **autores** e outra para guardarmos os **quotes**.

Por fim, utilizamos a palavra-chave **yield** de forma a modificarmos nossa função para que ela se torne um gerador, nos retornando um dicionário com nossos dados mapeados como **chave-valor** através da função **zip()**.

**Importante**: Dicionários aceitam apenas chaves únicas, sendo assim, teremos apenas um Quote de cada Autor

In [12]:
def spider(max_pages):
    page = 1
    autores = []
    quotes = []
    while page < (max_pages + 1):
        url = f'http://quotes.toscrape.com/page/{str(page)}/'
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'lxml')
        for autor in soup.find_all('small', class_='author'):
            autores.append(autor.text)
        for quote in soup.find_all('span', class_='text'):
            quotes.append(quote.text)
        page += 1
    yield dict(zip(autores, quotes))

Obtendo o objeto gerador

In [13]:
crawler = spider(7)
print(crawler)

<generator object spider at 0x7fe66c37c150>


Através do **for loop**, podemos percorrer os valores de nosso gerador

In [14]:
for c in crawler:
    print(c)

{'Albert Einstein': '“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”', 'J.K. Rowling': '“Do not pity the dead, Harry. Pity the living, and, above all those who live without love.”', 'Jane Austen': '“There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.”', 'Marilyn Monroe': '“I am good, but not an angel. I do sin, but I am not the devil. I am just a small girl in a big world trying to find someone to love.”', 'André Gide': '“It is better to be hated for what you are than to be loved for what you are not.”', 'Thomas A. Edison': "“I have not failed. I've just found 10,000 ways that won't work.”", 'Eleanor Roosevelt': '“Do one thing every day that scares you.”', 'Steve Martin': '“A day without sunshine is like, you know, night.”', 'Bob Marley': '“The truth is, everyone is going to hurt you. You just got to find 

### Buscando Quote por Autor

In [15]:
crawler = spider(1)

for c in crawler:
    print(c['Albert Einstein'])

“Try not to become a man of success. Rather become a man of value.”


### Convertendo nossos resultados para JSON

Para isso será necessário fazermos o import da biblioteca **json** que já vem acoplada na linguagem Python por padrão

Novamente vamos então obter o objeto gerador

In [16]:
crawler = spider(3)

Agora vamos percorrer os valores de nosso gerador com o **for loop** e imprimir os respectivos dados no formato JSON

In [17]:
import json

for c in crawler:
    print(json.dumps(c, sort_keys=True, indent=4, ensure_ascii=False))

{
    "Albert Einstein": "“Logic will get you from A to Z; imagination will get you everywhere.”",
    "Allen Saunders": "“Life is what happens to us while we are making other plans.”",
    "André Gide": "“It is better to be hated for what you are than to be loved for what you are not.”",
    "Bob Marley": "“One good thing about music, when it hits you, you feel no pain.”",
    "Douglas Adams": "“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”",
    "Dr. Seuss": "“Today you are You, that is truer than true. There is no one alive who is Youer than You.”",
    "Eleanor Roosevelt": "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
    "Elie Wiesel": "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”",
    "Friedrich Nietzsche": "“It