# Raspagem de dados web com Python

Este notebook expõe os conceitos básicos de raspagem de dados web (web scraping) e propõe alguns exercícios.
Utilizaremos conceitos pythônicos como funções e controle de fluxo, além de conceitos da Internet como o protocolo HTTP, URLs e de componentes fundamentias da web, como HTML, CSS, JavaScript etc.

## Bio e contatos
Vítor Mussa | desenvolvedor de software [@basedosdados](https://basedosdados.org/) | engenheiro de dados e de pesquisa [@labhdufba](http://www.labhd.ufba.br/) | pesquisador em ciência social computacional [@ppgsaufrj](http://ppgsa.ifcs.ufrj.br/)

 twitter [@vitormussa](https://twitter.com/vitormussa) | linkedin [https://www.linkedin.com/in/vmussa/](https://www.linkedin.com/in/vmussa/) | github [@vmussa](https://github.com/vmussa)

## Como funciona a World Wide Web (ou, simplesmente, Web)?

### Internet enquanto sistema global de redes de computadores interconectadas
#### Redes de computadores
![A rede mundial de computadores](internet.png "Internet")

#### Infraestrutura
![Cabos submarinos que conectam os computadores](internet2.png "Infraestrutura da Internet")

#### Os URLs e o Protocolo HTTP
* URL: Uniform Resource Layer -> endereço web
* HTTP: Hypertext Transfer Protocol -> fundação da comunicação de dados na web

![O protocolo HTTP e o URL sendo usado no Browser](http.png "HTTP/URL no Browser")





### Como o navegador transforma os dados recebidos via HTTP em elementos visuais?

#### O código-fonte dos websites: HTML, CSS e JavaScript

Exemplo da página [http://pythonscraping.com/pages/page1.html](http://pythonscraping.com/pages/page1.html)

```html
<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>
```

#### Um exemplo mais complexo, com CSS: [https://quotes.toscrape.com](https://quotes.toscrape.com); e outro, com JavaScript: [https://www.globo.com/](https://www.globo.com/)
É preciso clicar com o botão direito na página e clicar em `Exibir código-fonte`.

## Do navegador ao código: como ler a web com Python?

In [1]:
!pip install requests
!pip install bs4

You should consider upgrading via the '/home/vmussa/dev/alexandria/notebooks/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/vmussa/dev/alexandria/notebooks/.venv/bin/python -m pip install --upgrade pip' command.[0m


### `requests`: fazendo pedidos HTTP com Python

In [2]:
import requests

r = requests.get('http://pythonscraping.com/pages/page1.html')

In [3]:
# biblioteca interna ao Python para fazer prints mais bonitos
from pprint import pprint

pprint(r.content)

(b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Int'
 b'eresting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipis'
 b'icing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqu'
 b'a. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi u'
 b't aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in'
 b' voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint'
 b' occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit '
 b'anim id est laborum.\n</div>\n</body>\n</html>\n')


### `BeautifulSoup`: transformando HTML em dados estruturados

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content, 'html.parser')

In [5]:
soup.prettify()

'<html>\n <head>\n  <title>\n   A Useful Page\n  </title>\n </head>\n <body>\n  <h1>\n   An Interesting Title\n  </h1>\n  <div>\n   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n  </div>\n </body>\n</html>\n'

#### Navegando pela árvore do HTML

In [6]:
soup.head

<head>
<title>A Useful Page</title>
</head>

In [7]:
soup.title

<title>A Useful Page</title>

In [8]:
soup.h1

<h1>An Interesting Title</h1>

In [9]:
soup.div

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

## Raspando a web: um exemplo introdutório

Vamos unir as explicações desenvolvidas acima aos nossos conhecimentos de Python para raspar a seguinte página: [https://quotes.toscrape.com/](https://quotes.toscrape.com/). Essa página foi criada pela empresa ScrapingHub, desenvolvedora da biblioteca de web scraping avançado `Scrapy`, com o objetivo de introduzir iniciantes à raspagem de dados.

In [10]:
r = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(r.content, 'html.parser')

#### Reaproveitando o código acima em uma função

In [11]:
def get_soup(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup

soup = get_soup("https://quotes.toscrape.com/")

#### Como obter a citação do Einstein?

In [12]:
soup.span.text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [13]:
soup.small.text

'Albert Einstein'

#### Os métodos `find` e `find_all` do `BeautifulSoup`

In [14]:
soup.find('span', class_='text').text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [15]:
soup.find('small', class_="author").text

'Albert Einstein'

#### Obtendo uma lista de elementos com o find_all

In [16]:
soup.find_all('span')

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span>by <small class="author" itemprop="author">J.K. Rowling</small>
 <a href="/author/J-K-Rowling">(about)</a>
 </span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,

In [17]:
elements = soup.find_all('span', class_="text")
elements

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

#### Transformando nossa lista com as `list comprehensions` do Python

In [18]:
elements = [element.text for element in soup.find_all('span', class_='text')]
elements

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

#### Construindo uma `list` de citações e autores

In [19]:
authors = [author.text for author in soup.find_all('small', class_='author')]
quotes = [quote.text for quote in soup.find_all('span', class_='text')]

data = list(zip(authors, quotes))
data

[('Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'),
 ('J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'),
 ('Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'),
 ('Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'),
 ('Marilyn Monroe',
  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"),
 ('Albert Einstein',
  '“Try not to become a man of success. Rather become a man of value.”'),
 ('André Gide',
  '“It is better to be hated for what you are than to be loved for what you are not.”'),
 ('Thomas A. Edison',
  "“I have not failed. I've just found 10,000 ways that won't work.”"),
 ('Eleanor Roosevelt',
  "“A

#### Construindo uma função que retorna os registros de citações de uma página (autores, citação etc.)

In [20]:
def get_quote_records(soup):
    quotes = [quote.text for quote in soup.find_all("span", class_="text")]
    authors = [author.text for author in soup.find_all("small", class_="author")]

    data = list(zip(authors, quotes))

    return data


get_quote_records(soup)

[('Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'),
 ('J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'),
 ('Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'),
 ('Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'),
 ('Marilyn Monroe',
  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"),
 ('Albert Einstein',
  '“Try not to become a man of success. Rather become a man of value.”'),
 ('André Gide',
  '“It is better to be hated for what you are than to be loved for what you are not.”'),
 ('Thomas A. Edison',
  "“I have not failed. I've just found 10,000 ways that won't work.”"),
 ('Eleanor Roosevelt',
  "“A

#### Como fazer isso para todas as páginas?

In [21]:
count = 1
all_data = []

while True:
    # constrói o objeto `soup`
    r = requests.get(f"https://quotes.toscrape.com/page/{count}")
    soup = BeautifulSoup(r.content, "html.parser")

    # anexa os dados à list `data`
    page_data = get_quote_records(soup)
    all_data += page_data

    # incrementa o contador
    count += 1

    # condição de parada: quando não há mais dados
    if page_data == []:
        break

# mostra os dados
print(all_data)


[('Albert Einstein', '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'), ('J.K. Rowling', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'), ('Albert Einstein', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'), ('Jane Austen', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'), ('Marilyn Monroe', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"), ('Albert Einstein', '“Try not to become a man of success. Rather become a man of value.”'), ('André Gide', '“It is better to be hated for what you are than to be loved for what you are not.”'), ('Thomas A. Edison', "“I have not failed. I've just found 10,000 ways that won't work.”"), ('Eleanor Roosevelt', "“A woman is like a tea bag; 

#### Adicionando os dados de `tags`

In [22]:
# primeiro, resetamos o estado do objeto `soup`
soup = get_soup("https://quotes.toscrape.com/")

# agora podemos extrair os dados das tags
tag_divs = soup.find_all("div", class_="tags")

all_tags = []
for tag_div in tag_divs:
    tags = tag_div.find_all("a")
    tags = [tag.text for tag in tags]
    all_tags.append(tags)

all_tags

[['change', 'deep-thoughts', 'thinking', 'world'],
 ['abilities', 'choices'],
 ['inspirational', 'life', 'live', 'miracle', 'miracles'],
 ['aliteracy', 'books', 'classic', 'humor'],
 ['be-yourself', 'inspirational'],
 ['adulthood', 'success', 'value'],
 ['life', 'love'],
 ['edison', 'failure', 'inspirational', 'paraphrased'],
 ['misattributed-eleanor-roosevelt'],
 ['humor', 'obvious', 'simile']]

#### Encapsulando tudo em fuções

In [23]:
def get_tags(soup):
    tag_divs = soup.find_all("div", class_="tags")

    all_tags = []
    for tag_div in tag_divs:
        tags = tag_div.find_all("a")
        tags = [tag.text for tag in tags]
        all_tags.append(tags)

    return all_tags


def get_quote_records(soup):
    quotes = [quote.text for quote in soup.find_all("span", class_="text")]
    authors = [author.text for author in soup.find_all("small", class_="author")]
    tags = get_tags(soup)

    data = list(zip(authors, quotes, tags))

    return data


get_quote_records(soup)


[('Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  ['change', 'deep-thoughts', 'thinking', 'world']),
 ('J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  ['abilities', 'choices']),
 ('Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  ['inspirational', 'life', 'live', 'miracle', 'miracles']),
 ('Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  ['aliteracy', 'books', 'classic', 'humor']),
 ('Marilyn Monroe',
  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  ['be-yourself', 'inspirational']),
 ('Albert Einstein',
  '“Try not to become a man of success. Rather become a man of value.”',
  ['ad

In [24]:
def get_all_quotes(base_url):
    count = 1
    all_data = []

    while True:
        # constrói o objeto `soup`
        soup = get_soup(f"{base_url}page/{count}")

        # anexa os dados à list `data`
        page_data = get_quote_records(soup)
        all_data += page_data

        # incrementa o contador
        count += 1

        # condição de parada do loop: quando não há mais dados
        if page_data == []:
            break

    return all_data


get_all_quotes("https://quotes.toscrape.com/")


[('Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  ['change', 'deep-thoughts', 'thinking', 'world']),
 ('J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  ['abilities', 'choices']),
 ('Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  ['inspirational', 'life', 'live', 'miracle', 'miracles']),
 ('Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  ['aliteracy', 'books', 'classic', 'humor']),
 ('Marilyn Monroe',
  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  ['be-yourself', 'inspirational']),
 ('Albert Einstein',
  '“Try not to become a man of success. Rather become a man of value.”',
  ['ad

### **Seletores CSS** e **XPath**: duas formas flexíveis de selecionar elementos no HTML

In [25]:
soup.select("div.tags-box span a")

[<a class="tag" href="/tag/love/" style="font-size: 28px">love</a>,
 <a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>,
 <a class="tag" href="/tag/life/" style="font-size: 26px">life</a>,
 <a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>,
 <a class="tag" href="/tag/books/" style="font-size: 22px">books</a>,
 <a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>,
 <a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>,
 <a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>,
 <a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>,
 <a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>]

In [26]:
top_tags_elements = soup.select("div.tags-box span a")
top_tags = [element.text for element in top_tags_elements]
top_tags

['love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

## Raspando páginas que demandam interação com o browser

Nessa seção utilizaremos a biblioteca `helium`, que nos permite interagir com as páginas da web diretamente do Python. Com ele é possível clicar em botões, escrever dados em formulários e muito mais de uma forma muito mais simples que o mais conhecido `selenium`.

#### Fazendo login na página com `helium`

In [27]:
!pip install helium

You should consider upgrading via the '/home/vmussa/dev/alexandria/notebooks/.venv/bin/python -m pip install --upgrade pip' command.[0m


In [28]:
from helium import start_chrome; start_chrome()

<selenium.webdriver.chrome.webdriver.WebDriver (session="48551d32eec30b420cf117054f32e499")>

In [29]:
from helium import kill_browser; kill_browser()

#### Fazendo login no website com o `helium`

In [30]:
from helium import (
    start_chrome,
    write,
    click,
    press,
    TAB,
    ENTER,
    kill_browser,
)

driver = start_chrome("https://quotes.toscrape.com/")
click("Login")
write("a", into="Username")
press(TAB)
write("b")
press(ENTER)


#### Carregando os dados renderizados pelo `helium` em um objeto do `BeautifulSoup`

In [31]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

# agora podemos fechar o browser
kill_browser()

soup.title

<title>Quotes to Scrape</title>

#### Extraindo novos dados renderizados pelo login

In [32]:
about_elements = soup.select("div.quote span a:nth-of-type(1)")
about_elements

[<a href="/author/Albert-Einstein">(about)</a>,
 <a href="/author/J-K-Rowling">(about)</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a href="/author/Jane-Austen">(about)</a>,
 <a href="/author/Marilyn-Monroe">(about)</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a href="/author/Andre-Gide">(about)</a>,
 <a href="/author/Thomas-A-Edison">(about)</a>,
 <a href="/author/Eleanor-Roosevelt">(about)</a>,
 <a href="/author/Steve-Martin">(about)</a>]

In [33]:
about_elements[0]['href']

'/author/Albert-Einstein'

In [34]:
about_urls = [element['href'] for element in about_elements]
about_urls

['/author/Albert-Einstein',
 '/author/J-K-Rowling',
 '/author/Albert-Einstein',
 '/author/Jane-Austen',
 '/author/Marilyn-Monroe',
 '/author/Albert-Einstein',
 '/author/Andre-Gide',
 '/author/Thomas-A-Edison',
 '/author/Eleanor-Roosevelt',
 '/author/Steve-Martin']

In [35]:
goodreads_elements = soup.select("div.quote span a:nth-of-type(2)")
goodreads_elements

[<a href="http://goodreads.com/author/show/9810.Albert_Einstein">(Goodreads page)</a>,
 <a href="http://goodreads.com/author/show/1077326.J_K_Rowling">(Goodreads page)</a>,
 <a href="http://goodreads.com/author/show/9810.Albert_Einstein">(Goodreads page)</a>,
 <a href="http://goodreads.com/author/show/1265.Jane_Austen">(Goodreads page)</a>,
 <a href="http://goodreads.com/author/show/82952.Marilyn_Monroe">(Goodreads page)</a>,
 <a href="http://goodreads.com/author/show/9810.Albert_Einstein">(Goodreads page)</a>,
 <a href="http://goodreads.com/author/show/7617.Andr_Gide">(Goodreads page)</a>,
 <a href="http://goodreads.com/author/show/3091287.Thomas_A_Edison">(Goodreads page)</a>,
 <a href="http://goodreads.com/author/show/44566.Eleanor_Roosevelt">(Goodreads page)</a>,
 <a href="http://goodreads.com/author/show/7103.Steve_Martin">(Goodreads page)</a>]

In [36]:
goodreads_urls = [element['href'] for element in goodreads_elements]
goodreads_urls

['http://goodreads.com/author/show/9810.Albert_Einstein',
 'http://goodreads.com/author/show/1077326.J_K_Rowling',
 'http://goodreads.com/author/show/9810.Albert_Einstein',
 'http://goodreads.com/author/show/1265.Jane_Austen',
 'http://goodreads.com/author/show/82952.Marilyn_Monroe',
 'http://goodreads.com/author/show/9810.Albert_Einstein',
 'http://goodreads.com/author/show/7617.Andr_Gide',
 'http://goodreads.com/author/show/3091287.Thomas_A_Edison',
 'http://goodreads.com/author/show/44566.Eleanor_Roosevelt',
 'http://goodreads.com/author/show/7103.Steve_Martin']

#### Atualizando a função `get_all_quotes` para extrair os novos dados

In [37]:
from helium import go_to


def get_rendered_soup(url):
    # renderiza a página com o driver
    driver = start_chrome(
        url, headless=True
    )  # dessa vez iniciamos o browser em background
    click("Login")
    write("a", into="Username")
    press(TAB)
    write("b", into="Password")
    press(ENTER)
    go_to(url)

    # constrói o objeto `soup` com a página renderizada do driver
    soup = BeautifulSoup(driver.page_source, "html.parser")

    # fecha o browser
    kill_browser()

    return soup


def get_quote_full_records(soup):
    quotes = [quote.text for quote in soup.find_all("span", class_="text")]
    authors = [author.text for author in soup.find_all("small", class_="author")]
    tags = get_tags(soup)
    about_urls = [
        element["href"] for element in soup.select("div.quote span a:nth-of-type(1)")
    ]
    goodreads_urls = [
        element["href"] for element in soup.select("div.quote span a:nth-of-type(2)")
    ]

    data = list(zip(authors, quotes, tags, about_urls, goodreads_urls))

    return data


def get_all_quotes_full(base_url):
    all_data = []

    count = 1
    while True:
        # constrói o objeto `soup` renderizado
        soup = get_rendered_soup(f"{base_url}page/{count}")

        # anexa os dados à list `all_data`
        page_data = get_quote_full_records(soup)
        all_data += page_data

        # incrementa o contador
        count += 1

        # condição de parada do loop: quando não há mais dados
        if page_data == []:
            break

    return all_data


In [38]:
quotes_data = get_all_quotes_full("https://quotes.toscrape.com/")
quotes_data

[('Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  ['change', 'deep-thoughts', 'thinking', 'world'],
  '/author/Albert-Einstein',
  'http://goodreads.com/author/show/9810.Albert_Einstein'),
 ('J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  ['abilities', 'choices'],
  '/author/J-K-Rowling',
  'http://goodreads.com/author/show/1077326.J_K_Rowling'),
 ('Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  ['inspirational', 'life', 'live', 'miracle', 'miracles'],
  '/author/Albert-Einstein',
  'http://goodreads.com/author/show/9810.Albert_Einstein'),
 ('Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  ['aliteracy', 'books', 'classic', 'humor'],
  '/author/Jane-

## Exportando tudo para uma tabela em CSV com `pandas`

In [39]:
!pip install pandas

You should consider upgrading via the '/home/vmussa/dev/alexandria/notebooks/.venv/bin/python -m pip install --upgrade pip' command.[0m


#### Reunindo os dados num `dict`, para que o `pandas` consiga carregá-los num `DataFrame`

Podemos usar `dict` e `zip` para realizar essa tarefa. Exemplo de como eles funcionam:

In [40]:
list(zip([1,2], [3,4]))

[(1, 3), (2, 4)]

In [41]:
list(zip(*[['autor1',1,3,4,5],['autor2',2,3,4,5],['autor3',0,0,3,4],['autor4',1,1,4,5],['autor5',3,3,4,6]]))

[('autor1', 'autor2', 'autor3', 'autor4', 'autor5'),
 (1, 2, 0, 1, 3),
 (3, 3, 0, 1, 3),
 (4, 4, 3, 4, 4),
 (5, 5, 4, 5, 6)]

In [42]:
dict(
    zip(
        ["author", "quote", "tags", "about_url", "goodreads_url"],
        zip(*[['autor1',1,3,4,5],['autor2',2,3,4,5],['autor3',0,0,3,4],['autor4',1,1,4,5],['autor5',3,3,4,6]])
    )
)

{'author': ('autor1', 'autor2', 'autor3', 'autor4', 'autor5'),
 'quote': (1, 2, 0, 1, 3),
 'tags': (3, 3, 0, 1, 3),
 'about_url': (4, 4, 3, 4, 4),
 'goodreads_url': (5, 5, 4, 5, 6)}

In [43]:
data = dict(zip(['authors', 'quotes', 'tags', 'about_relative_url', 'goodreads_url'], zip(*quotes_data)))
data

{'authors': ('Albert Einstein',
  'J.K. Rowling',
  'Albert Einstein',
  'Jane Austen',
  'Marilyn Monroe',
  'Albert Einstein',
  'André Gide',
  'Thomas A. Edison',
  'Eleanor Roosevelt',
  'Steve Martin',
  'Marilyn Monroe',
  'J.K. Rowling',
  'Albert Einstein',
  'Bob Marley',
  'Dr. Seuss',
  'Douglas Adams',
  'Elie Wiesel',
  'Friedrich Nietzsche',
  'Mark Twain',
  'Allen Saunders',
  'Pablo Neruda',
  'Ralph Waldo Emerson',
  'Mother Teresa',
  'Garrison Keillor',
  'Jim Henson',
  'Dr. Seuss',
  'Albert Einstein',
  'J.K. Rowling',
  'Albert Einstein',
  'Bob Marley',
  'Dr. Seuss',
  'J.K. Rowling',
  'Bob Marley',
  'Mother Teresa',
  'J.K. Rowling',
  'Charles M. Schulz',
  'William Nicholson',
  'Albert Einstein',
  'Jorge Luis Borges',
  'George Eliot',
  'George R.R. Martin',
  'C.S. Lewis',
  'Marilyn Monroe',
  'Marilyn Monroe',
  'Albert Einstein',
  'Marilyn Monroe',
  'Marilyn Monroe',
  'Martin Luther King Jr.',
  'J.K. Rowling',
  'James Baldwin',
  'Jane Austen

In [44]:
import pandas as pd

df = pd.DataFrame(data)
df

Unnamed: 0,authors,quotes,tags,about_relative_url,goodreads_url
0,Albert Einstein,“The world as we have created it is a process ...,"[change, deep-thoughts, thinking, world]",/author/Albert-Einstein,http://goodreads.com/author/show/9810.Albert_E...
1,J.K. Rowling,"“It is our choices, Harry, that show what we t...","[abilities, choices]",/author/J-K-Rowling,http://goodreads.com/author/show/1077326.J_K_R...
2,Albert Einstein,“There are only two ways to live your life. On...,"[inspirational, life, live, miracle, miracles]",/author/Albert-Einstein,http://goodreads.com/author/show/9810.Albert_E...
3,Jane Austen,"“The person, be it gentleman or lady, who has ...","[aliteracy, books, classic, humor]",/author/Jane-Austen,http://goodreads.com/author/show/1265.Jane_Austen
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...","[be-yourself, inspirational]",/author/Marilyn-Monroe,http://goodreads.com/author/show/82952.Marilyn...
...,...,...,...,...,...
95,Harper Lee,“You never really understand a person until yo...,[better-life-empathy],/author/Harper-Lee,http://goodreads.com/author/show/1825.Harper_Lee
96,Madeleine L'Engle,“You have to write the book that wants to be w...,"[books, children, difficult, grown-ups, write,...",/author/Madeleine-LEngle,http://goodreads.com/author/show/106.Madeleine...
97,Mark Twain,“Never tell the truth to people who are not wo...,[truth],/author/Mark-Twain,http://goodreads.com/author/show/1244.Mark_Twain
98,Dr. Seuss,"“A person's a person, no matter how small.”",[inspirational],/author/Dr-Seuss,http://goodreads.com/author/show/61105.Dr_Seuss


In [45]:
df.to_csv('scraped_data.csv', index=False)

## Usando o `pandas` para raspar tabelas de websites

In [46]:
!pip install html5lib

You should consider upgrading via the '/home/vmussa/dev/alexandria/notebooks/.venv/bin/python -m pip install --upgrade pip' command.[0m


In [47]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_Copa_Libertadores_finals', flavor='html5lib')

In [48]:
data[2].head(15)

Unnamed: 0,Year,Country,Winner,Score,Runner-up,Country.1,Venue,Attendance
0,1960,Uruguay,Peñarol,1–0,Olimpia,Paraguay,"Estadio Centenario, Montevideo",44690
1,1960,Uruguay,Peñarol,1–1,Olimpia,Paraguay,"Estadio de Puerto Sajonia, Asunción",35000
2,1961,Uruguay,Peñarol,1–0,Palmeiras,Brazil,"Estadio Centenario, Montevideo",64376
3,1961,Uruguay,Peñarol,1–1,Palmeiras,Brazil,"Estádio do Pacaembu, São Paulo",50000
4,1962,Brazil,Santos,2–1,Peñarol,Uruguay,"Estadio Centenario, Montevideo",48105
5,1962,Brazil,Santos,2–3,Peñarol,Uruguay,"Vila Belmiro, Santos",18000
6,1962,Brazil,Santos,3–0,Peñarol,Uruguay,"Estadio Monumental, Buenos Aires",60000
7,1963,Brazil,Santos,3–2,Boca Juniors,Argentina,"Maracanã, Rio de Janeiro",100000
8,1963,Brazil,Santos,2–1,Boca Juniors,Argentina,"Estadio Boca Juniors, Buenos Aires",50000
9,1964,Argentina,Independiente,0–0,Nacional,Uruguay,"Estadio Centenario, Montevideo",60000


In [49]:
data[3].head(15)

Unnamed: 0,Club,Titles,Runners-up,Seasons won,Seasons runner-up
0,Independiente,7,0,"1964, 1965, 1972, 1973, 1974, 1975, 1984",—
1,Boca Juniors,6,5,"1977, 1978, 2000, 2001, 2003, 2007","1963, 1979, 2004, 2012, 2018"
2,Peñarol,5,5,"1960, 1961, 1966, 1982, 1987","1962, 1965, 1970, 1983, 2011"
3,River Plate,4,3,"1986, 1996, 2015, 2018","1966, 1976, 2019"
4,Estudiantes,4,1,"1968, 1969, 1970, 2009",1971
5,Olimpia,3,4,"1979, 1990, 2002","1960, 1989, 1991, 2013"
6,Nacional,3,3,"1971, 1980, 1988","1964, 1967, 1969"
7,São Paulo,3,3,"1992, 1993, 2005","1974, 1994, 2006"
8,Santos,3,2,"1962, 1963, 2011","2003, 2020"
9,Grêmio,3,2,"1983, 1995, 2017","1984, 2007"


In [50]:
data[4]

Unnamed: 0,Nation,Won,Lost
0,Argentina,25,12
1,Brazil,20,16
2,Uruguay,8,8
3,Colombia,3,7
4,Paraguay,3,5
5,Chile,1,5
6,Ecuador,1,3
7,Mexico,0,3
8,Peru,0,2


# Próximos passos

* Encapsular as últimas funções que fizemos em uma classe
* Usar a `Session` da `requests` para diminuir o tempo da resposta HTTP
* Usar funções que retornam um `bool` para escolher os elementos HTML no `BeautifulSoup`:  
* Conhecer a biblioteca `Scrapy`
* Aprender a sintaxe do `XPath`
* Agendar a execução periódica desse script no `Apache Airflow`

# Referências

## Tecnologias da Web
* [https://en.wikipedia.org/wiki/Internet]()
* [https://en.wikipedia.org/wiki/URL]()
* [https://en.wikipedia.org/wiki/World_Wide_Web]()
* [https://en.wikipedia.org/wiki/HTML]()
* [https://en.wikipedia.org/wiki/CSS]()
* [https://en.wikipedia.org/wiki/JavaScript]()

## Raspagem de dados com Python
* MITCHELL, R. [Web Scraping with Python](https://www.oreilly.com/library/view/web-scraping-with/9781491985564/). 2. ed. Sebastopol, CA, O’Reilly Media, Inc., 2018.