# Raspagem de dados web com Python

Este notebook expõe os conceitos básicos de raspagem de dados web (web scraping) e propõe alguns exercícios.
Utilizaremos conceitos pythônicos como funções e controle de fluxo, além de conceitos da Internet como o protocolo HTTP, URLs e de componentes fundamentias da web, como HTML, CSS, JavaScript etc. 

## Como funciona a web?

### Internet enquanto sistema global de redes de computadores interconectadas
#### Redes de computadores
![A rede mundial de computadores](internet.png "Internet")

#### Infraestrutura
![Cabos submarinos que conectam os computadores](internet2.png "Infraestrutura da Internet")

#### Os URLs e o Protocolo HTTP
* URL: Uniform Resource Layer -> endereço web
* HTTP: Hypertext Transfer Protocol -> fundação da comunicação de dados na web

![O protocolo HTTP e o URL sendo usado no Browser](http.png "HTTP/URL no Browser")





### Como o navegador transforma os dados recebidos via HTTP em elementos visuais?

#### O código-fonte dos websites: HTML, CSS e JavaScript

Exemplo da página [http://pythonscraping.com/pages/page1.html](http://pythonscraping.com/pages/page1.html)

```html
<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>
```

#### Um exemplo mais complexo, com CSS: [https://quotes.toscrape.com](https://quotes.toscrape.com); e outro, com JavaScript: [https://www.globo.com/](https://www.globo.com/)
É preciso clicar com o botão direito na página e clicar em `Exibir código-fonte`.

## Do navegador ao código: como ler a web com Python?

In [4]:
!pip install requests
!pip install bs4

You should consider upgrading via the '/home/vmussa/dev/alexandria/notebooks/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/vmussa/dev/alexandria/notebooks/.venv/bin/python -m pip install --upgrade pip' command.[0m


### `requests`: fazendo pedidos HTTP com Python

In [5]:
import requests

r = requests.get('http://pythonscraping.com/pages/page1.html')

In [6]:
# biblioteca interna ao Python para fazer prints mais bonitos
from pprint import pprint

pprint(r.content)

(b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Int'
 b'eresting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipis'
 b'icing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqu'
 b'a. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi u'
 b't aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in'
 b' voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint'
 b' occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit '
 b'anim id est laborum.\n</div>\n</body>\n</html>\n')


### `BeautifulSoup`: transformando HTML em dados estruturados

In [7]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content, 'html.parser')

In [13]:
soup.prettify()

'<html>\n <head>\n  <title>\n   A Useful Page\n  </title>\n </head>\n <body>\n  <h1>\n   An Interesting Title\n  </h1>\n  <div>\n   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n  </div>\n </body>\n</html>\n'

#### Navegando pela árvore do HTML

In [14]:
soup.head

<head>
<title>A Useful Page</title>
</head>

In [15]:
soup.title

<title>A Useful Page</title>

In [16]:
soup.h1

<h1>An Interesting Title</h1>

In [17]:
soup.div

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

## Raspando a web: um exemplo introdutório

Vamos unir as explicações desenvolvidas acima aos nossos conhecimentos de Python para raspar a seguinte página: [https://quotes.toscrape.com/](https://quotes.toscrape.com/). Essa página foi criada pela empresa ScrapingHub, desenvolvedora da biblioteca de web scraping avançado `Scrapy`, com o objetivo de introduzir iniciantes à raspagem de dados.

In [19]:
r = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(r.content, 'html.parser')

#### Como obter a citação do Einstein?

In [20]:
soup.span.text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [21]:
soup.small.text

'Albert Einstein'

#### Os métodos `find` e `find_all` do `BeautifulSoup`

In [84]:
soup.find('span', class_='text').text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [85]:
soup.find('small', class_="author").text

'Albert Einstein'

#### Obtendo uma lista de elementos com o find_all

In [86]:
soup.find_all('span')

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span>by <small class="author" itemprop="author">J.K. Rowling</small>
 <a href="/author/J-K-Rowling">(about)</a>
 </span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,

In [87]:
elements = soup.find_all('span')
elements

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span>by <small class="author" itemprop="author">J.K. Rowling</small>
 <a href="/author/J-K-Rowling">(about)</a>
 </span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,

In [88]:
elements = [elements.text for elements in soup.find_all('span', class_='text')]
elements

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

#### Construindo uma `list` de citações e autores

In [89]:
authors = [author.text for author in soup.find_all('small', class_='author')]
quotes = [quote.text for quote in soup.find_all('span', class_='text')]

data = list(zip(authors, quotes))
data

[('Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'),
 ('J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'),
 ('Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'),
 ('Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'),
 ('Marilyn Monroe',
  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"),
 ('Albert Einstein',
  '“Try not to become a man of success. Rather become a man of value.”'),
 ('André Gide',
  '“It is better to be hated for what you are than to be loved for what you are not.”'),
 ('Thomas A. Edison',
  "“I have not failed. I've just found 10,000 ways that won't work.”"),
 ('Eleanor Roosevelt',
  "“A

#### Construindo uma função que retorna os registros de citações de uma página (autores, citação etc.)

In [97]:
def get_quote_records(soup):
    quotes = [quote.text for quote in soup.find_all("span", class_="text")]
    authors = [author.text for author in soup.find_all("small", class_="author")]

    data = list(zip(authors, quotes))

    return data


get_quote_records(soup)


[('Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'),
 ('J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'),
 ('Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'),
 ('Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'),
 ('Marilyn Monroe',
  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"),
 ('Albert Einstein',
  '“Try not to become a man of success. Rather become a man of value.”'),
 ('André Gide',
  '“It is better to be hated for what you are than to be loved for what you are not.”'),
 ('Thomas A. Edison',
  "“I have not failed. I've just found 10,000 ways that won't work.”"),
 ('Eleanor Roosevelt',
  "“A

In [70]:
count = 1
all_data = []

while True:
    # constrói o objeto `soup`
    r = requests.get(f"https://quotes.toscrape.com/page/{count}")
    soup = BeautifulSoup(r.content, "html.parser")

    # anexa os dados à list `data`
    page_data = get_quote_records(soup)
    all_data += page_data

    # incrementa o contador
    count += 1

    # condição de parada: quando não há mais dados
    if page_data == []:
        break

# mostra os dados
print(all_data)


[('Albert Einstein', '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'), ('J.K. Rowling', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'), ('Albert Einstein', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'), ('Jane Austen', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'), ('Marilyn Monroe', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"), ('Albert Einstein', '“Try not to become a man of success. Rather become a man of value.”'), ('André Gide', '“It is better to be hated for what you are than to be loved for what you are not.”'), ('Thomas A. Edison', "“I have not failed. I've just found 10,000 ways that won't work.”"), ('Eleanor Roosevelt', "“A woman is like a tea bag; 

#### Adicionando os dados de `tags`

In [99]:
tag_divs = soup.find_all("div", class_="tags")

all_tags = []
for tag_div in tag_divs:
    tags = tag_div.find_all("a")
    tags = [tag.text for tag in tags]
    all_tags.append(tags)

all_tags

[['change', 'deep-thoughts', 'thinking', 'world'],
 ['abilities', 'choices'],
 ['inspirational', 'life', 'live', 'miracle', 'miracles'],
 ['aliteracy', 'books', 'classic', 'humor'],
 ['be-yourself', 'inspirational'],
 ['adulthood', 'success', 'value'],
 ['life', 'love'],
 ['edison', 'failure', 'inspirational', 'paraphrased'],
 ['misattributed-eleanor-roosevelt'],
 ['humor', 'obvious', 'simile']]

#### Encapsulando tudo em fuções

In [101]:
def get_tags(soup):
    tag_divs = soup.find_all("div", class_="tags")

    all_tags = []
    for tag_div in tag_divs:
        tags = tag_div.find_all("a")
        tags = [tag.text for tag in tags]
        all_tags.append(tags)

    return all_tags


def get_quote_records(soup):
    quotes = [quote.text for quote in soup.find_all("span", class_="text")]
    authors = [author.text for author in soup.find_all("small", class_="author")]
    tags = get_tags(soup)

    data = list(zip(authors, quotes, tags))

    return data


get_quote_records(soup)


[('Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  ['change', 'deep-thoughts', 'thinking', 'world']),
 ('J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  ['abilities', 'choices']),
 ('Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  ['inspirational', 'life', 'live', 'miracle', 'miracles']),
 ('Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  ['aliteracy', 'books', 'classic', 'humor']),
 ('Marilyn Monroe',
  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  ['be-yourself', 'inspirational']),
 ('Albert Einstein',
  '“Try not to become a man of success. Rather become a man of value.”',
  ['ad

In [103]:
def get_all_quotes(soup):
    count = 1
    all_data = []

    while True:
        # constrói o objeto `soup`
        r = requests.get(f"https://quotes.toscrape.com/page/{count}")
        soup = BeautifulSoup(r.content, "html.parser")

        # anexa os dados à list `data`
        page_data = get_quote_records(soup)
        all_data += page_data

        # incrementa o contador
        count += 1

        # condição de parada do loop: quando não há mais dados
        if page_data == []:
            break

    return all_data


get_all_quotes(soup)


[('Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  ['change', 'deep-thoughts', 'thinking', 'world']),
 ('J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  ['abilities', 'choices']),
 ('Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  ['inspirational', 'life', 'live', 'miracle', 'miracles']),
 ('Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  ['aliteracy', 'books', 'classic', 'humor']),
 ('Marilyn Monroe',
  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  ['be-yourself', 'inspirational']),
 ('Albert Einstein',
  '“Try not to become a man of success. Rather become a man of value.”',
  ['ad

## Raspando páginas que demandam interação com o browser

Nessa seção utilizaremos a biblioteca `helium`, que nos permite interagir com as páginas da web diretamente do Python. Com ele é possível clicar em botões, escrever dados em formulários e muito mais de uma forma muito mais simples que o mais conhecido `selenium`.

#### Fazendo login na página com `helium`

In [105]:
!pip install helium

Collecting helium
  Downloading helium-3.0.8.tar.gz (26.1 MB)
[K     |████████████████████████████████| 26.1 MB 5.5 MB/s 
[?25hCollecting selenium==3.141.0
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
[K     |████████████████████████████████| 904 kB 17.5 MB/s 
Using legacy 'setup.py install' for helium, since package 'wheel' is not installed.
Installing collected packages: selenium, helium
    Running setup.py install for helium ... [?25ldone
[?25hSuccessfully installed helium-3.0.8 selenium-3.141.0
You should consider upgrading via the '/home/vmussa/dev/alexandria/notebooks/.venv/bin/python -m pip install --upgrade pip' command.[0m


In [130]:
from helium import (
    start_chrome,
    write,
    click,
    press,
    TAB,
    ENTER,
    kill_browser,
)

driver = start_chrome("https://quotes.toscrape.com/")
click("Login")
write("a", into="Username")
press(TAB)
write("b", into="Password")
press(ENTER)


In [132]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

soup.title

<title>Quotes to Scrape</title>

In [133]:
quotes_data = get_all_quotes(soup)
quotes_data


{'authors': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 'quotes': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 'tags': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'}

## Exportando tudo para uma tabela em CSV com `pandas`

In [135]:

data = dict(zip(['authors', 'quotes', 'tags'], zip(*quotes_data)))
data

{'authors': ('Albert Einstein',
  'J.K. Rowling',
  'Albert Einstein',
  'Jane Austen',
  'Marilyn Monroe',
  'Albert Einstein',
  'André Gide',
  'Thomas A. Edison',
  'Eleanor Roosevelt',
  'Steve Martin',
  'Marilyn Monroe',
  'J.K. Rowling',
  'Albert Einstein',
  'Bob Marley',
  'Dr. Seuss',
  'Douglas Adams',
  'Elie Wiesel',
  'Friedrich Nietzsche',
  'Mark Twain',
  'Allen Saunders',
  'Pablo Neruda',
  'Ralph Waldo Emerson',
  'Mother Teresa',
  'Garrison Keillor',
  'Jim Henson',
  'Dr. Seuss',
  'Albert Einstein',
  'J.K. Rowling',
  'Albert Einstein',
  'Bob Marley',
  'Dr. Seuss',
  'J.K. Rowling',
  'Bob Marley',
  'Mother Teresa',
  'J.K. Rowling',
  'Charles M. Schulz',
  'William Nicholson',
  'Albert Einstein',
  'Jorge Luis Borges',
  'George Eliot',
  'George R.R. Martin',
  'C.S. Lewis',
  'Marilyn Monroe',
  'Marilyn Monroe',
  'Albert Einstein',
  'Marilyn Monroe',
  'Marilyn Monroe',
  'Martin Luther King Jr.',
  'J.K. Rowling',
  'James Baldwin',
  'Jane Austen

In [137]:
!pip install pandas

Collecting pandas
  Downloading pandas-1.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
[K     |████████████████████████████████| 11.5 MB 12.6 MB/s 
[?25hCollecting pytz>=2017.3
  Using cached pytz-2021.3-py2.py3-none-any.whl (503 kB)
Collecting numpy>=1.17.3
  Downloading numpy-1.21.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 6.6 MB/s 
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.21.3 pandas-1.3.4 pytz-2021.3
You should consider upgrading via the '/home/vmussa/dev/alexandria/notebooks/.venv/bin/python -m pip install --upgrade pip' command.[0m


In [138]:
import pandas as pd

df = pd.DataFrame(data)
df

Unnamed: 0,authors,quotes,tags
0,Albert Einstein,“The world as we have created it is a process ...,"[change, deep-thoughts, thinking, world]"
1,J.K. Rowling,"“It is our choices, Harry, that show what we t...","[abilities, choices]"
2,Albert Einstein,“There are only two ways to live your life. On...,"[inspirational, life, live, miracle, miracles]"
3,Jane Austen,"“The person, be it gentleman or lady, who has ...","[aliteracy, books, classic, humor]"
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...","[be-yourself, inspirational]"
...,...,...,...
95,Harper Lee,“You never really understand a person until yo...,[better-life-empathy]
96,Madeleine L'Engle,“You have to write the book that wants to be w...,"[books, children, difficult, grown-ups, write,..."
97,Mark Twain,“Never tell the truth to people who are not wo...,[truth]
98,Dr. Seuss,"“A person's a person, no matter how small.”",[inspirational]


In [139]:
df.to_csv('scraped_data.csv', index=False)

# Referências

## Tecnologias da Web
* [https://en.wikipedia.org/wiki/Internet]()
* [https://en.wikipedia.org/wiki/URL]()
* [https://en.wikipedia.org/wiki/World_Wide_Web]()
* [https://en.wikipedia.org/wiki/HTML]()
* [https://en.wikipedia.org/wiki/CSS]()
* [https://en.wikipedia.org/wiki/JavaScript]()

## Raspagem de dados com Python
* MITCHELL, R. [Web Scraping with Python](https://www.oreilly.com/library/view/web-scraping-with/9781491985564/). 2. ed. Sebastopol, CA, O’Reilly Media, Inc., 2018. 
* 