<h1 align= center>Aquisição de Dados e Extração</h1>
<p align= center><img src=https://www.bmr.it/bmr18/bmr-cont/uploads/2018/08/software.png width=500></p>

## How to parse websites and navigate the DOM using BeautifulSoup

In [1]:
import requests
from bs4 import BeautifulSoup

html = requests.get('http://localhost:8080/planets.html').text

soup = BeautifulSoup(html, 'lxml')

In [2]:
soup.html.body.div.tr

<tr id="planetHeader">
<th>
</th>
<th>
                    Name
                </th>
<th>
                    Mass (10^24kg)
                </th>
<th>
                    Diameter (km)
                </th>
<th>
                    How it got its Name
                </th>
<th>
                    More Info
                </th>
</tr>

In [3]:
soup.html.body.div.table.children

<list_iterator at 0x206345957b0>

In [4]:
# Usando uma lista de List Comphresion
[ str(c)[:45] for c in soup.html.body.div.table.children]

['\n',
 '<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n     ',
 '\n',
 '<tr class="planet" id="planet1" name="Mercury',
 '\n',
 '<tr class="planet" id="planet2" name="Venus">',
 '\n',
 '<tr class="planet" id="planet3" name="Earth">',
 '\n',
 '<tr class="planet" id="planet4" name="Mars">\n',
 '\n',
 '<tr class="planet" id="planet5" name="Jupiter',
 '\n',
 '<tr class="planet" id="planet6" name="Saturn"',
 '\n',
 '<tr class="planet" id="planet7" name="Uranus"',
 '\n',
 '<tr class="planet" id="planet8" name="Neptune',
 '\n',
 '<tr class="planet" id="planet9" name="Pluto">',
 '\n']

In [5]:
# utilizando o .parent (Pais)

str(soup.html.body.div.table.tr.parent)[:200]

'<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n                    Name\r\n                </th>\n<th>\r\n                    Mass (10^24kg)\r\n                </th>\n<th>\r\n     '

## Searching the DOM with Beautiful Soup's find methods

In [6]:
table = soup.find('table')
str(table)[:100]

'<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n                    Nam'

In [7]:
[str(tr)[:50] for tr in table.findAll('tr')]

['<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n          ',
 '<tr class="planet" id="planet1" name="Mercury">\n<t',
 '<tr class="planet" id="planet2" name="Venus">\n<td>',
 '<tr class="planet" id="planet3" name="Earth">\n<td>',
 '<tr class="planet" id="planet4" name="Mars">\n<td>\n',
 '<tr class="planet" id="planet5" name="Jupiter">\n<t',
 '<tr class="planet" id="planet6" name="Saturn">\n<td',
 '<tr class="planet" id="planet7" name="Uranus">\n<td',
 '<tr class="planet" id="planet8" name="Neptune">\n<t',
 '<tr class="planet" id="planet9" name="Pluto">\n<td>']

In [8]:
table.find("tr", {"id": "planet3"})

<tr class="planet" id="planet3" name="Earth">
<td>
<img src="img/earth-150x150.png"/>
</td>
<td>
                    Earth
                </td>
<td>
                    5.97
                </td>
<td>
                    12756
                </td>
<td>
                    The name Earth comes from the Indo-European base 'er,'which produced the Germanic noun 'ertho,' and ultimately German 'erde,' Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.' Related forms include Greek 'eraze,' meaning 'on the ground,' and Welsh 'erw,' meaning 'a piece of land.'
                </td>
<td>
<a href="https://en.wikipedia.org/wiki/Earth">Wikipedia</a>
</td>
</tr>

In [9]:
items = dict()
planet_rows = table.findAll('tr', {'class': 'planet'})
for i in planet_rows:
	tds = i.findAll('td')
	items[tds[1].text.strip()] = tds[2].text.strip()

items

{'Mercury': '0.330',
 'Venus': '4.87',
 'Earth': '5.97',
 'Mars': '0.642',
 'Jupiter': '1898',
 'Saturn': '568',
 'Uranus': '86.8',
 'Neptune': '102',
 'Pluto': '0.0146'}

## Querying the DOM with XPath and lxml



Alguns benefícios de utilizar **XPath**:
* Mais fácil de navegar pelo DOM
* Mais sofisticado e mais poderoso que o CSS Selector e as Regular Expressions
* Mais funções integradas e extensíveis
* Amplamente suportado por outras bibliotecas e plataformas de Scraping

O Xpath contém sete modelos de dados:
* nó raiz (mais elevado)
* nó elemento (\<a>..\</a>)
* nó atributo (href='example.html')
* nó texto ('this is a text')
* nó comentário (<!-- um comentário -->)
* nó namespace
* nó de processamento de instrução

XPath pode retornar diferentes tipos de dados:
* strings
* booleanos
* numéricos
* conjunto de nó (mais comum)

In [10]:
from lxml import html
import requests

page_html = requests.get('http://localhost:8080/planets.html').text

# Carregando o Html no lxml etree

tree = html.fromstring(page_html)


[tr for tr in tree.xpath("/html/body/div/table/tr")]

[<Element tr at 0x20634513c90>,
 <Element tr at 0x20634513f10>,
 <Element tr at 0x20634513e70>,
 <Element tr at 0x206345135b0>,
 <Element tr at 0x20634510590>,
 <Element tr at 0x20634510540>,
 <Element tr at 0x20634511da0>,
 <Element tr at 0x20631864d60>,
 <Element tr at 0x20631864db0>,
 <Element tr at 0x20631864e50>,
 <Element tr at 0x20631864ea0>]

In [11]:
from lxml import etree
[etree.tostring(tr)[:50] for tr in tree.xpath('/html/body/div/table/tr')]

[b'<tr id="planetHeader">&#13;\n                <th>&#',
 b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;',
 b'<tr id="footerRow">&#13;\n                <td>&#13;']

In [12]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr[@class='planet']")] # Capturando apenas os valores de Planetas.

[b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [13]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[1]/table/tr")] # Capturando apenas os valores da 1ª Div.

[b'<tr id="planetHeader">&#13;\n                <th>&#',
 b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [14]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[2]/table/tr")] # Capturando apenas os valores da 2ª Div.

[b'<tr id="footerRow">&#13;\n                <td>&#13;']

A 1ª \<div> neste documento é também um atributo:

 `<div id="planets">`


In [15]:
# Capturando somente a <div> com "id=planets"
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr")]

[b'<tr id="planetHeader">&#13;\n                <th>&#',
 b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [16]:
# Podemos excluir linhas utilizando !=

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr[@id!='planetHeader']")]

[b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [17]:
# Caso não tenhamos atributos ou cabeçados de linhas podemos utilizar a posição [position()]

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr[position() >1]")]

# Selecionado pela posição pulando a 1ª linha.

[b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [18]:
# É possivel navegar pelos "pais" usando [parent::*]

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::*")]

[b'<table id="planetsTable" border="1">&#13;\n        ',
 b'<table id="footerTable">&#13;\n            <tr id="']

In [19]:
# * é um caracter coringa que retorna todos os pais. Podemos ser mais específicos nomeando o elemento.

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::table")] # Nomeando o TABLE

[b'<table id="planetsTable" border="1">&#13;\n        ',
 b'<table id="footerTable">&#13;\n            <tr id="']

In [20]:
# o atalho para acessar os pais é [..]
# Para acessar o nó corrente é [.]

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/..")]

[b'<table id="planetsTable" border="1">&#13;\n        ',
 b'<table id="footerTable">&#13;\n            <tr id="']

In [21]:
# Encontrando a massa da Terra:

mass = tree.xpath("/html/body/div[1]/table/tr[@name='Earth']/td[3]/text()[1]")[0].strip()
mass

'5.97'

<h3>Consultando dados com seletores XPath e CSS</h3>

* todas as tags: *
* Uma específica tag (por exemplo, `tr`):   .planet
* Uma classe (por exemplo, `planet`): tr.planet
* Uma tag com `id` "`planet3`": tr#planet3
* Um filho `tr` da tabela: table tr
* Um descendente `tr`da tabela: table tr
* Uma tag com um atributo (que é `tr`, com `id="planet4"`): a[id=Mars]

In [22]:
from lxml import html
import requests

page_html = requests.get("http://localhost:8080/planets.html").text
tree = html.fromstring(page_html)

In [23]:
# todos os elementos <tr> da class='planet'
[(v, v.xpath("@name")) for v in tree.cssselect('tr.planet')]

[(<Element tr at 0x2063326c9f0>, ['Mercury']),
 (<Element tr at 0x206318671a0>, ['Venus']),
 (<Element tr at 0x20631888950>, ['Earth']),
 (<Element tr at 0x20631888ef0>, ['Mars']),
 (<Element tr at 0x20631888bd0>, ['Jupiter']),
 (<Element tr at 0x206318aa750>, ['Saturn']),
 (<Element tr at 0x206318aa570>, ['Uranus']),
 (<Element tr at 0x206318aa890>, ['Neptune']),
 (<Element tr at 0x206318aa8e0>, ['Pluto'])]

In [24]:
# O valor Earth pode ser encontrado de várias formas.

tr = tree.cssselect("tr#planet3")
tr[0], tr[0].xpath("./td[2]/text()")[0].strip()

(<Element tr at 0x20631888950>, 'Earth')

In [25]:
# Usar um atributo com valor específico
tr = tree.cssselect("tr[name='Pluto']") # Diferente do XPath não há necessidade de utilizar o @
tr[0], tr[0].xpath("./td[2]/text()")[0].strip()

(<Element tr at 0x206318aa8e0>, 'Pluto')

<h3>Usando selectores Scrapy</h3>

In [26]:
from scrapy.selector import Selector
import requests

response = requests.get("https://stackoverflow.com/questions")
selector = Selector(response)
selector

<Selector xpath=None data='<html class="html__responsive " lang=...'>

In [27]:
classes = selector.xpath('//div[@class]/h3')
classes[0:5]

[<Selector xpath='//div[@class]/h3' data='<h3 class="flex--item">\r\n            ...'>,
 <Selector xpath='//div[@class]/h3' data='<h3>\r\nyour communities            </h3>'>,
 <Selector xpath='//div[@class]/h3' data='<h3><a href="https://stackexchange.co...'>,
 <Selector xpath='//div[@class]/h3' data='<h3 class="s-post-summary--content-ti...'>,
 <Selector xpath='//div[@class]/h3' data='<h3 class="s-post-summary--content-ti...'>]

In [28]:
classes = selector.xpath('//div[@class="s-post-summary--content"]/h3/a/text()').getall()
classes[:10]


['Powershell pass variable to ArgumentList',
 'stratified sampling with priors in python',
 'How to fill column using values in 2 other columns in PROC SQL / SAS code in SAS Enterprise Guide?',
 'python google cloud list vm in json format',
 'pandas: get count within each column based on different arithmetic condition',
 'Usage of Synaptics RMI4 f54',
 'Draw.io not have scratchpad and menu',
 'Test connection and output in GUI screen in Powershell',
 'how can use throw new error to send a error response in postman',
 'Dealing with spring PropertySource']

<h3>Carrengando os dados em unicode/UTF-8</h3>

In [29]:
from urllib.request import urlopen

page = urlopen('http://localhost:8080/unicode.html')
content = page.read()
content[840:1280]

b'\r\n    </table>\r\n\r\n    <p><strong>Cyrillic</strong> &nbsp; U+0400 \xe2\x80\x93 U+04FF &nbsp; (1024\xe2\x80\x931279)</p>\r\n    <table class="unicode">\r\n        <tbody>\r\n            <tr valign="top">\r\n                <td width="50">&nbsp;</td>\r\n                <td class="b" width="50">\xd0\x89</td>\r\n                <td class="b" width="50">\xd0\xa9</td>\r\n                <td class="b" width="50">\xd1\x89</td>\r\n                <td class="b" width="50">\xd3\x83</td>\r\n            </tr>'

In [30]:
# Percebe que \xd0\x89 foram lidos como multi-byte.
# Para retificar isso precisamos usar um formato Python str()

str(content,'utf-8')[837:1280]

# Agora a saída foi feita de forma correta. Percebe-se a letra do alfatebo cirílico.

'\n    </table>\r\n\r\n    <p><strong>Cyrillic</strong> &nbsp; U+0400 – U+04FF &nbsp; (1024–1279)</p>\r\n    <table class="unicode">\r\n        <tbody>\r\n            <tr valign="top">\r\n                <td width="50">&nbsp;</td>\r\n                <td class="b" width="50">Љ</td>\r\n                <td class="b" width="50">Щ</td>\r\n                <td class="b" width="50">щ</td>\r\n                <td class="b" width="50">Ӄ</td>\r\n            </tr>\r\n        </'

In [37]:
# Podemos fazer isso de forma resumida com a "requests"
import requests
response = requests.get('http://localhost:8080/unicode.html').content
str(response,'utf-8')[837:1280]

'\n    </table>\r\n\r\n    <p><strong>Cyrillic</strong> &nbsp; U+0400 – U+04FF &nbsp; (1024–1279)</p>\r\n    <table class="unicode">\r\n        <tbody>\r\n            <tr valign="top">\r\n                <td width="50">&nbsp;</td>\r\n                <td class="b" width="50">Љ</td>\r\n                <td class="b" width="50">Щ</td>\r\n                <td class="b" width="50">щ</td>\r\n                <td class="b" width="50">Ӄ</td>\r\n            </tr>\r\n        </'