<h1 align= center>Aquisição de Dados e Extração</h1>
<p align= center><img src=https://www.bmr.it/bmr18/bmr-cont/uploads/2018/08/software.png width=500></p>

## How to parse websites and navigate the DOM using BeautifulSoup

In [2]:
import requests
from bs4 import BeautifulSoup

html = requests.get('http://localhost:8080/planets.html').text

soup = BeautifulSoup(html, 'lxml')

In [3]:
soup.html.body.div.tr

<tr id="planetHeader">
<th>
</th>
<th>
                    Name
                </th>
<th>
                    Mass (10^24kg)
                </th>
<th>
                    Diameter (km)
                </th>
<th>
                    How it got its Name
                </th>
<th>
                    More Info
                </th>
</tr>

In [4]:
soup.html.body.div.table.children

<list_iterator at 0x2d85adf3f70>

In [5]:
# Usando uma lista de List Comphresion
[ str(c)[:45] for c in soup.html.body.div.table.children]

['\n',
 '<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n     ',
 '\n',
 '<tr class="planet" id="planet1" name="Mercury',
 '\n',
 '<tr class="planet" id="planet2" name="Venus">',
 '\n',
 '<tr class="planet" id="planet3" name="Earth">',
 '\n',
 '<tr class="planet" id="planet4" name="Mars">\n',
 '\n',
 '<tr class="planet" id="planet5" name="Jupiter',
 '\n',
 '<tr class="planet" id="planet6" name="Saturn"',
 '\n',
 '<tr class="planet" id="planet7" name="Uranus"',
 '\n',
 '<tr class="planet" id="planet8" name="Neptune',
 '\n',
 '<tr class="planet" id="planet9" name="Pluto">',
 '\n']

In [6]:
# utilizando o .parent (Pais)

str(soup.html.body.div.table.tr.parent)[:200]

'<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n                    Name\r\n                </th>\n<th>\r\n                    Mass (10^24kg)\r\n                </th>\n<th>\r\n     '

## Searching the DOM with Beautiful Soup's find methods

In [7]:
table = soup.find('table')
str(table)[:100]

'<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n                    Nam'

In [8]:
[str(tr)[:50] for tr in table.findAll('tr')]

['<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n          ',
 '<tr class="planet" id="planet1" name="Mercury">\n<t',
 '<tr class="planet" id="planet2" name="Venus">\n<td>',
 '<tr class="planet" id="planet3" name="Earth">\n<td>',
 '<tr class="planet" id="planet4" name="Mars">\n<td>\n',
 '<tr class="planet" id="planet5" name="Jupiter">\n<t',
 '<tr class="planet" id="planet6" name="Saturn">\n<td',
 '<tr class="planet" id="planet7" name="Uranus">\n<td',
 '<tr class="planet" id="planet8" name="Neptune">\n<t',
 '<tr class="planet" id="planet9" name="Pluto">\n<td>']

In [9]:
table.find("tr", {"id": "planet3"})

<tr class="planet" id="planet3" name="Earth">
<td>
<img src="img/earth-150x150.png"/>
</td>
<td>
                    Earth
                </td>
<td>
                    5.97
                </td>
<td>
                    12756
                </td>
<td>
                    The name Earth comes from the Indo-European base 'er,'which produced the Germanic noun 'ertho,' and ultimately German 'erde,' Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.' Related forms include Greek 'eraze,' meaning 'on the ground,' and Welsh 'erw,' meaning 'a piece of land.'
                </td>
<td>
<a href="https://en.wikipedia.org/wiki/Earth">Wikipedia</a>
</td>
</tr>

In [10]:
items = dict()
planet_rows = table.findAll('tr', {'class': 'planet'})
for i in planet_rows:
	tds = i.findAll('td')
	items[tds[1].text.strip()] = tds[2].text.strip()

items

{'Mercury': '0.330',
 'Venus': '4.87',
 'Earth': '5.97',
 'Mars': '0.642',
 'Jupiter': '1898',
 'Saturn': '568',
 'Uranus': '86.8',
 'Neptune': '102',
 'Pluto': '0.0146'}

## Querying the DOM with XPath and lxml



Alguns benefícios de utilizar **XPath**:
* Mais fácil de navegar pelo DOM
* Mais sofisticado e mais poderoso que o CSS Selector e as Regular Expressions
* Mais funções integradas e extensíveis
* Amplamente suportado por outras bibliotecas e plataformas de Scraping

O Xpath contém sete modelos de dados:
* nó raiz (mais elevado)
* nó elemento (\<a>..\</a>)
* nó atributo (href='example.html')
* nó texto ('this is a text')
* nó comentário (<!-- um comentário -->)
* nó namespace
* nó de processamento de instrução

XPath pode retornar diferentes tipos de dados:
* strings
* booleanos
* numéricos
* conjunto de nó (mais comum)

In [18]:
from lxml import html
import requests

page_html = requests.get('http://localhost:8080/planets.html').text

# Carregando o Html no lxml etree

tree = html.fromstring(page_html)


[tr for tr in tree.xpath("/html/body/div/table/tr")]

[<Element tr at 0x2d85ba836f0>,
 <Element tr at 0x2d85ba833d0>,
 <Element tr at 0x2d85ba835b0>,
 <Element tr at 0x2d85baadbc0>,
 <Element tr at 0x2d85baae570>,
 <Element tr at 0x2d85baae5c0>,
 <Element tr at 0x2d85baae660>,
 <Element tr at 0x2d85baae6b0>,
 <Element tr at 0x2d85baae110>,
 <Element tr at 0x2d85baae340>,
 <Element tr at 0x2d85baae2a0>]

In [20]:
from lxml import etree
[etree.tostring(tr)[:50] for tr in tree.xpath('/html/body/div/table/tr')]

[b'<tr id="planetHeader">&#13;\n                <th>&#',
 b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;',
 b'<tr id="footerRow">&#13;\n                <td>&#13;']

In [21]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr[@class='planet']")] # Capturando apenas os valores de Planetas.

[b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [23]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[1]/table/tr")] # Capturando apenas os valores da 1ª Div.

[b'<tr id="planetHeader">&#13;\n                <th>&#',
 b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [24]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[2]/table/tr")] # Capturando apenas os valores da 2ª Div.

[b'<tr id="footerRow">&#13;\n                <td>&#13;']

A 1ª \<div> neste documento é também um atributo:

 `<div id="planets">`


In [30]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr")]

[b'<tr id="planetHeader">&#13;\n                <th>&#',
 b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [31]:
# Podemos excluir linhas utilizando !=

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr[@id!='planetHeader']")]

[b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [34]:
# Caso não tenhamos atributos ou cabeçados de linhas podemos utilizar a posição [position()]

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr[position() >1]")]

# Selecionado pela posição pulando a 1ª linha.

[b'<tr id="planet1" class="planet" name="Mercury">&#1',
 b'<tr id="planet2" class="planet" name="Venus">&#13;',
 b'<tr id="planet3" class="planet" name="Earth">&#13;',
 b'<tr id="planet4" class="planet" name="Mars">&#13;\n',
 b'<tr id="planet5" class="planet" name="Jupiter">&#1',
 b'<tr id="planet6" class="planet" name="Saturn">&#13',
 b'<tr id="planet7" class="planet" name="Uranus">&#13',
 b'<tr id="planet8" class="planet" name="Neptune">&#1',
 b'<tr id="planet9" class="planet" name="Pluto">&#13;']

In [35]:
# É possivel navegar pelos "pais" usando [parent::*]

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::*")]

[b'<table id="planetsTable" border="1">&#13;\n        ',
 b'<table id="footerTable">&#13;\n            <tr id="']

In [39]:
# * é um caracter coringa que retorna todos os pais. Podemos ser mais específicos nomeando o elemento.

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::table")] # Nomeando o TABLE

[b'<table id="planetsTable" border="1">&#13;\n        ',
 b'<table id="footerTable">&#13;\n            <tr id="']

In [43]:
# o atalho para acessar os pais é [..]
# Para acessar o nó corrente é [.]

[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/..")]

[b'<table id="planetsTable" border="1">&#13;\n        ',
 b'<table id="footerTable">&#13;\n            <tr id="']

In [76]:
# Encontrando a massa da Terra:

mass = tree.xpath("/html/body/div[1]/table/tr[@name='Earth']/td[3]/text()[1]")[0].strip()
mass

'5.97'

<h3>Consultando dados com seletores XPath e CSS</h3>

* todas as tags: *
* Uma específica tag (por exemplo, `tr`):   .planet
* Uma classe (por exemplo, `planet`): tr.planet
* Uma tag com `id` "`planet3`": tr#planet3
* Um filho `tr` da tabela: table tr
* Um descendente `tr`da tabela: table tr
* Uma tag com um atributo (que é `tr`, com `id="planet4"`): a[id=Mars]

In [77]:
from lxml import html
import requests

page_html = requests.get("http://localhost:8080/planets.html").text
tree = html.fromstring(page_html)

In [80]:
# todos os elementos <tr> da class='planet'
[(v, v.xpath("@name")) for v in tree.cssselect('tr.planet')]

[(<Element tr at 0x2d85b9fbfb0>, ['Mercury']),
 (<Element tr at 0x2d85bb09f30>, ['Venus']),
 (<Element tr at 0x2d85babbb50>, ['Earth']),
 (<Element tr at 0x2d85baaf2e0>, ['Mars']),
 (<Element tr at 0x2d85ba67b50>, ['Jupiter']),
 (<Element tr at 0x2d85bb3fba0>, ['Saturn']),
 (<Element tr at 0x2d85bb659e0>, ['Uranus']),
 (<Element tr at 0x2d85bb65a80>, ['Neptune']),
 (<Element tr at 0x2d85bb65300>, ['Pluto'])]

In [87]:
# O valor Earth pode ser encontrado de várias formas.

tr = tree.cssselect("tr#planet3")
tr[0], tr[0].xpath("./td[2]/text()")[0].strip()

(<Element tr at 0x2d85babbb50>, 'Earth')

In [89]:
# Usar um atributo com valor específico
tr = tree.cssselect("tr[name='Pluto']") # Diferente do XPath não há necessidade de utilizar o @
tr[0], tr[0].xpath("./td[2]/text()")[0].strip()

(<Element tr at 0x2d85bb65300>, 'Pluto')

<h3>Usando selectores Scrapy</h3>

In [92]:
from scrapy.selector import Selector
import requests

response = requests.get("https://stackoverflow.com/questions")
selector = Selector(response)
selector

<Selector xpath=None data='<html class="html__responsive " lang=...'>

In [124]:
classes = selector.xpath('//div[@class]/h3')
classes[0:5]

[<Selector xpath='//div[@class]/h3' data='<h3 class="flex--item">\r\n            ...'>,
 <Selector xpath='//div[@class]/h3' data='<h3>\r\nyour communities            </h3>'>,
 <Selector xpath='//div[@class]/h3' data='<h3><a href="https://stackexchange.co...'>,
 <Selector xpath='//div[@class]/h3' data='<h3 class="s-post-summary--content-ti...'>,
 <Selector xpath='//div[@class]/h3' data='<h3 class="s-post-summary--content-ti...'>]