# Parsing Data from Websites

In [10]:
import pandas as pd
import requests
import lxml
from lxml import html

## Using lxml to Parse HTML 

The libxml library `lxml` provides a lot of sophisticated (low-level) functionality for traversing XML and html documents.  In this example we use the specialized `html` submodule.

We download some data from wikipeda:

In [2]:
url = "https://en.wikipedia.org/wiki/United_States_presidential_election_in_Virginia,_2004"
resp = requests.get(url)
resp

<Response [200]>

## Construct the DOM

Parse the content into a document object model

In [11]:
dom = lxml.html.document_fromstring(resp.content)

In [12]:
dom

<Element html at 0x1fc5975c318>

## Traversing the DOM

In [13]:
dom.getchildren()

[<Element head at 0x1fc5989d598>, <Element body at 0x1fc5989d958>]

## Jumping directly to an Element

In [14]:
body = dom.find("body")

In [15]:
body.getchildren()

[<Element div at 0x1fc5989db38>,
 <Element div at 0x1fc5989d908>,
 <Element div at 0x1fc5989ddb8>,
 <Element div at 0x1fc5989de08>,
 <Element div at 0x1fc5989de58>,
 <Element div at 0x1fc5989dea8>,
 <Element script at 0x1fc5989def8>,
 <Element script at 0x1fc5989df48>,
 <Element script at 0x1fc5989df98>]

## Using XPath to query HTML 

The following XPath query finds all the table elements starting at anywhere `//` in the tree and then traverses into the table to the row `/tr` and then the data entry `/td` and looks for a link `a` with the title attribute `@title` having the value `"Accomack County, Virginia"` and then gets its parent (the `td`) and then its parent (`tr`) and then its parent (`table`) and returns that.

In [16]:
tables = dom.xpath('//table/tbody/tr/td/a[@title="Accomack County, Virginia"]/../../../..')

Printing the returned table:

In [18]:
tables

[<Element table at 0x1fc5989d778>]

print(html.tostring(tables[0], pretty_print=True).decode('UTF8'))

Building a DataFrame from the table:

In [19]:
df = pd.read_html(html.tostring(tables[0]))[0]
df.head()

Unnamed: 0,County or City,Kerry %,Kerry #,Bush %,Bush #,Other %,Other #
0,Accomack,41.3%,5518,57.8%,7726,0.8%,112
1,Albemarle,50.5%,22088,48.5%,21189,1.0%,449
2,Alleghany,44.5%,3203,55.1%,3962,0.4%,30
3,Amelia,34.5%,1862,64.8%,3499,0.7%,36
4,Amherst,38.3%,4866,61.1%,7758,0.6%,71
