# Introduction
In homework 1, we were required to implement a simple parser for XML file. However, in the real world, there are more mature libraries to achieve this goal. This tutorial introduces lxml — a powerful library for processing XML and HTML files in Python. It provides a native python API. Also, it‘s really easy to use because you don't need to manually manage the memory.

First, let's take a quick look at XML and HTML files.


## XML File
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

The application of XML is quite comprehensive. Although the design goals of XML focuses on textual data format in unicode. This language is still widely used for the representation of arbitrary data structures such as those used in web services.

## HTML File
HyperText Markup Language (HTML) is the standard markup language for creating web pages and web applications. With Cascading Style Sheets (CSS), and JavaScript, it forms a triad of cornerstone technologies for the World Wide Web.

## Tutorial Content

* [Install lxml](#Install-lxml)
* [Parse XML document](#Parse-XML-document)
 * [Output the XML document](#Output-the-XML-document)
 * [Access to tag and attributes](#Access-to-tag-and-attributes)
 * [Traversing child nodes](#Traversing-child-nodes)
 * [Access to parent node and sibling nodes](#Access-to-parent-node-and-sibling-nodes)
 * [Access to text](#Access-to-text)
 * [Search tag](#Search-tag)
 * [Tree iteration](#Tree-iteration)
* [Parse HTML document](#Parse-HTML-document)
* [Simple Example : Extract CD Information](#Simple-Example-:-Extract-CD-Information)
* [Complex Example: W3C.xml](#Complex-Example:-W3C.xml)
* [Further resources](#Further-resources)

## Install lxml
There are several ways to install lxml. The simplest one is through `pip`.

	$ pip install lxml

Also, you can specify the version to install.

	$ pip install lxml==3.4.2
	
Since lxml is open source, you can manually download the code and build it. Click [Here](http://lxml.de/build.html) for more information.

After installing, execute the following import statement.

In [1]:
from lxml import etree

## Parse XML document

Here is an example of an XML document.

	<?xml version="1.0" encoding="UTF-8"?>
	<!-- This is a comment -->
	<note date="8/31/12">
    <to>Tove</to>
    <from>Jani</from>
    <heading type="Reminder"/>
    <body>Don't forget me this weekend!</body>
    <!-- This is a multiline comment, which take a bit of care to parse -->
	</note>

If you did homework1, you may find it looks familiar :-)



The `parse()` function takes a file-like object as the input and returns a `lxml.etree._ElementTree` object. This object contains all the information of the XML document in a tree-based structure.

`xml_version`, `encoding` and other document information can be accessed by the attributes of this object.

In [2]:
import io
document = '''<?xml version="1.0" encoding="UTF-8"?><!-- This is a comment --><note date="8/31/12"><to>Tove</to><from>Jani</from><heading type="Reminder"/><body>Don't forget me this weekend!</body><!-- This is a multiline comment, which take a bit of care to parse --></note>'''
tree = etree.parse(io.BytesIO(document))
print type(tree)
print tree.docinfo.xml_version
print tree.docinfo.encoding

<type 'lxml.etree._ElementTree'>
1.0
UTF-8


As shown below, you can use the `getroot()` function to get the root node of this tree stucture. Every node in this tree is a `lxml.etree._Element` object.

In [3]:
root = tree.getroot()
print type(root)

<type 'lxml.etree._Element'>


### Output the XML document
The `etree.tostring()` function converts a `lxml.etree._Element` object to a string. Note that with the `pretty_print` argument set, the function generates output in a more readable way.

Also, you can speicify the `method` argument to set the output format, either `html` or `xml`.

In [4]:
print etree.tostring(root)

# set the pretty_print argument
print etree.tostring(root, pretty_print=True)

# output in xml format
print etree.tostring(root, method='xml')

# output in html format
print etree.tostring(root, method='html')

<note date="8/31/12"><to>Tove</to><from>Jani</from><heading type="Reminder"/><body>Don't forget me this weekend!</body><!-- This is a multiline comment, which take a bit of care to parse --></note>
<note date="8/31/12">
  <to>Tove</to>
  <from>Jani</from>
  <heading type="Reminder"/>
  <body>Don't forget me this weekend!</body>
  <!-- This is a multiline comment, which take a bit of care to parse -->
</note>

<note date="8/31/12"><to>Tove</to><from>Jani</from><heading type="Reminder"/><body>Don't forget me this weekend!</body><!-- This is a multiline comment, which take a bit of care to parse --></note>
<note date="8/31/12"><to>Tove</to><from>Jani</from><heading type="Reminder"></heading><body>Don't forget me this weekend!</body><!-- This is a multiline comment, which take a bit of care to parse --></note>


### Access to tag and attributes
There are several useful functions to access tag and attributes of a node, including getting the content and creating new attributes.

In [5]:
print root.tag

# to get one attribute
print root.get("date"), '\n'

# to set another attribute
root.set("new_attribute", "value")
print root.get("new_attribute"), '\n'

# iterate attributes
for key, value in root.items():
    print key, value
print '\n'
    
# another way to iterate attributes
for key in root.attrib:
    print key, root.attrib[key]

note
8/31/12 

value 

date 8/31/12
new_attribute value


date 8/31/12
new_attribute value


### Traversing child nodes
Traversing through child nodes is extremely easy using lxml. Every node is a list-like object - the elements in this list are the child nodes which are also `lxml.etree._Element` objects. It supports every operatons of a python list, like iterating or slicing.

In [6]:
print len(root)

# print the type of each child
for child in root:
    print type(child)
    
# note the last one is comment, which is according to our XML document
# then print the tag of each child node
for child in root[:-1]:
    print child.tag

5
<type 'lxml.etree._Element'>
<type 'lxml.etree._Element'>
<type 'lxml.etree._Element'>
<type 'lxml.etree._Element'>
<type 'lxml.etree._Comment'>
to
from
heading
body


### Access to parent node and sibling nodes

Navigating through the tree requires accessing the parent node. lxml supports the `getparent()` function to find the parent node of current node. Also, lxml supports `getprivous()`, `getnext()` to get the sibling nodes.

Now, with access to the parent node and child nodes, navigation is much easier.

In [7]:
# find the first and second child of root
child1 = root[0]
child2 = child1.getnext()
print child2.getprevious() is child1

# get the parent node of the first child
print root[0].getparent()

# check if they are the same object
print root is root[0].getparent()

True
<Element note at 0x1044bef80>
True


### Access to text
Sometimes, nodes contain text. You can easily extract the text by `text` attribute.

In [8]:
for child in root[:-1]:
    print child.text

Tove
Jani
None
Don't forget me this weekend!


`xpath()` function is useful when you want to extract all the text (including the text in the child nodes). By specifying the argument, you can get a list or a concatenated string.

In [9]:
# to get a list
print root.xpath("//text()")

# to get a string
print root.xpath("string()")

['Tove', 'Jani', "Don't forget me this weekend!"]
ToveJaniDon't forget me this weekend!


### Search tag
Sometimes you are interested in some particular tags. `find()` and `findall()` are extremely useful here. Below examples shows how to use them respectively.

In [10]:
# print the type of heading
print root.find('heading').get('type')

# print the text of body
print root.find('body').text

print ''

# If there are multiple tags, findAll can search them all
temp = etree.XML("<root><a>Text1</a><a>Text2</a><a>Text3</a></root>")
print etree.tostring(temp, pretty_print=True)
for node in temp.findall('a'):
    print node.text

# use xpath can find tag recursively
temp = etree.XML("<root><a1><b>in a1</b></a1><a2><b>in a2</b></a2><a3><b>in a3</b></a3></root>")
print etree.tostring(temp, pretty_print=True)

# note below function call return 0, you need to use xpath here!
print len(temp.findall('b'))

print len(temp.findall('.//b'))

Reminder
Don't forget me this weekend!

<root>
  <a>Text1</a>
  <a>Text2</a>
  <a>Text3</a>
</root>

Text1
Text2
Text3
<root>
  <a1>
    <b>in a1</b>
  </a1>
  <a2>
    <b>in a2</b>
  </a2>
  <a3>
    <b>in a3</b>
  </a3>
</root>

0
3


### Tree iteration
If you want to recursively traverse the whole tree, you can use the `iter()` function. The function returns a iterator for the tree. The order is the same as the results of DFS.

If you are only interested in some of the elements, you can pass the names to the function. It automatically filters the result.

In [11]:
# iterate the whole tree, note the comment doesn't have a tag, it only has text.
for node in root.iter():
    print node.tag, node.text
    
print ''

# only intested in the body part
for node in root.iter("body"):
    print node.tag, node.text

note None
to Tove
from Jani
heading None
body Don't forget me this weekend!
<cyfunction Comment at 0x1044c1110>  This is a multiline comment, which take a bit of care to parse 

body Don't forget me this weekend!


There are other useful functions to traverse the tree if we are given a node in the tree. See the examples below.

Suppose the XML document look like this:  `<root><a/><b/><c><d/><e/></c></root>`

In [12]:
r = etree.fromstring("<root><a/><b/><c><d/><e/></c></root>")
print etree.tostring(r, pretty_print=True)
a = r[0]
e = r[2][1]

# given a, traverse the siblings of a
for sibling in a.itersiblings():
    print sibling.tag
print ''

# given e, traverse the ancesters
for ancestor in e.iterancestors():
    print ancestor.tag
print ''

<root>
  <a/>
  <b/>
  <c>
    <d/>
    <e/>
  </c>
</root>

b
c

c
root



## Modify existing XML document

Nowadays, configuration files are often in XML format. Therefore, being able to modify existing XML documents is of high demand.

In this section, we will modify our previous example. We need to learn how to create new nodes, add new nodes to existing XML documents and modify attributes, text or tag of current XML document nodes.

To create a new node, we only need to create a `lxml.etree._Element` object. 

Suppose its tag is "newEle", its attributes is "attr=1", its text is "text!". So it should look like  `<newEle attr="1">text!</newEle>`

In [13]:
ele = etree.Element("newEle")
ele.set("attr", "1")
ele.text = "text!"
print etree.tostring(ele)

<newEle attr="1">text!</newEle>


There are two ways to add a node to existing XML document.

First, as mentioned before, the node is a list-like object, so you can use `append()` to insert in the end or `insert()` to specify a insert location.

Second, lxml provides a `etree.SubElement()` function to add a new node to a current node.

In [14]:
# method 1
print "Before add,", len(root)
root.append(ele)
print "After add,", len(root)

# method 2
print "Before add,", len(root)
etree.SubElement(root, "newEle2")
print "After add,", len(root)

# print current child of root
# note the comment doesn't have a tag
for node in root:
    print node.tag

Before add, 5
After add, 6
Before add, 6
After add, 7
to
from
heading
body
<cyfunction Comment at 0x1044c1110>
newEle
newEle2


## Parse HTML document

Parsing HTML document and XML document are similar. You just need to provide an additional `HTML parser` to the same `parse()` function.

The HTML parser can be created by the following code.

    parser = etree.HTMLParser()

The return of `parse()` is also a `lxml.etree._ElementTree` object, which means the operations are same to XML document.

Note that if the HTML is poorly formatted, the `parse()` function won't throw an exception. Instead, it will fill the vacancy of missing tags.

In [15]:
parser = etree.HTMLParser()
HTML='''<HTML>
<HEAD>
<TITLE>This is title.</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFF">
<a href="http://somegreatsite.com">Link Name</a>is a link to another nifty site
<H1>This is a Header</H1>
<H2>This is a Medium Header</H2>
</BODY>
</HTML>'''
tree = etree.parse(io.BytesIO(HTML), parser)
print type(tree)

# Here you can do everything like before, such as navigating, modifying nodes

print etree.tostring(tree, method='HTML', pretty_print=True)

<type 'lxml.etree._ElementTree'>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>This is title.</title>
</head>
<body bgcolor="FFFFFF">
<a href="http://somegreatsite.com">Link Name</a>is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
</body>
</html>



Here is an example of a poorly formatted HTML file. The corresponding tag of `<H2>` is `</H3>`. And the parser change it to the correct `</H2>`.

In [16]:
parser = etree.HTMLParser()
HTML='''<HTML>
<BODY>
<H2></H3>
</BODY>
</HTML>'''
tree = etree.parse(io.BytesIO(HTML), parser)
print etree.tostring(tree, method='HTML', pretty_print=True)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<h2>
</h2>
</body>
</html>



## Simple Example : Extract CD Information

`http://www.xmlfiles.com/examples/cd_catalog.xml`

The above URL is an XML file. It contains information of several CDs. Please write a function to extract the title, artist, country, company, price and year for each CD.

Specification:
* Your function should take the above URL and return a list of tuple, the order is specified in the comment below


In [17]:
import urllib
def extract_CD_Information(url):
    '''
    Argument:
        url (strint): the XML file location
    Returns:
        CDs (list) : it's a list of tuple, each tuple should contain the information of a CD, it should be like
                     (title, artist, country, company, price, year)
    '''
    xml = urllib.urlopen(url)
    root = etree.parse(xml).getroot()
    CDs = []
    for CD in root:
        cd = [info.text for info in CD]
        CDs.append(tuple(cd))
    return CDs

extract_CD_Information('http://www.xmlfiles.com/examples/cd_catalog.xml')


[('Empire Burlesque', 'Bob Dylan', 'USA', 'Columbia', '10.90', '1985'),
 ('Hide your heart', 'Bonnie Tylor', 'UK', 'CBS Records', '9.90', '1988'),
 ('Greatest Hits', 'Dolly Parton', 'USA', 'RCA', '9.90', '1982'),
 ('Still got the blues', 'Gary More', 'UK', 'Virgin redords', '10.20', '1990'),
 ('Eros', 'Eros Ramazzotti', 'EU', 'BMG', '9.90', '1997'),
 ('One night only', 'Bee Gees', 'UK', 'Polydor', '10.90', '1998'),
 ('Sylvias Mother', 'Dr.Hook', 'UK', 'CBS', '8.10', '1973'),
 ('Maggie May', 'Rod Stewart', 'UK', 'Pickwick', '8.50', '1990'),
 ('Romanza', 'Andrea Bocelli', 'EU', 'Polydor', '10.80', '1996'),
 ('When a man loves a woman',
  'Percy Sledge',
  'USA',
  'Atlantic',
  '8.70',
  '1987'),
 ('Black angel', 'Savage Rose', 'EU', 'Mega', '10.90', '1995'),
 ('1999 Grammy Nominees', 'Many', 'USA', 'Grammy', '10.20', '1999'),
 ('For the good times', 'Kenny Rogers', 'UK', 'Mucik Master', '8.70', '1995'),
 ('Big Willie style', 'Will Smith', 'USA', 'Columbia', '9.90', '1997'),
 ('Tupelo Ho

## Complex Example: W3C.xml
The file named `W3C.xml` is a XML specification presented by W3C. This file includs numerous information of this document and also XML specifications. Please use `lxml` to finish below tasks.

Specification:
* In findAuthors, you need to find all authors of this document and return a list of tuple(author_name, affiliation, email). If some information is not listed, leave it with None.
* I am curious about the design goal of XML. Can you help me find all the degisn goals? (Please return a list of str)
* There are so many new terms in this document. Can you give me the definition of all the terms? (Please return a dict)
* I forget the form of comment in xml file. But this document includes the format. Can you help me find the format? (Please return a str)

Hint:
* Please open this `W3C.xml`. It's a complex file. You need to carefully observe the structure, tag and text. The code is short, but you need to take some time to find the right tag.

In [18]:
w3c = open('W3C.xml', 'r')
root = etree.parse(w3c).getroot()

In [19]:
def findAuthors(root):
    res = []
    for author in root.find('.//authlist'):
        name = author.find('name')
        if name is not None:
            name = name.text
        aff = author.find('affiliation')
        if aff is not None:
            aff = aff.text
        email = author.find('email')
        if email is not None:
            email = email.text
        res.append((name, aff, email))
    return res

findAuthors(root)

[('Tim Bray', 'Textuality and Netscape', 'tbray@textuality.com'),
 ('Jean Paoli', 'Microsoft', 'jeanpa@microsoft.com'),
 ('C. M. Sperberg-McQueen', 'W3C', 'cmsmcq@w3.org'),
 ('Eve Maler', 'Sun Microsystems, Inc.', 'eve.maler@east.sun.com'),
 (u'Fran\xe7ois Yergeau', None, None)]

In [20]:
def findDesignGoals(root):
    res = []
    olist = root.xpath("//div2[@id='sec-origin-goals']/olist")
    for item in olist[0]:
        res.append(item[0].text)
    return res

findDesignGoals(root)

['XML shall be straightforwardly usable over the Internet.',
 'XML shall support a wide variety of applications.',
 'XML shall be compatible with SGML.',
 'It shall be easy to write programs which process XML documents.',
 'The number of optional features in XML is to be kept to the absolute\nminimum, ideally zero.',
 'XML documents should be human-legible and reasonably clear.',
 'The XML design should be prepared quickly.',
 'The design of XML shall be formal and concise.',
 'XML documents shall be easy to create.',
 'Terseness in XML markup is of minimal importance.']

In [21]:
def definition_of_terms(root):
    res = {}
    div2 = root.xpath("//div2[@id='sec-terminology']")
    for item in div2[0][1][-1]:
        term = item.find('def')[0][0]
        key = term.get('term')
        value = term.text.strip()
        res[key] = value
    return res

definition_of_terms(root)

{'At user option': 'Conforming software',
 'Error': 'A violation of the rules of this specification;\nresults are undefined. Unless otherwise specified, failure to observe a prescription of this specification indicated by one of the keywords',
 'Fatal Error': 'An error which a conforming',
 'For Compatibility': 'Marks\na sentence describing a feature of XML included solely to ensure\nthat XML remains compatible with SGML.',
 'For interoperability': 'Marks\na sentence describing a non-binding recommendation included to increase\nthe chances that XML documents can be processed by the existing installed\nbase of SGML processors which predate the WebSGML Adaptations Annex to ISO 8879.',
 'Validity constraint': 'A rule which applies to\nall',
 'Well-formedness constraint': 'A rule which applies\nto all',
 'match': '(Of strings or names:) Two strings\nor names being compared are identical. Characters with multiple possible\nrepresentations in ISO/IEC 10646 (e.g. characters with both precompo

In [22]:
def find_comment_form(root):
    div2 = root.xpath("//div2[@id='sec-comments']")
    return div2[0].find('scrap').find('prod').find('rhs').xpath("string()").replace('\n', "")
    
find_comment_form(root)

"'<!--' ((Char - '-') | ('-'(Char - '-')))* '-->'"

## Further resources

There are a lot of details in XML standard. If you are interested in it, you can see the `W3C.xml` file in the above example.

Also, the [XML offcial site](http://lxml.de/index.html) provides more tutorials and API documents for your reference.

Finally, [W3school](http://www.w3schools.com/xml/) has also a series of greate tutorials for XML including basic knowledge like tags, text as well as advanced topics like validation.

## References
* https://en.wikipedia.org/wiki/XML
* https://en.wikipedia.org/wiki/HTML
* http://lxml.de/index.html