# Python Element Tree

In natural language processing, working with XML is something that you have to do very regularly. Luckily, Python provides very powerful tools for processing and manipulating documents. In this exercise we look at the `xml.etree.ElementTree` module. as with many of the previous exercises, we will only look at a subset of the features available. Please read the [full documentation](https://docs.python.org/2/library/xml.etree.elementtree.html) for deeper understanding.

## Parsing a document

Parsing is very easy. All you need is the filename of the document you want to read.

We will parse a scientific paper from the [ART corpus](https://www.aber.ac.uk/en/cs/research/cb/projects/art/art-corpus/) that has been annotated by hand and split into a series of sentences. Each sentence's boundaries are represented with a `<s></s>`. If you want to examine the paper you can view it [here](assets/b414459g_mode2.xml).


In [None]:
import xml.etree.ElementTree as ET

#open and parse the paper
tree = ET.parse("assets/b414459g_mode2.xml")
root = tree.getroot()

#lets find all sentences in the paper
sentences = 0
for sentEl in root.iter("s"):
    sentences += 1
    
print("There are {} sentences in this document".format(sentences))

## Getting text

### text attribute
Text inside an element is represented using the `element.text` attribute. Lets find what some of the sentences in our paper say.

If you recall from looking at the [paper](assets/b414459g_mode2.xml) earlier. You will have noticed that the actual text in each sentence is wrapped inside an element called `<annotationART>`. 

Don't worry about what this means for now. However, we will have to create variables for annotationART in order to get hold of the text inside.

In [None]:
#lets work with the first 5 sentences for demo purposes
for sent in list(root.iter("s"))[0:5]:
    annoArt = sent.find('annotationART')
    print ("-----------------")
    print (annoArt.text)

### Itertext

There is in fact an easier way to get all text inside an element and its child nodes. The `itertext()` method will iterate through each of the element's child nodes and aggregate any text it finds

In [None]:
#lets work with the first 5 sentences for demo purposes
for sent in list(root.iter("s"))[0:5]:
    print ("".join(sent.itertext()))

## Element attributes

Attributes of elements are stored in the element as a dict. You can get an attribute using `element.get()` as in the following example where we print the ID of the first 5 sentences:

In [None]:
#lets work with the first 5 sentences for demo purposes
for sent in list(root.iter("s"))[0:5]:
    print (sent.get("sid"))

## Putting it all together

Remember annotationART from earlier? that element has a "type" attribute that holds one of 11 possible values that tell you what type of sentence this is in the context of a scientific paper. For example, the sentence could be background information, telling you more about previous works. It could be a hypothesis, explaining what result is expected after the experiment is carried out. For more information on this annotation scheme, CoreSC, you can read [this paper by Liakata et al. 2010](http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf).

Lets extract a list of sentences and their respective CoreSC designation using ETree.

In [None]:
sents = []

for sent in root.iter("s"):
    annoArt = sent.find('annotationART')
    id = sent.get("sid")
    text = "".join(annoArt.itertext())
    coreSC = annoArt.get("type")
    sents.append( (id, text,coreSC) )

#lets work with the first 5 sentences for demo purposes
for sent_tuple in sents[0:5]:
    print ("Sentence {0}. Type: {2}. Text: {1}\n".format(*sent_tuple))

## Conclusion

ElementTree presents you with a "swiss army knife" of tools for working with XML in Python. In this example we only discussed reading existing documents but the library can also be used to generate new XML documents too. To find out how to do this read the [full documentation](https://docs.python.org/2/library/xml.etree.elementtree.html).

Next we look at [matplotlib](matplotlib.ipynb)