# XML

XML stands for 'extensible mark-up language'. XML files can be generic or have a document type. For exmaple, GraphML is really just XML with a specific schema that is used for social network graph types. 

Like HTML, XML is a markup language that uses less than ```<``` and greater than ```>``` to encase the element tags. The text inside these tags must have some special characters escaped. 

~~~ xml 
<start> 
    <middle>
        <end1>   Here is an element! </end1>
        <end2>   Here is an element! </end2>
    </middle>
</start>
~~~

Elements have an "element tree". Above, ```start``` is the root node, ```middle``` is a child and ```end1``` is a child of middle. ```end1``` and ```end2``` are siblings. 

XML is a self-documenting style, which means that you can insert details about the elements into the document itself. For example, open up the included Canada.xml file in a text editor.

~~~ xml 
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>

~~~ 

Here you can see the link to the schema. This is a file with a whole ton of details I've never really needed in research. These details are, however, important for specifying what is standard for that type of XML (in this case the mediaWiki export). That's good because it means that well formatted XML should be reliable for it's type and easy to manage. 

Most of the time, we will not be so concerned with the top of an XML document. Rather, we will simply want to navigate the element tree to get to the element(s) that are of concern to us. Sometimes, parsers will already be written which takes the XML and loads it into a data structure for us. This is the case with graphML, the common format for social network data. To load graphML files you can use either the networkx or igraph packages. I mention these more in Chapter XX on networks. Newer Excel files (with the extension ending in x, as in .xlsx files) are also XML. Later in this chapter we will use pandas to parse those directly. 

In fact, there is a nice Python library aptly called ```wikipedia``` that can make navigating the XML structure easy and allow for direct querying of all kinds of elements. We are not using that library here, however, since we are making use of Wikipedia but this is to illustrate navigating XML.  

In the script below, we will use load in XML as a string. Then we will use beautiful soup to navigate the document and return aspects of the XML.

In [4]:
# loading some xml
import bs4, os

infile = open("..{0}Data{0}Canada.xml".format(os.sep),'r')

wikitext = infile.read()

# Note: In some circumstances, the file is saved as encoded data, in which case
# use the .decode('utf-8') function on the text. As in:
# soup = bs4.BeautifulSoup(wikitext.decode('utf8'), "lxml")
soup = bs4.BeautifulSoup(wikitext, "lxml")

print (soup.mediawiki.page.revision.id )

<id>864119742</id>


## Navigating XML

Navigating XML involves moving up and down or sideways along the element tree. In the case above it was clear that I know where to go for the text I wanted (```mediawiki.page.revision.id```). In general, however, navigating to the right element is a bit tedious. Some people prefer the use of Python's built-in ElementTree package. In either case, what you will be doing with your code is navigating a tree structure. Trees tend to use the following nomenclature that borrows from both the natural tree but also the notion of a family tree: 
- **Root**:The base or primary node is called the root node. 
- **Parent and child**: A parent is a node that has nodes nested within, like ```ID``` nested within ```revision``` above. In that case, revision is the parent node and ID is the child node. 
- **Sibling**: Two child nodes with the same parent. Like how ```sitename``` and ```dbname``` are both children of ```siteinfo```.
- **Leaf**: A sometiems used term to indicate a child node with no children of it's own. 

Below I use beautifulsoup to navigate through the tags so that I can get to the data I want. Normally one would do this and then clean it up so that only the proper working code remains. Notice that even though mediawiki is actually at: 
~~~ html
<html>
    <body>
        <mediawiki>
            ...
~~~ 

We do not need the full path to access it, similar to how it was done with HTML. BeautifulSoup will return ```mediawiki``` by going to ```soup.mediawiki.text```. But also note, that this is not the text on the Canada page. Instead it is the text under that leaf node, mediawiki. To get the text of the page from this schema, we would go to ```soup.mediawiki.page``` and get the text from there. 

In [14]:
# for i in soup.children: print(i.name)

# for i in soup.html.children: print(i.name)

# for i in soup.html.body.children: print(i.name)

# for i in soup.mediawiki.children: print(i.name) 

# for i in soup.mediawiki.page.children: print(i.name)

# I discover that we can just say soup.page and it will get the text. 
y = soup.page.text

print (soup.page.text == soup.html.body.mediawiki.page.text)
print(y[:100])

True

Canada
0
5042916

864119742
864118763
2018-10-15T06:33:55Z

Moxy
8729451

/* Government and politic


At the moment there is not much to do with this text. We can probably split it up or count the number of characters. Perhaps you could compare the length of text for Canada to other countries. In the exercise I show how to download this data directly from the special export page. But counting characters will only get us so far in answering questions. In the next chapter we will start parsing this text and adding it to DataFrames. Then in later chapters we will look at including even more data in our DataFrames by comparing data from different topics, sources, accounts, or time periods. First, however, we should look at a couple more data structures. The next one, CSV, being one of the most common formats around. 