# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

*************

## KLA code from here on in ##

*************

In [10]:
import pandas as pd
document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()
root
    

<Element 'mondial' at 0x108ea63d0>

In [None]:
for country in root.findall('country'):
    i = 0
    while i < 10:
        for node in country.getchildren():
            print node.tag, node.attrib
            for newnode in node.getchildren():
                print "     ", node.tag, node.attrib
        print "\n", "\n"
    i += 1

testlist = []

for country in root.findall('country'):
    try:
        testlist.append(country.find('infant_mortality').text)
    except:
        testlist.append(None)
    
print len(testlist)

-------
### it took a little while to figure that one out ###


While remembering the right syntax to go down the levels of an XML tree was slow on its own (is no doubt going to come with practice), I didn't realize how angry `for` would be when some `country` elements had no child element `infant_mortality` (some countries don't have this data listed) and returned `None`.  `try`/`except` eventually solved the problem and gave me a list of the correct length, which let me take that approach into actually building the dictionary below.

------

In [62]:
countryname = []
infmr = []

for country in root.findall('country'):
    countryname.append(country.find('name').text)
    try:
        infmr.append(float(country.find('infant_mortality').text))
    except:
        infmr.append(None)

countryinfmr = dict(zip(countryname, infmr))

df = pd.DataFrame(countryinfmr.items(), columns = ['country', 'infmr'])

df = df.sort_values('infmr')

print df.head(10)

            country  infmr
37           Monaco   1.81
225           Japan   2.13
77           Norway   2.48
70          Bermuda   2.48
82        Singapore   2.53
115          Sweden   2.60
61   Czech Republic   2.63
154       Hong Kong   2.73
57            Macao   3.13
203         Iceland   3.15


-------

*Voila, the ten lowest infant mortality rates recorded in `mondial_database.xml`. I'm kind of surprised by the Czech Republic.*

-------

It seems like the next exercise should be roughly the same -- just a couple layers deeper in iterating through children -- but looking at the raw XML shows that

1. many cities have multiple population values from different years listed,
2. not all cities have the same years listed, and
3. there are a lot of missing values. (naturally.)

In order to get useful information, then, it'll be necessary to make sure that the 