# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [55]:
document = ET.parse( './data/mondial_database.xml' )

1) 10 countries with the lowest infant mortality rates

In [37]:
countries = []
mortality = []
root = document.getroot()
for child in root.iter('country'):
    countries.append(child.find('name').text)
    mortality_node = child.find('infant_mortality')
    if not mortality_node is None:
        mortality.append(float(child.find('infant_mortality').text))
    else: # missing value for mortality rates
        mortality.append(-1) # -1 convenient when sorting mortality rates 

country_mortality = zip(countries, mortality)

country_mortality = sorted(country_mortality, key=lambda tup: tup[1], reverse=True)
for i, tup in enumerate(country_mortality[:10]):
    print "{} = {}, {}".format(i+1, tup[0], tup[1]) 

1 = Western Sahara, 145.82
2 = Afghanistan, 117.23
3 = Mali, 104.34
4 = Somalia, 100.14
5 = Central African Republic, 92.86
6 = Guinea-Bissau, 90.92
7 = Chad, 90.3
8 = Niger, 86.27
9 = Angola, 79.99
10 = Burkina Faso, 76.8


2) 10 cities with the largest population

In [65]:
cities = []
populations = []
root = document.getroot()
for elt_city in root.iter('city'):
    cityname = elt_city.find('name').text.encode('utf8')
    cities.append(cityname)
    max_pop = 0
    for elt_pop in elt_city.findall('population'):
        pop = int(elt_pop.text)
        if pop > max_pop:
            max_pop = pop
    populations.append(max_pop)
    
city_population = zip(cities, populations)
city_population = sorted(city_population, key=lambda tup: tup[1], reverse=True)
for i, tup in enumerate(city_population[:10]):
    print "{} = {}, {}".format(i+1, tup[0], tup[1]) 

1 = Shanghai, 22315474
2 = Istanbul, 13710512
3 = Delhi, 12877470
4 = Mumbai, 12442373
5 = Moskva, 11979529
6 = Beijing, 11716620
7 = São Paulo, 11152344
8 = Tianjin, 11090314
9 = Guangzhou, 11071424
10 = Shenzhen, 10358381


3) 10 ethnic groups with largest population (sum of latest estimates over all countries)