# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET
import pandas as pd

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [6]:
# Question 1
countries = []
rates = []
for elm in document.iterfind('country'):
    if elm.find('infant_mortality') is not None:
        rates.append(float(elm.find('infant_mortality').text))
        countries.append(elm.find('name').text)
count_and_rate = zip(countries,rates)

In [7]:
df = pd.DataFrame(count_and_rate,columns=['Country','Infant mortality rate'])
df.sort_values(['Infant mortality rate']).head(10)

Unnamed: 0,Country,Infant mortality rate
36,Monaco,1.81
90,Japan,2.13
109,Bermuda,2.48
34,Norway,2.48
98,Singapore,2.53
35,Sweden,2.6
8,Czech Republic,2.63
72,Hong Kong,2.73
73,Macao,3.13
39,Iceland,3.15


In [8]:
# Question 2
cities = []
population = []
for country in document.iterfind('country'):
    for city in country.iter('city'):
        max_year = 0
        for pop in city.iterfind('population'):
            year = int(pop.get('year'))
            if year > max_year:
                max_year = year
                max_pop = int(pop.text)
        cities.append(city.find('name').text)
        population.append(max_pop)
table = zip(cities,population)

In [9]:
df = pd.DataFrame(table,columns=['City','Population'])
df.sort_values(['Population'],ascending=False).head(10)

Unnamed: 0,City,Population
1341,Shanghai,22315474
771,Istanbul,13710512
1527,Mumbai,12442373
479,Moskva,11979529
1340,Beijing,11716620
2810,São Paulo,11152344
1342,Tianjin,11090314
1064,Guangzhou,11071424
1582,Delhi,11034555
1067,Shenzhen,10358381


In [10]:
# Question 3
ethnic_groups = {}
for country in document.iterfind('country'):
    max_year = 0
    for pop in country.iterfind('population'):
        year = int(pop.get('year'))
        if year > max_year:
            max_year = year
            max_pop = int(pop.text)
    for ethnic in country.iterfind('ethnicgroup'):
        group = ethnic.text
        frac = (float(ethnic.get('percentage'))*max_pop)/100
        if group in ethnic_groups:
            ethnic_groups[group] += frac
        else:
            ethnic_groups[group] = frac

In [11]:
df = pd.DataFrame.from_dict(ethnic_groups,orient='index')
df.columns = ['number']
df.sort_values(['number'],ascending=False).head(10)

Unnamed: 0,number
Han Chinese,1245059000.0
Indo-Aryan,871815600.0
European,494872200.0
African,318325100.0
Dravidian,302713700.0
Mestizo,157734400.0
Bengali,146776900.0
Russian,131857000.0
Japanese,126534200.0
Malay,121993600.0


In [13]:
# Question 4
for tup in [('river','length','longest'),('lake','area','largest'),('airport','elevation','highest')]:
    max_measure = 0
    for element in document.iterfind(tup[0]):
        try:
            measure = float(element.find(tup[1]).text)
        except:
            continue
        if measure > max_measure:
            max_measure = measure
            element_name = element.find('name').text
            countries_id = element.get('country').split()
            countries_name = []
            for country_id in countries_id:
                for country in document.iterfind('country'):
                    if country.get('car_code') == country_id:
                        countries_name.append(country.find('name').text)
                        break
    print 'The ' + tup[2], tup[0] +  ' is \"' + element_name + '\" in the countries of: ' + ', '.join(countries_name)

The longest river is "Amazonas" in the countries of: Colombia, Brazil, Peru
The largest lake is "Caspian Sea" in the countries of: Russia, Azerbaijan, Kazakhstan, Iran, Turkmenistan
The highest airport is "El Alto Intl" in the countries of: Bolivia
