# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
import numpy as np

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [6]:
names, mortalities = [], []
for country in document.iterfind('country'):
    name = country.find('name').text
    try:
        mort = country.find('infant_mortality').text
    except:
        mort = np.nan
    names.append(name)
    mortalities.append(mort)
morts = pd.Series(mortalities, index=names)
morts = morts[morts.notnull()].sort_values()
morts.head(10)


Monaco                   1.81
Romania                 10.16
Fiji                     10.2
Brunei                  10.48
Grenada                  10.5
Mauritius               10.59
Panama                   10.7
Seychelles              10.77
United Arab Emirates    10.92
Barbados                10.93
dtype: object

In [7]:
names, populations = [], []
for city in document.iter('city'):
    name = city.find('name').text
    population = city.findall('population')
    for i in population:
        if i.attrib['year'] == '2011':
            pop = int(i.text)
    populations.append(pop)
    names.append(name)

pops = pd.Series(populations, names)
pops.sort_values(ascending=False).head(10)

Mumbai       12442373
Delhi        11034555
Hubli         8443675
Bangalore     8443675
Mysore        8443675
London        8250205
Tehran        8154051
Dhaka         7423137
Thimphu       6731790
Warangal      6731790
dtype: int64

In [22]:
country_name, ethnicity, population = [], [], []
for country in document.iterfind('country'):
    name = country.find('name').text
    populations = country.findall('population')
    for i in populations:
        if i.attrib['year'] == '2011':
            pop = int(i.text)
    eth = country.findall('ethnicgroup')
    for e in eth:
        country_name.append(name)
        ethnicity.append(e.text)
        population.append(float(e.attrib['percentage'])/100*pop)
data = {'country_names': country_name, 'ethnicgroups': ethnicity, 'population': population}
df = pd.DataFrame(data)
ethnic = df.groupby('ethnicgroups').sum()
ethnic.sort_values('population', ascending=False).head(10)

Unnamed: 0_level_0,population
ethnicgroups,Unnamed: 1_level_1
Indo-Aryan,871815583.44
African,667724568.37
Bhote,605427488.5
Nepalese,423799241.95
Dravidian,302713744.25
Bengali,146776916.72
Burman,101845207.52
Arab,93474053.28
Arab-Berber,85709405.06
Tajik,79597613.48
