# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [104]:
import xml.etree.ElementTree as ET
import numpy as np
import pandas as pd

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [105]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [106]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [107]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':', capitals_string = '')
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    capitals_string[:-2]

TypeError: 'capitals_string' is an invalid keyword argument for this function

****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [115]:
document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()

In [116]:
countries = []
infmort = []
for child in document.findall('country'):
    countries.append(child.find('name').text)
for child in document.findall('country'):
    if child.find('infant_mortality') != None:
        infmort.append(child.find('infant_mortality').text)
    else:
        infmort.append(str(np.NaN))

In [117]:
for i in range(len(infmort)):
    infmort[i] = float(infmort[i])

In [173]:
df_infmort = pd.DataFrame([countries, infmort]).T
df_infmort.columns = ['Countries', 'Mortality']
lowest_infmort = df_infmort.sort('Mortality', ascending=True).head(10)
print('The 10 countries with the lowest infant mortality:')
lowest_infmort

The 10 countries with the lowest infant mortality:


Unnamed: 0,Countries,Mortality
38,Monaco,1.81
98,Japan,2.13
117,Bermuda,2.48
36,Norway,2.48
106,Singapore,2.53
37,Sweden,2.6
10,Czech Republic,2.63
78,Hong Kong,2.73
79,Macao,3.13
44,Iceland,3.15


In [176]:
populations = []
for country in document.findall('country'):
    curr_pop = country.find('population')
    for node_pop in country.findall('population'):
        curr_pop = node_pop
    populations.append([country.find('name').text, float(curr_pop.text)])
populations

df_pop = pd.DataFrame(populations, columns=['Country', 'Population'])
print("The countries with the greatest 10 populations are:")
df_pop.sort('Population', ascending=False).head(10)

The countries with the greatest 10 populations are:


Unnamed: 0,Country,Population
55,China,1360720000
67,India,1210854977
120,United States,318857056
88,Indonesia,252124458
176,Brazil,202768562
57,Pakistan,173149306
202,Nigeria,164294516
65,Bangladesh,149772364
23,Russia,143666931
98,Japan,127298000


###Find all ethnic group populations

In [180]:
ethnic = []
for country in document.findall('country'):
    curr_pop = country.find('population')
    for node_pop in country.findall('population'):
        curr_pop = float(node_pop.text)
    for node_eth in country.findall('ethnicgroup'):
        ethnic.append([node_eth.text, curr_pop*float(node_eth.get('percentage'))])

In [181]:
df_ethnic = pd.DataFrame(ethnic, columns=['Ethnic Group', 'Population'])

In [204]:
df_ethnic.sort('Ethnic Group')

Unnamed: 0,Ethnic Group,Population
609,Acholi,1.394273e+08
579,Afar,1.433457e+08
563,Afar,2.919126e+07
323,African,2.222568e+07
169,African,1.282113e+08
600,African,5.694456e+08
598,African,1.570758e+08
589,African,1.863626e+08
354,African,1.431936e+07
346,African,4.097313e+09
