# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print (child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [6]:
type(document)
document

<xml.etree.ElementTree.ElementTree at 0xb66cf47eb8>

#  10 countries with the lowest infant mortality rates

In [7]:
import pandas as pd
countries=[]
infant_mortality=[]
for element in document.iterfind('country'):
    if element.find('infant_mortality')!=None:
        countries.append(element.find('name').text)
        infant_mortality.append(float(element.find('infant_mortality').text))
a=pd.DataFrame(list(zip(countries,infant_mortality)),columns=['countries','infant_mortality'])
a=a.set_index('countries',drop=True)
b=a['infant_mortality'].nsmallest(10).index.values
b
#infant_mortality_contries=pd.Series(infant_mortality, index=countries)
#infant_mortality_contries.nlargest(10).index

array(['Monaco', 'Japan', 'Norway', 'Bermuda', 'Singapore', 'Sweden',
       'Czech Republic', 'Hong Kong', 'Macao', 'Iceland'], dtype=object)

# 10 cities with the largest population


In [8]:
# compare only with population at 2011:
cities=[]
population=[]
for element in document.iterfind('country'):    
    for subelement in element.getiterator('city'):
        for pop in subelement.getiterator('population'):
            if pop.get('year')=='2011':                 
                cities.append(subelement.find('name').text)  
                population.append(int(pop.text))
a=pd.DataFrame(list(zip(cities,population)),columns=['cities','population'])
a=a.set_index('cities',drop=True)
b=a['population'].nlargest(10).index.values
b

array(['Mumbai', 'Delhi', 'Bangalore', 'London', 'Tehran', 'Dhaka',
       'Hyderabad', 'Ahmadabad', 'Luanda', 'Chennai'], dtype=object)

# 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [9]:
# still use 2011 year population for comparison. 
group={}
for element in document.iterfind('country'):    
    for pop in element.getiterator('population'):
        if (pop.get('year')=='2011') & (element.find('ethnicgroup')!=None):             
            popN=int(pop.text)
            for eth in element.getiterator('ethnicgroup'):
                group[eth.text]=group.setdefault(eth.text,0)+popN*float(eth.get('percentage'))
                
a=pd.Series(group)   
a.nlargest(10).index.values

array(['Indo-Aryan', 'Dravidian', 'African', 'Bengali', 'German',
       'English', 'Mediterranean Nordic', 'Persian', 'Mongol', 'European'], dtype=object)

# name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [10]:
#longest river
countries=[]
river=[]
length=[]
for element in document.iterfind('river'):
    if element.find('length')!= None:
        countries.append(element.get('country'))
        river.append(element.find('name').text)
        length.append(float(element.find('length').text))
a=pd.DataFrame(list(zip(countries,river,length)),columns=['countries','river','length'])
a=a.set_index('length',drop=True)
print ('The longest river is {}, it is located in {}'.format(a.loc[a.index.max(),'river'],a.loc[a.index.max(),'countries']))
        

The longest river is Amazonas, it is located in CO BR PE


In [11]:
#largest lake
countries=[]
lake=[]
area=[]
for element in document.iterfind('lake'):
    if element.find('area')!= None:
        countries.append(element.get('country'))
        lake.append(element.find('name').text)
        area.append(float(element.find('area').text))
a=pd.DataFrame(list(zip(countries,lake,area)),columns=['countries','lake','area'])
a=a.set_index('area',drop=True)
print ('The largest lake is {}, it is located in {}'.format(a.loc[a.index.max(),'lake'],a.loc[a.index.max(),'countries']))       

The largest lake is Caspian Sea, it is located in R AZ KAZ IR TM


In [12]:
#airport at highest elevation
countries=[]
airport=[]
elevation=[]
for element in document.iterfind('airport'):
    if element.find('elevation').text!= None:
        countries.append(element.get('country'))
        airport.append(element.find('name').text)
        elevation.append(int(element.find('elevation').text))
a=pd.DataFrame(list(zip(countries,airport,elevation)),columns=['countries','airport','elevation'])
a=a.set_index('elevation',drop=True)
print ('The highest elevation airport is {}, it is located in {}'.format(a.loc[a.index.max(),'airport'],a.loc[a.index.max(),'countries']))

The highest elevation airport is El Alto Intl, it is located in BOL
