# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [4]:
document_tree = ET.parse( 'data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [10]:
for element in document_tree.iter('country'):
    print (element.find('population').text)

1214489
1096810
808724
6732256
311341
1584440
6197


In [13]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':', end = '');
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

* Albania:Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:Skopje, Kumanovo
* Serbia:Beograd, Novi Sad, Niš
* Montenegro:Podgorica
* Kosovo:Prishtine
* Andorra:Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

### 1. 10 countries with the lowest infant mortality rates

To answer this question, it is more efficient to import the data from xml to pandas data frame format in order to sort and query the answers.

In [84]:
import pandas as pd
document = ET.parse( 'data/mondial_database.xml' )

country = []
infant_m = []
for element in document.iter('country'):
    try :
        c = element.find('name').text
        i = element.find('infant_mortality').text
    except AttributeError:
        continue
    country.append(c)
    infant_m.append(float(i))
        
df1 = pd.DataFrame(infant_m, index = country, columns = ['infantMortalityRate'])
df1.sort_values(by = 'infantMortalityRate',ascending= False).head(10)

Unnamed: 0,infantMortalityRate
Western Sahara,145.82
Afghanistan,117.23
Mali,104.34
Somalia,100.14
Central African Republic,92.86
Guinea-Bissau,90.92
Chad,90.3
Niger,86.27
Angola,79.99
Burkina Faso,76.8


### 2. 10 cities with the largest population

In [87]:
cities = []
population = []

for element in document.iter('city'):
    try:
        city = element.find('name').text
        p = []
        
        for subelement in element.iter('population'):
            p.append(float(subelement.text))
        
        maxp = max(p)
    except (AttributeError, ValueError):
        continue
    cities.append(city)
    population.append(maxp)

df2 = pd.DataFrame(data = population, index = cities, columns = ['population'])
df2.sort_values(by = 'population', ascending=False).head(10)

Unnamed: 0,population
Shanghai,22315474.0
Istanbul,13710512.0
Delhi,12877470.0
Mumbai,12442373.0
Moskva,11979529.0
Beijing,11716620.0
São Paulo,11152344.0
Tianjin,11090314.0
Guangzhou,11071424.0
Shenzhen,10358381.0


### 3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [107]:
import numpy as np
country = []
ethnic = []
population = []
for element in document.iter('country'):
    pop = 0
    try:
        for subelement in element.iter('population'):
            if float(subelement.text) > pop:
                pop = float(subelement.text)
    except:
        print('no population found for ', element.text)
        continue
    
    for subelement in element.iter('ethnicgroup'):
        try:
            c = element.find('name').text
            e = subelement.text
            perc = float(subelement.attrib['percentage'])
        except ValueError:
            continue
        
        country.append(c)
        ethnic.append(e)
        population.append(perc*pop/100)

pd.set_option('display.float_format', lambda x: '%.0f' % x)
df3 = pd.DataFrame({'ethnic': ethnic, 'population':population, 'country':country})
df3 = df3.groupby(by = 'ethnic').sum()
df3.sort_values(by = 'population', ascending=False).head(10)


Unnamed: 0_level_0,population
ethnic,Unnamed: 1_level_1
Han Chinese,1245058800
Indo-Aryan,871815583
European,494939516
African,318359698
Dravidian,302713744
Mestizo,157855273
Bengali,146776917
Russian,136866551
Japanese,127289008
Malay,121993620


### 4. Name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [115]:
country = []
river = []
riverLength = []



for element in document.iter('river'):
    try:
        r = element.find('name').text
        c = element.attrib['country']
        l = float(element.find('length').text)
    except:
        continue
    country.append(c)
    river.append(r)
    riverLength.append(l)
    
df4 = pd.DataFrame({'country': country, 'river':river, 'riverLength':riverLength})
df4.sort_values(by = 'riverLength', ascending=False).head(1)

Unnamed: 0,country,river,riverLength
174,CO BR PE,Amazonas,6448


In [116]:
country = []
lake = []
lakeArea = []

for element in document.iter('lake'):
    try:
        l = element.find('name').text
        c = element.attrib['country']
        a = float(element.find('area').text)
    except:
        continue
    country.append(c)
    lake.append(l)
    lakeArea.append(a)
    
df5 = pd.DataFrame({'country': country, 'lake':lake, 'lakeArea':lakeArea})
df5.sort_values(by = 'lakeArea', ascending=False).head(1)

Unnamed: 0,country,lake,lakeArea
54,R AZ KAZ IR TM,Caspian Sea,386400


In [117]:
country = []
airport = []
airportElevation = []

for element in document.iter('airport'):
    try:
        a = element.find('name').text
        c = element.attrib['country']
        e = float(element.find('elevation').text)
    except:
        continue
    country.append(c)
    airport.append(a)
    airportElevation.append(e)
    
df6 = pd.DataFrame({'country': country, 'airport':airport, 'airportElevation':airportElevation})
df6.sort_values(by = 'airportElevation', ascending=False).head(1)

Unnamed: 0,airport,airportElevation,country
80,El Alto Intl,4063,BOL
