# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print (child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [6]:
import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np

In [7]:
# parse the xml, and to make things cleaner, just assign the root to a variable

document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()

In [8]:
# Part 1 - 10 countries with the lowest infant mortality rates
# Create some placeholder data frames to host the country names and infant mortality rates

country_name = []
infant_mortality = []

# Now we iterate through the XML to put the data into the data frames
for element in root:
    for subelement in element.getiterator('infant_mortality'):
        country_name.append(element.find('name').text)
        infant_mortality.append(np.float(subelement.text))
        
# Here we're just creating the definition of the dataframe that we want, which are the two columns combined
definition = {'country_name':country_name, 'infant_mortality':infant_mortality}
bottom10mortality = pd.DataFrame(data=definition)

# Now sort by infant mortality rate and take the first 10
bottom10mortality.sort_values('infant_mortality', ascending=True).head(10)


Unnamed: 0,country_name,infant_mortality
36,Monaco,1.81
90,Japan,2.13
109,Bermuda,2.48
34,Norway,2.48
98,Singapore,2.53
35,Sweden,2.6
8,Czech Republic,2.63
72,Hong Kong,2.73
73,Macao,3.13
39,Iceland,3.15


In [74]:
# Part 2 - 10 cities with the largest population

#so this is trickier because now instead of a 1 : 1 relationship like part 1, a city may have one or more population
#observations. Which means I now have to figure out how to get that to happen.

#and sometimes of course, a city country might have more than one name attribute. Ugh.

city_name = []
population_number = []
population_type = []
year = []

# so the city is a child of the root, but the population is a grandchild of the root. So we'll have to nest some things together
# the assignment isn't clear which population measure we should be taking, so I'll include the measurement type as well
# note that not all of the population measurements have a "measured" type

for element in root:
    for city in element.getiterator('city'):
        for population in city.getiterator('population'):
            city_name.append(city.find('name').text)
            population_number.append(np.int(population.text))
            year.append(np.int(population.get('year')))
            if 'measured' in population.attrib:
                population_type.append(population.get('measured'))
            else:
                population_type.append(np.nan)
                

# now define the end result data frame

definition = {'city_name':city_name, 'year':year, 'population_type':population_type,'population':population_number}
citypopulation = pd.DataFrame(data=definition)

# and pull out the 10 cities with the largest population

citypopulation.loc[citypopulation.reset_index().groupby('city_name')['year'].idxmax()]. \
sort_values('population', ascending=False).head(10)

Unnamed: 0,city_name,population,population_type,year
3750,Shanghai,22315474,census,2010
2607,Istanbul,13710512,admin.,2012
4303,Mumbai,12442373,census,2011
1546,Moskva,11979529,estimate,2013
3746,Beijing,11716620,census,2010
8208,São Paulo,11152344,census,2010
3754,Tianjin,11090314,census,2010
3364,Guangzhou,11071424,census,2010
4399,Delhi,11034555,census,2011
3371,Shenzhen,10358381,census,2010


In [162]:
# Part 3 - 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

# ethnic group is a child of the root. Can have multiples per country
# Population is also a child of the root - and it looks like we need measured = est.
# I'm guessing we'll have to multiply the ethnicgroup percentage by the population of the country
# to get the population of the ethnic group

ethnic_population = []
ethnic_group = []
ethnic_percent = []
year = []


for element in root:
    for country in element.getiterator('country'):
        for population in country.findall('population'):
            if country.find('ethnicgroup') is not None:
                print(population.text, element.find('name').text, population.get('year'), element.find('ethnicgroup').text, \
                     element.find('ethnicgroup').get('percentage'))
        


1214489 Albania 1950 Albanian 95
1618829 Albania 1960 Albanian 95
2138966 Albania 1970 Albanian 95
2734776 Albania 1980 Albanian 95
3446882 Albania 1990 Albanian 95
3249136 Albania 1997 Albanian 95
3304948 Albania 2000 Albanian 95
3069275 Albania 2001 Albanian 95
2800138 Albania 2011 Albanian 95
1096810 Greece 1861 Greek 93
1457894 Greece 1870 Greek 93
1679470 Greece 1879 Greek 93
2433806 Greece 1896 Greek 93
2631592 Greece 1907 Greek 93
5016889 Greece 1920 Greek 93
6204684 Greece 1928 Greek 93
7344860 Greece 1940 Greek 93
7632801 Greece 1951 Greek 93
8388553 Greece 1961 Greek 93
8768372 Greece 1971 Greek 93
9739589 Greece 1981 Greek 93
10217335 Greece 1991 Greek 93
10934097 Greece 2001 Greek 93
10816286 Greece 2011 Greek 93
808724 Macedonia 1921 Macedonian 64.2
949958 Macedonia 1931 Macedonian 64.2
1152986 Macedonia 1948 Macedonian 64.2
1304514 Macedonia 1953 Macedonian 64.2
1406003 Macedonia 1961 Macedonian 64.2
1647308 Macedonia 1971 Macedonian 64.2
1909136 Macedonia 1981 Macedonian

In [62]:
# apparently this is supposed to work

# element = root.find('foo')

# if not element:  # careful!
#    print("element not found, or element has no subelements")

# if element is None:
#    print("element not found")