# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
from xml.etree import ElementTree as ET

import pandas as pd

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':',)
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [6]:
document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()



In [7]:
root.tag

'mondial'

In [8]:
root.attrib

{}

In [9]:
#Solution 1: 10 countries with the lowest infant mortality rates
names = []
rates = []
ct=0
for country in root.findall(".//*[infant_mortality]"):
    rate = country.find('infant_mortality').text
    name = country.find('name').text
    names = names + [name]
    rates = rates + [float(rate)]

ratesDF = pd.DataFrame([names,rates])
ratesDF = ratesDF.transpose()

In [10]:
ratesDF.columns=[['names','rates']]
ratesDF.sort_values(by='rates').head(10)

Unnamed: 0,names,rates
36,Monaco,1.81
90,Japan,2.13
109,Bermuda,2.48
34,Norway,2.48
98,Singapore,2.53
35,Sweden,2.6
8,Czech Republic,2.63
72,Hong Kong,2.73
73,Macao,3.13
39,Iceland,3.15


Solution 2: 10 Cities with the highest populations

The challenge here is that the cities are nested under provinces, and there are multiple population numbers for each city, plus each city has a different set of years for which population estimates are available.

Let's compare the populations based on the latest estimate available for each city.

In [13]:
#This solution closely follows github user farfan92's solution, which I referred to

#Initializing some variables to hold the current highest and year. Using a dictionary instead of a list since we have key-value pairs
current_pop = 0
current_pop_year = 0
citypop = dict()

#Now we iterate through the list of countries using iterfind. iterfind is superior to findall since here we do not need
# a list of countries. We just need to know where they are so that we can access them.
for country in document.iterfind('country'):
    for city in country.iter('city'):
        
#Now instead of nodes, the year values are attributes. We basically cycle through the years and end up with the population 
# with the latest year
        for subelement in city.iterfind('population'):
            if int(subelement.attrib['year']) > current_pop_year:
                current_pop = int(subelement.text)
                current_pop_year = int(subelement.attrib['year'])
                
        citypop[city.findtext('name')] = current_pop
        current_pop = 0
        current_pop_year = 0
#Once we are done cycling through all years for a city, we reset
  
#Creating a dataframe from the dictionary of city-population pairs for easy sorting    

citypop_df = pd.DataFrame.from_dict(citypop, orient ='index')
citypop_df.columns = ['population']
citypop_df.index.names = ['city']
citypop_df.sort_values(by = 'population', ascending = False).head(10)

Unnamed: 0_level_0,population
city,Unnamed: 1_level_1
Shanghai,22315474
Istanbul,13710512
Mumbai,12442373
Moskva,11979529
Beijing,11716620
São Paulo,11152344
Tianjin,11090314
Guangzhou,11071424
Delhi,11034555
Shenzhen,10358381


Solution 3: Finding the ethnic groups that have the maximum population. We note that ethnic groups are not restricted to a single country. Therefore, we have to maintain a running count of populations for all ethnic groups and cycle through all countries that have population data

In [14]:
#This solution closely follows github user farfan92's solution, which I referred to

#This time, we find all population nodes for each country first, and pick the one that is for the latest year

#Also, the ethnic groups are provided as a percentage of population so we have to calculate the population for each group in each country

ethn = dict()
current_pop = 0
current_pop_year = 0
for country in document.iterfind('country'):
    for population in country.getiterator('population'):
        if int(population.attrib['year']) > current_pop_year: #getting the latest data
                current_pop = int(population.text) #converting text to integer
                current_pop_year = int(population.attrib['year'])
    for ethn_gp in country.iterfind('ethnicgroup'): #we loop through all the ethnic groups in the country and compare to our list
        if ethn_gp.text in ethn: 
            ethn[ethn_gp.text] += current_pop*float(ethn_gp.attrib['percentage'])/100 #if already there, add the population
        else:
            ethn[ethn_gp.text] = current_pop*float(ethn_gp.attrib['percentage'])/100 #if not, add it to the list and start the count
    current_pop = 0
    current_pop_year = 0 #reset the running variables

ethnic_df = pd.DataFrame.from_dict(ethn, orient ='index')
ethnic_df.columns = ['population']
ethnic_df.index.names = ['ethnic_group']
ethnic_df.groupby(ethnic_df.index).sum().sort_values(by = 'population', ascending = False).head(10)

Unnamed: 0_level_0,population
ethnic_group,Unnamed: 1_level_1
Han Chinese,1245059000.0
Indo-Aryan,871815600.0
European,494872200.0
African,318325100.0
Dravidian,302713700.0
Mestizo,157734400.0
Bengali,146776900.0
Russian,130484000.0
Japanese,126534200.0
Malay,121993600.0


Solution 4: Finding the name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [32]:
#Initialize some variables
rivername = None
riverct = None
riverlen= 0

lakename = None
lakect = None
lakearea = 0

arptname = None
arptct = None
arptel = 0
# Rivers are located outside the earlier tree of country etc that we were working with
# As we cycle through all the rivers, we update the length and country

for river in document.iterfind('river'):
    for length in river.iterfind('length'):
        if float(length.text)>riverlen:
            riverlen = float(length.text)
            riverct = river.attrib['country']
            rivername = river.findtext('name')

#Simialrly for the largest lake
for lake in document.iterfind('lake'):
    for area in lake.iterfind('area'):
        if float(area.text)>lakearea:
            lakearea = float(area.text)
            lakect = lake.attrib['country']
            lakename = lake.findtext('name')

#Finally for airports and elevations
for airport in document.iterfind('airport'):
    print(airport.findtext('name')) #This allows us to see that Xiangfan doesn't have an elevation
    for elevation in airport.iterfind('elevation'):
        if ((elevation.text is not None) and (float(elevation.text)>arptel) ):
            arptel = float(elevation.text)
            arptct = airport.attrib['country']
            arptname = airport.findtext('name')


Herat
Kabul Intl
Tirana Rinas
Cheikh Larbi Tebessi
Batna Airport
Soummam
Tamanrasset
Biskra
Mohamed Boudiaf Intl
Ain Arnat Airport
Es Senia
Noumerat
Annaba
Houari Boumediene
Zenata
Pago Pago Intl
Lubango
Cabinda
Menongue
Luanda 4 De Fevereiro
Huambo
Wallblake
V C Bird Intl
La Rioja
Jujuy
Comandante Espora
Teniente Benjamin Matienzo
San Luis
Santiago del Estero
Sauce Viejo
Corrientes
Presidente Peron
Salta
Aeroparque Jorge Newbery
Ministro Pistarini
Ushuaia Malvinas Argentinas
Formosa
Posadas
Rosario
Resistencia
Rio Gallegos
Comodoro Rivadavia
Mar Del Plata
El Plumerillo
Ambrosio L V Taravella
Zvartnots
Reina Beatrix Intl
Melbourne Intl
Sydney Intl
Cairns Intl
Townsville
Brisbane Intl
Canberra
Adelaide Intl
Newcastle Airport
Darwin Intl
Perth Intl
Hobart
Salzburg
Linz
Innsbruck
Schwechat
Graz
Woerthersee International Airport
Heydar Aliyev
Lynden Pindling Intl
Bahrain Intl
Zia Intl
Shah Amanat Intl
Grantley Adams Intl
Minsk 2
Deurne
Brussels South
Aéroport de Liège
Brussels Natl
Philip 

In [49]:
print('\t','Longest River','\t', 'Largest Lake','\t','Highest Airport')
print('Name','\t',rivername,'\t',lakename,'\t',arptname)
print('Country',riverct,'\t',lakect,'  ',arptct)
print('Name','\t',riverlen,'\t',lakearea,'\t',arptel)

	 Longest River 	 Largest Lake 	 Highest Airport
Name 	 Amazonas 	 Caspian Sea 	 El Alto Intl
Country CO BR PE 	 R AZ KAZ IR TM    BOL
Name 	 6448.0 	 386400.0 	 4063.0
