# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [24]:
from xml.etree import ElementTree as ET
document = ET.parse( 'C:\data_wrangling_xml/mondial_database.xml' )

In [3]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [4]:
document = ET.parse( './data/mondial_database.xml' )

In [3]:
# Get root
root=document.getroot()

### Question 1

In [6]:
# Let's prepare a country_mortality list
country_mortality = []
for i in range(len(root)):
    try:
        country_mortality.append((root[i].find('name').text, root[i].find('infant_mortality').text))
    except:
        continue
# To get top 10 sort them and get the first 10
country_mortality.sort(key=lambda x: x[1])
print "Top 10 countries with lowest infant mortality rates are \n"
print map(lambda x: x[0],country_mortality[:10])

Top 10 countries with lowest infant mortality rates are 

['Monaco', 'Romania', 'Fiji', 'Brunei', 'Grenada', 'Mauritius', 'Panama', 'Seychelles', 'United Arab Emirates', 'Barbados']


### Question 2

In [29]:
country_pop =[]
for i in range(len(root)):
    country = root[i].find('name').text
    year_pop = {}
    for subelement in root[i].iterfind('population'):
        year_pop.update({subelement.attrib['year']: int(subelement.text)})
    try:
        latest_pop = year_pop[str(max(map(int,year_pop.keys())))]
        country_pop.append((country,latest_pop))
    except:
        continue
country_pop.sort(key=lambda x: x[1])
print "Top 10 countries with highest population in 2011 are \n"
print map(lambda x: x[0],country_pop[-10:])   



Top 10 countries with highest population in 2011 are 

['Japan', 'Russia', 'Bangladesh', 'Nigeria', 'Pakistan', 'Brazil', 'Indonesia', 'United States', 'India', 'China']


### Question 3

In [41]:
country_pop = []
ethnic_groups = {}
for i in range(len(root)):
    country = root[i].find('name').text
    year_pop = {}
    for subelement in root[i].iterfind('population'):
        year_pop.update({subelement.attrib['year']: int(subelement.text)})
    try:
        latest_pop = year_pop[str(max(map(int,year_pop.keys())))]
        country_pop.append((country,latest_pop))
        for ethnic_element in root[i].iterfind('ethnicgroup'):
            new_pop = float(ethnic_element.attrib['percentage'])*latest_pop
            if ethnic_element.text in ethnic_groups.keys():
                ethnic_groups[ethnic_element.text] = ethnic_groups[ethnic_element.text] + new_pop
            else:
                ethnic_groups.update({ethnic_element.text: new_pop})
    except:
        continue
print "The top 10 ethnicities with largest overall populations\n"
print sorted(ethnic_groups, key=ethnic_groups.get)[-10:]

The top 10 ethnicities with largest overall populations

['Malay', 'Japanese', 'Russian', 'Bengali', 'Mestizo', 'Dravidian', 'African', 'European', 'Indo-Aryan', 'Han Chinese']


### Question 4 Hint

For all the parts: 
Create a country code to Country Name mapping by iterating over root childs using attrib and find functions
* Rivers
    1. Extract the rivers by iterating over root and using root[i].find('river')
    2. Extract lengths and country codes
    3. Map the country codes to original country names using the mapping generated above

Similar analysis for airports and lake


In [138]:
countries = []
for child in root.findall('country'):
    try:
        code = child.attrib['car_code']    
        countries.append((country,code))
    except: 
        continue   
print countries

[('Senegal', 'SN'), ('Guinea-Bissau', 'GNB'), ('Sierra Leone', 'WAL'), ('Uganda', 'EAU'), ('Lesotho', 'LS'), ('Madagascar', 'RM'), ('Malawi', 'MW'), ('Mozambique', 'MOC'), ('Mauritius', 'MS'), ('Mayotte', 'MAYO'), ('Swaziland', 'SD'), ('Reunion', 'REUN'), ('Saint Helena', 'HELX'), ('Sao Tome and Principe', 'STP'), ('Seychelles', 'SY')]


In [180]:
rivers = []
river_names = []
for child in root.findall('river'):
    try:
        river_name = child.find('name').text
        code = child.attrib['country']
        length = int(child.find('length').text)
        rivers.append([code, length, river_name])
    except:
        continue
rivers.sort(key=lambda x: x[1], reverse=True)
print rivers[0]

['CO BR PE', 6448, 'Amazonas']


In [181]:
lakes = []
for child in root.findall('lake'):
    try:
        lake_name = child.find('name').text
        code = child.attrib['country']
        area = int(child.find('area').text)
        lakes.append([code, area, lake_name])
    except:
        continue
lakes.sort(key=lambda x: x[1], reverse=True)
print lakes[0]

['R AZ KAZ IR TM', 386400, 'Caspian Sea']


In [182]:
airports = []
for child in root.findall('airport'):
    try:
        airport_name = child.find('name').text
        code = child.attrib['country']
        elevation = int(child.find('elevation').text)
        airports.append([code, elevation, airport_name])
    except:
        continue
airports.sort(key=lambda x: x[1], reverse=True)
print airports[0]

['BOL', 4063, 'El Alto Intl']


In [189]:
answer = []
for country in countries:
    if country[1] == 'CO' or country[1] == 'BR' or country[1] == 'PE' or country[1] == 'R' or country[1] == 'AZ' or country[1] == 'KAZ' or country[1] == 'IR' or country[1] == 'TM' or country[1] == 'BOL':
        answer.append(country[0])
print answer

['Russia', 'Iran', 'Turkmenistan', 'Azerbaijan', 'Kazakhstan', 'Colombia', 'Bolivia', 'Brazil', 'Peru']


In [201]:
print "The longest river is the " + rivers[0][2] + " located in " + rivers[0][0] + " (Columbia, Brazil, Peru)." 
print "The length of the " + rivers[0][2] + " is " + str(rivers[0][1]) + "."
print "The largest lake is the " + lakes[0][2] + " located in " + lakes[0][0] + " (Russia, Azerbaijan, Kazakhstan, Iran, Turkmenistan)." 
print "The area of the " + lakes[0][2] + " is " + str(lakes[0][1]) + "."
print "The airport with highest elevation is the " + airports[0][2] + " located in " + airports[0][0] + " (Bolivia)." 
print "The elevation of the " + airports[0][2] + " is " + str(airports[0][1]) + "."

The longest river is the Amazonas located in CO BR PE (Columbia, Brazil, Peru).
The length of the Amazonas is 6448.
The largest lake is the Caspian Sea located in R AZ KAZ IR TM (Russia, Azerbaijan, Kazakhstan, Iran, Turkmenistan).
The area of the Caspian Sea is 386400.
The airport with highest elevation is the El Alto Intl located in BOL (Bolivia).
The elevation of the El Alto Intl is 4063.
