# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    #print child.find('name').text
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
# Python 2 version
#for element in document_tree.iterfind('country'):
#    print '* ' + element.find('name').text + ':',
#    capitals_string = ''
#    for subelement in element.getiterator('city'):
#        capitals_string += subelement.find('name').text + ', '
#    print capitals_string[:-2]
# Python 3 version
for element in document_tree.iterfind('country'):
    print_string = '* ' + element.find('name').text + ': '
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print_string += capitals_string[:-2]
    print(print_string)

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [6]:
document

<xml.etree.ElementTree.ElementTree at 0x105ed2438>

In [7]:
root = document.getroot()

In [8]:
root

<Element 'mondial' at 0x105ecc778>

In [9]:
# Create a dictionary mapping each country to its infant mortality rate
im_dict = {}
for country in document.iterfind('country'):
    name = country.find('name').text
    infant_mortality = country.findtext('infant_mortality')
    # Let's ignore the countries that didn't report infant morality
    if infant_mortality is not None:
        im_dict[name] = float(infant_mortality)

In [10]:
# Find the 10 countries with the lowest reported infant mortality rates
im_10 = sorted(im_dict.items(), key=lambda x: x[1])[:10]
print('The 10 countries with the lowest reported infant morality rates are:')
print('Country: Rate')
for i in im_10:
    print(i[0]+':',i[1])

The 10 countries with the lowest reported infant morality rates are:
Country: Rate
Monaco: 1.81
Japan: 2.13
Bermuda: 2.48
Norway: 2.48
Singapore: 2.53
Sweden: 2.6
Czech Republic: 2.63
Hong Kong: 2.73
Macao: 3.13
Iceland: 3.15


In [11]:
# Make a dictionary matching each city with a recorded population 
# to its most recently recorded population
city_pop_dict = {}
for country in document.iterfind('country'):
    for city in country.iterfind('city'):
        city_name = city.find('name').text
        pop_dict = {}
        # Populate the population dictionary for this city with the 
        # population for each year it's available, i.e. year:population
        for pop in city.iterfind('population'):
            year = int(pop.get('year'))
            pop_dict[year] = int(pop.text)
        # Assuming this city actually recorded its population, 
        # find the most recent population data for the city
        # Then populate the city_pop_dict with it, i.e. 
        # city:most_recent_population
        if pop_dict != {}:
            latest_year = max(pop_dict, key=pop_dict.get)
            city_pop_dict[city_name] = pop_dict[latest_year]

In [12]:
# Find the 10 most populous cities, according to the most recent data
pop_10 = sorted(city_pop_dict.items(), reverse=True, key=lambda x: x[1])[:10]
print('The 10 cities with the greatest populations are:')
print('City: Population')
for i in pop_10:
    print(i[0]+':',i[1])

The 10 cities with the greatest populations are:
City: Population
Seoul: 10229262
Al Qahirah: 8471859
Bangkok: 7506700
Hong Kong: 7055071
Ho Chi Minh: 5968384
Singapore: 5076700
Al Iskandariyah: 4123869
New Taipei: 3939305
Busan: 3813814
Pyongyang: 3255288


In [13]:
# Create a dictionary displaying the population of each ethnic group
egroup_dict = {}
for country in document.iterfind('country'):
    country_name = country.find('name').text
    pop_dict = {}
    for pop in country.iterfind('population'):
        year = int(pop.get('year'))
        pop_dict[year] = int(pop.text)
    latest_year = max(pop_dict, key=pop_dict.get)
    country_pop = pop_dict[latest_year]
    for eg in country.iterfind('ethnicgroup'):
        egroup = eg.text
        pop = round(country_pop * float(eg.get('percentage')) / 100.)
        if egroup not in egroup_dict.keys():
            egroup_dict[egroup] = pop
        else:
            egroup_dict[egroup] += pop

In [14]:
# Find the 10 most populous ethnic groups, according to the most recent data
eg_10 = sorted(egroup_dict.items(), reverse=True, key=lambda x: x[1])[:10]
print('The 10 most populous ethnic groups are:')
print('Group: Population')
for i in eg_10:
    print(i[0]+':',i[1])

The 10 most populous ethnic groups are:
Group: Population
Han Chinese: 1245058800
Indo-Aryan: 871815583
European: 494939517
African: 318359698
Dravidian: 302713744
Mestizo: 157855272
Bengali: 146776917
Russian: 136866551
Japanese: 127289008
Malay: 121993620


In [15]:
# Create a dictionary mapping country names to car codes
country_dict = {}
for country in document.iterfind('country'):
    country_name = country.find('name').text
    country_code = country.get('car_code')
    country_dict[country_code] = country_name

In [16]:
# Find the length and country of origin of each river
river_country = {}
river_length = {}
for river in document.iterfind('river'):
    river_name = river.find('name').text
    # If more than one source country is listed, take the first one
    country_code = river.find('source').get('country').split()[0]
    length = river.findtext('length')
    # Ignore any rivers whose lengths are not recorded
    if length is not None:
        river_length[river_name] = float(length)
        river_country[river_name] = country_dict[country_code]

In [17]:
# Find the name and country of origin of the longest river
lr = max(river_length, key=river_length.get)
print('The longest river is the {}.'.format(lr))
print('It is {} km long and originates in {}.'.format(river_length[lr], river_country[lr]))

The longest river is the Amazonas.
It is 6448.0 km long and originates in Peru.


In [18]:
# Find the area and country of origin of each lake
lake_country = {}
lake_area = {}
for lake in document.iterfind('lake'):
    lake_name = lake.find('name').text
    # If more than one country is listed, take the first one
    country_code = lake.get('country').split()[0]
    area = lake.findtext('area')
    # Ignore any lakes whose areas are not recorded
    if area is not None:
        lake_area[lake_name] = float(area)
        lake_country[lake_name] = country_dict[country_code]

In [19]:
# Find the name and country of origin of the largest lake
ll = max(lake_area, key=lake_area.get)
print('The largest lake is the {}.'.format(ll))
print('It has an area of {} km**2 and is located in {}.'.format(lake_area[ll], lake_country[ll]))

The largest lake is the Caspian Sea.
It has an area of 386400.0 km**2 and is located in Russia.


In [20]:
# Find the elevation and country of each airport
ap_country = {}
ap_elevation = {}
for ap in document.iterfind('airport'):
    ap_name = ap.find('name').text
    country_code = ap.get('country')
    el = ap.find('elevation').text
    if el is not None:
        ap_elevation[ap_name] = int(el)
        ap_country[ap_name] = country_dict[country_code]

In [21]:
# Find the name and country of the airport at the greatest elevation
ge = max(ap_elevation, key=ap_elevation.get)
print('The airport at the greatest elevation is {}.'.format(ge))
print('Its elevation is {} m and it is located in {}.'.format(ap_elevation[ge], ap_country[ge]))

The airport at the greatest elevation is El Alto Intl.
Its elevation is 4063 m and it is located in Bolivia.
