# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
import xml.etree.ElementTree as ET
import numpy as np
import pandas as pd

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [105]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [106]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [107]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':', capitals_string = '')
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    capitals_string[:-2]

TypeError: 'capitals_string' is an invalid keyword argument for this function

****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [3]:
document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()

In [4]:
countries = []
infmort = []
for child in document.findall('country'):
    countries.append(child.find('name').text)
for child in document.findall('country'):
    if child.find('infant_mortality') != None:
        infmort.append(child.find('infant_mortality').text)
    else:
        infmort.append(str(np.NaN))

In [5]:
for i in range(len(infmort)):
    infmort[i] = float(infmort[i])

In [6]:
df_infmort = pd.DataFrame([countries, infmort]).T
df_infmort.columns = ['Countries', 'Mortality']
lowest_infmort = df_infmort.sort('Mortality', ascending=True).head(10)
print('The 10 countries with the lowest infant mortality:')
lowest_infmort

The 10 countries with the lowest infant mortality:


Unnamed: 0,Countries,Mortality
38,Monaco,1.81
98,Japan,2.13
117,Bermuda,2.48
36,Norway,2.48
106,Singapore,2.53
37,Sweden,2.6
10,Czech Republic,2.63
78,Hong Kong,2.73
79,Macao,3.13
44,Iceland,3.15


In [7]:
populations = []
for country in document.findall('country'):
    curr_pop = country.find('population')
    for node_pop in country.findall('population'):
        curr_pop = node_pop
    populations.append([country.find('name').text, float(curr_pop.text)])
populations

df_pop = pd.DataFrame(populations, columns=['Country', 'Population'])
print("The countries with the greatest 10 populations are:")
df_pop.sort('Population', ascending=False).head(10)

The countries with the greatest 10 populations are:


Unnamed: 0,Country,Population
55,China,1360720000
67,India,1210854977
120,United States,318857056
88,Indonesia,252124458
176,Brazil,202768562
57,Pakistan,173149306
202,Nigeria,164294516
65,Bangladesh,149772364
23,Russia,143666931
98,Japan,127298000


###Find all ethnic group populations

In [8]:
ethnic = []
for country in document.findall('country'):
    curr_pop = country.find('population')
    for node_pop in country.findall('population'):
        curr_pop = float(node_pop.text)
    for node_eth in country.findall('ethnicgroup'):
        ethnic.append([node_eth.text, curr_pop*float(node_eth.get('percentage'))])

In [9]:
df_ethnic = pd.DataFrame(ethnic, columns=['Ethnic Group', 'Population'])

In [24]:
print('The top 10 Ethnic Group populations are:')
df_ethnic.groupby('Ethnic Group').sum().sort('Population', ascending=False).head(10)

The top 10 Ethnic Group populations are:


Unnamed: 0_level_0,Population
Ethnic Group,Unnamed: 1_level_1
Han Chinese,124505900000.0
Indo-Aryan,87181560000.0
European,49487220000.0
African,31832510000.0
Dravidian,30271370000.0
Mestizo,15773440000.0
Bengali,14677690000.0
Russian,13185700000.0
Japanese,12653420000.0
Malay,12199360000.0


In [169]:
#Get all rivers information
rivers = []
for river in document.findall('river'):
    if river.find('length') != None:
        rivers.append([river.get('country'), river.find('name').text, float(river.find('length').text)])

In [170]:
#Get all airports information
airports = []
for airport in document.findall('airport'):
    if airport.find('elevation').text != None:
        airports.append([airport.get('country'), airport.find('name').text, float(airport.find('elevation').text)])

In [171]:
#Get all lakes information
lakes = []
for lake in document.findall('lake'):
    if lake.find('area') != None:
        lakes.append([lake.get('country'), lake.find('name').text, float(lake.find('area').text)])

In [172]:
#Table Lookup for country codes
country_codes = []
for country in document.findall('country'):
    country_codes.append([country.get('car_code'), country.find('name').text])

In [180]:
df_rivers = pd.DataFrame(rivers, columns=['Country', 'River', 'Length'])
df_airports = pd.DataFrame(airports, columns=['Country', 'Airport', 'Elevation'])
df_lakes = pd.DataFrame(lakes, columns=['Country', 'Lake', 'Area'])
df_countries = pd.DataFrame(country_codes, columns=['Code', 'Country'])
df_countries = df_countries.set_index('Code')

In [194]:
longest_riv = df_rivers.sort('Length', ascending=False).head(1).reset_index()
longest_riv = longest_riv[['Country', 'River', 'Length']]
print("The largest river in the world:")
longest_riv

The largest river in the world:


Unnamed: 0,Country,River,Length
0,CO BR PE,Amazonas,6448


In [195]:
largest_lake = df_lakes.sort('Area', ascending=False).head(1)
largest_lake.reset_index(inplace=1)
largest_lake = largest_lake[['Country', 'Lake', 'Area']]
print('The largest lake in the world:')
largest_lake

The largest lake in the world:


Unnamed: 0,Country,Lake,Area
0,R AZ KAZ IR TM,Caspian Sea,386400


In [196]:
highest_airport = df_airports.sort('Elevation', ascending=False).head(1).reset_index()
highest_airport = highest_airport[['Country', 'Airport', 'Elevation']]
print('The highest airport in the world:')
highest_airport

The highest airport in the world:


Unnamed: 0,Country,Airport,Elevation
0,BOL,El Alto Intl,4063
