# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

+ Adding additional reference for updated Python version: https://docs.python.org/3.6/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':', end = " "),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [18]:
document = ET.parse( './data/mondial_database.xml' )
import numpy as np

### Exercise 1: 10 Countries with lowest infant mortality

#### Attempted Steps:
1. Extract a list of tuples, each containing a country and its infant mortality rate
    Note --> those with no infant mortality rate are recorded as 'NaN'
2. Merge lists into a pandas data frame and remove missing data
3. sort the dataframe and print the 10 countries with the lowest infant mortality

In [92]:
data = document.getroot() # transform tree into readable data

# Step 1: extract data as a list of tuples

countries = []
for country in data.findall('country'):                    # iterate through the countries in the data, bulding tuples
    name = country.find('name').text                       # add the country name to the tuple
    try:                                                   # check for errors since only some countries have an infant mortality
        mortality = country.find('infant_mortality').text  # if no error, add the infant mortality
    except AttributeError:                                 # if the data is missing, an AttributeError is thrown
        mortality = np.nan                                 # if an error is thrown, then missing data is entered into the tuple

    ob = ( name, mortality )
    countries.append(ob)                                   # append tuple to the list of countries


In [93]:
# Step 2: generate pandas data frame

import pandas as pd

labels = ['country', 'infant_mortality']                  # generate column names
mortality_data = pd.DataFrame.from_records(countries, columns = labels).dropna() #generate data frame while removing NaN values

mortality_data.head(10)

Unnamed: 0,country,infant_mortality
0,Albania,13.19
1,Greece,4.78
2,Macedonia,7.9
3,Serbia,6.16
6,Andorra,3.69
7,France,3.31
8,Spain,3.33
9,Austria,4.16
10,Czech Republic,2.63
11,Germany,3.46


In [104]:
# Step 3.1: Confirm proper data structures

print(mortality_data.dtypes)                                                           # demonstrate type of data to be sorted
mortality_data['infant_mortality'] = pd.to_numeric(mortality_data['infant_mortality']) # coerce data to numeric data
print(mortality_data.dtypes)                                                           # confirm correct data type

country              object
infant_mortality    float64
dtype: object
country              object
infant_mortality    float64
dtype: object


In [105]:
# Step 3.2: Print output
mortality_data.sort_values(by = 'infant_mortality').head(10)

Unnamed: 0,country,infant_mortality
38,Monaco,1.81
98,Japan,2.13
117,Bermuda,2.48
36,Norway,2.48
106,Singapore,2.53
37,Sweden,2.6
10,Czech Republic,2.63
78,Hong Kong,2.73
79,Macao,3.13
44,Iceland,3.15


### Exercise 2: 10 cities with the largest population

#### Attempted Steps: