# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

(update: https://docs.python.org/3.6/library/xml.etree.elementtree.html )

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [1]:
# setup packages
from xml.etree import ElementTree as ET
import numpy as np
import pandas as pd

# import data
document = ET.parse( './data/mondial_database.xml' )
data = document.getroot() # transform tree into readable data

### Exercise 1: 10 Countries with lowest infant mortality

#### Attempted Steps:
1. Extract a list of tuples, each containing a country and its infant mortality rate
    Note --> those with no infant mortality rate are recorded as 'NaN'
2. Merge lists into a pandas data frame and remove missing data
3. sort the dataframe and print the 10 countries with the lowest infant mortality

In [2]:
data = document.getroot() # transform tree into readable data

# Step 1: extract data as a list of tuples

countries = []
for country in data.findall('country'):                    # iterate through the countries in the data, bulding tuples
    name = country.find('name').text                       # record the country name
    try:                                                   # check for errors since only some countries have an infant mortality
        mortality = country.find('infant_mortality').text  # if no error, add the infant mortality
    except AttributeError:                                 # if the data is missing, an AttributeError is thrown
        mortality = np.nan                                 # if an error is thrown, then missing data is entered into the tuple

    ob = ( name, mortality )
    countries.append(ob)                                   # append tuple to the list of countries


In [3]:
# Step 2: generate pandas data frame

labels = ['country', 'infant_mortality']                  # generate column names
mortality_data = pd.DataFrame.from_records(countries, columns = labels).dropna() #generate data frame while removing NaN values

mortality_data.head(10)

Unnamed: 0,country,infant_mortality
0,Albania,13.19
1,Greece,4.78
2,Macedonia,7.9
3,Serbia,6.16
6,Andorra,3.69
7,France,3.31
8,Spain,3.33
9,Austria,4.16
10,Czech Republic,2.63
11,Germany,3.46


In [4]:
# Step 3.1: Confirm proper data structures

print(mortality_data.dtypes)                                                           # demonstrate type of data to be sorted
mortality_data['infant_mortality'] = pd.to_numeric(mortality_data['infant_mortality']) # coerce data to numeric data
print(mortality_data.dtypes)                                                           # confirm correct data type

country             object
infant_mortality    object
dtype: object
country              object
infant_mortality    float64
dtype: object


In [5]:
# Step 3.2: Print output
mortality_data.sort_values(by = 'infant_mortality').head(10)

Unnamed: 0,country,infant_mortality
38,Monaco,1.81
98,Japan,2.13
117,Bermuda,2.48
36,Norway,2.48
106,Singapore,2.53
37,Sweden,2.6
10,Czech Republic,2.63
78,Hong Kong,2.73
79,Macao,3.13
44,Iceland,3.15


### Exercise 2: 10 cities with the largest population

#### Attempted Steps:
1. Make a list of tuples, where each tuple is the name of the city, most recent population, year of estimate/census
2. Use the list to create a PD data frame and ensure that the data are proper numeric values
3. Print the 10 cities with the largest populations

In [6]:
# Step 1: make the list of city_name-population tuples

cities = []                                              # empty list of city tuples, to be filled

for country in data.findall('country'):                  # iterate list of all countries
    for city in country.iter('city'):                    # iterate list of all cities in the country
        
        city_name = city.find('name').text               # record the city name
        last_year = 0                                    # set a base for tracking the last year tested
        for pop_count in city.findall('population'):     # make a list of all population measurements/estimates for a city
            year = int(pop_count.attrib['year'])         # record the year of the population measurement/estimate
            if (year >= last_year):                      # test if current year or more recent than last year
                pop = int(pop_count.text)                # if more recent, update population measurement/estimate
                last_year = year                         # and update last_year to current year
        
        ob = (city_name, pop, year)                      # make 3-item tuple with city name, population, and year of population
        cities.append(ob)                                # append tuple to the list of cities using the most recent population


In [7]:
# Step 2.1: Use the list to create a PD data frame

labels = ['city', 'population', 'year']
city_pops = pd.DataFrame.from_records(cities, columns = labels)

city_pops.head(10)

Unnamed: 0,city,population,year
0,Tirana,418495,2011
1,Shkodër,77075,2011
2,Durrës,113249,2011
3,Vlorë,79513,2011
4,Elbasan,78703,2011
5,Korçë,51152,2011
6,Komotini,51152,2011
7,Kavala,58790,2011
8,Athina,664046,2011
9,Peiraias,163688,2011


In [8]:
# Step 2.2: ensure that the data are proper numeric values

# Step 2.2.1: test data types
print(city_pops.dtypes)

city          object
population     int64
year           int64
dtype: object


In [9]:
# check passed; moving to step 3

# Step 3: Sort and print list of top 10 cities by most recent population
city_pops.sort_values(by = 'population', ascending = False).head(10)

Unnamed: 0,city,population,year
1341,Shanghai,22315474,2010
771,Istanbul,13710512,2012
1527,Mumbai,12442373,2011
479,Moskva,11979529,2013
1340,Beijing,11716620,2010
2810,São Paulo,11152344,2010
1342,Tianjin,11090314,2010
1064,Guangzhou,11071424,2010
1582,Delhi,11034555,2011
1067,Shenzhen,10358381,2010


### Exercise 3: 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

#### Steps Attempted:
1. Extract list of ethnic group population percentages by country and country populations
2. Build a PD data frame and convert from country populations and ethnicity percentages to ethnicity population counts
3. Merge data into single counts by ethnic group
4. Sort and print the 10 most populous groups

In [10]:
# Step 1: extract ethnic group populations by country

country_ethnicity = []

for country in data.findall('country'):                              # iterate list of all countries
    country_name = country.find('name').text                         # record the country name
    
    # find the newest/best population estimate
    last_year = 0                                                    # set base for year updates
    last_est_type = ""                                               # set estimate type base ("est." v. "census")
    
    for pop_est in country.findall('population'):                    # iterate through population measurements for the country
        
        year = int(pop_est.attrib['year'])                           # record year of current measurement
        try:
            est_type = pop_est.attrib['measured']                    # record current measurement type ("est." v. "census")
        except KeyError:                                             # unless est. type is missing, then enter as blank string
            est_type = ""
        
        # determine if the current population measurement is better than the previous estimate
        newer = year > last_year                                     # newer measurement if year is greater
        better = est_type == "census" and last_est_type != "census"  # better estimate if change from estimate to census
        worse = est_type != "census." and last_est_type == "census"  # worse if change from census to estimate
        if( better or ( newer and not worse ) ):                     # if the estimate is better or newer (but not worse!)...
            last_year = year                                         # ...then update the year
            last_est_type = est_type                                 # ...update the measurement type
            pop = int(pop_est.text)                                  # ...and update the measurement!
        
    # get ethnicities and their population percentages
    for ethnicity in country.iter('ethnicgroup'):                    # iterate list of all ethnicities in the country
        group = ethnicity.text                                       # record ethnic group name
        pop_p100 = float(ethnicity.attrib['percentage'])             # record population percentage as a float
        
        ob = (country_name, year, pop, est_type, group, pop_p100)    # make 5-item tuple with the recorded data
        country_ethnicity.append(ob)                                 # append tuple to the list of country-ethnic groups
        #print(ob) #--> removed; used to test if the algorithm worked

In [11]:
# Step 2: Build data frame

# Step 2.1: build dataframe
labels = ['country', 'year', 'total_pop', 'est_type', 'ethnicity', 'percentage']
ethnic_groups = city_pops = pd.DataFrame.from_records(country_ethnicity, columns = labels)

# Step 2.2: test datatypes
print(ethnic_groups.dtypes)
ethnic_groups.head(10)

country        object
year            int64
total_pop       int64
est_type       object
ethnicity      object
percentage    float64
dtype: object


Unnamed: 0,country,year,total_pop,est_type,ethnicity,percentage
0,Albania,2011,3069275,census,Albanian,95.0
1,Albania,2011,3069275,census,Greek,3.0
2,Greece,2011,1096810,census,Greek,93.0
3,Macedonia,2011,808724,estimate,Macedonian,64.2
4,Macedonia,2011,808724,estimate,Albanian,25.2
5,Macedonia,2011,808724,estimate,Turkish,3.9
6,Macedonia,2011,808724,estimate,Gypsy,2.7
7,Macedonia,2011,808724,estimate,Serb,1.8
8,Serbia,2011,7620531,census,Serb,82.9
9,Serbia,2011,7620531,census,Montenegrin,0.9


In [12]:
# Step 2.3: append population counts for ethnic groups
ethnic_groups['pop_by_ethnicity'] = round(ethnic_groups['total_pop'] * ethnic_groups['percentage'] / 100, 0)
ethnic_groups.head(10)

Unnamed: 0,country,year,total_pop,est_type,ethnicity,percentage,pop_by_ethnicity
0,Albania,2011,3069275,census,Albanian,95.0,2915811.0
1,Albania,2011,3069275,census,Greek,3.0,92078.0
2,Greece,2011,1096810,census,Greek,93.0,1020033.0
3,Macedonia,2011,808724,estimate,Macedonian,64.2,519201.0
4,Macedonia,2011,808724,estimate,Albanian,25.2,203798.0
5,Macedonia,2011,808724,estimate,Turkish,3.9,31540.0
6,Macedonia,2011,808724,estimate,Gypsy,2.7,21836.0
7,Macedonia,2011,808724,estimate,Serb,1.8,14557.0
8,Serbia,2011,7620531,census,Serb,82.9,6317420.0
9,Serbia,2011,7620531,census,Montenegrin,0.9,68585.0


In [13]:
# Step 3: Group by ethnic group and report total populations for the 10 largest groups

# print out the dataset with the following arrangements:
# 1. group data by ethnicity and sum all numeric values
# 2. sort the data for total by populations by ethnicity by descending values
# 3. print all the top 10 ethnicities, only including the ethnicity name (index) and the population value
ethnic_groups.groupby('ethnicity').sum().sort_values('pop_by_ethnicity', ascending = False).head(10)[['pop_by_ethnicity']]

Unnamed: 0_level_0,pop_by_ethnicity
ethnicity,Unnamed: 1_level_1
Han Chinese,1136990000.0
European,390938900.0
African,179149400.0
Indo-Aryan,171645400.0
Russian,127431500.0
Japanese,127001400.0
Mestizo,82388520.0
Mulatto,73703500.0
German,68526010.0
Viet/Kinh,65413010.0


### Exercise 4: Name and country of a) longest river, b) largest lake and c) airport at highest elevation

#### Steps Attempted, for each desired item:
1. Iterate through the data to find the longest river, largest lake, or highest airport and record length/area/elevation, and country/countries
2. go through countries and find the name of any and all appropriate countries
3. Print needed output:
   4.1) Longest river
   4.2) Largest lake
   4.3) Airport at highest elevation

In [14]:
# Step 1.1: iterate through rivers to find the longest one

river_length = 0                                        # set some bases for later comparison: river length
name = ""                                               # river name
for river in data.findall('river'):                     # iterate through the river elements
    try:                                                # try and exract the needed numbers
        test_length = float(river.find('length').text)  # if the length element was found, record the length
    except AttributeError:                              # account for the error when the element's missing
        test_length = 0                                 # if not found, assume no usable data and return to base

    if test_length > river_length:                      # compare current river element to any previously-found longest river
        name = river.find('name').text                  # if current river is longer, replace the previous river name
        river_length = test_length                      # if longer, replace the current length with the new one
        river_codes = river.attrib['country'].split()   # if longer, replace the list of countries with the new one

river = [name, river_length]                            # once the data is reviewed, store the final result for longest river

In [15]:
# Step 1.2: find largest lake

# the same process is followed for the largest lake, with lake elements replacing river elements and area replacing length

lake_area = 0
name = ""
for lake in data.findall('lake'):
    try:
        test_area = float(lake.find('area').text)
    except AttributeError:
        test_area = 0

    if test_area > lake_area:
        name = lake.find('name').text
        lake_area = test_area
        lake_codes = lake.attrib['country'].split()

lake = [name, lake_area]

In [16]:
# Step 1.3: find highest airport

# the same process as the previous two is also followed for the airport, but with some unique aspects:
# - no airport is missing its elevation element, but at least one is missing the measurement
# - an aiport can only be in one country, and so the country codes are recorded in the list with the airport name and elevation

aiport_elevation = 0
name = ""
for airport in data.findall('airport'):
    try:
        test_elevation = float(airport.find('elevation').text)
    except TypeError:                                           # unique error: missing data but not element
        test_elevation = 0                                      # if error is found, use zero elevation

    if test_elevation > aiport_elevation:
        name = airport.find('name').text
        aiport_elevation = test_elevation
        airport_country_code = airport.attrib['country']

airport = [name, aiport_elevation, airport_country_code]

In [17]:
# Step 2: find country locations for the river, lake, and airport

#car_code attribute matches country codes from river, lake, and airport elements

river_loc = []                                         # empty list for the countries where the longest river is
for country in data.findall('country'):                # search through country elements
    if country.attrib['car_code'] in river_codes:      # record any country names for car_codes in the river_codes list
        river_loc.append( country.find('name').text )  # country names are added to a river location list

lake_loc = []                                          # empty list for the countries where the largest lake is
for country in data.findall('country'):                # search through country elements
    if country.attrib['car_code'] in lake_codes:       # record any country names for car_codes in the lake_codes list
        lake_loc.append( country.find('name').text )   # country names are added to a lake location list

for country in data.findall('country'):                # get a country name for the highest aiport
    if country.attrib['car_code'] == airport[2]:       # get a country name for the highest aiport by matching to country codes
        airport.append( country.find('name').text )    # store name as additional value in the airport list

In [18]:
# Step 3: Print final results

# complete sentences are constructed to report results

print("The longest river is the " + river[0] + ", running for " + str(int(river[1]))
      + "km, and crossing " + str(len(river_codes)) + " countries:", end = " ")
for i in range(len(river_loc)):                       # iterate through list to print all values
    if i < len(river_loc) - 1:                        # select the appropriate punctuation
        print( river_loc[i] + ",", end = " ")         # if not at end, then print comma and space
    else:
        print( river_loc[i] + ".")                    # if at end, print period and default new line

print("")

print("The largest lake is the " + lake[0] + ", covering " + str(int(lake[1]))
      + " square km, and located in " + str(len(lake_codes)) + " countries:", end = " ")
for i in range(len(lake_loc)):                       # iterate through list to print all values
    if i < len(lake_loc) - 1:                        # select the appropriate punctuation
        print( lake_loc[i] + ",", end = " " )        # if not at end, then print comma and space
    else:
        print( lake_loc[i] + "." )                   # if at end, print period and default new line
        
print("")

print("The highest airport is " + airport[0] + ", at " + str(int(airport[1]))
      + "m above sea level, and located in " + airport[3] + ".")

The longest river is the Amazonas, running for 6448km, and crossing 3 countries: Colombia, Brazil, Peru.

The largest lake is the Caspian Sea, covering 386400 square km, and located in 5 countries: Russia, Iran, Turkmenistan, Azerbaijan, Kazakhstan.

The highest airport is El Alto Intl, at 4063m above sea level, and located in Bolivia.
