# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [6]:
import pandas as pd

Q1: 10 countries with the lowest infant mortality rates

In [7]:
#set list for countries and infant mortality rate(imr)
countries = []
imr = []

In [8]:
#for loop to go over the data and return country names and infant mortality rate and append it to list
for element in document.iterfind('country'):
    country = element.find('name').text
    countries.append(country)
    try:
        infant_mortality = element.find('infant_mortality').text
        imr.append(float(infant_mortality))
    except:
        infant_mortality = "NaN"
        imr.append(float(infant_mortality))
        continue

## ANSWER to question 1

In [9]:
#create dataframe to figure out 10 countries with lowest infant mortality rates
df = pd.DataFrame({'Country': countries, 'Infant Mortality Rate': imr})
df.sort_values(['Infant Mortality Rate'], ascending=True).head(10)

Unnamed: 0,Country,Infant Mortality Rate
38,Monaco,1.81
98,Japan,2.13
117,Bermuda,2.48
36,Norway,2.48
106,Singapore,2.53
37,Sweden,2.6
10,Czech Republic,2.63
78,Hong Kong,2.73
79,Macao,3.13
44,Iceland,3.15


Q2: 10 cities with the largest population

- question is unclear 
- answer to be given will be based on different years 
- duplicate cities will be removed, taking only the most recent data available

In [10]:
#set list for city, population and year
cities = []
population = []
year = []

In [11]:
#for loop to go over the data and fill list for the ff: cities, poppulation and year
for element in document.iterfind('country'): # go to country
    for child in element.iter('city'): # go over the cities in the country
        city = child.find('name').text # look for city name 
        for pop in child.iter():
            if pop.tag == 'population': # take the population
                cities.append(city) # fill the cities list
                population.append(int(pop.text)) # fill the corresponding population
                year.append(int(pop.attrib.get('year'))) # fill the corresponding year

In [12]:
#create dataframe with city, population and year
df_cities = pd.DataFrame({'City': cities, 'Population': population, 'Year': year})
#drop duplicate cities
df_cpy = df_cities.sort_values(['Year'], ascending=True).drop_duplicates(['City'], keep='last')

## ANSWER to question 2

In [13]:
#sort for top 10 largest cities by population
df_cpy.sort_values(['Population'], ascending=False).head(10)

Unnamed: 0,City,Population,Year
3750,Shanghai,22315474,2010
2607,Istanbul,13710512,2012
4303,Mumbai,12442373,2011
1546,Moskva,11979529,2013
3746,Beijing,11716620,2010
8208,São Paulo,11152344,2010
3754,Tianjin,11090314,2010
3364,Guangzhou,11071424,2010
4399,Delhi,11034555,2011
3371,Shenzhen,10358381,2010


Q3: 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [14]:
#set list for country, population, year, ethnicgroup, and percentage 
countries = []
country_population = []
population = []
year = []
ethnic_group = []
percentage = []

In [15]:
#for loop to go over the data and fill list for the ff: countries, country_population, population, year, ethnic_group, and percentage 
for element in document.iter('country'): #go to country
    country = element.find('name').text #look at country name
    for pop in element.iter(): 
        if pop.tag == 'population':  # take the population
            country_population.append(country) # fill country_population list
            year.append(int(pop.attrib.get('year'))) #fill the year list
            population.append(int(pop.text)) #fill the population list
        elif pop.tag == 'city' or pop.tag == 'province': # to make sure that city and province population will not be inlcuded
            break
    for items in element.iter():      
        if items.tag == 'ethnicgroup':#take the ethnic group
            countries.append(country)# put the country name in the proper list
            percentage.append(float(items.attrib.get("percentage"))) # fill in the percentage
            ethnic_group.append(items.text) # fill the ethnic_group list

In [16]:
#build two dataframes to be merged later on
#Ethnic DataFrame
Edf = pd.DataFrame({'Country': countries, 'Ethnicity': ethnic_group, 'Percentage': percentage})
#Population DataFrame
Pdf = pd.DataFrame ({'Country': country_population, 'Population': population, 'Year': year})

In [17]:
#Population DataFrame Cleaning, taking only the latest information available
Latest_Pdf = Pdf.sort_values(['Year']).drop_duplicates(['Country'], keep='last')

In [18]:
#merge dataframes
df = pd.DataFrame.merge(Edf, Latest_Pdf)

In [19]:
#add column for Ethnic Population by multiplying the percentage to total population 
df['Ethnic Population'] = (df['Percentage'] / 100) * df['Population']
#new data frame taking only needed information
df = df[['Ethnicity', 'Ethnic Population']]

In [20]:
#sum of Ethnic Population and then group by Ethnicity
df = df.groupby("Ethnicity")[["Ethnic Population"]].sum()

## ANSWER to question 3

In [21]:
#sort for 10 ethnic groups with the largest population
df.sort_values('Ethnic Population', ascending=False).head(10)

Unnamed: 0_level_0,Ethnic Population
Ethnicity,Unnamed: 1_level_1
Han Chinese,1245059000.0
Indo-Aryan,871815600.0
European,494872200.0
African,318325100.0
Dravidian,302713700.0
Mestizo,157734400.0
Bengali,146776900.0
Russian,131857000.0
Japanese,126534200.0
Malay,121993600.0


In [22]:
#create a dictionary for country codes and country
country_codes = {}
for country in document.findall('country'):
    country_codes[country.attrib['car_code']] = country.find('name').text
#Country Code dataframe
CCdf = pd.DataFrame(data=(country_codes.items()), columns =['Country Code', 'Country'])

Q4a: Name and Country of longest river

In [23]:
#create list to add rivers and other data
rivers = []
attributes = []

In [24]:
#add attributes (Country, River and Length) to list
for river in document.findall('river'):
    try:
        attributes = [river.attrib['country'], river.find('name').text, int(river.find('length').text)]
        rivers.append(attributes)
    except:
        pass

In [25]:
#create a rivers dataframe
col = ['Country Code', 'River Name', 'River Length'] #create column names
river_df = pd.DataFrame(columns=col) #blank data frame with columns
river_df = river_df.append(pd.DataFrame(rivers, columns=col)) #fill in data to dataframe
river_df #notice some country codes have multiple entires

Unnamed: 0,Country Code,River Name,River Length
0,IS,Thjorsa,230.0
1,IS,Joekulsa a Fjoellum,206.0
2,N,Glomma,604.0
3,N,Lagen,322.0
4,S,Goetaaelv,93.0
5,N S,Klaraelv,460.0
6,S,Umeaelv,470.0
7,S,Dalaelv,520.0
8,S,Vaesterdalaelv,320.0
9,S,Oesterdalaelv,241.0


In [26]:
#create a list for multiple country codes
expanded_rivers = []
#split country codes to several entries
for row in river_df.itertuples(): 
    if len(row[1].split(' ')) == 1:
        entry = [row[1], row[2], row[3]]
        expanded_rivers.append(entry)
    else:
        for code in row[1].split(' '):
            entry = [code, row[2], row[3]]
            expanded_rivers.append(entry)

In [27]:
#create a new expanded dataframe to have one country code per row
exp_river_df = pd.DataFrame(columns=col)
exp_river_df = exp_river_df.append(pd.DataFrame(expanded_rivers, columns=col))

In [28]:
#merge country code with the expanded dataframe
CC_Exp_df = exp_river_df.merge(CCdf, on='Country Code')
answer = CC_Exp_df.sort_values(('River Length'), ascending=False)
CC_Exp_df.sort_values(('River Length'), ascending=False).head() #to check which ones are at the top of the list

Unnamed: 0,Country Code,River Name,River Length,Country
295,CO,Amazonas,6448.0,Colombia
306,PE,Amazonas,6448.0,Peru
298,BR,Amazonas,6448.0,Brazil
223,CN,Jangtse,6380.0,China
222,CN,Hwangho,4845.0,China


## ANSWER to question 4a

In [29]:
print ('The longest river is the Amazonas found in Coumbia, Peru and Brazil.')
answer.head(3)

The longest river is the Amazonas found in Coumbia, Peru and Brazil.


Unnamed: 0,Country Code,River Name,River Length,Country
295,CO,Amazonas,6448.0,Colombia
306,PE,Amazonas,6448.0,Peru
298,BR,Amazonas,6448.0,Brazil


Q4b: Name and Country of largest lake

In [30]:
#create list to add lakes and other data
#basically, this is similar to question 4a with some modifications to the code
lakes = []
attributes = []

In [31]:
#add attributes (Country, River and Length) to list
for lake in document.findall('lake'):
    try:
        attributes = [lake.attrib['country'], lake.find('name').text, int(lake.find('area').text)]
        lakes.append(attributes)
    except:
        pass

In [32]:
#create a lakes dataframe
col = ['Country Code', 'Lake Name', 'Lake Area'] #create column names
lake_df = pd.DataFrame(columns=col) #blank data frame with columns
lake_df = lake_df.append(pd.DataFrame(lakes, columns=col)) #fill in data to dataframe
lake_df #notice some country codes have multiple entires

Unnamed: 0,Country Code,Lake Name,Lake Area
0,SF,Inari,1040.0
1,SF,Oulujaervi,928.0
2,SF,Kallavesi,472.0
3,SF,Saimaa,4370.0
4,SF,Paeijaenne,1118.0
5,N,Mjoesa-See,368.0
6,S,Storuman,173.0
7,S,Siljan,290.0
8,S,Maelaren,1140.0
9,S,Vaenern,5648.0


In [33]:
#expand the dataframe to have one country code per row
expanded_lakes = []
#split country codes to several entries
for row in lake_df.itertuples(): 
    if len(row[1].split(' ')) == 1:
        entry = [row[1], row[2], row[3]]
        expanded_lakes.append(entry)
    else:
        for code in row[1].split(' '):
            entry = [code, row[2], row[3]]
            expanded_lakes.append(entry)

In [34]:
#create a new expanded dataframe to have one country code per row
exp_lake_df = pd.DataFrame(columns=col)
exp_lake_df = exp_river_df.append(pd.DataFrame(expanded_lakes, columns=col))

In [35]:
#merge country code with the expanded dataframe
CC_Exp_df = exp_lake_df.merge(CCdf, on='Country Code')
answer = CC_Exp_df.sort_values(('Lake Area'), ascending=False)
CC_Exp_df.sort_values(('Lake Area'), ascending=False).head(10) #to check which ones are at the top of the list

Unnamed: 0,Country Code,Lake Area,Lake Name,River Length,River Name,Country
66,R,386400.0,Caspian Sea,,,Russia
310,TM,386400.0,Caspian Sea,,,Turkmenistan
274,KAZ,386400.0,Caspian Sea,,,Kazakhstan
264,IR,386400.0,Caspian Sea,,,Iran
255,AZ,386400.0,Caspian Sea,,,Azerbaijan
370,USA,82103.0,Lake Superior,,,United States
346,CDN,82103.0,Lake Superior,,,Canada
510,EAT,68870.0,Lake Victoria,,,Tanzania
508,EAU,68870.0,Lake Victoria,,,Uganda
579,EAK,68870.0,Lake Victoria,,,Kenya


## ANSWER to question 4b

In [36]:
print ('The Caspian sea is the leargest. It\'s found in Azerbaijan, Russia, Iran, Kazakhstan, and Turkmenistan.')
answer.head(5)

The Caspian sea is the leargest. It's found in Azerbaijan, Russia, Iran, Kazakhstan, and Turkmenistan.


Unnamed: 0,Country Code,Lake Area,Lake Name,River Length,River Name,Country
66,R,386400.0,Caspian Sea,,,Russia
310,TM,386400.0,Caspian Sea,,,Turkmenistan
274,KAZ,386400.0,Caspian Sea,,,Kazakhstan
264,IR,386400.0,Caspian Sea,,,Iran
255,AZ,386400.0,Caspian Sea,,,Azerbaijan


Q4c: Name and Country of airport at highest elevation

In [37]:
#create list to add airports and other data
#basically, this is similar to question 4a and 4b without the need to separate country codes
airports = []
attributes = []

In [38]:
#add attributes (Country, Airport and Elevation) to list
for airport in document.findall('airport'):
    try:
        attributes = [airport.attrib['country'], airport.find('name').text, int(airport.find('elevation').text)]
        airports.append(attributes)
    except:
        pass

In [39]:
#create an airport dataframe
col = ['Country Code', 'Airport Name', 'Airport Elevation'] #create column names
airport_df = pd.DataFrame(columns=col) #blank data frame with columns
airport_df = airport_df.append(pd.DataFrame(airports, columns=col)) #fill in data to dataframe
airport_df

Unnamed: 0,Country Code,Airport Name,Airport Elevation
0,AFG,Herat,977.0
1,AFG,Kabul Intl,1792.0
2,AL,Tirana Rinas,38.0
3,DZ,Cheikh Larbi Tebessi,811.0
4,DZ,Batna Airport,822.0
5,DZ,Soummam,6.0
6,DZ,Tamanrasset,1377.0
7,DZ,Biskra,88.0
8,DZ,Mohamed Boudiaf Intl,691.0
9,DZ,Ain Arnat Airport,1024.0


In [40]:
#merge country code with the expanded dataframe
Air_df = airport_df.merge(CCdf, on='Country Code')
answer = Air_df.sort_values(('Airport Elevation'), ascending=False)
Air_df.sort_values(('Airport Elevation'), ascending=False).head() #to check which ones are at the top of the list

Unnamed: 0,Country Code,Airport Name,Airport Elevation,Country
80,BOL,El Alto Intl,4063.0,Bolivia
212,CN,Lhasa-Gonggar,4005.0,China
230,CN,Yushu Batang,3963.0,China
787,PE,Juliaca,3827.0,Peru
789,PE,Teniente Alejandro Velasco Astete Intl,3311.0,Peru


## ANSWER to question 4c

In [41]:
print ('The airport with the highest elevation according to the data is El Alto Intl, which can be found in Bolivia.')

The airport with the highest elevation according to the data is El Alto Intl, which can be found in Bolivia.


## Note:

After a quick google search, it is revealed that that there 4 more aiports at a higher elevation that weren't included in the data. These are the following:
- Daocheng Yading Airport
- Qamdo Bamda Airport
- Kangding Airport
- Ngari Gunsa Airport

These airports can all be found in China

Reference: https://en.wikipedia.org/wiki/List_of_highest_airports