# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


In [6]:
for child in document_tree.getroot():
    c = child.find('city')
    print(c)

<Element 'city' at 0x10d87f458>
None
<Element 'city' at 0x10d8bec28>
<Element 'city' at 0x10d8c2ea8>
<Element 'city' at 0x10d8d2278>
<Element 'city' at 0x10d8d2c28>
<Element 'city' at 0x10d8d8a98>


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

## 10 Countries with Lowest Infant Mortality Rates

We need to find the 10 countries with the lowest infant mortality rates.

The first step is to load the XML data and parse it using the xml.etree library (Pandas does not provide support for XML parsing). We then proceed to examine the XML file to discover its Document Object Model (or DOM). We find that 'country' is the immediate child of the root and 'infant_mortality' is the immediate child of 'country'.

We, therefore, traverse into the country and then extract the infant mortality child from it.

Finally, we sort our list of tuples based on mortality rates and output the 10 countries with the 10 lowest mortality rates. We discover that Monaco has the lowest mortality rate, followed by Japan, Norway and Bermuda.

In [7]:
document = ET.parse( './data/mondial_database.xml' )

infant_mortality = []
for child in document.getroot():
    country = child.find('name').text
    mortality = child.find('infant_mortality')
    if mortality is not None:
        mortality = mortality.text
        infant_mortality.append((country, float(mortality)))
    

print("Top Ten Countries with Lowest Infant Mortality Rates\n")

top_ten = sorted(infant_mortality, key=lambda x: x[1])[:10]
for count, tup in enumerate(top_ten):
    print(str(count+1) + ". " + tup[0])

Top Ten Countries with Lowest Infant Mortality Rates

1. Monaco
2. Japan
3. Norway
4. Bermuda
5. Singapore
6. Sweden
7. Czech Republic
8. Hong Kong
9. Macao
10. Iceland


## 10 Cities with the Largest Population

To find cities with the largest population, we look at all the cities in a particular country and record its population. The list of tuples is then sorted based on the population and the top ten cities with the largest number of people are outputted.

We discover that Shenzhen is the most populated city in the world followed by Delhi, Guangzhou and Tianjin.

In [8]:
populations = []
for element in document.iterfind('country'):
    for subelement in element.getiterator('city'):
        city = subelement.find('name').text
        #population = 0
        population = subelement.findall('population')
        if population is not None and len(population) > 0:
            population = int(population[-1].text)
        else:
            population = 0
        populations.append((city, population))


print("Top Ten Cities with the Largest Population\n")
top_ten = sorted(populations, key=lambda x: x[1])[-10:]
for count, tup in enumerate(top_ten):
    print(str(count+1) + ". " + tup[0])

Top Ten Cities with the Largest Population

1. Shenzhen
2. Delhi
3. Guangzhou
4. Tianjin
5. São Paulo
6. Beijing
7. Moskva
8. Mumbai
9. Istanbul
10. Shanghai


## 10 Ethnic Groups with the Largest Overall Population

To find the top 10 ethnic groups, we maintain a list of tuples of all ethnicities and their corresponding populations.

We then iterate over every country and calculate the population of each group based on the total population and the percentage of each ethnic group. We then add this population to the corresponding field in our ethnic group list.

Finally, we output the 10 largest ethnic groups by sorting them based on population. We discover that the Han Chinese are the largest ethnic group followed by Indo Aryans, Europeans and Africans.

In [24]:
import operator

ethnicity = {}

for element in document.iterfind('country'):
    country = element.find('name').text
    population = element.findall('population')
    if population is not None and len(population) > 0:
        population = int(population[-1].text)
    else:
        population = 0
    
    ethnicgroups = element.findall('ethnicgroup')
    for ethnicgroup in ethnicgroups:
        group = ethnicgroup.text
        percentage = float(ethnicgroup.get('percentage'))
        group_pop = round((percentage * population) / 100)
        if group in ethnicity:
            ethnicity[group] = ethnicity[group] + group_pop
        else:
            ethnicity[group] = group_pop
            
print("Top Ten Ethnic Groups with Largest Overall Population\n")
top_ten = dict(sorted(ethnicity.items(), key=operator.itemgetter(1), reverse=True)[:10])
top_ten = sorted(list(top_ten.items()), key=lambda x: x[1], reverse=True)

for count, tup in enumerate(top_ten):
    print(str(count+1) + ". " + tup[0] + " (" + str(tup[1]) + ")")

Top Ten Ethnic Groups with Largest Overall Population

1. Han Chinese (1245058800)
2. Indo-Aryan (871815583)
3. European (494872221)
4. African (318325121)
5. Dravidian (302713744)
6. Mestizo (157734355)
7. Bengali (146776917)
8. Russian (131856994)
9. Japanese (126534212)
10. Malay (121993550)


## Longest River, Largest Lake and Highest Airport

The final part of the exercise asks us to find the name and the country of the longest river, the largest lake and the airport at the highest elevation.

The process is largely the same for all three. We iterate through all the instances and extract the required value. We then put these values in a list of tuples and sort them based on the value (length, area, elevation). Finally, we output the result with its corresponding country.

We discover that the highest airport is El Alto International, the longest river is Amazonas and the largest lake is Caspian Sea.

In [34]:
from prettytable import PrettyTable

airports = []
rivers = []
lakes = []

for element in document.iterfind('airport'):
    country = element.get('country')
    name = element.find('name')
    elevation = element.find('elevation')
    if name is not None and elevation is not None and elevation.text is not None:
        name = name.text
        elevation = int(elevation.text)
        airports.append((country, name, elevation))

for element in document.iterfind('river'):
    country = element.get('country')
    name = element.find('name')
    length = element.find('length')
    if name is not None and length is not None and length.text is not None:
        name = name.text
        length = round(float(length.text))
        rivers.append((country, name, length))

for element in document.iterfind('lake'):
    country = element.get('country')
    name = element.find('name')
    area = element.find('area')
    if name is not None and area is not None and area.text is not None:
        name = name.text
        area = round(float(element.find('area').text))
        lakes.append((country, name, area))

highest_airport = sorted(airports, key=lambda x: x[2], reverse=True)[0]
longest_river = sorted(rivers, key=lambda x: x[2], reverse=True)[0]
largest_lake = sorted(lakes, key=lambda x: x[2], reverse=True)[0]

table = PrettyTable(['Type','Country', 'Name', 'Value'])
table.add_row(['Highest Airport'] + list(highest_airport))
table.add_row(['Longest River'] + list(longest_river))
table.add_row(['Largest Lake'] + list(largest_lake))

print(table)

+-----------------+----------------+--------------+--------+
|       Type      |    Country     |     Name     | Value  |
+-----------------+----------------+--------------+--------+
| Highest Airport |      BOL       | El Alto Intl |  4063  |
|  Longest River  |    CO BR PE    |   Amazonas   |  6448  |
|   Largest Lake  | R AZ KAZ IR TM | Caspian Sea  | 386400 |
+-----------------+----------------+--------------+--------+
