# XML exercise

Using data from [**mondial database**](https://drive.google.com/file/d/14lFT4nWHgwN36ij4XZh6OUuup-K9qLgR/view?usp=sharing) find the answers to following questions:

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [1]:
import pandas as pd
import xml.etree.ElementTree as ET

In [2]:
tree = ET.parse('mondial.xml')

In [3]:
root = tree.getroot()
root

<Element 'mondial' at 0x7ff1192c6900>

In [4]:
print(root.tag)
print(root.attrib)
print(len(root))

mondial
{}
3403


In [5]:
countries = root.findall('country')
print(len(countries))
print(countries[243])

244
<Element 'country' at 0x7ff11aba1a90>


In [6]:
print(countries[0].attrib)
print(countries[243].attrib)

{'car_code': 'AL', 'area': '28750', 'capital': 'cty-Albania-Tirane', 'memberships': 'org-BSEC org-CEI org-CD org-SELEC org-CE org-EAPC org-EBRD org-EITI org-FAO org-IPU org-IAEA org-IBRD org-ICC org-ICAO org-ICCt org-Interpol org-IDA org-IFRCS org-IFC org-IFAD org-ILO org-IMO org-IMF org-IOC org-IOM org-ISO org-OIF org-ITU org-ITUC org-IDB org-MIGA org-NATO org-OSCE org-OPCW org-OAS org-OIC org-PCA org-UN org-UNCTAD org-UNESCO org-UNIDO org-UPU org-WCO org-WFTU org-WHO org-WIPO org-WMO org-UNWTO org-WTO'}
{'car_code': 'SY', 'area': '455', 'capital': 'cty-Seychelles-Victoria', 'memberships': 'org-AfDB org-AU org-ACP org-AOSIS org-COMESA org-C org-CD org-FAO org-G-77 org-InOC org-IPU org-IAEA org-IBRD org-ICAO org-ICCt org-Interpol org-IFRCS org-IFC org-IFAD org-ILO org-IMO org-IMF org-IOC org-IOM org-ISO org-OIF org-ITU org-MIGA org-NAM org-OPCW org-SADC org-UN org-UNCTAD org-UNESCO org-UNIDO org-UPU org-WCO org-WHO org-WIPO org-WMO org-UNWTO org-WTO'}


# We will try to answer question 1.

In [7]:
country1 = root[0]


In [8]:
print(country1[0])
print(country1[0].tag)
print(country1[0].text)

<Element 'name' at 0x7ff1192c69f0>
name
Albania


In [9]:
print(country1.find('infant_mortality').text)

13.19


In [10]:
country = root.find('country')
country.find('infant_mortality').text

'13.19'

In [11]:
countries = []
infant_mortality_rates = []


In [12]:
for country in root.findall(".//infant_mortality/.."):
    countries.append(country.find('name').text)
    infant_mortality_rates.append(country.find('infant_mortality').text)
    
    
 

In [13]:
print(len(countries))
print(countries[0])

228
Albania


In [14]:
print(len(infant_mortality_rates))
print(infant_mortality_rates[0])

228
13.19


In [15]:
df = pd.DataFrame({"Country": countries, "Infant Mortality Rate": infant_mortality_rates})
df.head()

Unnamed: 0,Country,Infant Mortality Rate
0,Albania,13.19
1,Greece,4.78
2,North Macedonia,7.9
3,Serbia,6.16
4,Andorra,3.69


In [16]:
df.dtypes

Country                  object
Infant Mortality Rate    object
dtype: object

In [17]:
df_new = df.astype({"Infant Mortality Rate": 'float'})
df_new.dtypes

Country                   object
Infant Mortality Rate    float64
dtype: object

See Question 2 for why we had to change the data type of the "Infant Mortality Rate" column.

In [18]:
df_new.sort_values(by = 'Infant Mortality Rate').head(10)

Unnamed: 0,Country,Infant Mortality Rate
36,Monaco,1.81
90,Japan,2.13
109,Bermuda,2.48
34,Norway,2.48
98,Singapore,2.53
35,Sweden,2.6
8,Czech Republic,2.63
6,Spain,2.7
72,Hong Kong,2.73
73,Macao,3.13


# We will try to answer question 2.

In [19]:
cities = root.findall(".//city")
print(len(cities))

3404


In [20]:
cities[0]

<Element 'city' at 0x7ff1192cd5e0>

In [21]:
cities[1].find('name').text

'Shkodër'

In [22]:
cities[3403].find('name').text

'Victoria'

In [23]:
cities[3403].get('id')

'cty-Seychelles-Victoria'

In [24]:
cities_2 = root.findall(".//city/population/..")
print(len(cities_2))

3061


In [25]:
cities_2[0].find('name').text

'Tirana'

In [26]:
cities_2[0].find('population[last()]').text

'418495'

In [27]:
cities_2[0].find('population[last()]').get('year')

'2011'

In [28]:
cities_2[3060].find('name').text

'Victoria'

Note that for each city that only the population measured for the most recent year will be considered.


In [29]:
my_dict = {'City': [], 'Population': [], 'Year': []}

In [30]:
for city in root.findall(".//city/population/.."):
    name_value = city.find('name').text
    my_dict['City'].append(name_value)
    
    
    population_value = city.find('population[last()]').text
    my_dict['Population'].append(population_value)
    
    year_value = city.find('population[last()]').get('year')
    my_dict['Year'].append(year_value)

In [31]:
print(len(my_dict['City']))
print(len(my_dict['Population']))
print(len(my_dict['Year']))

3061
3061
3061


In [32]:
df_1 = pd.DataFrame(my_dict) 
df_1.head()

Unnamed: 0,City,Population,Year
0,Tirana,418495,2011
1,Shkodër,77075,2011
2,Durrës,113249,2011
3,Vlorë,79513,2011
4,Elbasan,78703,2011


In [33]:
df_1.dtypes

City          object
Population    object
Year          object
dtype: object

In [34]:
df_2 = df_1.astype({'Population': 'int'})
df_2.dtypes

City          object
Population     int64
Year          object
dtype: object

In [35]:
df_2.sort_values(by='Population', ascending=False).head(10)

Unnamed: 0,City,Population,Year
1251,Shanghai,22315474,2010
1334,Karachi,14916456,2017
2866,Lagos,13745000,2016
712,Istanbul,13710512,2012
1422,Mumbai,12442373,2011
448,Moskva,11979529,2013
1250,Beijing,11716620,2010
2811,Kinshasa,11575000,2015
2596,São Paulo,11152344,2010
1313,Lahore,11126285,2017


Note that I had to change the data type of the Population column to integer because it was the object data type before and this caused problems with sorting.  I will need to do something similar for question 1.


# We will try to answer question 3)a).

In [36]:
rivers = root.findall(".//river[@country]/length/..")
print(len(rivers))

438


In [37]:
rivers[0]

<Element 'river' at 0x7ff11ac46c20>

In [38]:
print(rivers[0].find('name').text)
print(rivers[0].get('country'))
print(rivers[0].find('length').text)

Thjorsa
IS
230


In [39]:
rivers[437]

<Element 'river' at 0x7ff11af9b310>

In [40]:
print(rivers[437].find('name').text)
print(rivers[437].get('country'))
print(rivers[437].find('length').text)

Clutha River
NZ
338


In [41]:
my_dict_a = {'River': [], 'Country': [], 'Length': []}

In [42]:
for river in root.findall(".//river[@country]/length/.."):
    name_value = river.find('name').text
    my_dict_a['River'].append(name_value)
    
    
    country_value = river.get('country')
    my_dict_a['Country'].append(country_value)
    
    length_value = river.find('length').text
    my_dict_a['Length'].append(length_value)

In [43]:
print(len(my_dict_a['River']))
print(len(my_dict_a['Country']))
print(len(my_dict_a['Length']))

438
438
438


In [44]:
df_3a = pd.DataFrame(my_dict_a) 
df_3a.head()

Unnamed: 0,River,Country,Length
0,Thjorsa,IS,230
1,Jökulsa a Fjöllum,IS,206
2,Thames,GB,346
3,Severn,GB,354
4,Trent,GB,298


In [45]:
df_3a_new = df_3a.astype({'Length': 'float'})
df_3a_new.dtypes

River       object
Country     object
Length     float64
dtype: object

In [46]:
df_3a_new.head()

Unnamed: 0,River,Country,Length
0,Thjorsa,IS,230.0
1,Jökulsa a Fjöllum,IS,206.0
2,Thames,GB,346.0
3,Severn,GB,354.0
4,Trent,GB,298.0


In [47]:
df_3a_new["Length"].max()

6380.0

In [48]:
df_3a_new["Length"].idxmax()

214

In [49]:
df_3a_new.iloc[df_3a_new["Length"].idxmax()]

River      Yangtze
Country         CN
Length        6380
Name: 214, dtype: object

In [50]:
# This is the longest river.
df_3a_new.iloc[df_3a_new["Length"].idxmax()]["River"] 

'Yangtze'

In [51]:
df_3a_new.iloc[df_3a_new["Length"].idxmax()]["Country"]

'CN'

In [52]:
# This is the name of the country with the longest river.
country_longest = root.find("country[@car_code='CN']")
print(country_longest.find('name').text)

China


In [53]:
df_3a_new.sort_values(by='Length', ascending=False).head(10)

Unnamed: 0,River,Country,Length
214,Yangtze,CN,6380.0
211,Hwangho,CN,4845.0
181,Lena,R,4400.0
404,Zaire,RCB ZRE,4374.0
224,Mekong,CN MYA LAO THA K VN,4350.0
168,Irtysch,R KAZ CN,4248.0
381,Niger,RMM RN WAN RG,4184.0
289,Missouri,USA,4130.0
173,Jenissej,R,4092.0
331,Amazonas,CO BR PE,3778.0


# We will try to answer question 3)b).

In [54]:
lakes = root.findall(".//lake[@country]/area/..")
print(len(lakes))

189


In [55]:
lakes[0]

<Element 'lake' at 0x7ff11af9b810>

In [56]:
print(lakes[0].find('name').text)
print(lakes[0].get('country'))
print(lakes[0].find('area').text)

Inarijärvi
SF
1040


In [57]:
print(lakes[188].find('name').text)
print(lakes[188].get('country'))
print(lakes[188].find('area').text)

Lake Wanaka
NZ
192


In [58]:
my_dict_b = {'Lake': [], 'Country': [], 'Area': []}

In [59]:
for lake in root.findall(".//lake[@country]/area/.."):
    name_value = lake.find('name').text
    my_dict_b['Lake'].append(name_value)
    
    
    country_value = lake.get('country')
    my_dict_b['Country'].append(country_value)
    
    area_value = lake.find('area').text
    my_dict_b['Area'].append(area_value)

In [60]:
print(len(my_dict_b['Lake']))
print(len(my_dict_b['Country']))
print(len(my_dict_b['Area']))

189
189
189


In [61]:
df_3b = pd.DataFrame(my_dict_b) 
df_3b.head()

Unnamed: 0,Lake,Country,Area
0,Inarijärvi,SF,1040
1,Oulujärvi,SF,928
2,Saimaa,SF,4370
3,Päijänne,SF,1118
4,Mjoesa-See,N,368


In [62]:
df_3b_new = df_3b.astype({'Area': 'float'})
df_3b_new.dtypes

Lake        object
Country     object
Area       float64
dtype: object

In [63]:
df_3b_new.head()

Unnamed: 0,Lake,Country,Area
0,Inarijärvi,SF,1040.0
1,Oulujärvi,SF,928.0
2,Saimaa,SF,4370.0
3,Päijänne,SF,1118.0
4,Mjoesa-See,N,368.0


In [64]:
df_3b_new["Area"].max()

386400.0

In [65]:
df_3b_new["Area"].idxmax()

59

In [66]:
df_3b_new.iloc[df_3b_new["Area"].idxmax()]

Lake          Caspian Sea
Country    R AZ KAZ IR TM
Area               386400
Name: 59, dtype: object

In [67]:
#This is the largest lake.
df_3b_new.iloc[df_3b_new["Area"].idxmax()]["Lake"] 

'Caspian Sea'

In [68]:
df_3b_new.iloc[df_3b_new["Area"].idxmax()]["Country"] 

'R AZ KAZ IR TM'

Note that there are 5 countries with the largest lake.

In [69]:
countries_largest = (df_3b_new.iloc[df_3b_new["Area"].idxmax()]["Country"]).split()
print(countries_largest)

['R', 'AZ', 'KAZ', 'IR', 'TM']


In [70]:
countries_largest_2 = []

country_largest_0 = root.find("country[@car_code='R']")
countries_largest_2.append(country_largest_0.find('name').text)

country_largest_1 = root.find("country[@car_code='AZ']")
countries_largest_2.append(country_largest_1.find('name').text)

country_largest_2 = root.find("country[@car_code='KAZ']")
countries_largest_2.append(country_largest_2.find('name').text)

country_largest_3 = root.find("country[@car_code='IR']")
countries_largest_2.append(country_largest_3.find('name').text)

country_largest_4 = root.find("country[@car_code='TM']")
countries_largest_2.append(country_largest_4.find('name').text)
    

In [71]:
# These are the names of the countries with the largest lake.
print(countries_largest_2)

['Russia', 'Azerbaijan', 'Kazakhstan', 'Iran', 'Turkmenistan']


In [72]:
df_3b_new.sort_values(by='Area', ascending=False).head(10)

Unnamed: 0,Lake,Country,Area
59,Caspian Sea,R AZ KAZ IR TM,386400.0
142,Lake Superior,CDN USA,82103.0
110,Lake Victoria,EAT EAK EAU,68870.0
138,Lake Huron,CDN USA,59600.0
141,Lake Michigan,USA,57800.0
50,Dead Sea,IL JOR WEST,41650.0
112,Lake Tanganjika,ZRE Z BI EAT,32893.0
129,Great Bear Lake,CDN,31792.0
47,Ozero Baikal,R,31492.0
118,Lake Malawi,MW MOC EAT,29600.0


In [73]:
# This is another to find the countries.

countries_largest_3 = []

for country in countries_largest:
    country_value = root.find("country[@car_code= '%s']" % country) 
    countries_largest_3.append(country_value.find('name').text)
    
print(countries_largest_3)   

['Russia', 'Azerbaijan', 'Kazakhstan', 'Iran', 'Turkmenistan']


# We will try to answer question 3)c).

In [74]:
airports = root.findall(".//airport[@country]/elevation/..")
print(len(airports))

1292


In [75]:
print(airports[0].find('name').text)
print(airports[0].get('country'))
print(airports[0].find('elevation').text)

Herat
AFG
977


In [76]:
print(airports[1291].find('name').text)
print(airports[1291].get('country'))
print(airports[1291].find('elevation').text)

Harare Intl
ZW
1490


In [77]:
my_dict_c = {'Airport': [], 'Country': [], 'Elevation': []}

In [78]:
for airport in root.findall(".//airport[@country]/elevation/.."):
    name_value = airport.find('name').text
    my_dict_c['Airport'].append(name_value)
    
    
    country_value = airport.get('country')
    my_dict_c['Country'].append(country_value)
    
    elevation_value = airport.find('elevation').text
    my_dict_c['Elevation'].append(elevation_value)

In [79]:
print(len(my_dict_c['Airport']))
print(len(my_dict_c['Country']))
print(len(my_dict_c['Elevation']))

1292
1292
1292


In [80]:
df_3c = pd.DataFrame(my_dict_c) 
df_3c.head()

Unnamed: 0,Airport,Country,Elevation
0,Herat,AFG,977
1,Kabul Intl,AFG,1792
2,Tirana Rinas,AL,38
3,Cheikh Larbi Tebessi,DZ,811
4,Batna Airport,DZ,822


In [81]:
df_3c_new = df_3c.astype({'Elevation': 'int'})
df_3c_new.dtypes

Airport      object
Country      object
Elevation     int64
dtype: object

In [82]:
df_3c_new.head()

Unnamed: 0,Airport,Country,Elevation
0,Herat,AFG,977
1,Kabul Intl,AFG,1792
2,Tirana Rinas,AL,38
3,Cheikh Larbi Tebessi,DZ,811
4,Batna Airport,DZ,822


In [83]:
df_3c_new["Elevation"].max()

4063

In [84]:
df_3c_new["Elevation"].idxmax()

81

In [85]:
df_3c_new.iloc[df_3c_new["Elevation"].idxmax()]

Airport      El Alto Intl
Country               BOL
Elevation            4063
Name: 81, dtype: object

In [86]:
# This is the airport at the highest elevation.
df_3c_new.iloc[df_3c_new["Elevation"].idxmax()]['Airport']

'El Alto Intl'

In [87]:
df_3c_new.iloc[df_3c_new["Elevation"].idxmax()]["Country"]

'BOL'

In [88]:
# This is the name of the country with the airport at the highest elevation.
country_highest = root.find("country[@car_code='BOL']")
print(country_highest.find('name').text)

Bolivia


In [89]:
df_3c_new.sort_values(by='Elevation', ascending=False).head(10)

Unnamed: 0,Airport,Country,Elevation
81,El Alto Intl,BOL,4063
216,Lhasa-Gonggar,CN,4005
233,Yushu Batang,CN,3963
790,Juliaca,PE,3827
792,Teniente Alejandro Velasco Astete Intl,PE,3311
83,Juana Azurduy De Padilla,BOL,2905
312,Mariscal Sucre Intl,EC,2813
782,Coronel Fap Alfredo Mendivil Duarte,PE,2719
784,Mayor General FAP Armando Revoredo Iglesias Ai...,PE,2677
671,Licenciado Adolfo Lopez Mateos Intl,MEX,2581
