Country Profile Data Project

In this project, I analyzed country profile data from the MONDIAL Database. I used ElementTree and Pandas to process their XML dataset. 

Source: http://www.dbis.informatik.uni-goettingen.de/Mondial/

I conducted five tasks on this dataset:

1) I found the top 10 countries with the highest infant mortality rates. 

2) I found the top 10 most populous cities.

3) I found the 10 largest ethnic groups in the world. 

4) I found the name and country for the largest river and largest lake.

5) I found the top 10 airports with the highest elevation. 

Introduction:

In [66]:
# Imports
from xml.etree import ElementTree as et
import pandas as pd

In [68]:
# Parses the XML data file and names it document
document = et.parse( './mondial_database.xml' )

In [69]:
# countries is a list of element values for each country 
countries = document.findall('country')

In [70]:
'''
Creates a DataFrame that stores the name, ID, population, infant mortality, and ethnic groups of each country. 
This will be used later to analyze the data. 
'''

# country_list is a list of relevent information for each country that will be used to create the country DataFrame
country_list = [ ] 
# country_labels are the columns of the country DataFrame
country_labels = ['Country', 'Country ID', 'Population', 'Infant Mortality', 'Ethnicities']

# Loops through each country in the dataset. 
for country in countries:
        '''
        name is the name of the country as an element and n is the name as text. 
        Note: The code tests to make sure the name element is not empty before obtaining its text. 
        If the element is empty, the name is empty as well. This is done for population, infant mortality, and ethnic groups as well. 
        '''
        name = country.find('name')
        try: 
            assert name != None
            n = name.text
        except AssertionError:
            n = ""
        
        # id is the country's ID
        id = country.get('car_code')
        
        #pop is the list of the country's population records for every year stored. p is the most recent population as an integer. 
        pop = country.findall('population')
        try:
            assert pop[-1] != None
            p = int(pop[-1].text)
        except AssertionError:
            p = None
        
        # inf is the infant mortality rate stored as an element. i is the infant mortality stored as a decimal. 
        inf = country.find('infant_mortality')
        try: 
            assert inf != None
            i = float(inf.text)
        except AssertionError:
            i = None
            
        # ethnic is the list of all ethnicgroup elements in the particular country
        ethnic = country.findall('ethnicgroup')
        # e is the list of all ethnic group names and populations for the particular country
        e = [ ] 
        try:
            assert ethnic != [ ] 
            # This loops through each ethnic group for the particular country.
            for group in ethnic:
                # For the specific ethnic group, eth_dic stores its name and total population. 
                eth_dic = { }
                eth_dic['name'] = group.text
                # This calculates the total popuation of the ethnic group based on the percentage and most recent population estimate given in the dataset. 
                eth_dic['population'] = int(float(group.get('percentage')) * p / 100) 
                e.append(eth_dic)
        except AssertionError:
             e = []        
        
        # new_country is all the relevent country information stored as a tuple. It is then added to country_list. 
        new_country = ( n, id, p, i, e)
        country_list.append(new_country)

# country_df is the DataFrame that stores relevent information about each country. 
country_df = pd.DataFrame.from_records(country_list, columns=country_labels)

Task 1:

Find the top 10 countries with the highest infant mortality rates

In [71]:
# Top 10 countries with the highest infant mortality rates
country_df.sort_values('Infant Mortality', ascending = False).head(10)

Unnamed: 0,Country,Country ID,Population,Infant Mortality,Ethnicities
194,Western Sahara,WSA,554795,145.82,[]
54,Afghanistan,AFG,26023100,117.23,"[{'name': 'Tajik', 'population': 6505775}, {'n..."
189,Mali,RMM,13985961,104.34,"[{'name': 'Mande', 'population': 6992980}, {'n..."
226,Somalia,SP,9636173,100.14,"[{'name': 'Somali', 'population': 8190747}]"
213,Central African Republic,RCA,4349921,92.86,"[{'name': 'Baya', 'population': 1478973}, {'na..."
230,Guinea-Bissau,GNB,1586624,90.92,"[{'name': 'European', 'population': 15866}, {'..."
214,Chad,TCH,11720781,90.3,[]
192,Niger,RN,17138707,86.27,"[{'name': 'Fula', 'population': 1456790}, {'na..."
195,Angola,ANG,24383301,79.99,"[{'name': 'European', 'population': 243833}, {..."
201,Burkina Faso,BF,17322796,76.8,"[{'name': 'Mossi', 'population': 4157471}]"


Task 2

Find the top 10 most populous cities

In [77]:
# Creates a DataFrame that stores name, country, and population of every city. 

# city_list is a list of relevent information for each city that will be used to create the city DataFramecity_list = []
city_list = []

# country_labels are the columns of the city DataFrame
city_labels = ['City', 'Country', 'Population']

for country in countries: 
    # country_cities is the list of all cities in that particular country 
    country_cities = country.findall('city')
    
    '''
    Some cities are only listed under a specific province. 
    This searches through all provinces to find these cities and add them to the country_cities list. 
    '''
    provinces = country.findall('province')
    
    # Loops through each province to find their city
    for province in provinces:
        # Adds the cities in this province to the country_cities list
        country_cities += province.findall('city')
    
    for city in country_cities: 
        # city_name is the city's name
        city_name = city.find('name').text
        
        #pop is the list of the city's population records for every year stored. p is the most recent population.  
        pop = city.findall('population')
        try:
            assert pop != []
            p = int(pop[-1].text)
            new_city = (city_name, country.find('name').text, p)
            city_list.append(new_city)
        except AssertionError:
            continue
    
    

# city_df is the DataFrame of all cities     
city_df = pd.DataFrame.from_records(city_list, columns = city_labels)

In [80]:
# Top 10 most populous cities
city_df.sort_values('Population', ascending = False).head(10)

Unnamed: 0,City,Country,Population
1251,Shanghai,China,22315474
707,Istanbul,Turkey,13710512
1421,Mumbai,India,12442373
443,Moskva,Russia,11979529
1250,Beijing,China,11716620
2594,São Paulo,Brazil,11152344
1252,Tianjin,China,11090314
974,Guangzhou,China,11071424
1467,Delhi,India,11034555
977,Shenzhen,China,10358381


In [None]:
# Sums the population of each ethnic group across the world and creates a DataFrame with those values 

# ethnic_dic stores each ethnic group's name and running total population
ethnic_dic = { }
# ethnic_labels are the labels for the new ethnic group DataFrame. 
ethnic_labels = ['Ethnicity', 'Global Population']

# Loops through each country and adds its ethnic groups to the ethnic_dic's current global total
    try: 
        assert country != [ ]
    except AssertionError:
        continue
        
     '''
    Loops through each ethnic group in each country
    If the ethnic group has not been added to ethnic_dic, it creates a new key for that group.
    Otherwise, it adds the population of that ethnic group to ethnic_dic's running total. 
    '''
    for e in country:
        if e['name'] in ethnic_dic:
            ethnic_dic[e['name']] = ethnic_dic[e['name']] + e['population']
        else:
            ethnic_dic[e['name']] = e['population']

# ethnic_list stores the global populations of each ethnic group as a list
ethnic_list = ethnic_dic.items()
#for e in ethnic_dic.items():
#    ethnic_list.append(e)

# ethnic_df stores the global popuations of each ethnic group as a Dataframe
ethnic_df = pd.DataFrame.from_records(ethnic_list, columns=ethnic_labels)
ethnic_df.head()


Task 3: 

Find the 10 largest ethnic groups in the world

In [86]:
# Sums the population of each ethnic group across the world and creates a DataFrame with those values 

# ethnic_dic stores each ethnic group's name and running total population
ethnic_dic = { }
# ethnic_labels are the labels for the new ethnic group DataFrame. 
ethnic_labels = ['Ethnicity', 'Global Population']

# Loops through each country and adds its ethnic groups to the ethnic_dic's current global total
for country in country_df.Ethnicities:
    try: 
        assert country != [ ]
        
        '''
        Loops through each ethnic group in each country
        If the ethnic group has not been added to ethnic_dic, it creates a new key for that group.
        Otherwise, it adds the population of that ethnic group to ethnic_dic's running total. 
        '''
        for e in country:
            if e['name'] in ethnic_dic:
                ethnic_dic[e['name']] = ethnic_dic[e['name']] + e['population']
            else:
                ethnic_dic[e['name']] = e['population']
                
    except AssertionError:
        continue

# ethnic_list stores the global populations of each ethnic group as a list
ethnic_list = [ ] 
for e in ethnic_dic.items():
    ethnic_list.append(e)

# ethnic_df stores the global popuations of each ethnic group as a Dataframe
ethnic_df = pd.DataFrame.from_records(ethnic_list, columns=ethnic_labels)

In [22]:
# Finds the top 10 most largest ethnic groups in the globe
ethnic_df.sort_values('Global Population', ascending = False).head(10)

Unnamed: 0,Ethnicity,Global Population
80,Han Chinese,1245058800
106,Indo-Aryan,871815583
128,European,494872201
16,African,318325104
105,Dravidian,302713744
150,Mestizo,157734349
98,Bengali,146776916
33,Russian,131856989
139,Japanese,126534212
110,Malay,121993548


Task 4: 

Find the name and country for the largest river, largest lake, and airport with the highest elevation.

In [32]:
# Creates a waters DataFrame for all lakes and rivers and use that to find the largest of each

# water_list is the list of bodies of water that will eventually be made into the waters DataFrame. 
water_list = []

# water_labels are the columns of the waters DataFrame
water_labels = ['Name', 'Type', 'Size', 'Country ID'] 

# rivers is the list of all rivers, and lakes the list of all lakes from the dataset. 
rivers = document.findall('river')
lakes = document.findall('lake')

In [88]:
# Adds all rivers in the dataset to water_list 

# Loops through each river to find each's name, size, and location
for river in rivers:
    '''
    name is the name of the river as an element and n is the name as text. 
    Note: The code tests to make sure the name element is not empty before determining obtaining its text. 
    If the element is empty, the name is empty as well. It will test this for each size and source as well. 
    '''
    name = river.find('name')
    try: 
        assert name != None
        n = name.text
    except AssertionError:
        n = "" 
    
    # t is the type of water body it is, which in this case is 'River'. 
    t = 'River'
    
    # size is the length of the river as an element, and s is the size as a decimal (in km) 
    size = river.find('length')
    try: 
        assert size != None
        s = float(size.text)
    except AssertionError:
        s = None
    
    # loc is the country of origin as an element, and l is the country of origin (by country ID) as a string
    loc = river.find('source')
    try:
        assert loc != None
        a = loc.get('country')
    except AssertionError:
        a = ""
    
    # new_water is all the relevent information for the new body of water stored as a tuple. It is then added to water_list. 
    new_water = (n, t, s, a)
    water_list.append(new_water)

In [89]:
# Adds all lakes in the dataset to water_list 

# Loops through each lake to find each's name, size, and location
for lake in lakes:
    '''
    name is the name of the lake as an element and n is the name as text. 
    Note: The code tests to make sure the name element is not empty before determining obtaining its text. 
    If the element is empty, the name is empty as well. It will test this for each size and source as well. 
    '''
    name = lake.find('name')
    try: 
        assert name != None
        n = name.text
    except AssertionError:
        n = "" 
    
    # t is the type of water body it is, which in this case is 'Lake'. 
    t = 'Lake'
    
    # size is the area of the lake as an element, and s is the size as a decimal
    size = lake.find('area')
    try: 
        assert size != None
        s = float(size.text)
    except AssertionError:
        s = None
    
    # loc is the country of origin as an element, and l is the country of origin (by country ID) as a string
    loc = lake.find('located')
    try:
        assert loc != None
        a = loc.get('country')
    except AssertionError:
        a = ""
    
    # new_water is all the relevent information for the new body of water stored as a tuple. It is then added to water_list. 
    new_water = (n, t, s, a)
    water_list.append(new_water)

In [90]:
# Creates a DataFrame for all lakes and rivers
waters_df = pd.DataFrame(water_list, columns = water_labels)

In [91]:
# Creates a country_key, which stores the name and Country ID for each country

country_key = pd.DataFrame({"Country": country_df['Country'].tolist(), "Country ID": country_df['Country ID'].tolist() })

In [92]:
# Adds the country name to each body of water. 

waters_df = waters_df.merge(country_key)

In [93]:
# Largest River
waters_df[waters_df.Type == 'River'].sort_values('Size', ascending = False).head(1)

# Note: The length for the Nile is missing, hence why it is not included. 

Unnamed: 0,Name,Type,Size,Country ID,Country
687,Amazonas,River,6448.0,PE,Peru


In [94]:
# Largest Lake
waters_df[waters_df.Type == 'Lake'].sort_values('Size', ascending = False).head(1)

Unnamed: 0,Name,Type,Size,Country ID,Country
467,Caspian Sea,Lake,386400.0,R,Russia


Task 5:

Find the top 10 airports with the highest elevation.

In [61]:
# airport_list is a list of relevent information for each country that will be used to create the airport DataFrame
airport_list = []
# airport_labels are the columns of the airport DataFrame
airport_labels = ['Name', 'Elevation', 'Country ID']

# airports is the list of all airports
airports = document.findall('airport')

# Loops through each airport in the dataset
for airport in airports:
    '''
        name is the name of the airport as an element and n is the name as text. 
        Note: The code tests to make sure the name element is not empty before determining obtaining its text. 
        If the element is empty, the name is empty as well. It will test this for elevation as well. 
        '''
    name = airport.find('name')
    try:
        assert name != None
        n = name.text
    except AssertionError:
        n = ""
        
    # elevation is the airport's elevation as an element, and e is the elevation as a decimal
    elevation = airport.find('elevation')
    try:
        assert elevation.text != None
        e = int(elevation.text)
    except AssertionError:
        e = None
        
    # id is the country id for the airport
    id = airport.get('country')
    
    # new_airport is all the relevent country information stored as a tuple. It is then added to airport_list. 
    new_airport = (n, e, id)
    airport_list.append(new_airport)

# airport_df is the DataFrame that stores relevent information about each airport. 
airports_df = pd.DataFrame(airport_list, columns=airport_labels)    

In [95]:
# Adds the country name to each airport row. 

airports_df = airports_df.merge(country_key)

In [96]:
# 10 airports with the highest elevation
airports_df.sort_values('Elevation', ascending = False).head(10)

Unnamed: 0,Name,Elevation,Country ID,Country
80,El Alto Intl,4063.0,BOL,Bolivia
219,Lhasa-Gonggar,4005.0,CN,China
241,Yushu Batang,3963.0,CN,China
813,Juliaca,3827.0,PE,Peru
815,Teniente Alejandro Velasco Astete Intl,3311.0,PE,Peru
82,Juana Azurduy De Padilla,2905.0,BOL,Bolivia
334,Mariscal Sucre Intl,2813.0,EC,Ecuador
805,Coronel Fap Alfredo Mendivil Duarte,2719.0,PE,Peru
807,Mayor General FAP Armando Revoredo Iglesias Ai...,2677.0,PE,Peru
692,Licenciado Adolfo Lopez Mateos Intl,2581.0,MEX,Mexico
