# Exercise - Data Exchange Formats with Python

In the following three exercises, you are asked to write Python code for reading
data from XML, JSON and RDF files and for querying the data using the XPath
and SPARQL query languages. Each subsection is dedicated to one of the three
data exchange formats. The tasks are rather basic and the goal is to refresh
your knowledge in Python in general and in particular in parsing those formats.

## 1 XML

This subsection is dedicated to the XML format. In particular, you are asked
to perform XPath queries on the Mondial dataset. This dataset includes world
geographic information integrated from the CIA World Factbook, the International
Atlas and the TERRA database, to name just the pre-dominant sources.

Please inspect the documents in '/input' manually (using a text editor) in order to explore
the structure.

You can also have a look at the [w3school XPath tutorial 2](https://www.w3schools.com/xml/xpath_intro.asp) to solve the following tasks.

### 1.1 Load the dataset and inspect the schema.

We use the [pandas](https://pandas.pydata.org/) library to load and process XML files in Python.

Pandas offers the function [read_xml](https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html) to read XML documents into a pandas DataFrame object.

In this first task, load the dataset from 'input/mondial-3.0.xml' and print the names of the nodes below the root node.

In [2]:
import pandas as pd

# Load the file and return the columns. The columns of the dataframe represent nodes of the input XML.
df_nodes = pd.read_xml("input/mondial-3.0.xml", xpath = "/*")
df_nodes.columns

Index(['continent', 'country', 'organization', 'mountain', 'desert', 'island',
       'river', 'sea', 'lake'],
      dtype='object')

### 1.2 Basic XPath

Adapt the solution of the previous task in the way that it prints
the names of all countries which belong to the continent with the name Europe.

Hints: 
- Have a look at the schema of the node country to see how it is linked to the continent.
- The xpath parameter of [read_xml](https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html) should return a collection of elements and not a single element. Select the 'name' attribute using the pandas syntax.

In [10]:
# Load the XML into a pandas dataframe
df_europe = pd.read_xml("input/mondial-3.0.xml",
                        xpath = "/mondial/country[encompassed/@continent=/mondial/continent[@name='Europe']/@id]")
df_europe['name'].head()

0    \n       Albania\n     
1    \n       Andorra\n     
2    \n       Austria\n     
3    \n       Belarus\n     
4    \n       Belgium\n     
Name: name, dtype: object

### 1.3 Basic XPath

Extend the XPath for the former task in order to retrieve only countries which are part of Europe and Asia.

In [4]:
# Load the XML into a pandas dataframe
df_europe_and_asia = pd.read_xml("input/mondial-3.0.xml",
                                 xpath = "/mondial/country[encompassed/@continent=/mondial/continent[@name='Europe']/@id and encompassed/@continent=/mondial/continent[@name='Asia']/@id]")
df_europe_and_asia['name']

0    \n       Russia\n     
1    \n       Turkey\n     
Name: name, dtype: object

## 2 JSON

### 2.1 Load the dataset

We use the [pandas](https://pandas.pydata.org/) library to load and process JSON files in Python.

Pandas offers the function [read_json](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html) to read JSON documents into a pandas DataFrame object.

Load the dataset from 'input/mondial-3.0-europe-countries.json' and calculate the total number of inhabitants of all countries in the file.

Hint:
- Double-check the format of the JSON document before loading the file.

In [5]:
# Load and inspect the dataset.
df_countries = pd.read_json('input/mondial-3.0-europe-countries.json', lines=True)
df_countries.head()

Unnamed: 0,id,total_area,infant_mortality,datacode,name,indep_date,gdp_total,population_growth,inflation,government,gdp_agri,car_code,capital,population,gdp_serv,gdp_ind
0,f0_136,28750.0,49.2,AL,Albania,28 11 1912,4100.0,1.34,16.0,emerging democracy,55.0,AL,f0_1461,3249136,,
1,f0_144,450.0,2.2,AN,Andorra,,1000.0,2.96,,parliamentary democracy that retains as i...,,AND,f0_1464,72766,,
2,f0_149,83850.0,6.2,AU,Austria,12 11 1918,152000.0,0.41,2.3,federal republic,2.0,A,f0_1467,8023244,64.0,34.0
3,f0_157,207600.0,13.4,BO,Belarus,25 08 1991,49200.0,0.2,244.0,republic,21.0,BY,f0_1474,10415973,30.0,49.0
4,f0_162,30510.0,6.4,BE,Belgium,04 10 1830,197000.0,0.33,1.6,constitutional monarchy,2.0,B,f0_1477,10170241,70.0,28.0


In [6]:
# Calculate the total population.
df_countries['population'].sum()

792002189