# Exercise - Data Exchange Formats with Python

In the following three exercises, you are asked to write Python code for reading
data from XML, JSON and RDF files and for querying the data using the XPath
and SPARQL query languages. Each subsection is dedicated to one of the three
data exchange formats. The tasks are rather basic and the goal is to refresh
your knowledge in Python in general and in particular in parsing those formats.

## 1 XML

This subsection is dedicated to the XML format. In particular, you are asked
to perform XPath queries on the Mondial dataset. This dataset includes world
geographic information integrated from the CIA World Factbook, the International
Atlas and the TERRA database, to name just the pre-dominant sources.

Please inspect the documents in '/input' manually (using a text editor) in order to explore
the structure.

You can also have a look at the [w3school XPath tutorial 2](https://www.w3schools.com/xml/xpath_intro.asp) to solve the following tasks.

### 1.1 Load the dataset and inspect the schema.

We use the [pandas](https://pandas.pydata.org/) library to load and process XML files in Python.

Pandas offers the function [read_xml](https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html) to read XML documents into a pandas DataFrame object.

In this first task, load the dataset from 'input/mondial-3.0.xml' and print the names of the nodes below the root node.

In [None]:
import pandas as pd

# Load the file and return the columns. The columns of the dataframe represent nodes of the input XML.


### 1.2 Basic XPath

Adapt the solution of the previous task in the way that it prints
the names of all countries which belong to the continent with the name Europe.

Hints: 
- Have a look at the schema of the node country to see how it is linked to the continent.
- The xpath parameter of [read_xml](https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html) should return a collection of elements and not a single element. Select the 'name' attribute using the pandas syntax.

In [None]:
# Load the XML into a pandas dataframe


### 1.3 Basic XPath

Extend the XPath for the former task in order to retrieve only countries which are part of Europe and Asia.

In [None]:
# Load the XML into a pandas dataframe


## 2 JSON

### 2.1 Load the dataset

We use the [pandas](https://pandas.pydata.org/) library to load and process JSON files in Python.

Pandas offers the function [read_json](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html) to read JSON documents into a pandas DataFrame object.

Load the dataset from 'input/mondial-3.0-europe-countries.json' and calculate the total number of inhabitants of all countries in the file.

Hint:
- Double-check the format of the JSON document before loading the file.

In [None]:
# Load and inspect the dataset.


In [None]:
# Calculate the total population.


## 3 RDF

In the last part of this exercise session, we will focus on RDF and SPARQL.
On the course web page you can find the European countries with their name,
population and spoken languages stored as RDF file. The file was generated
from the original mondial XML file.

In the following you will be asked to formulate SPARQL queries to answer questions about the dataset using the [rdflib](https://rdflib.readthedocs.io/en/stable/index.html) library.

In addition to the lecture the [W3 site of SPARQL Query Language](https://www.w3.org/TR/rdf-sparql-query/) can help you to answer the questions.

In [None]:
# Install rdflib
#!pip install rdflib

### 3.1 Load the graph and query with SPARQL I

Load the graph and formulate a SPARQL query which returns the name and id of all countries within the dataset ordered by the name.

What is the last country on this list?

Hint:
- Explore the property names and namespaces in the RDF file using a text editor.
- Parse the SPARQL query result into a pandas dataframe for simple processing.

In [None]:
import rdflib

# Load the graph from 'input/mondial-3.0-europe-countries.rdf'.


In [None]:
# Define the query


# Execute query


# Convert result and load with pandas


# Print tail


### 3.2 Query with SPARQL II

Query for the largest countries in the dataset and return the second top 5 largest countries (rank 6-10) by population.

In [None]:
# Define the query

# Execute query


# Convert result and load with pandas


# Print head


### 3.3 Query with SPARQL III

Query for all German-speaking countries 

In [None]:
# Define the query


# Execute query


# Convert result and load with pandas


# Print head
