<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/september-2024/notebook-september-2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 September 2024 - Practice Notebook

This notebook provides hands-on practice for the September 2024 exam.

**Exam Structure:**
- Section A: 10 MCQs (Q1a-j) - 40 marks (on Inspera)
- Section B: Answer 2 of 3 questions - 60 marks
  - Q2: Historical Lute Music Database
  - Q3: Poetry Contest XML/TEI
  - Q4: Wikidata SPARQL / Belgian Artists / MongoDB

**Instructions:**
1. Run the Setup cells first
2. Write your answers in the empty code cells
3. Check your answers against the solution sheet

---

# 1. Environment Setup

Run these cells first to set up MySQL, MongoDB, xmllint, and SPARQL.

In [None]:
# === MySQL Setup ===
!apt -qq update > /dev/null
!apt -y -qq install mysql-server > /dev/null
!service mysql start

# Create user and database
!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

# === xmllint Setup (for XML/XPath exercises) ===
!apt -y -qq install libxml2-utils > /dev/null

# === Python libraries ===
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0 lxml sparqlwrapper

%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

print("MySQL ready!")
print("xmllint ready!")

In [None]:
# === MongoDB Setup ===
!wget -q http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb
!dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb > /dev/null 2>&1
!wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add - > /dev/null 2>&1
!echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.4 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list > /dev/null
!apt-get update -qq > /dev/null
!apt-get install -y -qq mongodb-org > /dev/null
!mkdir -p /data/db
!mongod --fork --logpath /var/log/mongodb.log --dbpath /data/db

# Test MongoDB is running
!mongo --quiet --eval 'print("MongoDB ready!")'

In [None]:
# === SPARQL Setup (for Wikidata queries) ===
from SPARQLWrapper import SPARQLWrapper, JSON

def run_sparql(query, endpoint="https://query.wikidata.org/sparql"):
    """Run a SPARQL query against Wikidata and print results."""
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    try:
        results = sparql.query().convert()
        for result in results["results"]["bindings"]:
            print(result)
        return results
    except Exception as e:
        print(f"Error: {e}")
        return None

print("SPARQL ready!")

---

# Question 2: Historical Lute Music Database [30 marks]

## Context

An enthusiast website for historical lute music stores data in CSV files:

**Sources file** (library references, names, dates, instruments):
```
Ref_Short;Ref_Long;Library;Name_German;Name_English;Date;Instruments
NL-At;Ms. 205.B.32;Amsterdam, Toonkunst-Bibliotheek;;;1600-1680;Baroque Lute
D-DI_M297;Mscr Dresd. M. 297;Staats- und Universitätsbibliothek Dresden;Liederbuch eines Jenenser Studenten;Songbook of a student from Jena;1603;Renaissance Lute
```

**Concordances file** (work IDs, composers, locations):
```
Conc_no;Composer;Concordances
Conc_51;V. Gaultier;NL-At/2v – F-PnVmb7/188 – D-B40068/59r
Conc_15;V. Gaultier or D. Gaultier;NL-At/24v – GB-Balcarres/86
```

**Individual source file** (NL-At.csv - pieces in that source):
```
Piece_no;Key;Page_no;Title;Composer;Conc_no
4;c minor;2v;sans titre;V. Gaultier;Conc_51
17;d minor;24v;Caprice;D. Gaultier;Conc_15
```

## Q2(a): Database vs File-Based Approach [6 marks]

**Question:** What might be the advantages and disadvantages of a database approach compared to the file-based approach used here? Where relevant, refer to specific examples in the data above.

In [None]:
# Q2(a) YOUR ANSWER:
# Advantages of database approach:
#
# Disadvantages of database approach:
#

## Q2(b): Recommended Database Model [2 marks]

**Question:** What database model would you recommend as the easiest to use, given the current state of the data? Why?

In [None]:
# Q2(b) YOUR ANSWER:


## Q2(c): Relational Database Model [12 marks]

**Question:** It has been decided to use a relational database. Propose a model, listing the tables, fields, and keys. List any concerns you have about the data and your model.

In [None]:
# Q2(c) YOUR ANSWER - Describe your model:
# Tables:
#
# Concerns:
#

In [None]:
%%sql
-- Q2(c) Create your tables:
DROP TABLE IF EXISTS ConcordanceLocation;
DROP TABLE IF EXISTS Piece;
DROP TABLE IF EXISTS Concordance;
DROP TABLE IF EXISTS Source;
DROP TABLE IF EXISTS Composer;

-- Add your CREATE TABLE statements here:


## Q2(d): Query for Ungrouped Lachrimae Pieces [6 marks]

**Question:** Write a query that finds all pieces with 'lachrimae' or 'flow' in their names that are not included in a Concordance associated with composer 'John Dowland'.

In [None]:
%%sql
-- Q2(d) YOUR SQL:


## Q2(e): GRANT Command for Web Application [4 marks]

**Question:** Write an appropriate GRANT command for the account that the web application will use (for adding sources and their contents).

In [None]:
# Q2(e) YOUR ANSWER - Write the GRANT command:


---

# Question 3: Poetry Contest XML/TEI [30 marks]

## Context

XML file collecting poetry contest entries:

In [None]:
%%writefile poetry_contest.xml
<?xml version="1.0" encoding="UTF-8"?>
<contests xmlns:tei="http://www.tei-c.org/ns/1.0">
  <competition theme="limericks" date="2024-01-03">
    <entry>
      <authors>
        <author viaf="23156">Edward Lear</author>
      </authors>
      <poem>
        <tei:lg type="stanza">
          <tei:l>There was an old man of Dumbree</tei:l>
          <tei:l>Who taught little owls to drink tea</tei:l>
          <tei:l>For he said, "To eat mice is not proper or nice"</tei:l>
          <tei:l>That amiable man of Dumbree</tei:l>
        </tei:lg>
      </poem>
    </entry>
    <entry>
      <authors>
        <author viaf="12345">Anonymous</author>
      </authors>
      <poem>
        <tei:lg type="stanza">
          <tei:l>A wonderful bird is the pelican</tei:l>
          <tei:l>His bill can hold more than his belican</tei:l>
          <tei:l>He can take in his beak enough food for a week</tei:l>
          <tei:l>But I'm darned if I see how the helican</tei:l>
        </tei:lg>
      </poem>
    </entry>
  </competition>
  <competition theme="haiku" date="2024-02-15">
    <entry>
      <authors>
        <author viaf="67890">Matsuo Basho</author>
      </authors>
      <poem>
        <tei:lg type="stanza">
          <tei:l>An old silent pond</tei:l>
          <tei:l>A frog jumps into the pond</tei:l>
          <tei:l>Splash! Silence again</tei:l>
        </tei:lg>
      </poem>
    </entry>
  </competition>
</contests>

In [None]:
# Verify XML is well-formed
!xmllint --noout poetry_contest.xml && echo "XML is well-formed!"

## Q3(a): File Format [1 mark]

**Question:** What is the format of this file?

In [None]:
# Q3(a) YOUR ANSWER:


## Q3(b): TEI Claim Assessment [3 marks]

**Question:** The competition website says they save data as Text Encoding Initiative files. Are they correct? Give a more specific (and accurate) statement.

In [None]:
# Q3(b) YOUR ANSWER:


## Q3(c): XPath for First Lines [3 marks]

**Question:** Write a simple XPath expression to retrieve the first line from all entries to competitions with theme 'limericks'. Note: the precise syntax is less important than the logic.

In [None]:
# Q3(c) YOUR ANSWER - Write the XPath expression:


In [None]:
# Test your XPath with lxml
from lxml import etree

doc = etree.parse('poetry_contest.xml')
namespaces = {'tei': 'http://www.tei-c.org/ns/1.0'}

# Your XPath here:
xpath_expr = ""  # Fill in your expression
result = doc.xpath(xpath_expr, namespaces=namespaces)
print("Result:", result)

## Q3(d): Relational Model with Judging [12 marks]

**Question:** Design a relational model for the files, adding the ability for judges to give numerical assessments of each entry (usually, three judges score each). Give the CREATE commands needed to build a MySQL database. Explain your choices and show what Normal Forms your solution is in.

In [None]:
# Q3(d) YOUR ANSWER - Explain your design choices:
# Tables:
#
# Normal Forms:
#

In [None]:
%%sql
-- Q3(d) Create your tables:
DROP TABLE IF EXISTS Assessment;
DROP TABLE IF EXISTS EntryAuthor;
DROP TABLE IF EXISTS Entry;
DROP TABLE IF EXISTS Author;
DROP TABLE IF EXISTS Judge;
DROP TABLE IF EXISTS Competition;

-- Add your CREATE TABLE statements here:


## Q3(e): Winning Entry Query [5 marks]

**Question:** Give a query for your database that retrieves the winning (highest scoring) entry for the Limerick challenge of 3 Jan 2024.

In [None]:
%%sql
-- Q3(e) YOUR SQL:


## Q3(f): XML vs Relational Comparison [6 marks]

**Question:** How do the XML document-based approach and the relational model compare for this use case? What works best in each? Would there be any benefit to a hybrid approach?

In [None]:
# Q3(f) YOUR ANSWER:
# XML approach:
#   Pros:
#   Cons:
#
# Relational approach:
#   Pros:
#   Cons:
#
# Hybrid approach benefits:
#

---

# Question 4: Wikidata SPARQL / Belgian Artists / MongoDB [30 marks]

## Context

A researcher queries Wikidata for Belgian artists born before 1600:

```sparql
SHOW person ?personLabel ?placeLabel ?dob 
{{
  BIND (wd:Q31 as ?country) # Q31 is Belgium
  BIND (wd:Q483501 as ?job) # Q483501 is Artist
  person wdt:P19 ?place ,    # P19 is place of birth
         wdt:P569 ?dob ,     # P569 is date of birth
         wdt:P106 ?job ;     # P106 is occupation
  place wdt:P17 ?country .   # P17 is country
  FILTER( YEAR(?dob) < 1600 )
  
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}}
```

## Q4(a): Query Language [1 mark]

**Question:** What language does this query use?

In [None]:
# Q4(a) YOUR ANSWER:


## Q4(b): Syntax Corrections [2 marks]

**Question:** Some of the syntax of this query is wrong. Correct it.

In [None]:
# Q4(b) YOUR CORRECTED QUERY:
corrected_query = """

"""
print(corrected_query)

In [None]:
# Test your corrected query (optional - may take time)
# run_sparql(corrected_query)

## Q4(c): Retrieval Less Than 100% [4 marks]

**Question:** What might cause the retrieval of this query to be less than 100%?

In [None]:
# Q4(c) YOUR ANSWER:
# Reasons for incomplete retrieval:
#

## Q4(d): Unwanted Results [3 marks]

**Question:** Given the purpose of the query, what results might be returned that are not wanted?

In [None]:
# Q4(d) YOUR ANSWER:
# Unwanted results might include:
#

## Q4(e): Place of Birth vs Country of Citizenship [4 marks]

**Question:** The researcher originally considered querying by country of citizenship (P27) rather than place of birth (P19). Since Belgium didn't exist before 1830, this was unsuccessful. How does the query above avoid this problem?

In [None]:
# Q4(e) YOUR ANSWER:


## Q4(f): MongoDB Query for Artworks [6 marks]

**Question:** The researcher creates a MongoDB database of artworks. Give a query that returns all artworks made between 1520 and 1530 by artists born in Antwerp.

In [None]:
# Q4(f) YOUR MONGODB QUERY:
mongo_query = """

"""
print(mongo_query)

In [None]:
# Test with pymongo (if you have sample data)
# from pymongo import MongoClient
# client = MongoClient('localhost', 27017)
# db = client['belgian_art']
# 
# # Your query here
# results = db.artworks.find({...})
# for r in results:
#     print(r)

## Q4(g): Database Model Evaluation [10 marks]

**Question:** Do you think the researcher was right to use an object database for this? Evaluate a graph, object, and relational model in this context. What is special about this case that makes these work better or worse?

In [None]:
# Q4(g) YOUR ANSWER:
# Graph model (RDF/SPARQL):
#   Pros:
#   Cons:
#
# Object/Document model (MongoDB):
#   Pros:
#   Cons:
#
# Relational model (MySQL):
#   Pros:
#   Cons:
#
# Special considerations for this case:
#
# Conclusion:
#

---

# Done!

Check your answers against the **solution sheet**.