<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/march-2025/notebook-march-2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 March 2025 - Practice Notebook

This notebook provides hands-on practice for the March 2025 exam.

**Exam Structure:**
- Section A: MCQs (taken separately on VLE)
- Section B: Answer 2 of 3 questions - 60 marks
  - Q2: Mortality Bills Dataset (SQL/Database Design)
  - Q3: BeerXML (XML/XPath, ML Classification)
  - Q4: MusicBrainz JSON-LD/RDF (SPARQL Queries)

**Instructions:**
1. Run the Setup cells first
2. Write your answers in the empty code cells
3. Check your answers against the solution sheet

---

# 1. Environment Setup

Run these cells first to set up MySQL, xmllint, jing, rapper, and rdflib.

In [None]:
# === MySQL Setup ===
!apt -qq update > /dev/null
!apt -y -qq install mysql-server > /dev/null
!service mysql start

# Create user and database
!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

# === xmllint Setup (for XML/XPath, DTD and XSD validation) ===
# xmllint can validate: well-formedness, DTD (--dtdvalid), XSD (--schema), RelaxNG (--relaxng)
!apt -y -qq install libxml2-utils > /dev/null

# === jing Setup (for RelaxNG validation) ===
# Why jing instead of xmllint --relaxng?
# 1. Better, more descriptive error messages for validation failures
# 2. Supports RelaxNG Compact Syntax (.rnc) which xmllint doesn't
# 3. More complete RelaxNG implementation (handles some edge cases better)
# 4. Commonly used in TEI/digital humanities contexts
# Note: xmllint --relaxng works for simple cases if you prefer fewer dependencies
!apt -y -qq install jing > /dev/null

# === rapper Setup (for RDF/Turtle validation) ===
!apt -y -qq install raptor2-utils > /dev/null

# === Python libraries ===
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0 lxml rdflib

%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

print("MySQL ready!")
print("xmllint ready (DTD: --dtdvalid, XSD: --schema, RelaxNG: --relaxng)!")
print("jing ready (preferred for RelaxNG - better error messages)!")
print("rapper ready!")

---

# Question 2: Mortality Bills Dataset [30 marks]

## Context

A historical dataset of London mortality bills (1644-1849) contains weekly death counts by parish, age group, and cause of death.

**Data files:**
- `ages.txt` - Weekly death counts by age group (1729-1849)
- `counts.txt` - Weekly parish-level plague death counts (1644-1849)
- `ParcodeDict.txt` - Parish code dictionary
- City-wide cause-of-death file: `codID|weekID|cod|codn`

**Key challenge:** The 1752 calendar change (11 days skipped) creates irregular data.

## Q2(a): Logical schema for MySQL [12 marks]

**Question:** Design a logical schema for MySQL. List tables, fields, keys, and state what normal forms they satisfy. What fields have you removed/added compared to the original files?

In [None]:
# Q2(a) YOUR ANSWER - Describe your schema design:
# Tables:
#
# Normal forms:
#
# Removed/added fields:
#

In [None]:
%%sql
-- Q2(a) Create your tables:
DROP TABLE IF EXISTS city_weekly_cause_count;
DROP TABLE IF EXISTS age_weekly_count;
DROP TABLE IF EXISTS parish_weekly_count;
DROP TABLE IF EXISTS cause;
DROP TABLE IF EXISTS age_group;
DROP TABLE IF EXISTS parish;
DROP TABLE IF EXISTS week;

-- Add your CREATE TABLE statements here:


## Q2(b): Date representation & the 1752 skip [3 marks]

**Question:** How would you represent the date in MySQL, and what issues are raised by the 1752 skip?

In [None]:
# Q2(b) YOUR ANSWER:
# Date representation:
#
# Issues with 1752 skip:
#

## Q2(c): MySQL query - plague deaths in St Dunstan, Stepney, week 2 of 1729 [2 marks]

**Question:** Write a query to find plague deaths in St Dunstan, Stepney (parcode 'STEP') for week 2 of 1729.

In [None]:
%%sql
-- Q2(c) YOUR SQL:


## Q2(d): MySQL query - annual deaths by age group, 1760-1790 [4 marks]

**Question:** Write a query to show annual deaths by age group for years 1760-1790.

In [None]:
%%sql
-- Q2(d) YOUR SQL:


## Q2(e): Adding city-wide causes & parity check [5 marks]

**Question:** How would you add the city-wide cause-of-death data? How would you check if parish totals match city totals?

In [None]:
# Q2(e) YOUR ANSWER - Describe approach:
#

In [None]:
%%sql
-- Q2(e) Parity check query:


## Q2(f): Using the dataset for population-health trends [4 marks]

**Question:** What issues would need to be addressed to use this data for population-health trends? What other data would help?

In [None]:
# Q2(f) YOUR ANSWER:
# Issues:
#
# Helpful external data:
#

---

# Question 3: BeerXML [30 marks]

## Context

A BeerXML file containing brewing recipe data:

In [None]:
%%writefile beerxml_sample.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- BeerXML format -->
<RECIPES>
  <RECIPE>
    <NAME>Burton Ale</NAME>
    <TYPE>All Grain</TYPE>
    <BREWER>John Smith</BREWER>
    <BATCH_SIZE>20.0</BATCH_SIZE>
    <BOIL_SIZE>25.0</BOIL_SIZE>
    <BOIL_TIME>60</BOIL_TIME>
    <STYLE>
      <NAME>English IPA</NAME>
      <CATEGORY>India Pale Ale</CATEGORY>
      <OG_MIN>1.050</OG_MIN>
      <OG_MAX>1.075</OG_MAX>
    </STYLE>
    <HOPS>
      <HOP>
        <NAME>East Kent Goldings</NAME>
        <ALPHA>5.0</ALPHA>
        <AMOUNT>0.050</AMOUNT>
        <USE>Boil</USE>
        <TIME>60</TIME>
      </HOP>
      <HOP>
        <NAME>Fuggle</NAME>
        <ALPHA>4.5</ALPHA>
        <AMOUNT>0.030</AMOUNT>
        <USE>Boil</USE>
        <TIME>15</TIME>
      </HOP>
    </HOPS>
    <FERMENTABLES>
      <FERMENTABLE>
        <NAME>Maris Otter</NAME>
        <TYPE>Grain</TYPE>
        <AMOUNT>5.0</AMOUNT>
      </FERMENTABLE>
      <FERMENTABLE>
        <NAME>Crystal 60L</NAME>
        <TYPE>Grain</TYPE>
        <AMOUNT>0.5</AMOUNT>
      </FERMENTABLE>
    </FERMENTABLES>
    <YEASTS>
      <YEAST>
        <NAME>English Ale</NAME>
        <TYPE>Ale</TYPE>
        <ATTENUATION>75</ATTENUATION>
      </YEAST>
    </YEASTS>
  </RECIPE>
  <RECIPE>
    <NAME>Hefeweizen</NAME>
    <TYPE>All Grain</TYPE>
    <BREWER>Hans Mueller</BREWER>
    <BATCH_SIZE>20.0</BATCH_SIZE>
    <STYLE>
      <NAME>Weissbier</NAME>
      <CATEGORY>German Wheat Beer</CATEGORY>
    </STYLE>
    <HOPS>
      <HOP>
        <NAME>Hallertau</NAME>
        <ALPHA>4.0</ALPHA>
        <AMOUNT>0.025</AMOUNT>
        <USE>Boil</USE>
        <TIME>60</TIME>
      </HOP>
    </HOPS>
    <FERMENTABLES>
      <FERMENTABLE>
        <NAME>Wheat Malt</NAME>
        <TYPE>Grain</TYPE>
        <AMOUNT>2.5</AMOUNT>
      </FERMENTABLE>
      <FERMENTABLE>
        <NAME>Pilsner Malt</NAME>
        <TYPE>Grain</TYPE>
        <AMOUNT>2.5</AMOUNT>
      </FERMENTABLE>
    </FERMENTABLES>
  </RECIPE>
</RECIPES>

In [None]:
# Verify XML is well-formed
!xmllint --noout beerxml_sample.xml && echo "XML is well-formed!"

## Q3(a): What format is this? [1 mark]

In [None]:
# Q3(a) YOUR ANSWER:


## Q3(b): What is the root node? [1 mark]

In [None]:
# Q3(b) YOUR ANSWER:


## Q3(c): Schema and validation [3 marks]

**Question:** Does this instance reference a schema? How could you validate it?

In [None]:
# Q3(c) YOUR ANSWER:


## Q3(d): XPath - names of all hops in recipe "Burton Ale" [4 marks]

In [None]:
# Q3(d) YOUR XPATH EXPRESSION:
xpath_expr = ""  # Fill in your expression

In [None]:
# Test your XPath with lxml
from lxml import etree

doc = etree.parse('beerxml_sample.xml')
result = doc.xpath(xpath_expr)
print("Hop names in Burton Ale:", [r.text if hasattr(r, 'text') else r for r in result])

## Q3(e): 10-fold cross-validation - what does it mean? [3 marks]

In [None]:
# Q3(e) YOUR ANSWER:


## Q3(f): 50% accuracy - is it good? What else to know? [6 marks]

**Question:** A classifier for beer styles (15 styles) achieves 50% accuracy. Is this good? What else would you want to know?

In [None]:
# Q3(f) YOUR ANSWER:
# Is 50% good?
#
# What else to know:
#

## Q3(g): Document DB vs data interchange? [3 marks]

**Question:** Is BeerXML primarily a document database format or a data interchange format?

In [None]:
# Q3(g) YOUR ANSWER:


## Q3(h): Tree vs graph vs relational for this domain [9 marks]

**Question:** Compare tree (XML/JSON), graph, and relational models for storing brewing recipe data.

In [None]:
# Q3(h) YOUR ANSWER:
# Tree (XML/JSON):
#   Pros:
#   Cons:
#
# Relational:
#   Pros:
#   Cons:
#
# Graph:
#   Pros:
#   Cons:
#
# Recommendation:
#

---

# Question 4: MusicBrainz JSON-LD / RDF [30 marks]

## Context

JSON-LD data from MusicBrainz was converted to RDF/Turtle. Here's a sample of the resulting triples:

In [None]:
%%writefile musicbrainz.ttl
@prefix schema: <http://schema.org/> .
@prefix mbartist: <http://musicbrainz.org/artist/> .
@prefix mbarea: <http://musicbrainz.org/area/> .
@prefix mbrelease: <http://musicbrainz.org/release-group/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

# Areas (location hierarchy: United States > Massachusetts > Lincoln)
mbarea:489ce91b-6658-3307-9877-795b68554c98
    a schema:Country ;
    schema:name "United States" .

mbarea:05f68b4c-10f3-49b5-b28c-260a1b707043
    a schema:AdministrativeArea ;
    schema:name "Massachusetts" ;
    schema:containedIn mbarea:489ce91b-6658-3307-9877-795b68554c98 .

mbarea:11c4099a-ff61-45a3-ada4-23ac7a25d111
    a schema:City ;
    schema:name "Lincoln" ;
    schema:containedIn mbarea:05f68b4c-10f3-49b5-b28c-260a1b707043 .

# A music group (They Might Be Giants)
mbartist:183d6ef6-e161-47ff-9085-063c8b897e97
    a schema:MusicGroup ;
    schema:name "They Might Be Giants" ;
    schema:foundingDate "1982"^^xsd:gYear ;
    schema:groupOrigin mbarea:11c4099a-ff61-45a3-ada4-23ac7a25d111 ;
    schema:member [
        a schema:OrganizationRole ;
        schema:startDate "1982"^^xsd:gYear ;
        schema:member mbartist:person-john-flansburgh
    ] ;
    schema:member [
        a schema:OrganizationRole ;
        schema:startDate "1982"^^xsd:gYear ;
        schema:member mbartist:person-john-linnell
    ] .

# Person members
mbartist:person-john-flansburgh
    a schema:Person ;
    schema:name "John Flansburgh" .

mbartist:person-john-linnell
    a schema:Person ;
    schema:name "John Linnell" .

# Albums
mbrelease:album-flood
    a schema:MusicAlbum ;
    schema:name "Flood" ;
    schema:byArtist mbartist:183d6ef6-e161-47ff-9085-063c8b897e97 ;
    schema:albumProductionType <http://schema.org/StudioAlbum> ;
    schema:datePublished "1990"^^xsd:gYear .

mbrelease:album-lincoln
    a schema:MusicAlbum ;
    schema:name "Lincoln" ;
    schema:byArtist mbartist:183d6ef6-e161-47ff-9085-063c8b897e97 ;
    schema:albumProductionType <http://schema.org/StudioAlbum> ;
    schema:datePublished "1988"^^xsd:gYear .

In [None]:
# Validate the Turtle and count triples
!rapper -i turtle -c musicbrainz.ttl

In [None]:
# Set up rdflib for SPARQL queries
import rdflib

g = rdflib.Graph()
g.parse('musicbrainz.ttl', format='turtle')
print(f"Loaded {len(g)} triples")

def query_local(sparql_query):
    """Run SPARQL query against local rdflib graph and print results."""
    try:
        results = g.query(sparql_query)
        for row in results:
            print(row)
        return results
    except Exception as e:
        print(f"Error: {e}")
        return None

## Q4(a): What did it convert into? Relation to JSON-LD [2 marks]

In [None]:
# Q4(a) YOUR ANSWER:
# Converted to:
#
# Relation to JSON-LD:
#

## Q4(b): Which ontology is used? [1 mark]

In [None]:
# Q4(b) YOUR ANSWER:


## Q4(c): Example triple where the requested URL does not occur [1 mark]

In [None]:
# Q4(c) YOUR ANSWER:


## Q4(d): Bug in schema:MusicAlbum export - what & why [2 marks]

In [None]:
# Q4(d) YOUR ANSWER:
# Bug:
#
# Why:
#

## Q4(e): "Two members" vs "impossible to know how many" - who's right? [2 marks]

In [None]:
# Q4(e) YOUR ANSWER:


## Q4(f): SPARQL - all groups founded in the United States [4 marks]

In [None]:
# Q4(f) YOUR SPARQL QUERY:
query_f = """

"""
query_local(query_f)

## Q4(g): Ensure results are real groups, not persons [2 marks]

In [None]:
# Q4(g) YOUR ANSWER:


## Q4(h): SPARQL - list all albums made by bands of which John Linnell has been a member [4 marks]

In [None]:
# Q4(h) YOUR SPARQL QUERY:
query_h = """

"""
query_local(query_h)

## Q4(i): Why no public SPARQL endpoint? [2 marks]

In [None]:
# Q4(i) YOUR ANSWER:


## Q4(j): Relational schema mirroring the RDF view [10 marks]

In [None]:
# Q4(j) YOUR ANSWER - Describe your schema:
# Tables:
#
# Keys:
#
# Normalization:
#

In [None]:
%%sql
-- Q4(j) Create your relational schema:


---

# Done!

Check your answers against the **solution sheet**.