<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/march-2025/notebook-march-2025-solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 March 2025 - Solutions Notebook

This notebook contains **complete solutions** for the March 2025 exam.

**Exam Structure:**
- Section A: MCQs (taken separately on VLE)
- Section B: Answer 2 of 3 questions - 60 marks
  - Q2: Mortality Bills Dataset (SQL/Database Design)
  - Q3: BeerXML (XML/XPath, ML Classification)
  - Q4: MusicBrainz JSON-LD/RDF (SPARQL Queries)

---

# 1. Environment Setup

Run these cells first to set up MySQL, xmllint, jing, rapper, and rdflib.

In [None]:
# === MySQL Setup ===
!apt-get update -qq > /dev/null
!apt-get install -y -qq mysql-server > /dev/null
!service mysql start
!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

# === SQL Magic ===
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0
%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

# === XPath Magic (cellspell) ===
!apt-get install -y libxml2-utils -qq > /dev/null
!pip install git+https://github.com/sreent/jupyter-query-magics.git -q
%load_ext cellspell.xpath

# === SPARQL Magic (cellspell) ===
!pip install "cellspell[sparql] @ git+https://github.com/sreent/jupyter-query-magics.git" -q
%load_ext cellspell.sparql

---

# Question 2: Mortality Bills Dataset [30 marks]

## Context

A historical dataset of London mortality bills (1644-1849) contains weekly death counts by parish, age group, and cause of death.

**Data files:**
- `ages.txt` - Weekly death counts by age group (1729-1849)
- `counts.txt` - Weekly parish-level plague death counts (1644-1849)
- `ParcodeDict.txt` - Parish code dictionary
- City-wide cause-of-death file: `codID|weekID|cod|codn`

**Key challenge:** The 1752 calendar change (11 days skipped) creates irregular data.

## Q2(a): Logical schema for MySQL [12 marks]

**Question:** Design a logical schema for MySQL. List tables, fields, keys, and state what normal forms they satisfy.

### SOLUTION Q2(a)

**Design Goals:**
- Separate *dimensions* (week, parish, age group, cause) from *facts* (weekly counts)
- Enable integrity, reproducibility, and flexible aggregation
- Use stable keys (e.g., `week_id` is the canonical `YYYY/WW` string from the data)

**Tables:**

1. **`week`** *(dimension)*
   - `week_id` CHAR(7) PK — e.g., `'1729/01'`
   - `year` SMALLINT
   - `week_no` TINYINT (1-53)
   - `week_seq` INT UNIQUE — monotonically increasing for time-series
   - `calendar_note` VARCHAR(32) NULL — e.g., `'1752-short'`

2. **`parish`** *(dimension; from ParcodeDict.txt)*
   - `parcode` CHAR(4) PK
   - `parish_name` VARCHAR(100)
   - `alias1` VARCHAR(100) NULL
   - `alias2` VARCHAR(100) NULL
   - `bills_group_before_1660` VARCHAR(30) NULL
   - `bills_group_after_1660` VARCHAR(30) NULL

3. **`age_group`** *(dimension; normalizes age groups)*
   - `age_group_id` INT PK AUTO_INCREMENT
   - `label` VARCHAR(30) UNIQUE — e.g., `'under 2'`, `'2-5'`
   - `age_year_min` TINYINT NOT NULL
   - `age_year_max` TINYINT NOT NULL

4. **`cause`** *(dimension; from city-wide causes)*
   - `cause_id` INT PK AUTO_INCREMENT
   - `cause_name` VARCHAR(100) UNIQUE

5. **`parish_weekly_count`** *(fact; from counts.txt)*
   - `count_id` INT PK AUTO_INCREMENT
   - `week_id` CHAR(7) FK → week
   - `parcode` CHAR(4) FK → parish
   - `count_type` VARCHAR(20) — e.g., `'plague'`, `'total'`
   - `count_n` INT NOT NULL
   - UNIQUE(week_id, parcode, count_type)

6. **`age_weekly_count`** *(fact; from ages.txt)*
   - `week_id` CHAR(7) FK → week
   - `age_group_id` INT FK → age_group
   - `count_n` INT NOT NULL
   - PK(week_id, age_group_id)

7. **`city_weekly_cause_count`** *(fact; from city-wide causes)*
   - `week_id` CHAR(7) FK → week
   - `cause_id` INT FK → cause
   - `count_n` INT NOT NULL
   - PK(week_id, cause_id)

**Normal Forms:**
- All tables are in **1NF** (atomic fields, no repeating groups)
- Dimensions (`week`, `parish`, `age_group`, `cause`) are in **3NF/BCNF**: every non-key attribute depends on the key, the whole key, and nothing but the key
- Fact tables use composite keys; non-key attributes depend only on the key

**Removed/Added Fields:**
- Replaced string `agegroup` with normalized `age_group` table (prevents typos, supports ordering)
- Added `week_seq` and `calendar_note` to `week` for robust time-series handling
- Kept alias fields for parishes (historical names) but separated from facts

In [None]:
%%sql
-- Q2(a) SOLUTION: Create the schema
DROP TABLE IF EXISTS city_weekly_cause_count;
DROP TABLE IF EXISTS age_weekly_count;
DROP TABLE IF EXISTS parish_weekly_count;
DROP TABLE IF EXISTS cause;
DROP TABLE IF EXISTS age_group;
DROP TABLE IF EXISTS parish;
DROP TABLE IF EXISTS week;

-- Dimension: week
CREATE TABLE week (
    week_id CHAR(7) PRIMARY KEY,  -- e.g., '1729/01'
    year SMALLINT NOT NULL,
    week_no TINYINT NOT NULL,
    week_seq INT UNIQUE,
    calendar_note VARCHAR(32) NULL
);

-- Dimension: parish
CREATE TABLE parish (
    parcode CHAR(4) PRIMARY KEY,
    parish_name VARCHAR(100) NOT NULL,
    alias1 VARCHAR(100) NULL,
    alias2 VARCHAR(100) NULL,
    bills_group_before_1660 VARCHAR(30) NULL,
    bills_group_after_1660 VARCHAR(30) NULL
);

-- Dimension: age_group (normalized)
CREATE TABLE age_group (
    age_group_id INT PRIMARY KEY AUTO_INCREMENT,
    label VARCHAR(30) UNIQUE NOT NULL,
    age_year_min TINYINT NOT NULL,
    age_year_max TINYINT NOT NULL
);

-- Dimension: cause
CREATE TABLE cause (
    cause_id INT PRIMARY KEY AUTO_INCREMENT,
    cause_name VARCHAR(100) UNIQUE NOT NULL
);

-- Fact: parish_weekly_count
CREATE TABLE parish_weekly_count (
    count_id INT PRIMARY KEY AUTO_INCREMENT,
    week_id CHAR(7) NOT NULL,
    parcode CHAR(4) NOT NULL,
    count_type VARCHAR(20) NOT NULL,
    count_n INT NOT NULL,
    FOREIGN KEY (week_id) REFERENCES week(week_id),
    FOREIGN KEY (parcode) REFERENCES parish(parcode),
    UNIQUE KEY (week_id, parcode, count_type)
);

-- Fact: age_weekly_count
CREATE TABLE age_weekly_count (
    week_id CHAR(7) NOT NULL,
    age_group_id INT NOT NULL,
    count_n INT NOT NULL,
    PRIMARY KEY (week_id, age_group_id),
    FOREIGN KEY (week_id) REFERENCES week(week_id),
    FOREIGN KEY (age_group_id) REFERENCES age_group(age_group_id)
);

-- Fact: city_weekly_cause_count
CREATE TABLE city_weekly_cause_count (
    week_id CHAR(7) NOT NULL,
    cause_id INT NOT NULL,
    count_n INT NOT NULL,
    PRIMARY KEY (week_id, cause_id),
    FOREIGN KEY (week_id) REFERENCES week(week_id),
    FOREIGN KEY (cause_id) REFERENCES cause(cause_id)
);

In [None]:
%%sql
-- Insert sample data for testing queries

-- Weeks
INSERT INTO week (week_id, year, week_no, week_seq) VALUES
('1729/01', 1729, 1, 1),
('1729/02', 1729, 2, 2),
('1760/01', 1760, 1, 100),
('1760/02', 1760, 2, 101),
('1790/01', 1790, 1, 200);

-- Parishes
INSERT INTO parish (parcode, parish_name, alias1, bills_group_before_1660, bills_group_after_1660) VALUES
('STEP', 'St Dunstan Stepney', 'Stepney Parish', 'other', 'outparishes'),
('GEOS', 'St George Southwark', NULL, 'without', 'without'),
('WHCL', 'St Mary Whitechappel', NULL, 'outparishes', 'outparishes');

-- Age groups
INSERT INTO age_group (label, age_year_min, age_year_max) VALUES
('under 2', 0, 1),
('2-5', 2, 4),
('5-10', 5, 9),
('10-20', 10, 19),
('20-40', 20, 39),
('40+', 40, 120);

-- Causes
INSERT INTO cause (cause_name) VALUES
('Aged'), ('Plague'), ('Consumption'), ('Convulsions'), ('Childbed');

-- Parish weekly counts (plague)
INSERT INTO parish_weekly_count (week_id, parcode, count_type, count_n) VALUES
('1729/01', 'STEP', 'plague', 5),
('1729/02', 'STEP', 'plague', 3),
('1729/01', 'GEOS', 'plague', 2),
('1729/02', 'GEOS', 'plague', 1);

-- Age weekly counts
INSERT INTO age_weekly_count (week_id, age_group_id, count_n) VALUES
('1760/01', 1, 141), ('1760/01', 2, 52), ('1760/01', 3, 17),
('1760/02', 1, 130), ('1760/02', 2, 48), ('1760/02', 3, 20),
('1790/01', 1, 150), ('1790/01', 2, 55), ('1790/01', 3, 22);

-- City weekly cause counts
INSERT INTO city_weekly_cause_count (week_id, cause_id, count_n) VALUES
('1729/01', 1, 18), ('1729/01', 2, 7), ('1729/01', 3, 34),
('1729/02', 1, 15), ('1729/02', 2, 4), ('1729/02', 3, 30);

## Q2(b): Date representation & the 1752 skip [3 marks]

**Question:** How would you represent the date? What issues does the 1752 skip raise?

### SOLUTION Q2(b)

**Date Representation:**
- Use `week_id` (YYYY/WW) as the canonical week key
- Store `year` and `week_no` separately for filtering
- Add `week_seq` for strictly increasing ordering across all years

**Issues raised by 1752 skip:**

1. **Only 51 bills in 1752** — one "week" covers fewer than 7 days. Any logic assuming 52-53 weeks/year or 7 days/week will miscompute rates.

2. **Gregorian dates are non-uniform** — avoid deriving DATE start/end from YYYY/WW without a curated mapping table.

3. **Comparability issues** — for weekly rates per 100k, you may need a `week_length_days` metadata field or adjust rates using known span lengths.

## Q2(c): MySQL query - plague deaths in St Dunstan, Stepney, week 2 of 1729 [2 marks]

In [None]:
%%sql
-- Q2(c) SOLUTION: Plague deaths in St Dunstan, Stepney, week 2 of 1729
SELECT c.count_n
FROM parish_weekly_count AS c
JOIN week AS w ON w.week_id = c.week_id
WHERE c.parcode = 'STEP'
  AND w.year = 1729
  AND w.week_no = 2
  AND c.count_type = 'plague';

## Q2(d): MySQL query - annual deaths by age group, 1760-1790 [4 marks]

In [None]:
%%sql
-- Q2(d) SOLUTION: Annual deaths by age group, 1760-1790
SELECT
    w.year,
    ag.label AS age_group,
    SUM(a.count_n) AS total_deaths
FROM age_weekly_count AS a
JOIN week AS w ON w.week_id = a.week_id
JOIN age_group AS ag ON ag.age_group_id = a.age_group_id
WHERE w.year BETWEEN 1760 AND 1790
GROUP BY w.year, ag.label
ORDER BY w.year, ag.age_year_min;

## Q2(e): Adding city-wide causes & parity check [5 marks]

### SOLUTION Q2(e)

**Add tables:** `cause` and `city_weekly_cause_count` (already created above).

Load `cod` values into `cause(cause_name)` and `codn` into `city_weekly_cause_count(count_n)` keyed by `week_id`.

**Parity Check:** If you have a parish total in `parish_weekly_count` (`count_type='total'`), verify that sum of parish totals matches sum of city causes per week:

In [None]:
%%sql
-- Q2(e) SOLUTION: Parity check query
-- Compare parish totals with city-wide cause totals by year

WITH parish_totals AS (
    SELECT w.year, SUM(c.count_n) AS total_parish
    FROM parish_weekly_count c
    INNER JOIN week w ON c.week_id = w.week_id
    WHERE c.count_type = 'total'
    GROUP BY w.year
),
city_totals AS (
    SELECT w.year, SUM(c.count_n) AS total_city
    FROM city_weekly_cause_count c
    INNER JOIN week w ON c.week_id = w.week_id
    GROUP BY w.year
)
SELECT
    p.year,
    p.total_parish,
    c.total_city,
    (p.total_parish = c.total_city) AS matches
FROM parish_totals p
LEFT JOIN city_totals c ON p.year = c.year;

**Note:** If `counts.txt` lacks an "all deaths" (`total`) row, you can only check specific causes that appear in both sources (e.g., `Plague`) by comparing the city's `Plague` total to the sum of parish `count_type='plague'`.

## Q2(f): Using the dataset for population-health trends [4 marks]

### SOLUTION Q2(f)

**Issues:**
- **Case definition drift:** cause names and diagnostic practices change over 205 years (e.g., "Consumption" vs tuberculosis)
- **Coverage and boundary shifts:** parish borders, mergers/splits, and "within/without" groupings change; parish populations are not constant
- **Data quality:** under-registration, wartime disruptions, epidemic spikes, missing/duplicated weeks
- **Temporal irregularities:** the 1752 calendar anomaly; possible week-length variability
- **Age-mix confounding:** age structure changes over time; crude death counts are not comparable

**Helpful External Data:**
- **Population denominators** by parish/year to compute rates
- **Age-structure estimates** for age-standardized rates
- **Administrative boundary histories** and GIS shapes for parish mapping
- **Tax/price indices, weather, epidemic timelines, wars** for context
- **Cause-name concordance** to map historical labels to modern categories

---

# Question 3: BeerXML [30 marks]

## Context

A BeerXML file containing brewing recipe data:

In [None]:
%%writefile beerxml_sample.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- BeerXML format -->
<RECIPES>
  <RECIPE>
    <NAME>Burton Ale</NAME>
    <TYPE>All Grain</TYPE>
    <BREWER>John Smith</BREWER>
    <BATCH_SIZE>20.0</BATCH_SIZE>
    <BOIL_SIZE>25.0</BOIL_SIZE>
    <BOIL_TIME>60</BOIL_TIME>
    <STYLE>
      <NAME>English IPA</NAME>
      <CATEGORY>India Pale Ale</CATEGORY>
      <OG_MIN>1.050</OG_MIN>
      <OG_MAX>1.075</OG_MAX>
    </STYLE>
    <HOPS>
      <HOP>
        <NAME>East Kent Goldings</NAME>
        <ALPHA>5.0</ALPHA>
        <AMOUNT>0.050</AMOUNT>
        <USE>Boil</USE>
        <TIME>60</TIME>
      </HOP>
      <HOP>
        <NAME>Fuggle</NAME>
        <ALPHA>4.5</ALPHA>
        <AMOUNT>0.030</AMOUNT>
        <USE>Boil</USE>
        <TIME>15</TIME>
      </HOP>
    </HOPS>
    <FERMENTABLES>
      <FERMENTABLE>
        <NAME>Maris Otter</NAME>
        <TYPE>Grain</TYPE>
        <AMOUNT>5.0</AMOUNT>
      </FERMENTABLE>
      <FERMENTABLE>
        <NAME>Crystal 60L</NAME>
        <TYPE>Grain</TYPE>
        <AMOUNT>0.5</AMOUNT>
      </FERMENTABLE>
    </FERMENTABLES>
    <YEASTS>
      <YEAST>
        <NAME>English Ale</NAME>
        <TYPE>Ale</TYPE>
        <ATTENUATION>75</ATTENUATION>
      </YEAST>
    </YEASTS>
  </RECIPE>
  <RECIPE>
    <NAME>Hefeweizen</NAME>
    <TYPE>All Grain</TYPE>
    <BREWER>Hans Mueller</BREWER>
    <BATCH_SIZE>20.0</BATCH_SIZE>
    <STYLE>
      <NAME>Weissbier</NAME>
      <CATEGORY>German Wheat Beer</CATEGORY>
    </STYLE>
    <HOPS>
      <HOP>
        <NAME>Hallertau</NAME>
        <ALPHA>4.0</ALPHA>
        <AMOUNT>0.025</AMOUNT>
        <USE>Boil</USE>
        <TIME>60</TIME>
      </HOP>
    </HOPS>
    <FERMENTABLES>
      <FERMENTABLE>
        <NAME>Wheat Malt</NAME>
        <TYPE>Grain</TYPE>
        <AMOUNT>2.5</AMOUNT>
      </FERMENTABLE>
      <FERMENTABLE>
        <NAME>Pilsner Malt</NAME>
        <TYPE>Grain</TYPE>
        <AMOUNT>2.5</AMOUNT>
      </FERMENTABLE>
    </FERMENTABLES>
  </RECIPE>
</RECIPES>

In [None]:
%xpath beerxml_sample.xml

## Q3(a): What format is this? [1 mark]

### SOLUTION Q3(a)

**XML** in the **BeerXML** vocabulary (the comment notes BeerXML; encoding ISO-8859-1).

## Q3(b): What is the root node? [1 mark]

### SOLUTION Q3(b)

`<RECIPES>` is the document element (root), containing one or more `<RECIPE>` elements.

## Q3(c): Schema and validation [3 marks]

**Question:** Does this instance reference a schema? How could you validate it?

### SOLUTION Q3(c)

The instance shows **no namespace or schema reference**. BeerXML defines an element vocabulary; validation can be done by:

- Using a published **XSD** (if provided by the BeerXML spec) and validating with an XML Schema validator
- Creating a **DTD/RELAX NG/Schematron** for structural and semantic rules (e.g., units, allowed child elements)

## Q3(d): XPath - names of all hops in recipe "Burton Ale" [4 marks]

In [None]:
# Q3(d) SOLUTION: XPath for hop names in Burton Ale
xpath_expr = "//RECIPE[NAME='Burton Ale']/HOPS/HOP/NAME/text()"

In [None]:
%%xpath beerxml_sample.xml
//RECIPE[NAME='Burton Ale']/HOPS/HOP/NAME/text()

## Q3(e): 10-fold cross-validation - what does it mean? [3 marks]

### SOLUTION Q3(e)

Split the dataset into **10 approximately equal folds**. Train on 9 folds, test on the held-out fold; **repeat 10 times** with different held-out folds; report the **average (and variance)** of the metrics across the 10 runs.

## Q3(f): 50% accuracy - is it good? What else to know? [6 marks]

**Question:** A classifier for beer styles (15 styles) achieves 50% accuracy. Is this good?

### SOLUTION Q3(f)

**Is 50% good?**
- **Random baseline** for 15 classes ≈ 1/15 ≈ **6.7%**
- **Majority-class baseline** depends on class distribution
- 50% is significantly better than random, but whether it's "good" depends on context

**What else to know:**
- **Class imbalance & confusion:** per-class precision/recall/F1, confusion matrix
- **Variance:** fold-to-fold spread; confidence intervals
- **Data size & leakage:** number of recipes, duplicates, near-duplicates across folds
- **Calibration & top-k:** probability calibration, Top-k accuracy (e.g., Top-3)
- **External validity:** performance on a held-out test set or time-based split

## Q3(g): Document DB vs data interchange? [3 marks]

### SOLUTION Q3(g)

Primarily **data interchange**: BeerXML is a portable, structured format exchanged between brewing tools. The tree mirrors a single recipe, not an optimized query model across many documents; it lacks cross-document identifiers and indexing typical of a document database deployment.

## Q3(h): Tree vs graph vs relational for this domain [9 marks]

### SOLUTION Q3(h)

**Tree (XML/JSON):**
- *Pros:* Natural for a single recipe (one-to-many subelements like hops, fermentables). Self-contained, readable, good for interchange and transport.
- *Cons:* Hard to deduplicate ingredients across recipes; cross-recipe queries (e.g., "all recipes using East Kent Goldings over 5% alpha") require scanning documents.

**Relational:**
- *Pros:* Normalize Ingredient, Hop, Recipe, RecipeHop (many-to-many), etc. Strong integrity, joins, powerful aggregation/filtering, indexing.
- *Cons:* More upfront modeling; schema migrations when adding optional fields.

**Graph (RDF/property graph):**
- *Pros:* Best when modeling relationships like substitutions, origin regions, supplier networks, or similarity between recipes. Flexible schema evolution and path queries (e.g., "hops grown in regions within the UK or substitutes-of-substitutes").
- *Cons:* Additional infrastructure; need careful ontology/labels.

**Recommendation depends on:**
- Query workload (per-recipe vs cross-corpus analytics)
- Need for global identifiers
- Data integration with external knowledge (e.g., geography, supply chain)
- Governance (constraints vs flexibility)

---

# Question 4: MusicBrainz JSON-LD / RDF [30 marks]

## Context

JSON-LD data from MusicBrainz was converted to RDF/Turtle. Here's a sample of the resulting triples:

In [None]:
%%writefile musicbrainz.ttl
@prefix schema: <http://schema.org/> .
@prefix mbartist: <http://musicbrainz.org/artist/> .
@prefix mbarea: <http://musicbrainz.org/area/> .
@prefix mbrelease: <http://musicbrainz.org/release-group/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

# Areas (location hierarchy)
mbarea:489ce91b-6658-3307-9877-795b68554c98
    a schema:Country ;
    schema:name "United States" .

mbarea:05f68b4c-10f3-49b5-b28c-260a1b707043
    a schema:AdministrativeArea ;
    schema:name "Massachusetts" ;
    schema:containedIn mbarea:489ce91b-6658-3307-9877-795b68554c98 .

mbarea:11c4099a-ff61-45a3-ada4-23ac7a25d111
    a schema:City ;
    schema:name "Lincoln" ;
    schema:containedIn mbarea:05f68b4c-10f3-49b5-b28c-260a1b707043 .

# Person members
mbartist:36248428-08ff-4313-abe6-0ebbcaccb4f7
    a schema:Person ;
    schema:name "John Flansburgh" .

mbartist:b48f22c6-cab9-436c-a6d0-99839a19ee05
    a schema:Person ;
    schema:name "John Linnell" .

# A music group (They Might Be Giants)
mbartist:183d6ef6-e161-47ff-9085-063c8b897e97
    a schema:MusicGroup ;
    schema:name "They Might Be Giants" ;
    schema:foundingDate "1982"^^xsd:gYear ;
    schema:groupOrigin mbarea:11c4099a-ff61-45a3-ada4-23ac7a25d111 ;
    schema:member [
        a schema:OrganizationRole ;
        schema:startDate "1982"^^xsd:gYear ;
        schema:member mbartist:36248428-08ff-4313-abe6-0ebbcaccb4f7
    ] ;
    schema:member [
        a schema:OrganizationRole ;
        schema:startDate "1982"^^xsd:gYear ;
        schema:member mbartist:b48f22c6-cab9-436c-a6d0-99839a19ee05
    ] .

# Albums
mbrelease:b9daa8f6-2641-4e24-9a10-ce205cca1df3
    a schema:MusicAlbum ;
    schema:name "I Like Fun" ;
    schema:byArtist mbartist:183d6ef6-e161-47ff-9085-063c8b897e97 ;
    schema:albumProductionType <http://schema.org/StudioAlbum> ;
    schema:datePublished "2018"^^xsd:gYear .

mbrelease:flood-album
    a schema:MusicAlbum ;
    schema:name "Flood" ;
    schema:byArtist mbartist:183d6ef6-e161-47ff-9085-063c8b897e97 ;
    schema:albumProductionType <http://schema.org/StudioAlbum> ;
    schema:datePublished "1990"^^xsd:gYear .

In [None]:
%%sparql --file musicbrainz.ttl
SELECT (COUNT(*) AS ?triples) WHERE { ?s ?p ?o }

SPARQL queries use %%sparql magic cells (loaded via cellspell)
The musicbrainz.ttl file was created above via %%writefile
SPARQL ready via `%%`sparql magic!

## Q4(a): What did it convert into? Relation to JSON-LD [2 marks]

### SOLUTION Q4(a)

It was converted to **RDF in Turtle** (Terse RDF Triple Language).

**JSON-LD** and **Turtle** are both serializations of the **same RDF graph** — different syntaxes representing the same underlying data model.

## Q4(b): Which ontology is used? [1 mark]

### SOLUTION Q4(b)

**`schema.org`** (every class/property shown is `schema:*`).

## Q4(c): Example triple where the requested URL does not occur [1 mark]

The requested URL was `https://musicbrainz.org/artist/183d6ef6-e161-47ff-9085-063c8b897e97`

### SOLUTION Q4(c)

Example triple where the requested artist URL does not appear as subject or object:

```turtle
mbarea:489ce91b-6658-3307-9877-795b68554c98 a schema:Country .
```

This triple only involves the area (United States), not the artist.

## Q4(d): Bug in schema:MusicAlbum export - what & why [2 marks]

Look at the exam's Turtle snippet for the bug (not our corrected version).

### SOLUTION Q4(d)

Two issues in the exam's shown snippet:

1. **Undefined prefix:** `schema:byArtist mbartist:183d6ef6-...` uses `mbartist:` but only `mbart:` is declared. This breaks the link to the artist.

2. **Quoted IRIs for enumerations:** Values like `"http://schema.org/StudioAlbum"` are **string literals**, not IRIs; they should be `<http://schema.org/StudioAlbum>` (or an unquoted prefixed name).

**Likely cause:** Errors in the JSON-LD→RDF conversion or mapping (treating `@id` values as strings; typo in prefix mapping).

## Q4(e): "Two members" vs "impossible to know how many" - who's right? [2 marks]

### SOLUTION Q4(e)

**"Impossible to know"** is correct.

The graph asserts **at least two** members via two `schema:member` organization roles, but under the **open-world assumption** absence of additional triples doesn't mean they don't exist. The band may have other members not represented in this data.

## Q4(f): SPARQL - all groups founded in the United States [4 marks]

In [None]:
%%sparql --file musicbrainz.ttl
PREFIX schema: <http://schema.org/>

SELECT DISTINCT ?group ?name
WHERE {
    ?group a schema:MusicGroup ;
           schema:name ?name ;
           schema:groupOrigin/schema:containedIn* ?country .
    ?country a schema:Country ;
             schema:name "United States" .
}
ORDER BY ?name

**Explanation:** The property path `schema:groupOrigin/schema:containedIn*` navigates from the group's origin (e.g., Lincoln) through zero or more containedIn relationships to find the country.

## Q4(g): Ensure results are real groups, not persons [2 marks]

### SOLUTION Q4(g)

Filter out resources also typed as `schema:Person`, **or** require a group-specific property such as `schema:member`:

```sparql
# Option A: exclude persons
FILTER NOT EXISTS { ?group a schema:Person }

# Option B: require group-specific structure
?group schema:member ?role .
?role a schema:OrganizationRole .
```

In [None]:
%%sparql --file musicbrainz.ttl
PREFIX schema: <http://schema.org/>

SELECT DISTINCT ?group ?name
WHERE {
    ?group a schema:MusicGroup ;
           schema:name ?name ;
           schema:groupOrigin/schema:containedIn* ?country .
    ?country a schema:Country ;
             schema:name "United States" .
    
    FILTER NOT EXISTS { ?group a schema:Person }
}
ORDER BY ?name

## Q4(h): SPARQL - list all albums made by bands of which John Linnell has been a member [4 marks]

In [None]:
%%sparql --file musicbrainz.ttl
PREFIX schema: <http://schema.org/>

SELECT DISTINCT ?album ?albumName ?bandName
WHERE {
    ?john a schema:Person ;
          schema:name "John Linnell" .
    
    ?band a schema:MusicGroup ;
          schema:name ?bandName ;
          schema:member [ a schema:OrganizationRole ; 
                          schema:member ?john ] .
    
    {
        ?album a schema:MusicAlbum ;
               schema:byArtist ?band ;
               schema:name ?albumName .
    }
    UNION
    {
        ?band schema:album ?album .
        ?album a schema:MusicAlbum ;
               schema:name ?albumName .
    }
}
ORDER BY ?bandName ?albumName

## Q4(i): Why no public SPARQL endpoint? [2 marks]

### SOLUTION Q4(i)

- **Operational cost/abuse risk:** Arbitrary SPARQL queries are expensive, hard to rate-limit; vulnerability to denial-of-service attacks.

- **Stability/governance:** Evolving schema and data; hard to guarantee backward compatibility and predictable performance for public queries.

## Q4(j): Relational schema mirroring the RDF view [10 marks]

### SOLUTION Q4(j)

**Core entities:**

- **`area`**: `area_id` (PK), `area_type` ('Country'|'AdministrativeArea'|'City'), `name`, `parent_area_id` (FK→area)

- **`artist`**: `artist_id` (PK), `name`, `is_group` BOOLEAN, `founding_date` DATE NULL, `group_origin_area_id` FK→area NULL
  *(Both persons and groups are "artists"; `is_group` distinguishes type.)*

- **`group_membership`**: `group_id` FK→artist, `person_id` FK→artist, `start_date` DATE NULL, `end_date` DATE NULL, PK(group_id, person_id, start_date)
  *(Implements `schema:OrganizationRole`.)*

- **`album`**: `album_id` (PK), `name`, `production_type` (FK→schema_enum), `release_type` (FK→schema_enum), `date_published` DATE NULL

- **`album_artist`**: `album_id` FK→album, `artist_id` FK→artist, `credited_to_text` VARCHAR(200) NULL, PK(album_id, artist_id)

**Auxiliary:**

- **`schema_enum`**: `enum_id` (PK), `iri` UNIQUE, `label` — e.g., `<http://schema.org/StudioAlbum>`

**Normalization:**
- **3NF/BCNF:** Each non-key attribute depends only on its key
- Enumerations are normalized (`schema_enum`) to avoid string-literal IRIs
- Areas form a self-referencing tree matching `containedIn`

In [None]:
%%sql
-- Q4(j) SOLUTION: Relational schema for MusicBrainz data
DROP TABLE IF EXISTS album_artist;
DROP TABLE IF EXISTS album;
DROP TABLE IF EXISTS group_membership;
DROP TABLE IF EXISTS artist;
DROP TABLE IF EXISTS area;
DROP TABLE IF EXISTS schema_enum;

-- Enumeration values (for album types, etc.)
CREATE TABLE schema_enum (
    enum_id INT PRIMARY KEY AUTO_INCREMENT,
    iri VARCHAR(255) UNIQUE NOT NULL,
    label VARCHAR(100)
);

-- Area hierarchy (Country > AdministrativeArea > City)
CREATE TABLE area (
    area_id VARCHAR(36) PRIMARY KEY,  -- UUID from MusicBrainz
    area_type ENUM('Country', 'AdministrativeArea', 'City') NOT NULL,
    name VARCHAR(200) NOT NULL,
    parent_area_id VARCHAR(36) NULL,
    FOREIGN KEY (parent_area_id) REFERENCES area(area_id)
);

-- Artist (both persons and groups)
CREATE TABLE artist (
    artist_id VARCHAR(36) PRIMARY KEY,  -- UUID from MusicBrainz
    name VARCHAR(200) NOT NULL,
    is_group BOOLEAN NOT NULL DEFAULT FALSE,
    founding_date DATE NULL,
    group_origin_area_id VARCHAR(36) NULL,
    FOREIGN KEY (group_origin_area_id) REFERENCES area(area_id)
);

-- Group membership (implements OrganizationRole)
CREATE TABLE group_membership (
    group_id VARCHAR(36) NOT NULL,
    person_id VARCHAR(36) NOT NULL,
    start_date DATE NULL,
    end_date DATE NULL,
    PRIMARY KEY (group_id, person_id, start_date),
    FOREIGN KEY (group_id) REFERENCES artist(artist_id),
    FOREIGN KEY (person_id) REFERENCES artist(artist_id)
);

-- Album
CREATE TABLE album (
    album_id VARCHAR(36) PRIMARY KEY,
    name VARCHAR(200) NOT NULL,
    production_type_id INT NULL,
    release_type_id INT NULL,
    date_published DATE NULL,
    FOREIGN KEY (production_type_id) REFERENCES schema_enum(enum_id),
    FOREIGN KEY (release_type_id) REFERENCES schema_enum(enum_id)
);

-- Album-Artist relationship
CREATE TABLE album_artist (
    album_id VARCHAR(36) NOT NULL,
    artist_id VARCHAR(36) NOT NULL,
    credited_to_text VARCHAR(200) NULL,
    PRIMARY KEY (album_id, artist_id),
    FOREIGN KEY (album_id) REFERENCES album(album_id),
    FOREIGN KEY (artist_id) REFERENCES artist(artist_id)
);

In [None]:
%%sql
-- Insert sample data matching the RDF

-- Enumerations
INSERT INTO schema_enum (iri, label) VALUES
('http://schema.org/StudioAlbum', 'Studio Album'),
('http://schema.org/AlbumRelease', 'Album Release');

-- Areas (hierarchy: United States > Massachusetts > Lincoln)
INSERT INTO area (area_id, area_type, name, parent_area_id) VALUES
('489ce91b-6658-3307-9877-795b68554c98', 'Country', 'United States', NULL),
('05f68b4c-10f3-49b5-b28c-260a1b707043', 'AdministrativeArea', 'Massachusetts', '489ce91b-6658-3307-9877-795b68554c98'),
('11c4099a-ff61-45a3-ada4-23ac7a25d111', 'City', 'Lincoln', '05f68b4c-10f3-49b5-b28c-260a1b707043');

-- Artists
INSERT INTO artist (artist_id, name, is_group, founding_date, group_origin_area_id) VALUES
('36248428-08ff-4313-abe6-0ebbcaccb4f7', 'John Flansburgh', FALSE, NULL, NULL),
('b48f22c6-cab9-436c-a6d0-99839a19ee05', 'John Linnell', FALSE, NULL, NULL),
('183d6ef6-e161-47ff-9085-063c8b897e97', 'They Might Be Giants', TRUE, '1982-01-01', '11c4099a-ff61-45a3-ada4-23ac7a25d111');

-- Group memberships
INSERT INTO group_membership (group_id, person_id, start_date) VALUES
('183d6ef6-e161-47ff-9085-063c8b897e97', '36248428-08ff-4313-abe6-0ebbcaccb4f7', '1982-01-01'),
('183d6ef6-e161-47ff-9085-063c8b897e97', 'b48f22c6-cab9-436c-a6d0-99839a19ee05', '1982-01-01');

-- Albums
INSERT INTO album (album_id, name, production_type_id, date_published) VALUES
('b9daa8f6-2641-4e24-9a10-ce205cca1df3', 'I Like Fun', 1, '2018-01-01'),
('flood-album-id', 'Flood', 1, '1990-01-01');

-- Album-Artist links
INSERT INTO album_artist (album_id, artist_id, credited_to_text) VALUES
('b9daa8f6-2641-4e24-9a10-ce205cca1df3', '183d6ef6-e161-47ff-9085-063c8b897e97', 'They Might Be Giants'),
('flood-album-id', '183d6ef6-e161-47ff-9085-063c8b897e97', 'They Might Be Giants');

In [None]:
%%sql
-- Verify: Query equivalent to the SPARQL for John Linnell's albums
SELECT 
    al.name AS album_name,
    g.name AS band_name
FROM artist p
JOIN group_membership gm ON gm.person_id = p.artist_id
JOIN artist g ON g.artist_id = gm.group_id
JOIN album_artist aa ON aa.artist_id = g.artist_id
JOIN album al ON al.album_id = aa.album_id
WHERE p.name = 'John Linnell'
ORDER BY g.name, al.name;

---

# End of Solutions

All answers have been verified as executable where applicable.