<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/march-2023/notebook-march-2023-solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 March 2023 - Solutions Notebook

This notebook contains **complete solutions** for the March 2023 exam.

**Exam Structure:**
- Section A: MCQs (taken separately on VLE)
- Section B: Answer 2 of 3 questions - 60 marks
  - Q2: Analyzing OpenDocument Format (ODF) and RelaxNG Schema
  - Q3: MusicBrainz / Linked Data
  - Q4: Enhancing an ER Model for 16th-Century Music Records

**Instructions:**
1. Run the Setup cells first
2. All solution cells are pre-filled with correct answers
3. Compare with your own attempts from the practice notebook

---

# 1. Environment Setup

Run these cells first to set up MySQL, xmllint, rapper, and rdflib.

In [None]:
# === MySQL Setup ===
!apt -qq update > /dev/null
!apt -y -qq install mysql-server > /dev/null
!service mysql start

# Create user and database
!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

# === xmllint Setup (for XML/XPath exercises) ===
!apt -y -qq install libxml2-utils > /dev/null

# === jing Setup (for RelaxNG validation - used by ODF) ===
!apt -y -qq install jing > /dev/null

# === rapper Setup (for RDF/Turtle validation) ===
!apt -y -qq install raptor2-utils > /dev/null

# === Python libraries ===
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0 lxml rdflib

%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

print("MySQL ready!")
print("xmllint ready!")
print("jing ready (for RelaxNG validation)!")
print("rapper ready!")

---

# Question 2: Analyzing OpenDocument Format (ODF) and RelaxNG Schema [30 marks]

## Context

An extract from an ODF word processing document is shown below:

In [None]:
%%writefile odf_extract.xml
<?xml version="1.0" encoding="UTF-8"?>
<office:text xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
             xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">
  <text:p>Introduction to Data Structures</text:p>
  <text:list>
    <text:list-item>
      <text:p>Trees</text:p>
    </text:list-item>
    <text:list-item>
      <text:p>Graphs</text:p>
    </text:list-item>
    <text:list-item>
      <text:p>Relations</text:p>
    </text:list-item>
  </text:list>
</office:text>

In [None]:
%%writefile odf_text.rng
<?xml version="1.0" encoding="UTF-8"?>
<!-- Simplified RelaxNG schema for ODF text elements -->
<!-- Based on the schema snippet shown in the exam -->
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

  <start>
    <ref name="office-text"/>
  </start>

  <define name="office-text">
    <element name="office:text" ns="urn:oasis:names:tc:opendocument:xmlns:office:1.0">
      <zeroOrMore>
        <choice>
          <ref name="text-p"/>
          <ref name="text-list"/>
        </choice>
      </zeroOrMore>
    </element>
  </define>

  <define name="text-p">
    <element name="text:p" ns="urn:oasis:names:tc:opendocument:xmlns:text:1.0">
      <optional>
        <attribute name="text:style-name"/>
      </optional>
      <text/>
    </element>
  </define>

  <!-- This is the schema snippet from the exam question -->
  <define name="text-list">
    <element name="text:list" ns="urn:oasis:names:tc:opendocument:xmlns:text:1.0">
      <ref name="text-list-attr"/>
      <optional>
        <ref name="text-list-header"/>
      </optional>
      <zeroOrMore>
        <ref name="text-list-item"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="text-list-attr">
    <optional>
      <attribute name="text:style-name"/>
    </optional>
    <optional>
      <attribute name="text:continue-numbering"/>
    </optional>
  </define>

  <define name="text-list-header">
    <element name="text:list-header" ns="urn:oasis:names:tc:opendocument:xmlns:text:1.0">
      <zeroOrMore>
        <ref name="text-p"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="text-list-item">
    <element name="text:list-item" ns="urn:oasis:names:tc:opendocument:xmlns:text:1.0">
      <zeroOrMore>
        <choice>
          <ref name="text-p"/>
          <ref name="text-list"/>  <!-- Allows nested lists -->
        </choice>
      </zeroOrMore>
    </element>
  </define>

</grammar>

In [None]:
# Validate the ODF extract against the RelaxNG schema
print("=== Validating odf_extract.xml against odf_text.rng ===")
!jing odf_text.rng odf_extract.xml && echo "VALID: odf_extract.xml passes schema validation!"

### RelaxNG Schema Snippet

```xml
<define name="text-list">
  <element name="text:list">
    <ref name="text-list-attr"/>
    <optional>
      <ref name="text-list-header"/>
    </optional>
    <zeroOrMore>
      <ref name="text-list-item"/>
    </zeroOrMore>
  </element>
</define>
```

## Q2(a): What language is this encoded in? [1 mark]

### Answer: **XML (Extensible Markup Language)**

ODF files are ZIP containers containing XML files. The snippet shows tags like `<office:text>` and `<text:p>` which are XML elements.

## Q2(b): What data structure does it use? [1 mark]

### Answer: **Tree (hierarchical) structure**

XML inherently uses a tree structure - a single root element with nested children forming a hierarchy.

## Q2(c): List the two namespaces [2 marks]

### Answer:

1. `urn:oasis:names:tc:opendocument:xmlns:office:1.0` (prefix: `office:`)
2. `urn:oasis:names:tc:opendocument:xmlns:text:1.0` (prefix: `text:`)

## Q2(d): XPath expressions [4 marks]

### Answer:

**`//text:list-item/text:p`:**
- Selects `<text:p>` elements that are **direct children** of `<text:list-item>`

**`//text:list//text:p`:**
- Selects **all** `<text:p>` elements that are **descendants** of `<text:list>` (at any depth)

**In this example:** Both expressions return the same three items (`Trees`, `Graphs`, `Relations`) because each `<text:p>` is already a direct child of `<text:list-item>`. In a more complex or nested structure, these expressions could yield different results.

In [None]:
# Verify XPath expressions
from lxml import etree

doc = etree.parse('odf_extract.xml')
namespaces = {
    'office': 'urn:oasis:names:tc:opendocument:xmlns:office:1.0',
    'text': 'urn:oasis:names:tc:opendocument:xmlns:text:1.0'
}

# Test //text:list-item/text:p
result1 = doc.xpath('//text:list-item/text:p/text()', namespaces=namespaces)
print("//text:list-item/text:p:", result1)

# Test //text:list//text:p
result2 = doc.xpath('//text:list//text:p/text()', namespaces=namespaces)
print("//text:list//text:p:", result2)

print("\nBoth return same results in this case:", result1 == result2)

## Q2(e): Well-formedness [2 marks]

### Answer:

The RelaxNG schema **does not** help assess well-formedness.

A schema only checks structure and allowed elements **after** the document is confirmed well-formed by an XML parser. Well-formedness rules (correct tag nesting, matching start/end tags, single root, quoted attributes) are checked by the parser, not the schema.

## Q2(f): Validity [2 marks]

### Answer:

The RelaxNG schema checks if the document follows the structural rules it defines:
- Required elements and their sequences
- Allowed attributes and their values
- Cardinality (how many of each element)

If the document meets these requirements, it is **valid**; otherwise, it is invalid.

## Q2(g): Schema relevance [2 marks]

### Answer:

The RelaxNG snippet is relevant to `<text:list>` elements and their children:
- `<text:list-header>` (optional)
- `<text:list-item>` (zero or more)

## Q2(h): Invalid element example [3 marks]

### Answer:

```xml
<text:list>
  <text:list-item>Item Content</text:list-item>
  <text:invalid-element>Invalid Content</text:invalid-element>
</text:list>
```

`<text:invalid-element>` is not defined in the schema, so the document fails validation.

In [None]:
# Q2(h) Demonstration: Create and validate an INVALID ODF document
%%writefile odf_invalid.xml
<?xml version="1.0" encoding="UTF-8"?>
<office:text xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
             xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">
  <text:list>
    <text:invalid-element>This element is not allowed!</text:invalid-element>
    <text:list-item>
      <text:p>Valid item</text:p>
    </text:list-item>
  </text:list>
</office:text>

In [None]:
# Q2(h) Validate the INVALID ODF - should show an error
print("=== Validating odf_invalid.xml - should FAIL ===")
!jing odf_text.rng odf_invalid.xml || echo "INVALID: text:invalid-element is not defined in the schema!"

## Q2(i): Compare XML vs Relational [13 marks]

### Answer:

**XML / Tree Structures for Word Processing:**

| Advantages | Disadvantages |
|------------|---------------|
| Natural hierarchy for nested structures | Verbose and repetitive |
| Standards (ODF, OOXML) well supported | Complex queries (XPath/XQuery learning curve) |
| Flexible schema, easy to embed metadata | Processing overhead for large documents |
| Mixed content (text + markup) support | |

**Relational Model:**

| Advantages | Disadvantages |
|------------|---------------|
| Strong data integrity (PK/FK, constraints) | Poor fit for deeply nested data |
| Efficient SQL for structured queries | Rigid schema, changes require ALTER TABLE |
| Mature tools for backup, replication | Cannot handle mixed content naturally |
| ACID compliance | Object-relational mismatch |

**Conclusion:** XML is well-suited for hierarchical, text-heavy documents. Relational model is better for strongly structured, tabular data with extensive analytical queries.

---

# Question 3: MusicBrainz / Linked Data [30 marks]

## Context

RDF/Turtle data describing a music group (BTS):

In [None]:
%%writefile musicbrainz.ttl
@prefix schema: <http://schema.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix mba: <http://musicbrainz.org/artist/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

mba:9fe8e-ba27-4859-bb8c-2f255f346853
    a schema:MusicGroup ;
    schema:name "BTS"@en ;
    schema:foundingDate "2013-06-12"^^xsd:date ;
    schema:member [
        a schema:OrganizationRole ;
        schema:startDate "2013-06-12"^^xsd:date ;
        schema:member mba:person-jin
    ] ;
    schema:member [
        a schema:OrganizationRole ;
        schema:startDate "2013-06-12"^^xsd:date ;
        schema:member mba:person-suga
    ] .

mba:person-jin
    a schema:Person, schema:MusicGroup ;
    schema:name "JIN"@en .

mba:person-suga
    a schema:Person ;
    schema:name "SUGA"@en .

In [None]:
# Validate the Turtle
!rapper -i turtle -c musicbrainz.ttl

## Q3(a): Accept header type [1 mark]

### Answer: **`text/turtle`** (or `application/turtle`)

## Q3(b): Full URL of predicate [1 mark]

### Answer: **`http://schema.org/member`**

The `schema:` prefix expands to `http://schema.org/`.

## Q3(c): Band member count [1 mark]

### Answer: **2** members (JIN and SUGA)

## Q3(d): Comment on schema:member usage [3 marks]

### Answer:

A **role-based** approach is used:

1. The band node has `schema:member` pointing to a blank node of type `schema:OrganizationRole`
2. That blank node itself has `schema:member` pointing to the person's URI

**Benefits:**
- Allows adding membership attributes like `schema:startDate` to the role object
- Models the relationship (membership) as an entity with its own properties
- Common pattern in RDF for n-ary relationships

## Q3(e): Types for "JIN" [1 mark]

### Answer:

JIN is typed as both:
- `schema:Person`
- `schema:MusicGroup`

(This is due to how MusicBrainz RDF is auto-generated)

In [None]:
# Verify with rdflib
import rdflib

g = rdflib.Graph()
g.parse('musicbrainz.ttl', format='turtle')

query = """
PREFIX schema: <http://schema.org/>
SELECT ?type WHERE {
  ?person schema:name "JIN"@en .
  ?person a ?type .
}
"""

print("Types for JIN:")
for row in g.query(query):
    print(f"  - {row[0]}")

## Q3(f): SPARQL prefixes [1 mark]

### Answer:

```sparql
PREFIX mba: <http://musicbrainz.org/artist/>
PREFIX schema: <http://schema.org/>
```

(And possibly `rdf:` if using `rdf:type`.)

## Q3(g): Query results [2 marks]

### Answer:

The query returns pairs of `(?a, ?b)` where:
- `?a` = **member name** (e.g., "JIN", "SUGA")
- `?b` = **startDate** from the membership role (e.g., "2013-06-12")

Essentially: each band member's name plus when they joined.

In [None]:
# Verify with rdflib
query = """
PREFIX mba: <http://musicbrainz.org/artist/>
PREFIX schema: <http://schema.org/>

SELECT ?a ?b WHERE {
  mba:9fe8e-ba27-4859-bb8c-2f255f346853 schema:member ?c .
  ?c schema:startDate ?b ;
     schema:member ?d .
  ?d schema:name ?a .
}
"""

print("Query results:")
for row in g.query(query):
    print(f"  Name: {row.a}, StartDate: {row.b}")

## Q3(h): ER diagram [6 marks]

### Answer:

```
Artist                         Membership
+---------------+              +---------------+
| ArtistId PK   |<-----+       | BandId PK, FK |
| Name          |      |       | MemberId PK,FK|
| Type          |      +-------| StartDate     |
| FoundingDate  |              | RoleName      |
+---------------+              +---------------+
```

- **Artist**: Holds both bands and individuals (differentiated by `Type`)
- **Membership**: Composite PK (BandId, MemberId), both FK to Artist
- `BandId` references Artist row with `Type='MusicGroup'`
- `MemberId` references Artist row with `Type='Person'`

## Q3(i): CREATE TABLE commands [4 marks]

### Answer:

In [None]:
%%sql
DROP TABLE IF EXISTS Membership;
DROP TABLE IF EXISTS Artist;

CREATE TABLE Artist (
  ArtistId     INT PRIMARY KEY,
  Name         VARCHAR(100) NOT NULL,
  Type         VARCHAR(20)  NOT NULL,  -- 'Person' or 'MusicGroup'
  FoundingDate DATE
);

CREATE TABLE Membership (
  BandId    INT NOT NULL,
  MemberId  INT NOT NULL,
  StartDate DATE,
  RoleName  VARCHAR(100),
  PRIMARY KEY (BandId, MemberId),
  FOREIGN KEY (BandId)   REFERENCES Artist(ArtistId),
  FOREIGN KEY (MemberId) REFERENCES Artist(ArtistId)
);

SELECT 'Tables created!' AS Status;

In [None]:
%%sql
-- Insert sample data
INSERT INTO Artist VALUES (1, 'BTS', 'MusicGroup', '2013-06-12');
INSERT INTO Artist VALUES (2, 'JIN', 'Person', NULL);
INSERT INTO Artist VALUES (3, 'SUGA', 'Person', NULL);

INSERT INTO Membership VALUES (1, 2, '2013-06-12', 'Member');
INSERT INTO Membership VALUES (1, 3, '2013-06-12', 'Member');

SELECT * FROM Artist;
SELECT * FROM Membership;

## Q3(j): Data integrity query [5 marks]

### Answer:

Query to find members who joined before the band was founded:

In [None]:
%%sql
SELECT aMember.Name AS MemberName,
       aBand.Name   AS BandName,
       m.StartDate,
       aBand.FoundingDate
FROM Membership m
INNER JOIN Artist aBand   ON m.BandId   = aBand.ArtistId
INNER JOIN Artist aMember ON m.MemberId = aMember.ArtistId
WHERE m.StartDate < aBand.FoundingDate;

## Q3(k): Database dump vs Linked Data [5 marks]

### Answer:

**Database Dump:**

| Pros | Cons |
|------|------|
| Complete offline snapshot for large queries | Becomes outdated quickly |
| Independent of network availability | Large storage overhead |
| Full control over query performance | Requires database setup |
| Can create custom indexes | Maintenance burden |

**Linked Data:**

| Pros | Cons |
|------|------|
| Always up-to-date data | Network dependent |
| Easy to interlink with other sources | Slower for bulk operations |
| No local storage required | Rate limits may apply |
| Standard SPARQL queries | Endpoint downtime affects apps |

---

# Question 4: Enhancing an ER Model for 16th-Century Music Records [30 marks]

## Context

An existing ER model for a database of 16th-century European music books needs enhancement.

## Q4(a): Order and coordinates [3 marks]

### Answer:

Add attributes to the **Line** entity:

| Attribute | Type | Purpose |
|-----------|------|--------|
| `LineOrder` | INT | Track visual/logical order on page |
| `XCoordinate` | FLOAT | Horizontal position |
| `YCoordinate` | FLOAT | Vertical position |

## Q4(b): Tablebook format [8 marks]

### Answer:

**New Entities:**

1. **InstrumentOrVoicePart** - e.g., "Soprano," "Alto," "Violin Part"
2. **Region** - defines different areas on a page, potentially oriented differently

**Relationships:**

- **Line** now references:
  - A **Piece** (the composition)
  - A **Page** (the physical page it's on)
  - A **Region** (sub-area of that page)
  - A **Part** (e.g., soprano or instrumental line)

```
Piece ||--o{ Line : has
Page ||--o{ Line : contains
Page ||--o{ Region : has
Region ||--o{ Line : includes
InstrumentOrVoicePart ||--o{ Line : is_for
```

## Q4(c): Tables, PKs, and FKs [7 marks]

### Answer:

| Table | Primary Key | Foreign Keys |
|-------|-------------|---------------|
| **Piece** | PieceId | - |
| **Page** | PageId | BookId → Book(BookId) |
| **Region** | RegionId | PageId → Page(PageId) |
| **InstrumentOrVoicePart** | PartId | - |
| **Line** | LineId | PieceId, PageId, RegionId, PartId |

In [None]:
%%sql
DROP TABLE IF EXISTS Line;
DROP TABLE IF EXISTS Region;
DROP TABLE IF EXISTS Page;
DROP TABLE IF EXISTS Piece;
DROP TABLE IF EXISTS InstrumentOrVoicePart;
DROP TABLE IF EXISTS Book;

CREATE TABLE Book (
    BookId INT PRIMARY KEY AUTO_INCREMENT,
    Title VARCHAR(200) NOT NULL
);

CREATE TABLE Piece (
    PieceId INT PRIMARY KEY AUTO_INCREMENT,
    Title VARCHAR(200) NOT NULL,
    Composer VARCHAR(100)
);

CREATE TABLE Page (
    PageId INT PRIMARY KEY AUTO_INCREMENT,
    BookId INT,
    PageNumber INT,
    FOREIGN KEY (BookId) REFERENCES Book(BookId)
);

CREATE TABLE Region (
    RegionId INT PRIMARY KEY AUTO_INCREMENT,
    PageId INT,
    Description VARCHAR(100),
    Orientation VARCHAR(20),
    FOREIGN KEY (PageId) REFERENCES Page(PageId)
);

CREATE TABLE InstrumentOrVoicePart (
    PartId INT PRIMARY KEY AUTO_INCREMENT,
    PartName VARCHAR(50) NOT NULL
);

CREATE TABLE Line (
    LineId INT PRIMARY KEY AUTO_INCREMENT,
    PieceId INT,
    PageId INT,
    RegionId INT,
    PartId INT,
    LineOrder INT,
    XCoordinate FLOAT,
    YCoordinate FLOAT,
    FOREIGN KEY (PieceId) REFERENCES Piece(PieceId),
    FOREIGN KEY (PageId) REFERENCES Page(PageId),
    FOREIGN KEY (RegionId) REFERENCES Region(RegionId),
    FOREIGN KEY (PartId) REFERENCES InstrumentOrVoicePart(PartId)
);

SELECT 'Tables created!' AS Status;

In [None]:
%%sql
-- Insert sample data
INSERT INTO Book VALUES (1, 'Cantiones Sacrae 1575');
INSERT INTO Piece VALUES (1, 'Ave Maria', 'Palestrina'), (2, 'Kyrie', 'Byrd');
INSERT INTO Page VALUES (1, 1, 1), (2, 1, 2);
INSERT INTO Region VALUES (1, 1, 'Top region', 'horizontal'), (2, 1, 'Bottom region', 'vertical');
INSERT INTO InstrumentOrVoicePart VALUES (1, 'Soprano'), (2, 'Alto'), (3, 'Tenor');

INSERT INTO Line VALUES (1, 1, 1, 1, 1, 1, 10.0, 50.0);
INSERT INTO Line VALUES (2, 1, 1, 1, 2, 2, 10.0, 100.0);
INSERT INTO Line VALUES (3, 1, 1, 2, 3, 1, 200.0, 50.0);
INSERT INTO Line VALUES (4, 2, 2, NULL, 1, 1, 10.0, 50.0);
INSERT INTO Line VALUES (5, 2, 2, NULL, 2, 2, 10.0, 100.0);

SELECT 'Sample data inserted!' AS Status;

## Q4(d): Line count query [5 marks]

### Answer:

Query to list pieces with total number of lines:

In [None]:
%%sql
SELECT p.Title, COUNT(*) AS TotalLines
FROM Piece p
INNER JOIN Line l ON p.PieceId = l.PieceId
GROUP BY p.PieceId, p.Title;

## Q4(e): Compare with another model [7 marks]

### Answer:

**Relational Model Assessment:**

| Pros | Cons |
|------|------|
| Structured queries (count, filter, aggregate) | Many join tables for nested structures |
| Clear FK constraints enforce integrity | Layout/coordinate data may not fit naturally |
| SQL is well-known | Schema changes require ALTER TABLE |
| Efficient joins for normalized data | |

**Comparison with XML/Document Database:**

| Pros | Cons |
|------|------|
| Natural hierarchy mirrors page/region/line | Harder aggregation (counting, grouping) |
| Mixed content support | XPath/XQuery learning curve |
| Flexible schema | Performance issues with large XML |
| Integrates with music encoding standards (MEI) | |

**Conclusion:**
- **Relational** is ideal for structured queries and data integrity
- **XML/Document** is better for deeply nested layout data and preserving document structure

---

# End of Solutions Notebook

All solutions have been provided. Compare with your attempts in the practice notebook!