<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/september-2022/notebook-september-2022-solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 September 2022 - Solutions Notebook

This notebook contains **complete solutions** for the September 2022 exam.

**Exam Structure:**
- Section A: 10 MCQs (Q1a-j) - 40 marks
- Section B: Answer 2 of 3 questions - 60 marks
  - Q2: Database Design and Querying
  - Q3: XML, XPath, and Relational Models
  - Q4: RDF, Ontologies, and Linked Data

**Instructions:**
1. Run the Setup cells first
2. All solution cells are pre-filled with correct answers
3. Compare with your own attempts from the practice notebook

---

# 1. Environment Setup

Run these cells first to set up MySQL, MongoDB, xmllint, rapper, and rdflib.

In [None]:
# === MySQL Setup ===
!apt -qq update > /dev/null
!apt -y -qq install mysql-server > /dev/null
!service mysql start

# Create user and database
!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

# === xmllint Setup (for XML/XPath and schema validation) ===
!apt -y -qq install libxml2-utils > /dev/null

# === jing Setup (for RelaxNG validation - used by TEI) ===
!apt -y -qq install jing > /dev/null

# === rapper Setup ===
!apt -y -qq install raptor2-utils > /dev/null

# === Python libraries ===
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0 lxml rdflib

%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

print("MySQL ready!")
print("xmllint ready!")
print("jing ready (for RelaxNG validation)!")
print("rapper ready!")

In [None]:
# === MongoDB Setup ===
!wget -q http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb
!dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb > /dev/null 2>&1
!wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add - > /dev/null 2>&1
!echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.4 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list > /dev/null
!apt-get update -qq > /dev/null
!apt-get install -y -qq mongodb-org > /dev/null
!mkdir -p /data/db
!mongod --fork --logpath /var/log/mongodb.log --dbpath /data/db

!mongo --quiet --eval 'print("MongoDB ready!")'

---

# Section A: MCQ Solutions

Complete solutions for all 10 MCQs.

In [None]:
print("""SECTION A: MCQ SOLUTIONS
========================

Q1(a) SQL Transactions - What is missing?
Answer: iv. COMMIT;
- Without COMMIT, updates remain uncommitted and could be lost
- ROLLBACK undoes changes, END TRANSACTION is non-standard

Q1(b) SPARQL Query - Why doesn't it return Ronaldo's birth city?
Answer: ii. "Cristiano Ronaldo"@en is a string, not a URL.
- Literals cannot be subjects in RDF triples
- Must use a URI like dbr:Cristiano_Ronaldo

Q1(c) RDF Predicates - How many predicates?
Answer: i. 4
- a (rdf:type), foaf:family_name, foaf:givenname, foaf:title

Q1(d) XPath Query - How many results?
Answer: vi. None
- The XPath uses /track (direct child of disk)
- But <track> elements are inside <tracks>, not direct children of <disk>
- Therefore the query returns NOTHING
- If it were //disk/tracks/track[@duration>150]/*, it would return 4

Q1(e) Precision/Recall - Best parameter setting?
Answer: ii. Right of graph before drop (68% precision, 90% recall)
- ~45 min for 3 missed items + ~6 min for false positives
- Better than 100% precision with manual search of remaining 83%

Q1(f) Normal Forms - Which does the table satisfy?
Answer: iv. 1NF only
- Has partial dependencies (Date of Birth depends only on Artist)
- Has transitive dependencies (Artist -> Date of Birth)

Q1(g) E/R Diagram Issues - Why is it not good?
Answer: i, ii, iii, iv, vi, viii
- Cardinality between entity/attribute, no explicit relationships
- Arrow meaningless, spaces in names, invalid cardinalities (21, ß, x)

Q1(h) SQL JOINs - Which query continuations work?
Answer: iii and iv
- Both properly join Client -> Meeting -> Employee
- Filter by client name "Shug Avery"

Q1(i) MongoDB Query - Which finds actors born before 1957?
Answer: vii. db.actors.find({"dateOfBirth": {$lt: ISODate("1957-01-01")}});
- find() returns all matches (not findOne)
- $lt is correct operator (not "<")
- ISODate() for proper date comparison

Q1(j) RecipeML DTD - Which statements are true?
Answer: i, ii (and arguably iv)
- i: Must have exactly one <ingredients> (no modifier)
- ii: Order matters in DTD sequences
- iv: "Can have one" is technically true
""")

## Q1(j) DTD Validation Demonstration

Let's validate the RecipeML DTD answers by actually running validation tests:

In [None]:
%%writefile recipeml.dtd
<!-- RecipeML DTD (simplified for exam practice) -->
<!-- Based on the DTD shown in Q1(j) -->

<!ELEMENT recipe (head, description*, equipment?, ingredients, directions, nutrition?, diet-exchanges?)>

<!-- Required elements (no modifier = exactly one) -->
<!ELEMENT head (#PCDATA)>
<!ELEMENT ingredients (#PCDATA)>
<!ELEMENT directions (#PCDATA)>

<!-- Optional elements (? = zero or one) -->
<!ELEMENT equipment (#PCDATA)>
<!ELEMENT nutrition (#PCDATA)>
<!ELEMENT diet-exchanges (#PCDATA)>

<!-- Zero or more (* = zero or more) -->
<!ELEMENT description (#PCDATA)>

In [None]:
# Valid recipe - demonstrates Q1(j)(i): "must have exactly one ingredients"
%%writefile recipe_valid.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE recipe SYSTEM "recipeml.dtd">
<recipe>
  <head>Chocolate Chip Cookies</head>
  <description>A classic family recipe.</description>
  <ingredients>2 cups flour, 1 cup sugar, 1 cup chocolate chips</ingredients>
  <directions>Mix ingredients. Bake at 350F for 12 minutes.</directions>
</recipe>

In [None]:
# Validate the valid recipe
print("=== Testing Q1(j)(i): Recipe MUST have exactly one ingredients ===")
!xmllint --dtdvalid recipeml.dtd recipe_valid.xml --noout && echo "VALID!"

In [None]:
# INVALID recipe - wrong element order (demonstrates Q1(j)(ii): order matters)
%%writefile recipe_wrong_order.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE recipe SYSTEM "recipeml.dtd">
<recipe>
  <head>Bad Recipe</head>
  <directions>Cook somehow!</directions>
  <ingredients>Some stuff</ingredients>
</recipe>

In [None]:
# Validate - should FAIL (Q1(j)(ii): ingredients must come BEFORE directions)
print("=== Testing Q1(j)(ii): Order matters in DTD - ingredients before directions ===")
!xmllint --dtdvalid recipeml.dtd recipe_wrong_order.xml --noout 2>&1 || echo "INVALID: Order matters in DTD sequences!"

In [None]:
# INVALID recipe - multiple ingredients (demonstrates Q1(j)(v) is FALSE)
%%writefile recipe_multi_ingredients.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE recipe SYSTEM "recipeml.dtd">
<recipe>
  <head>Over-specified Recipe</head>
  <ingredients>First batch of ingredients</ingredients>
  <ingredients>Second batch of ingredients</ingredients>
  <directions>Mix everything!</directions>
</recipe>

In [None]:
# Validate - should FAIL (Q1(j)(v) is FALSE: can only have ONE ingredients)
print("=== Testing Q1(j)(v) is FALSE: Cannot have multiple ingredients ===")
!xmllint --dtdvalid recipeml.dtd recipe_multi_ingredients.xml --noout 2>&1 || echo "INVALID: DTD allows only ONE ingredients element (no * or + modifier)!"

---

# Question 2: Database Design and Querying [30 marks]

## Database Setup

In [None]:
%%sql
DROP TABLE IF EXISTS Tests;
DROP TABLE IF EXISTS Students;

CREATE TABLE Students (
    Id INT PRIMARY KEY,
    GivenName VARCHAR(50) NOT NULL,
    FamilyName VARCHAR(50) NOT NULL,
    Gender VARCHAR(10) NOT NULL,
    BirthDate DATE NOT NULL,
    School VARCHAR(130),
    City VARCHAR(130)
);

CREATE TABLE Tests (
    TestId INT PRIMARY KEY,
    StudentId INT,
    TestDate DATE,
    Score DOUBLE,
    FOREIGN KEY (StudentId) REFERENCES Students(Id)
);

INSERT INTO Students VALUES
(1, 'Alice', 'Smith', 'F', '2005-05-10', 'Birmingham High', 'Birmingham'),
(2, 'Bob', 'Jones', 'M', '2005-06-12', 'Berlin Academy', 'Berlin'),
(3, 'Charlie', 'Brown', 'M', '2004-03-20', 'Seoul International', 'Seoul'),
(4, 'Diana', 'Miles', 'F', '2005-01-01', 'Birmingham High', 'Birmingham');

INSERT INTO Tests VALUES
(101, 1, '2019-01-10', 50.5),
(102, 1, '2019-09-10', 55.0),
(103, 2, '2019-01-10', 80.9),
(104, 2, '2019-09-15', 77.2),
(105, 3, '2019-05-01', 91.0),
(106, 4, '2019-01-10', 63.0);

SELECT 'Database ready!' AS Status;

## Q2(a): Which aggregate function is used? [1 mark]

### Solution

In [None]:
# Q2(a) SOLUTION
print("""Answer: AVG()

The AVG() function calculates the arithmetic mean of Score values.

In the query:
  SELECT AVG(Score) AS Average, ...

Other common aggregate functions:
- COUNT() - count rows
- SUM() - sum values
- MIN() - minimum value
- MAX() - maximum value
""")

In [None]:
%%sql
-- Demonstrate AVG function
SELECT AVG(Score) AS AverageScore,
       YEAR(TestDate) AS TestYear,
       S.Gender,
       S.City
FROM Tests T
INNER JOIN Students S ON T.StudentId = S.Id
GROUP BY YEAR(TestDate), S.Gender, S.City;

## Q2(b): Database design problem [6 marks]

### Solution

In [None]:
# Q2(b) SOLUTION
print("""Problems with the database design:

1. CITY STORED AS FREE TEXT:
   - "Birmingham, AL" vs "Birmingham" vs "Birmingham, USA"
   - These are treated as different cities in GROUP BY
   - Leads to incorrect aggregation results

2. SCHOOL STORED AS FREE TEXT:
   - Same issue - inconsistent naming
   - "Birmingham High" vs "Birmingham High School"

3. DATA DUPLICATION:
   - City and School info repeated for each student
   - Update anomalies possible

SOLUTIONS:

1. Normalize City and School into separate tables:
   - Cities(CityId PK, CityName, Country)
   - Schools(SchoolId PK, SchoolName, CityId FK)
   - Students references CityId and SchoolId

2. Use foreign keys for referential integrity:
   - Ensures consistent values
   - Prevents typos and variations

3. Benefits:
   - Accurate aggregations
   - Single point of update
   - Data consistency
""")

## Q2(c): Minimal read-only access [4 marks]

### Solution

In [None]:
# Q2(c) SOLUTION
print("""SQL command for read-only access:

CREATE USER 'researcher'@'localhost' IDENTIFIED BY 'securepassword';
GRANT SELECT ON exam_db.* TO 'researcher'@'localhost';

Or in older MySQL syntax:
GRANT SELECT ON exam_db.*
      TO 'researcher'@'localhost'
      IDENTIFIED BY 'securepassword';

Key Points:
- SELECT only - no INSERT, UPDATE, DELETE
- Follows principle of least privilege
- Read-only access as requested
""")

## Q2(d): Aggregated data access [4 marks]

### Solution

In [None]:
# Q2(d) SOLUTION
print("""Approach: Create a VIEW that exposes only aggregated data.

This protects individual student records while allowing research.
""")

In [None]:
%%sql
-- Q2(d) SOLUTION: Create aggregated view
DROP VIEW IF EXISTS AggregatedTestData;

CREATE VIEW AggregatedTestData AS
SELECT S.Gender,
       S.City,
       AVG(T.Score) AS AvgScore,
       COUNT(*) AS SampleSize,
       YEAR(T.TestDate) AS TestYear
FROM Tests T
INNER JOIN Students S ON T.StudentId = S.Id
GROUP BY S.Gender, S.City, YEAR(T.TestDate)
HAVING COUNT(*) >= 2;  -- Minimum group size for privacy

SELECT 'View created!' AS Status;

In [None]:
%%sql
-- View the aggregated data
SELECT * FROM AggregatedTestData;

In [None]:
# Grant access only to the view
print("""Then grant access only to the view:

GRANT SELECT ON exam_db.AggregatedTestData TO 'researcher'@'localhost';

Benefits:
- Researcher cannot see individual student records
- Only aggregated statistics visible
- HAVING clause ensures minimum group sizes
""")

## Q2(e): Limitation for researcher [1 mark]

### Solution

In [None]:
# Q2(e) SOLUTION
print("""Limitations of aggregated-only access:

1. Cannot access individual-level records
2. Cannot perform outlier detection or analysis
3. Cannot conduct correlation studies at individual level
4. Cannot verify data quality or identify data entry errors
5. Cannot analyze small subgroups that get filtered out
6. Cannot track individual student progress over time
7. Limited ability to investigate unexpected patterns
""")

## Q2(f): Student table problems [8 marks]

### Solution

In [None]:
# Q2(f) SOLUTION
print("""Problems with the Student table and solutions:

| Problem                  | Issue                           | Resolution                        |
|--------------------------|---------------------------------|-----------------------------------|
| VARCHAR(25) as PK        | Inefficient indexing            | Use INT AUTO_INCREMENT            |
| Gender ENUM('M','F')     | Binary-only, not inclusive      | Use VARCHAR or add more options   |
| School as free text      | Inconsistent entries            | Create Schools table, use FK      |
| City as free text        | Duplicates, typos possible      | Create Cities table, use FK       |
| No referential integrity | Can't ensure valid school/city  | Add foreign key constraints       |
| School/City can be NULL  | Data quality issues             | Consider NOT NULL or defaults     |
""")

In [None]:
%%sql
-- Q2(f) SOLUTION: Improved normalized schema
DROP TABLE IF EXISTS TestsNorm;
DROP TABLE IF EXISTS StudentsNorm;
DROP TABLE IF EXISTS Schools;
DROP TABLE IF EXISTS Cities;

CREATE TABLE Cities (
    CityId INT PRIMARY KEY AUTO_INCREMENT,
    CityName VARCHAR(100) NOT NULL,
    Country VARCHAR(50)
);

CREATE TABLE Schools (
    SchoolId INT PRIMARY KEY AUTO_INCREMENT,
    SchoolName VARCHAR(130) NOT NULL,
    CityId INT,
    FOREIGN KEY (CityId) REFERENCES Cities(CityId)
);

CREATE TABLE StudentsNorm (
    Id INT PRIMARY KEY AUTO_INCREMENT,
    ExternalId VARCHAR(25) UNIQUE,
    GivenName VARCHAR(80) NOT NULL,
    FamilyName VARCHAR(80) NOT NULL,
    Gender VARCHAR(20),
    BirthDate DATE NOT NULL,
    SchoolId INT,
    CityId INT,
    FOREIGN KEY (SchoolId) REFERENCES Schools(SchoolId),
    FOREIGN KEY (CityId) REFERENCES Cities(CityId)
);

SELECT 'Normalized schema created!' AS Status;

## Q2(g): MongoDB analysis [6 marks]

### Solution

In [None]:
# Q2(g) SOLUTION
print("""MongoDB Analysis for Student/Test Data:

ADVANTAGES:
+------------------+-----------------------------------------------+
| Advantage        | Explanation                                   |
+------------------+-----------------------------------------------+
| Flexible schema  | Can add fields without ALTER TABLE            |
| Embedded docs    | Tests can be embedded in student documents    |
| No rigid struct  | Easy to handle varying data formats           |
| Horizontal scale | Built for distributed systems                 |
| JSON-like docs   | Natural fit for modern web applications       |
+------------------+-----------------------------------------------+

DISADVANTAGES:
+--------------------+---------------------------------------------+
| Disadvantage       | Explanation                                 |
+--------------------+---------------------------------------------+
| No traditional JOINs| Cross-collection queries more complex      |
| Data duplication   | City/School info repeated in each document  |
| Weak ref integrity | No enforced foreign keys                    |
| Update anomalies   | Changing school name requires many updates  |
| Aggregation complex| GROUP BY operations less straightforward    |
+--------------------+---------------------------------------------+

Example MongoDB Document:
{
  "_id": "student123",
  "givenName": "Alice",
  "familyName": "Smith",
  "gender": "F",
  "birthDate": ISODate("2005-05-10"),
  "school": "Birmingham High School",
  "city": "Birmingham",
  "tests": [
    {"date": ISODate("2019-01-10"), "score": 50.5},
    {"date": ISODate("2019-09-10"), "score": 55.0}
  ]
}

CONCLUSION: Relational model is better for this use case due to:
- Need for consistent aggregations
- Importance of data integrity
- Complex analytical queries
""")

---

# Question 3: XML, XPath, and Relational Models [30 marks]

## XML Setup

In [None]:
%%writefile manuscript.xml
<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:id="manuscript_3945" xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader xmlns:tei="http://www.tei-c.org/ns/1.0">
    <fileDesc>
      <titleStmt>
        <title>Christ Church MS. 341</title>
        <title type="collection">Christ Church MSS.</title>
        <respStmt>
          <resp>Cataloguer</resp>
          <persName>Ralph Hanna</persName>
          <persName>David Rundle</persName>
        </respStmt>
      </titleStmt>
    </fileDesc>
  </teiHeader>
</TEI>

In [None]:
%%writefile msdesc.rng
<?xml version="1.0" encoding="UTF-8"?>
<!-- Simplified RelaxNG schema for TEI manuscript descriptions -->
<!-- Based on TEI P5 msdesc module - simplified for exam practice -->
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
         xmlns:tei="http://www.tei-c.org/ns/1.0"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

  <start>
    <ref name="TEI"/>
  </start>

  <define name="TEI">
    <element name="TEI" ns="http://www.tei-c.org/ns/1.0">
      <attribute name="xml:id"/>
      <ref name="teiHeader"/>
    </element>
  </define>

  <define name="teiHeader">
    <element name="teiHeader" ns="http://www.tei-c.org/ns/1.0">
      <optional>
        <attribute name="xmlns:tei"/>
      </optional>
      <ref name="fileDesc"/>
    </element>
  </define>

  <define name="fileDesc">
    <element name="fileDesc" ns="http://www.tei-c.org/ns/1.0">
      <ref name="titleStmt"/>
    </element>
  </define>

  <define name="titleStmt">
    <element name="titleStmt" ns="http://www.tei-c.org/ns/1.0">
      <!-- At least one title is REQUIRED -->
      <oneOrMore>
        <ref name="title"/>
      </oneOrMore>
      <!-- respStmt is OPTIONAL (zero or more) -->
      <zeroOrMore>
        <ref name="respStmt"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="title">
    <element name="title" ns="http://www.tei-c.org/ns/1.0">
      <optional>
        <attribute name="type"/>
      </optional>
      <text/>
    </element>
  </define>

  <define name="respStmt">
    <element name="respStmt" ns="http://www.tei-c.org/ns/1.0">
      <ref name="resp"/>
      <oneOrMore>
        <ref name="persName"/>
      </oneOrMore>
    </element>
  </define>

  <define name="resp">
    <element name="resp" ns="http://www.tei-c.org/ns/1.0">
      <text/>
    </element>
  </define>

  <define name="persName">
    <element name="persName" ns="http://www.tei-c.org/ns/1.0">
      <text/>
    </element>
  </define>

</grammar>

In [None]:
# Validate manuscript.xml against the RelaxNG schema
print("=== Validating manuscript.xml against msdesc.rng ===")
!jing msdesc.rng manuscript.xml && echo "VALID: manuscript.xml passes schema validation!"

## Q3(a): Markup language and root node [2 marks]

### Solution

In [None]:
# Q3(a) SOLUTION
print("""Answer:
- Markup Language: XML (specifically TEI - Text Encoding Initiative)
- Root Node: <TEI>

Note: TEI is an XML application/vocabulary designed for
encoding texts in the humanities.
""")

## Q3(b): Well-formedness [3 marks]

### Solution

In [None]:
# Q3(b) SOLUTION
print("""Answer: YES, this fragment is well-formed.

Well-formedness requirements met:
1. Exactly one root element (<TEI>) ✓
2. All tags properly opened and closed ✓
3. Proper nesting (no overlapping tags) ✓
4. Attribute values in quotes ✓
5. Valid characters in element/attribute names ✓

Note: The exam question may show a truncated fragment.
If closing tags are missing, it would NOT be well-formed.
""")

In [None]:
# Verify well-formedness
!xmllint --noout manuscript.xml && echo "XML is well-formed!"

## Q3(c): XPath expression //fileDesc//title/@type [2 marks]

### Solution

In [None]:
# Q3(c) SOLUTION
print("""Answer: The query selects the 'type' attribute value from
<title> elements under <fileDesc>.

Result: "collection"

This is from: <title type="collection">Christ Church MSS.</title>
""")

In [None]:
# Demonstrate with lxml
from lxml import etree

doc = etree.parse('manuscript.xml')
namespaces = {'tei': 'http://www.tei-c.org/ns/1.0'}
result = doc.xpath('//tei:fileDesc//tei:title/@type', namespaces=namespaces)
print("Result:", result)

## Q3(d): XPath //resp[text()='Cataloguer']/../persName [2 marks]

### Solution

In [None]:
# Q3(d) SOLUTION
print("""Answer: All <persName> elements that are siblings of a <resp>
element containing "Cataloguer".

Result:
- <persName>Ralph Hanna</persName>
- <persName>David Rundle</persName>

How it works:
1. //resp[text()='Cataloguer'] - Find <resp> with text "Cataloguer"
2. /.. - Navigate UP to parent (<respStmt>)
3. /persName - Select <persName> children
""")

In [None]:
# Test with lxml
result = doc.xpath("//tei:resp[text()='Cataloguer']/../tei:persName/text()", namespaces=namespaces)
print("Result:", result)

## Q3(e): Why use complex XPath? [4 marks]

### Solution

In [None]:
# Q3(e) SOLUTION
print("""Why use the complex XPath instead of just //persName?

Situation 1: DISAMBIGUATION
- Document may have multiple <persName> elements in different contexts
- Authors, editors, scribes, cataloguers, etc.
- This XPath specifically gets people with "Cataloguer" role

Situation 2: ROLE-SPECIFIC QUERIES
- You only want cataloguers, not all named people
- Other <respStmt> might have roles like "Editor" or "Transcriber"
- The complex XPath filters to only the relevant role

Situation 3: CONTEXT PRESERVATION
- Maintains relationship between person and their role
- You KNOW these people ARE cataloguers

Situation 4: PRECISION IN LARGE DOCUMENTS
- In a catalogue with thousands of entries, //persName returns everyone
- Complex XPath returns only those in cataloguer role
""")

## Q3(f): Relational model for manuscript contents [8 marks]

### Solution

In [None]:
# Q3(f) SOLUTION
print("""Problems with n="2" attribute in relational model:

1. ORDERING NOT IMPLICIT
   - Relational tables have no inherent row order
   - Must store sequence explicitly

2. ATTRIBUTE VS COLUMN
   - The n attribute needs explicit storage as a column

3. NESTED STRUCTURE
   - Manuscripts contain items which may have sub-items
   - Requires careful modeling

SOLUTION: Store sequence number explicitly
""")

In [None]:
%%sql
-- Q3(f) SOLUTION: Relational schema for manuscript contents
DROP TABLE IF EXISTS ManuscriptItems;
DROP TABLE IF EXISTS Manuscripts;

CREATE TABLE Manuscripts (
    ManuscriptId VARCHAR(50) PRIMARY KEY,
    Title VARCHAR(200)
);

CREATE TABLE ManuscriptItems (
    ItemId INT PRIMARY KEY AUTO_INCREMENT,
    ManuscriptId VARCHAR(50),
    ItemNumber INT NOT NULL,  -- Explicit ordering
    Incipit TEXT,
    Explicit TEXT,
    Notes TEXT,
    FOREIGN KEY (ManuscriptId) REFERENCES Manuscripts(ManuscriptId),
    UNIQUE (ManuscriptId, ItemNumber)  -- No duplicate item numbers
);

-- Insert sample data
INSERT INTO Manuscripts VALUES ('manuscript_3945', 'Christ Church MS. 341');
INSERT INTO ManuscriptItems (ManuscriptId, ItemNumber, Incipit) VALUES
('manuscript_3945', 1, 'First textual item...'),
('manuscript_3945', 2, 'Seynt austyn sei in e secounde boke...');

SELECT 'Schema created!' AS Status;

In [None]:
%%sql
-- Query items in correct order
SELECT ManuscriptId, ItemNumber, Incipit
FROM ManuscriptItems
WHERE ManuscriptId = 'manuscript_3945'
ORDER BY ItemNumber;

## Q3(g): What is msdesc.rng? [3 marks]

### Solution

In [None]:
# Q3(g) SOLUTION
print("""What is msdesc.rng?

A Relax NG schema file (.rng extension) that:
1. Defines the structure, elements, attributes for valid TEI manuscript descriptions
2. Specifies which elements are required, optional, and their allowed order

Why is it referenced?
1. VALIDATION: Allows XML documents to be validated against the schema
2. STRUCTURE: Specifies constraints for valid documents
3. INTEROPERABILITY: Ensures all catalogue entries follow the same structure
4. DOCUMENTATION: Machine-readable specification of the format

The <?xml-model?> processing instruction tells validators which schema to use.
""")

## Q3(h): Valid vs Well-formed [2 marks]

### Solution

In [None]:
# Q3(h) SOLUTION
print("""Valid vs Well-Formed XML:

| Aspect       | Well-Formed           | Valid                          |
|--------------|-----------------------|--------------------------------|
| Definition   | Follows XML syntax    | Well-formed AND conforms to    |
|              | rules                 | a schema/DTD                   |
| Requirements | Proper nesting, one   | All schema constraints         |
|              | root, closed tags     | satisfied                      |
| Can check    | By any XML parser     | Only with schema available     |
| Relationship | Prerequisite for      | Subset of well-formed          |
|              | validity              | documents                      |

Examples:
- Well-formed but not valid: Syntactically correct but missing required elements
- Valid: Syntactically correct AND follows schema rules
- Not well-formed: Has syntax errors (can never be valid)
""")

## Q3(i) & Q3(j): Omitting elements [2 marks]

### Solution

In [None]:
# Q3(i) and Q3(j) SOLUTION
print("""Q3(i): If respStmt was omitted, would the XML be legal?

Answer:
- Well-formed: YES (if tags still properly closed)
- Valid: DEPENDS on schema
  - Schema shows <zeroOrMore><ref name="model.respLike"/></zeroOrMore>
  - This means respStmt is OPTIONAL, so likely still valid

---

Q3(j): If title elements were omitted, would the XML be legal?

Answer:
- Well-formed: YES (syntactically correct)
- Valid: NO
  - Schema shows <oneOrMore><ref name="title"/></oneOrMore>
  - At least one <title> is REQUIRED in <titleStmt>
  - Without it, validation fails
""")

In [None]:
# Q3(i) VALIDATION TEST: XML without respStmt (should PASS - it's optional)
%%writefile manuscript_no_respstmt.xml
<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:id="manuscript_3945" xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader xmlns:tei="http://www.tei-c.org/ns/1.0">
    <fileDesc>
      <titleStmt>
        <title>Christ Church MS. 341</title>
        <title type="collection">Christ Church MSS.</title>
        <!-- respStmt is OMITTED - but this is OK because it's optional -->
      </titleStmt>
    </fileDesc>
  </teiHeader>
</TEI>

In [None]:
print("=== Q3(i) Test: Validating XML without respStmt ===")
!jing msdesc.rng manuscript_no_respstmt.xml && echo "VALID: respStmt is OPTIONAL - omitting it is allowed!"

In [None]:
# Q3(j) VALIDATION TEST: XML without title (should FAIL - title is required)
%%writefile manuscript_no_title.xml
<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:id="manuscript_3945" xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader xmlns:tei="http://www.tei-c.org/ns/1.0">
    <fileDesc>
      <titleStmt>
        <!-- NO title elements - this violates the schema! -->
        <respStmt>
          <resp>Cataloguer</resp>
          <persName>Ralph Hanna</persName>
        </respStmt>
      </titleStmt>
    </fileDesc>
  </teiHeader>
</TEI>

In [None]:
print("=== Q3(j) Test: Validating XML without title elements ===")
!jing msdesc.rng manuscript_no_title.xml || echo "INVALID: title is REQUIRED - at least one must exist!"

## Q3(k): XML to HTML conversion [2 marks]

### Solution

In [None]:
# Q3(k) SOLUTION
print("""Two technologies for automatic XML to HTML conversion:

1. XSLT (Extensible Stylesheet Language Transformations)
   - Purpose-built for transforming XML to other formats
   - Uses template matching to process XML nodes
   - Declarative language

2. XSLT Processor (e.g., Saxon, Xalan, libxslt)
   - Executes the XSLT transformations
   - Can be integrated into automated pipelines
   - Available as command-line tools or libraries

Alternative answer: XQuery could also be used for XML-to-HTML transformation.
""")

---

# Question 4: RDF, Ontologies, and Linked Data [30 marks]

## RDF Setup

In [None]:
%%writefile annotation.ttl
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix oa: <http://www.w3.org/ns/oa#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix myrdf: <http://example.org/> .
@prefix armadale: <https://literary-greats.com/WCollins/Armadale/> .

myrdf:anno-001 a oa:Annotation ;
    dcterms:created "2015-10-13T13:00:00+00:00"^^xsd:dateTime ;
    dcterms:creator myrdf:DL192 ;
    oa:hasBody [
        a oa:TextualBody ;
        rdf:value "Note the use of visual language here."
    ] ;
    oa:hasTarget [
        a oa:SpecificResource ;
        oa:hasSelector [
            a oa:TextPositionSelector ;
            oa:start 235 ;
            oa:end 300
        ] ;
        oa:hasSource armadale:Chapter3
    ] ;
    oa:motivatedBy oa:commenting .

myrdf:DL192 a foaf:Person ;
    foaf:name "David Lewis" .

In [None]:
# Validate the Turtle
!rapper -i turtle -c annotation.ttl

## Q4(a): Model and serialization [2 marks]

### Solution

In [None]:
# Q4(a) SOLUTION
print("""Q4(a)(i) - What is the model?
Answer: RDF (Resource Description Framework)

Q4(a)(ii) - What is the serialisation format?
Answer: Turtle (Terse RDF Triple Language)

Other RDF serialization formats include:
- RDF/XML
- N-Triples
- JSON-LD
- N3 (Notation3)
""")

## Q4(b): Name two ontologies [3 marks]

### Solution

In [None]:
# Q4(b) SOLUTION
print("""Two ontologies used in this document:

1. Dublin Core (dcterms:)
   - http://purl.org/dc/terms/
   - Standard for metadata elements

2. FOAF (foaf:)
   - http://xmlns.com/foaf/0.1/
   - Friend of a Friend - describes people and relationships

Also acceptable:
3. Open Annotation (oa:)
   - http://www.w3.org/ns/oa#
   - Web Annotation vocabulary
""")

## Q4(c): Properties from each ontology [5 marks]

### Solution

In [None]:
# Q4(c) SOLUTION
print("""Properties from each ontology used in this document:

Dublin Core (dcterms:):
- dcterms:created - date/time of creation
- dcterms:creator - person who created the annotation

FOAF (foaf:):
- foaf:name - name of the person

Open Annotation (oa:):
- oa:hasBody - the content of the annotation
- oa:hasTarget - what is being annotated
- oa:hasSelector - specific selection within target
- oa:hasSource - the source document
- oa:motivatedBy - reason for annotation
- oa:start - start position
- oa:end - end position
""")

## Q4(d): Fix the SPARQL query [7 marks]

### Solution

In [None]:
# Q4(d) SOLUTION - Show the broken and fixed queries
print("""BROKEN QUERY:

SELECT ?body ?creator
WHERE {
  ?annotation a oa:Annotation .
  ?creator ;
  oa:hasBody body .
  hasSource armadale:Chapter3 }

PROBLEMS:
1. Missing PREFIX declarations
2. ?creator has no predicate connecting it
3. 'body' should be '?body' (variable)
4. 'hasSource' should be 'oa:hasSource'
5. Need to navigate through oa:hasTarget to get to oa:hasSource
6. Need to get actual text value and creator name

---

CORRECTED QUERY:

PREFIX oa: <http://www.w3.org/ns/oa#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX armadale: <https://literary-greats.com/WCollins/Armadale/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?bodyText ?creatorName
WHERE {
  ?annotation a oa:Annotation ;
              dcterms:creator ?creator ;
              oa:hasBody ?body ;
              oa:hasTarget ?target .
  ?body rdf:value ?bodyText .
  ?target oa:hasSource armadale:Chapter3 .
  ?creator foaf:name ?creatorName .
}
""")

In [None]:
# Test the corrected query with rdflib
import rdflib

g = rdflib.Graph()
g.parse('annotation.ttl', format='turtle')

query = """
PREFIX oa: <http://www.w3.org/ns/oa#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX armadale: <https://literary-greats.com/WCollins/Armadale/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?bodyText ?creatorName
WHERE {
  ?annotation a oa:Annotation ;
              dcterms:creator ?creator ;
              oa:hasBody ?body ;
              oa:hasTarget ?target .
  ?body rdf:value ?bodyText .
  ?target oa:hasSource armadale:Chapter3 .
  ?creator foaf:name ?creatorName .
}
"""

print("Query Results:")
for row in g.query(query):
    print(f"  Body: {row.bodyText}")
    print(f"  Creator: {row.creatorName}")

## Q4(e): E/R diagram [5 marks]

### Solution

In [None]:
# Q4(e) SOLUTION
print("""E/R Diagram for Web Annotations:

Option 1: TRIPLE STORE APPROACH (Simple)

┌─────────────────────────────────────────┐
│                Triples                  │
├─────────────────────────────────────────┤
│ Subject   VARCHAR(256)  PK              │
│ Predicate VARCHAR(256)  PK              │
│ Object    VARCHAR(512)  PK              │
└─────────────────────────────────────────┘

Option 2: TRADITIONAL E/R APPROACH

┌──────────────┐         ┌──────────────┐
│  Annotations │         │   Persons    │
├──────────────┤         ├──────────────┤
│ AnnotationId │────┐    │ PersonId PK  │
│ Created      │    │    │ Name         │
│ Motivation   │    │    └──────────────┘
│ CreatorId FK │────┘
└──────┬───────┘
       │
       │ 1:1
       ▼
┌──────────────┐         ┌──────────────┐
│    Bodies    │         │   Targets    │
├──────────────┤         ├──────────────┤
│ BodyId PK    │         │ TargetId PK  │
│ AnnotationId │         │ AnnotationId │
│ BodyType     │         │ SourceURI    │
│ Value        │         │ StartPos     │
└──────────────┘         │ EndPos       │
                         └──────────────┘
""")

## Q4(f): Tables and keys [5 marks]

### Solution

In [None]:
# Q4(f) SOLUTION
print("""Tables for relational implementation:

OPTION 1: SINGLE TRIPLE TABLE

| Table   | Primary Key                     | Foreign Keys |
|---------|---------------------------------|--------------|
| Triples | (Subject, Predicate, Object)    | None         |

OPTION 2: TRADITIONAL RELATIONAL

| Table       | Primary Key   | Foreign Keys                    |
|-------------|---------------|---------------------------------|
| Persons     | PersonId      | -                               |
| Annotations | AnnotationId  | CreatorId -> Persons(PersonId)  |
| Bodies      | BodyId        | AnnotationId -> Annotations     |
| Targets     | TargetId      | AnnotationId -> Annotations     |
| Sources     | SourceId      | -                               |
| Selectors   | SelectorId    | TargetId -> Targets(TargetId)   |
""")

In [None]:
%%sql
-- Q4(f) SOLUTION: Create Triple Store table
DROP TABLE IF EXISTS Triples;

CREATE TABLE Triples (
    Subject VARCHAR(256),
    Predicate VARCHAR(256),
    Object VARCHAR(512),
    PRIMARY KEY (Subject, Predicate, Object)
);

-- Insert sample annotation data
INSERT INTO Triples VALUES
('myrdf:anno-001', 'rdf:type', 'oa:Annotation'),
('myrdf:anno-001', 'dcterms:creator', 'myrdf:DL192'),
('myrdf:anno-001', 'oa:hasBody', '_:body1'),
('myrdf:anno-001', 'oa:hasTarget', '_:target1'),
('_:body1', 'rdf:value', 'Note the use of visual language here.'),
('_:target1', 'oa:hasSource', 'armadale:Chapter3'),
('myrdf:DL192', 'rdf:type', 'foaf:Person'),
('myrdf:DL192', 'foaf:name', 'David Lewis');

SELECT 'Triple store created!' AS Status;

## Q4(g): MySQL equivalent query [3 marks]

### Solution

In [None]:
# Q4(g) SOLUTION explanation
print("""MySQL query equivalent for the SPARQL query:

Using the triple store design, we need multiple self-joins:
""")

In [None]:
%%sql
-- Q4(g) SOLUTION: MySQL equivalent query
SELECT tBodyVal.Object AS BodyText,
       tCreatorName.Object AS CreatorName
FROM Triples tAnno
INNER JOIN Triples tType
    ON tAnno.Subject = tType.Subject
INNER JOIN Triples tBody
    ON tAnno.Subject = tBody.Subject
INNER JOIN Triples tBodyVal
    ON tBody.Object = tBodyVal.Subject
INNER JOIN Triples tCreator
    ON tAnno.Subject = tCreator.Subject
INNER JOIN Triples tCreatorName
    ON tCreator.Object = tCreatorName.Subject
INNER JOIN Triples tTarget
    ON tAnno.Subject = tTarget.Subject
INNER JOIN Triples tSource
    ON tTarget.Object = tSource.Subject
WHERE tType.Predicate = 'rdf:type'
  AND tType.Object = 'oa:Annotation'
  AND tBody.Predicate = 'oa:hasBody'
  AND tBodyVal.Predicate = 'rdf:value'
  AND tCreator.Predicate = 'dcterms:creator'
  AND tCreatorName.Predicate = 'foaf:name'
  AND tTarget.Predicate = 'oa:hasTarget'
  AND tSource.Predicate = 'oa:hasSource'
  AND tSource.Object = 'armadale:Chapter3';

In [None]:
# Q4(g) Explanation
print("""Note: This demonstrates why SPARQL is more natural for RDF data.

The SQL query requires EIGHT self-joins to traverse the graph structure,
whereas SPARQL handles this pattern matching naturally.

Each alias represents finding a specific triple pattern:
- tAnno: Base annotation
- tType: Verify it's an oa:Annotation
- tBody: Find oa:hasBody
- tBodyVal: Get rdf:value from body
- tCreator: Find dcterms:creator
- tCreatorName: Get foaf:name from creator
- tTarget: Find oa:hasTarget
- tSource: Verify oa:hasSource is armadale:Chapter3
""")

---

# End of Solutions Notebook

All solutions have been provided. Compare with your attempts in the practice notebook!