<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/september-2022/notebook-september-2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 September 2022 - Practice Notebook

This notebook provides hands-on practice for the September 2022 exam.

**Exam Structure:**
- Section A: 10 MCQs (Q1a-j) - 40 marks
- Section B: Answer 2 of 3 questions - 60 marks
  - Q2: Database Design and Querying
  - Q3: XML, XPath, and Relational Models
  - Q4: RDF, Ontologies, and Linked Data

**Instructions:**
1. Run the Setup cells first
2. Write your answers in the empty code cells
3. Check your answers against the solution sheet

---

# 1. Environment Setup

Run these cells first to set up MySQL, MongoDB, xmllint, rapper, and rdflib.

In [None]:
# === MySQL Setup ===
!apt -qq update > /dev/null
!apt -y -qq install mysql-server > /dev/null
!service mysql start

# Create user and database
!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

# === xmllint Setup (for XML/XPath exercises) ===
!apt -y -qq install libxml2-utils > /dev/null

# === rapper Setup (for RDF/Turtle validation) ===
!apt -y -qq install raptor2-utils > /dev/null

# === Python libraries ===
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0 lxml rdflib

%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

print("MySQL ready!")
print("xmllint ready!")
print("rapper ready!")

In [None]:
# === MongoDB Setup ===
!wget -q http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb
!dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb > /dev/null 2>&1
!wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add - > /dev/null 2>&1
!echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.4 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list > /dev/null
!apt-get update -qq > /dev/null
!apt-get install -y -qq mongodb-org > /dev/null
!mkdir -p /data/db
!mongod --fork --logpath /var/log/mongodb.log --dbpath /data/db

# Test MongoDB is running
!mongo --quiet --eval 'print("MongoDB ready!")'

---

# Section A: Multiple Choice Questions

Answer the MCQs by writing your choice (e.g., "iv") in the answer cells.

## Q1(a): SQL Transactions [4 marks]

**Question:** What is missing from the following set of commands?

```sql
START TRANSACTION;
UPDATE Account SET Balance = Balance-100 WHERE AccNo=21430885;
UPDATE Account SET Balance = Balance+100 WHERE AccNo=29584776;
SELECT SUM(Balance) FROM Account;
```

**Options:**
- i. ROLLBACK;
- ii. INSERT INTO Account VALUES (100);
- iii. END TRANSACTION;
- iv. COMMIT;
- v. UPDATE Account SET Balance = Balance+100 WHERE AccNo=21430885;

In [None]:
# Q1(a) YOUR ANSWER:
answer_1a = ""  # Enter your choice: i, ii, iii, iv, or v

## Q1(b): SPARQL Query Issue [4 marks]

**Question:** The following query should return the name of the city of Cristiano Ronaldo's birth. Why doesn't it?

```sparql
SELECT DISTINCT *
WHERE {
  "Cristiano Ronaldo"@en dbo:birthPlace
    [
      a dbo:City ;
      rdfs:label ?cityName
    ] .
  FILTER ( LANG(?cityName) = 'en' )
}
```

**Options:**
- i. The city is not in England, so the filter removes it.
- ii. "Cristiano Ronaldo"@en is a string, not a URL. It can't be the subject of a triple.
- iii. The first part of the WHERE clause is a duple, not a triple.
- iv. Ronaldo's place of birth is not in Wikipedia in a way that dbpedia can access.

In [None]:
# Q1(b) YOUR ANSWER:
answer_1b = ""  # Enter your choice: i, ii, iii, or iv

## Q1(c): RDF Predicate Count [4 marks]

**Question:** How many predicates does this extract contain?

```turtle
card:I a :Male;
       foaf:family_name "Berners-Lee";
       foaf:givenname "Timothy";
       foaf:title "Sir".
```

**Options:** i. 4, ii. 7, iii. 5, iv. 8, v. 9, vi. 1, vii. 12, viii. 2, ix. None

In [None]:
# Q1(c) YOUR ANSWER:
answer_1c = ""  # Enter your choice

## Q1(d): XPath Query Count [4 marks]

**Question:** How many results does this XPath query select?

```xpath
//disk[@xml:id="1847336"]/track[@duration>150]/*
```

Given XML:
```xml
<collection>
  <disk xml:id="1847336">
    <title>The Greatest Hits Ever: Volume 123</title>
    <tracks>
      <track no="1" duration="193">
        <title>What is wrong with parsley?</title>
        <artist>Herbal Reasoning</artist>
      </track>
      <track no="2" duration="167">
        <title>Love threw me a googly</title>
        <artist>Botham and the Fielders</artist>
      </track>
      <track no="3" duration="121">
        <title>Comedy farm</title>
        <artist>Just weird</artist>
      </track>
    </tracks>
  </disk>
</collection>
```

**Options:** i. 5, ii. 4, iii. 1, iv. 6, v. 3, vi. None, vii. 2

In [None]:
# Q1(d) YOUR ANSWER:
answer_1d = ""  # Enter your choice

## Q1(e): Precision/Recall Optimization [4 marks]

**Question:** An archive search tool shows precision/recall behavior. Which parameter setting minimizes time?

- Archive: 50,000 items, 30 relevant
- Time to find missed item manually: 15 minutes
- Time wasted on false positive: 30 seconds

**Options:**
- i. Just right of centre - about 12 false negatives and 5 false positives
- ii. Right of graph before drop - 68% precision and 90% recall
- iii. Do not use this tool - find manually
- iv. Left of graph - 100% precision with 17% recall, find rest at 15 min each
- v. Left of graph - 100% precision with 17% recall, find rest at 30 seconds each

In [None]:
# Q1(e) YOUR ANSWER:
answer_1e = ""  # Enter your choice

## Q1(f): Normal Forms [4 marks]

**Question:** Which normal forms does the Music Singles table satisfy? (Select ALL that apply)

| Chart | Date | Position | Title | Artist | Date of Birth |
|-------|------|----------|-------|--------|---------------|
| RIAS | 2022-04-14 | 1 | As It Was | Harry Styles | 1994-02-01 |
| ... | ... | ... | ... | ... | ... |

**Options:**
- i. 2NF
- ii. 3NF
- iii. 5NF
- iv. 1NF
- v. 4NF
- vi. None of them
- vii. SNCF
- viii. BCNF

In [None]:
# Q1(f) YOUR ANSWER:
answer_1f = ""  # Enter your choice(s), e.g., "iv" or "i, ii"

## Q1(g): E/R Diagram Issues [4 marks]

**Question:** Why is the plant identification E/R diagram not good? (Select ALL that apply)

**Options:**
- i. By convention, cardinality is only given between entities (not attributes)
- ii. Entities are connected without explicit relationships
- iii. The arrow is meaningless
- iv. Spaces are not permitted in attribute names
- v. An attribute can't be shared between entities
- vi. Cardinalities like '21' are not allowed
- vii. There is a ternary relationship
- viii. Cardinalities like ÃŸ and x are inadvisable

In [None]:
# Q1(g) YOUR ANSWER:
answer_1g = ""  # Enter your choice(s)

## Q1(h): SQL JOINs [4 marks]

**Question:** Find staff who had interactions with client "Shug Avery". Which query continuations work?

```sql
SELECT Employee.givenName, Employee.familyName
```

**Options:**
- i. FROM Employee LEFT JOIN Client ON (Client.name="Shug Avery");
- ii. FROM Meeting LEFT JOIN Client ON (Meeting.ID = Client.ID)...
- iii. FROM Client INNER JOIN Meeting ON (Meeting.ClientID = Client.ID)...
- iv. FROM Employee, Client, Meeting WHERE Employee.ID = Meeting.EmployeeID...
- v. FROM Client NATURAL JOIN Employee...

In [None]:
# Q1(h) YOUR ANSWER:
answer_1h = ""  # Enter your choice(s)

## Q1(i): MongoDB Query [4 marks]

**Question:** Which query finds actors born before 1957?

**Options:**
- i. db.actors.findOne({"dateOfBirth": {$lt: ISODate("1957-01-01")}});
- ii. db.actors.findOne({"dateOfBirth": {$lt: 1957}});
- iii-vi. (various incorrect queries)
- vii. db.actors.find({"dateOfBirth": {$lt: ISODate("1957-01-01")}});
- viii. db.actors.findOne({"dateOfBirth": {"<": ISODate("1957-01-01")}});

In [None]:
# Q1(i) YOUR ANSWER:
answer_1i = ""  # Enter your choice

## Q1(j): RecipeML DTD [4 marks]

**Question:** Given the RecipeML DTD:
```dtd
<!ELEMENT recipe (head, description*, equipment?, ingredients, directions, nutrition?, diet-exchanges?)>
```

Select ALL true statements:
- i. `<recipe>` must have one `<ingredients>` element as a direct child
- ii. The `<ingredients>` element must come before the `<directions>` element
- iii. The order of children of `<recipe>` is not important
- iv. `<recipe>` can have one `<ingredients>` element as a direct child
- v. `<recipe>` can have multiple `<ingredients>` elements as direct children

In [None]:
# Q1(j) YOUR ANSWER:
answer_1j = ""  # Enter your choice(s)

---

# Question 2: Database Design and Querying [30 marks]

## Context

An organisation monitoring non-verbal reasoning tests maintains a database. A sociologist runs:

```sql
SELECT AVG(Score) AS Average,
       Year(TestDate) AS TestYear,
       Gender,
       TIMESTAMPDIFF(YEAR, BirthDate, TestDate) AS Age,
       Student.City as City
FROM Test INNER JOIN Student ON Test.Student=Student.ID
GROUP BY TestYear, Gender, Age, City
```

## Database Setup

In [None]:
%%sql
DROP TABLE IF EXISTS Tests;
DROP TABLE IF EXISTS Students;

CREATE TABLE Students (
    Id INT PRIMARY KEY,
    GivenName VARCHAR(50) NOT NULL,
    FamilyName VARCHAR(50) NOT NULL,
    Gender VARCHAR(10) NOT NULL,
    BirthDate DATE NOT NULL,
    School VARCHAR(130),
    City VARCHAR(130)
);

CREATE TABLE Tests (
    TestId INT PRIMARY KEY,
    StudentId INT,
    TestDate DATE,
    Score DOUBLE,
    FOREIGN KEY (StudentId) REFERENCES Students(Id)
);

-- Insert sample data
INSERT INTO Students VALUES
(1, 'Alice', 'Smith', 'F', '2005-05-10', 'Birmingham High', 'Birmingham'),
(2, 'Bob', 'Jones', 'M', '2005-06-12', 'Berlin Academy', 'Berlin'),
(3, 'Charlie', 'Brown', 'M', '2004-03-20', 'Seoul International', 'Seoul'),
(4, 'Diana', 'Miles', 'F', '2005-01-01', 'Birmingham High', 'Birmingham');

INSERT INTO Tests VALUES
(101, 1, '2019-01-10', 50.5),
(102, 1, '2019-09-10', 55.0),
(103, 2, '2019-01-10', 80.9),
(104, 2, '2019-09-15', 77.2),
(105, 3, '2019-05-01', 91.0),
(106, 4, '2019-01-10', 63.0);

SELECT 'Database ready!' AS Status;

## Q2(a): Which aggregate function is used? [1 mark]

In [None]:
# Q2(a) YOUR ANSWER:
# Which aggregate function is used in the query above?


## Q2(b): Database design problem [6 marks]

**Question:** There is a problem with the database design that risks making the aggregation incorrect. What is it, and how could it be resolved?

In [None]:
# Q2(b) YOUR ANSWER:
# Describe the problem and solution:


## Q2(c): Minimal read-only access [4 marks]

**Question:** For security reasons, the researcher should be given minimal, read-only access. Give a suitable command.

In [None]:
# Q2(c) YOUR ANSWER:
# Write the SQL GRANT command:


## Q2(d): Aggregated data access [4 marks]

**Question:** How would you give access only to aggregated data to protect confidential information about minors?

In [None]:
# Q2(d) YOUR ANSWER:
# Describe your approach (creating a VIEW):


In [None]:
%%sql
-- Q2(d) Create the VIEW:


## Q2(e): Limitation for researcher [1 mark]

**Question:** What limitation would giving only aggregated access create for the researcher?

In [None]:
# Q2(e) YOUR ANSWER:


## Q2(f): Student table problems [8 marks]

**Question:** The Student table is defined as:

```sql
CREATE TABLE Student(
  ID VARCHAR(25) PRIMARY KEY,
  GivenName VARCHAR(80) NOT NULL,
  FamilyName VARCHAR(80) NOT NULL,
  Gender ENUM('M','F') NOT NULL,
  BirthDate DATE NOT NULL,
  School VARCHAR(130),
  City VARCHAR(130));
```

What problems can you see, and how would you resolve them?

In [None]:
# Q2(f) YOUR ANSWER:
# List problems and solutions:


## Q2(g): MongoDB analysis [6 marks]

**Question:** How well would this data work in an object database like MongoDB? What would be the advantages or disadvantages?

In [None]:
# Q2(g) YOUR ANSWER:
# Advantages:
#
# Disadvantages:
#

---

# Question 3: XML, XPath, and Relational Models [30 marks]

## Context

An entry in the Oxford Medieval Manuscript catalogue:

In [None]:
%%writefile manuscript.xml
<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:id="manuscript_3945" xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader xmlns:tei="http://www.tei-c.org/ns/1.0">
    <fileDesc>
      <titleStmt>
        <title>Christ Church MS. 341</title>
        <title type="collection">Christ Church MSS.</title>
        <respStmt>
          <resp>Cataloguer</resp>
          <persName>Ralph Hanna</persName>
          <persName>David Rundle</persName>
        </respStmt>
      </titleStmt>
    </fileDesc>
  </teiHeader>
</TEI>

## Q3(a): Markup language and root node [2 marks]

**Question:** What markup language is being used? And what is the root node?

In [None]:
# Q3(a) YOUR ANSWER:
# Markup language:
# Root node:


## Q3(b): Well-formedness [3 marks]

**Question:** Is this fragment well-formed? Justify your answer.

In [None]:
# Q3(b) YOUR ANSWER:


In [None]:
# Test well-formedness with xmllint
!xmllint --noout manuscript.xml && echo "XML is well-formed!"

## Q3(c): XPath expression [2 marks]

**Question:** What would be selected by `//fileDesc//title/@type`?

In [None]:
# Q3(c) YOUR ANSWER:


In [None]:
# Test with xmllint (note: namespaces make this tricky)
from lxml import etree

doc = etree.parse('manuscript.xml')
namespaces = {'tei': 'http://www.tei-c.org/ns/1.0'}
result = doc.xpath('//tei:fileDesc//tei:title/@type', namespaces=namespaces)
print("Result:", result)

## Q3(d): XPath with text() [2 marks]

**Question:** What would be selected by `//resp[text()='Cataloguer']/../persName`?

In [None]:
# Q3(d) YOUR ANSWER:


In [None]:
# Test with lxml
result = doc.xpath("//tei:resp[text()='Cataloguer']/../tei:persName/text()", namespaces=namespaces)
print("Result:", result)

## Q3(e): Why use complex XPath? [4 marks]

**Question:** Why might you choose the expression in (d) rather than the simpler `persName`? Give two situations.

In [None]:
# Q3(e) YOUR ANSWER:
# Situation 1:
#
# Situation 2:
#

## Q3(f): Relational model for manuscript contents [8 marks]

**Question:** The `<msItem n="2">` element uses n=2 to indicate item order. How well would this work in a relational model? How would you approach the problem?

In [None]:
# Q3(f) YOUR ANSWER:


In [None]:
%%sql
-- Q3(f) Create your relational schema:


## Q3(g): What is msdesc.rng? [3 marks]

**Question:** What is the file `msdesc.rng`, and why is it referenced?

In [None]:
# Q3(g) YOUR ANSWER:


## Q3(h): Valid vs Well-formed [2 marks]

**Question:** What is the difference between valid and well-formed XML?

In [None]:
# Q3(h) YOUR ANSWER:


## Q3(i) & Q3(j): Omitting elements [2 marks]

**Questions:**
- (i) If `respStmt` was omitted, would the XML be legal?
- (j) If `title` elements were omitted, would the XML be legal?

In [None]:
# Q3(i) YOUR ANSWER:

# Q3(j) YOUR ANSWER:


## Q3(k): XML to HTML conversion [2 marks]

**Question:** What two technologies would be used for automatic XML to HTML conversion?

In [None]:
# Q3(k) YOUR ANSWER:
# 1.
# 2.

---

# Question 4: RDF, Ontologies, and Linked Data [30 marks]

## Context - Web Annotation Data

In [None]:
%%writefile annotation.ttl
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix oa: <http://www.w3.org/ns/oa#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix myrdf: <http://example.org/> .
@prefix armadale: <https://literary-greats.com/WCollins/Armadale/> .

myrdf:anno-001 a oa:Annotation ;
    dcterms:created "2015-10-13T13:00:00+00:00"^^xsd:dateTime ;
    dcterms:creator myrdf:DL192 ;
    oa:hasBody [
        a oa:TextualBody ;
        rdf:value "Note the use of visual language here."
    ] ;
    oa:hasTarget [
        a oa:SpecificResource ;
        oa:hasSelector [
            a oa:TextPositionSelector ;
            oa:start 235 ;
            oa:end 300
        ] ;
        oa:hasSource armadale:Chapter3
    ] ;
    oa:motivatedBy oa:commenting .

myrdf:DL192 a foaf:Person ;
    foaf:name "David Lewis" .

In [None]:
# Validate the Turtle
!rapper -i turtle -c annotation.ttl

## Q4(a): Model and serialization [2 marks]

**Questions:**
- (i) What is the model?
- (ii) What is the serialisation format?

In [None]:
# Q4(a)(i) YOUR ANSWER - Model:

# Q4(a)(ii) YOUR ANSWER - Serialization format:


## Q4(b): Name two ontologies [3 marks]

In [None]:
# Q4(b) YOUR ANSWER:
# 1.
# 2.

## Q4(c): Properties from each ontology [5 marks]

In [None]:
# Q4(c) YOUR ANSWER:
# Ontology 1 properties:
#
# Ontology 2 properties:
#

## Q4(d): Fix the SPARQL query [7 marks]

**Question:** A scholar wants annotations about `armadale:Chapter3`. The following SPARQL doesn't work:

```sparql
SELECT ?body ?creator
WHERE {
  ?annotation a oa:Annotation .
  ?creator ;
  oa:hasBody body .
  hasSource armadale:Chapter3 }
```

Write a correct version.

In [None]:
# Q4(d) YOUR CORRECTED SPARQL:


In [None]:
# Test your query with rdflib
import rdflib

g = rdflib.Graph()
g.parse('annotation.ttl', format='turtle')

# Your SPARQL query here:
query = """

"""

for row in g.query(query):
    print(row)

## Q4(e): E/R diagram [5 marks]

**Question:** Draw an E/R diagram for web annotations, for a backend database.

In [None]:
# Q4(e) YOUR ANSWER - Describe or draw the E/R diagram:


## Q4(f): Tables and keys [5 marks]

**Question:** Identify the tables needed for a relational implementation and list the keys.

In [None]:
# Q4(f) YOUR ANSWER:
# Table 1:
#   Primary Key:
#   Foreign Keys:
#
# Table 2:
#   ...

In [None]:
%%sql
-- Q4(f) Create your tables:


## Q4(g): MySQL equivalent query [3 marks]

**Question:** Give a MySQL query equivalent for the scholar's corrected SPARQL query from Q4(d).

In [None]:
%%sql
-- Q4(g) YOUR MySQL QUERY:


---

# Done!

Check your answers against the **solution sheet**.