<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/september-2023/notebook-september-2023-solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 September 2023 - Solutions Notebook

This notebook contains **complete solutions** for the September 2023 exam.

**Exam Structure:**
- Section A: 10 MCQs - 40 marks
- Section B: Answer 2 of 3 questions - 60 marks
- Both parts completed together on Inspera (4 hours total)
  - Q1: Linked Data (RDF + SPARQL)
  - Q2: ER Question - Estate Agency
  - Q3: IR/Document DB - Hathi Trust

**Instructions:**
1. Run the Setup cells first
2. All solution cells are pre-filled with correct answers
3. Compare with your own attempts from the practice notebook

---

# 1. Environment Setup

Run these cells first to set up MySQL, MongoDB, and RDF tools.

In [None]:
# === MySQL Setup ===
!apt -qq update > /dev/null
!apt -y -qq install mysql-server > /dev/null
!service mysql start

!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0 rdflib

%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

print("MySQL ready!")

In [None]:
# === MongoDB Setup ===
!wget -q http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb
!dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb > /dev/null 2>&1
!wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add - > /dev/null 2>&1
!echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.4 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list > /dev/null
!apt-get update -qq > /dev/null
!apt-get install -y -qq mongodb-org > /dev/null
!mkdir -p /data/db
!mongod --fork --logpath /var/log/mongodb.log --dbpath /data/db

!pip install -q pymongo

!mongo --quiet --eval 'print("MongoDB ready!")'

In [None]:
# === RDF/SPARQL Setup ===
import rdflib
from rdflib.plugins.sparql import prepareQuery

print("RDFLib ready for SPARQL queries!")

---

# Question 1: Linked Data (RDF + SPARQL) [30 marks]

## Setup: Load Sample RDF Data

In [None]:
# Load sample RDF data for testing
turtle_data = '''
@prefix bn: <http://babelnet.org/rdf/> .
@prefix lemon: <http://www.lemon-model.net/lemon#> .
@prefix lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> .

bn:post_n_EN a lemon:LexicalEntry ;
    lemon:canonicalForm bn:post_n_EN_form ;
    lemon:language "EN" ;
    lexinfo:partOfSpeech lexinfo:noun .

bn:post_n_EN_form lemon:writtenRep "post" .

bn:run_v_EN a lemon:LexicalEntry ;
    lemon:canonicalForm bn:run_v_EN_form ;
    lemon:language "EN" ;
    lexinfo:partOfSpeech lexinfo:verb .

bn:run_v_EN_form lemon:writtenRep "run" .

bn:house_n_EN a lemon:LexicalEntry ;
    lemon:canonicalForm bn:house_n_EN_form ;
    lemon:language "EN" ;
    lexinfo:partOfSpeech lexinfo:noun .

bn:house_n_EN_form lemon:writtenRep "house" .
'''

g = rdflib.Graph()
g.parse(data=turtle_data, format="turtle")
print(f"Loaded {len(g)} triples into the graph.")

## Q1(a)(i): Generic Data Model [1 mark]

### Solution

In [None]:
# Q1(a)(i) SOLUTION
print("Answer: RDF (Resource Description Framework)")
print("")
print("RDF represents data as subject-predicate-object triples.")
print("It's the foundation for Linked Data and the Semantic Web.")

## Q1(a)(ii): Serialization Format [1 mark]

### Solution

In [None]:
# Q1(a)(ii) SOLUTION
print("Answer: Turtle (Terse RDF Triple Language)")
print("")
print("Evidence:")
print("- @prefix declarations")
print("- Semicolon (;) to continue same subject")
print("- Period (.) to end statements")

## Q1(b): Interpretation Debate [4 marks]

### Solution

In [None]:
# Q1(b) SOLUTION
print("""BOTH FRIENDS HAVE A POINT:

Friend 1 (Skeptic) - Partially Correct:
- The actual word "post" isn't in these triples
- Only a URI reference to canonicalForm is shown
- Strictly speaking, need to dereference the link

Friend 2 (Pragmatist) - Practically Correct:
- URI "post_n_EN" strongly suggests "post"
- Language = "EN" is explicit
- partOfSpeech = noun is explicit

What's Missing:
- The writtenRep property with value "post"
- Located in the linked canonicalForm document

Further Information Needed:
- Fetch <http://babelnet.org/rdf/post_n_EN/canonicalForm>
- This would show: lemon:writtenRep "post"
""")

## Q1(c)(i): SPARQL Query for All Nouns [6 marks]

### Solution

In [None]:
# Q1(c)(i) SOLUTION
query_nouns = prepareQuery('''
PREFIX lemon:   <http://www.lemon-model.net/lemon#>
PREFIX lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#>

SELECT ?writtenRep ?lang
WHERE {
  ?lexEntry a lemon:LexicalEntry ;
            lemon:canonicalForm ?form ;
            lemon:language ?lang ;
            lexinfo:partOfSpeech lexinfo:noun .

  ?form lemon:writtenRep ?writtenRep .
}
''')

print("All Nouns in the Graph:")
for row in g.query(query_nouns):
    print(f"  Word: {row.writtenRep} | Language: {row.lang}")

## Q1(c)(ii): SPARQL Query for "post" [4 marks]

### Solution

In [None]:
# Q1(c)(ii) SOLUTION
query_post = prepareQuery('''
PREFIX lemon:   <http://www.lemon-model.net/lemon#>
PREFIX lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#>

SELECT ?language ?pos
WHERE {
  ?lexEntry a lemon:LexicalEntry ;
            lemon:canonicalForm ?form ;
            lemon:language ?language ;
            lexinfo:partOfSpeech ?pos .

  ?form lemon:writtenRep "post" .
}
''')

print("Details for words written 'post':")
for row in g.query(query_post):
    print(f"  Language: {row.language} | Part of Speech: {row.pos}")

## Q1(d): Ontology Questions

### Solution

In [None]:
# Q1(d) SOLUTION
print("""(i) Role of the document:
    ONTOLOGY/SCHEMA DEFINITION
    - Defines classes (LexicalSense, SenseDefinition)
    - Defines properties (definition, value)
    - Structures the Lemon lexical model

(ii) Format:
    TURTLE (or RDF/XML depending on content negotiation)

(iii) OWL prefix refers to:
    WEB ONTOLOGY LANGUAGE (OWL)
    - Namespace: http://www.w3.org/2002/07/owl#
    - Provides expressive ontology constructs
    - e.g., owl:Class, owl:ObjectProperty, owl:disjointWith

(iv) Definition triples for English noun "post":
""")

definition_turtle = '''
@prefix lemon: <http://www.lemon-model.net/lemon#> .
@prefix bn: <http://babelnet.org/rdf/> .
@prefix ex: <http://example.org/> .

bn:post_n_EN_sense a lemon:LexicalSense ;
    lemon:definition ex:post_n_EN_def .

ex:post_n_EN_def a lemon:SenseDefinition ;
    lemon:value "A piece of wood or metal set upright to support something."@en .
'''
print(definition_turtle)

## Q1(e): ER Diagram for Relational Implementation [7 marks]

### Solution

In [None]:
# Q1(e) SOLUTION
print("""ER DIAGRAM FOR LEMON MODEL:

Entities and Relationships:

LexicalEntry (1) ----< (M) Form
     |
     | (1)
     v
    (M)
LexicalSense (1) ----< (M) SenseDefinition

Tables:

| Table            | Columns                              | Keys                    |
|------------------|--------------------------------------|-------------------------|
| LexicalEntry     | LexicalEntryId, Language, PartOfSpeech | PK: LexicalEntryId     |
| Form             | FormId, LexicalEntryId, WrittenRep   | PK: FormId, FK->Entry   |
| LexicalSense     | SenseId, LexicalEntryId              | PK: SenseId, FK->Entry  |
| SenseDefinition  | DefId, SenseId, TextValue            | PK: DefId, FK->Sense    |

Cardinalities:
- One LexicalEntry can have MANY Forms (1:M)
- One LexicalEntry can have MANY LexicalSenses (1:M)
- One LexicalSense can have MANY SenseDefinitions (1:M)
""")

---

# Question 2: ER Question - Estate Agency [30 marks]

## Q2(a): Add Cardinality [3 marks]

### Solution

In [None]:
# Q2(a) SOLUTION
print("""CARDINALITY INDICATIONS:

| Relationship           | Cardinality | Explanation                          |
|------------------------|-------------|--------------------------------------|
| Seller - Property      | 1:M         | One seller owns many properties      |
| Estate Agent - Property| 1:M         | One agent handles many properties    |
| Property - Offers      | 1:M         | One property can have many offers    |
| Property - Views       | 1:M         | One property can have many viewings  |
| Buyer - Offers         | 1:M         | One buyer can make many offers       |
| Buyer - Views          | 1:M         | One buyer can have many viewings     |
""")

## Q2(b): Adapt to Relational Model [5 marks]

### Solution

In [None]:
# Q2(b) SOLUTION
print("""ADAPTATIONS TO RELATIONAL MODEL:

1. Convert diamond relationships to tables:
   - Offers becomes a table with FKs to Property and Buyer
   - Views becomes a table with FKs to Property and Buyer

2. Add surrogate keys (recommended):
   - PropertyId, SellerId, AgentId, BuyerId, OfferId, ViewId
   - Or use natural keys (Address, Name) as shown in diagram

3. Add foreign keys:
   - Property references Seller and EstateAgent
   - Offers references Property and Buyer
   - Views references Property and Buyer

4. Add timestamp columns:
   - OfferDate, ViewDate for tracking events
""")

## Q2(c): List Tables, Primary and Foreign Keys [6 marks]

### Solution

In [None]:
# Q2(c) SOLUTION
print("""TABLES, PRIMARY KEYS, AND FOREIGN KEYS:

| Table       | Columns                                    | PK                            | FKs                      |
|-------------|--------------------------------------------|------------------------------ |--------------------------|
| Seller      | Name, Address, PhoneNumber                 | Name                          | -                        |
| EstateAgent | Name                                       | Name                          | -                        |
| Buyer       | Name, Address, PhoneNumber                 | Name                          | -                        |
| Property    | Address, Type, Bedrooms, AskingPrice,      | Address                       | SellerName -> Seller,    |
|             | SellerName, AgentName                      |                               | AgentName -> EstateAgent |
| Offers      | PropertyAddress, BuyerName, OfferDate,     | (PropertyAddress, BuyerName,  | PropertyAddress->Property|
|             | OfferStatus, OfferValue                    |  OfferDate)                   | BuyerName -> Buyer       |
| Views       | PropertyAddress, BuyerName, ViewDate       | (PropertyAddress, BuyerName,  | PropertyAddress->Property|
|             |                                            |  ViewDate)                    | BuyerName -> Buyer       |
""")

## Q2(d): MySQL CREATE Command [3 marks]

### Solution

In [None]:
%%sql
-- Q2(d) SOLUTION - Create the tables
DROP TABLE IF EXISTS Views;
DROP TABLE IF EXISTS Offers;
DROP TABLE IF EXISTS Property;
DROP TABLE IF EXISTS Seller;
DROP TABLE IF EXISTS EstateAgent;
DROP TABLE IF EXISTS Buyer;

CREATE TABLE Seller (
    Name VARCHAR(100) PRIMARY KEY,
    Address VARCHAR(200),
    PhoneNumber VARCHAR(50)
);

CREATE TABLE EstateAgent (
    Name VARCHAR(100) PRIMARY KEY
);

CREATE TABLE Buyer (
    Name VARCHAR(100) PRIMARY KEY,
    Address VARCHAR(200),
    PhoneNumber VARCHAR(50)
);

CREATE TABLE Property (
    Address VARCHAR(200) PRIMARY KEY,
    Type VARCHAR(50),
    Bedrooms INT,
    AskingPrice DECIMAL(12, 2),
    SellerName VARCHAR(100) NOT NULL,
    AgentName VARCHAR(100) NOT NULL,
    FOREIGN KEY (SellerName) REFERENCES Seller(Name),
    FOREIGN KEY (AgentName) REFERENCES EstateAgent(Name)
);

CREATE TABLE Offers (
    PropertyAddress VARCHAR(200),
    BuyerName VARCHAR(100),
    OfferDate DATE,
    OfferStatus VARCHAR(50),
    OfferValue DECIMAL(12, 2),
    PRIMARY KEY (PropertyAddress, BuyerName, OfferDate),
    FOREIGN KEY (PropertyAddress) REFERENCES Property(Address),
    FOREIGN KEY (BuyerName) REFERENCES Buyer(Name)
);

CREATE TABLE Views (
    PropertyAddress VARCHAR(200),
    BuyerName VARCHAR(100),
    ViewDate DATE,
    PRIMARY KEY (PropertyAddress, BuyerName, ViewDate),
    FOREIGN KEY (PropertyAddress) REFERENCES Property(Address),
    FOREIGN KEY (BuyerName) REFERENCES Buyer(Name)
);

SELECT 'Tables created!' AS Status;

In [None]:
%%sql
-- Insert sample data
INSERT INTO Seller VALUES ('Alice Seller', '1 Seller St', '555-111');
INSERT INTO Seller VALUES ('Bob Seller', '2 Seller Rd', '555-222');
INSERT INTO EstateAgent VALUES ('AgentGrace');
INSERT INTO EstateAgent VALUES ('AgentHeidi');
INSERT INTO Buyer VALUES ('Charlie Buyer', '99 Buyer Rd', '555-333');
INSERT INTO Buyer VALUES ('Doris Buyer', '100 Buyer Ln', '555-444');
INSERT INTO Property VALUES ('10 Main Street', 'Flat', 2, 250000, 'Alice Seller', 'AgentGrace');
INSERT INTO Property VALUES ('20 Baker Avenue', 'Terraced House', 3, 350000, 'Bob Seller', 'AgentHeidi');
INSERT INTO Offers VALUES ('10 Main Street', 'Charlie Buyer', '2023-01-05', 'sale completed', 240000);
INSERT INTO Offers VALUES ('10 Main Street', 'Doris Buyer', '2023-01-10', 'rejected', 230000);
INSERT INTO Offers VALUES ('20 Baker Avenue', 'Doris Buyer', '2023-02-01', 'sale completed', 340000);

SELECT 'Sample data inserted!' AS Status;

## Q2(e)(i): Commission Query [6 marks]

### Solution

In [None]:
%%sql
-- Q2(e)(i) SOLUTION: Commission per agent since Jan 2023
SELECT
    p.AgentName AS EstateAgent,
    SUM(o.OfferValue * 0.01) AS TotalCommission
FROM Property p
INNER JOIN Offers o ON p.Address = o.PropertyAddress
WHERE o.OfferStatus = 'sale completed'
  AND o.OfferDate >= '2023-01-01'
GROUP BY p.AgentName;

## Q2(e)(ii): Top Earning Agent [2 marks]

### Solution

In [None]:
%%sql
-- Q2(e)(ii) SOLUTION: Top earning agent
SELECT
    p.AgentName AS EstateAgent,
    SUM(o.OfferValue * 0.01) AS TotalCommission
FROM Property p
INNER JOIN Offers o ON p.Address = o.PropertyAddress
WHERE o.OfferStatus = 'sale completed'
  AND o.OfferDate >= '2023-01-01'
GROUP BY p.AgentName
ORDER BY TotalCommission DESC
LIMIT 1;

## Q2(f): Document Database Consideration [5 marks]

### Solution

In [None]:
# Q2(f) SOLUTION
print("""DOCUMENT DATABASE CONSIDERATIONS (Specific to Estate Agency):

REASONS FOR Document Database:
1. Flexible property details - different property types have different attributes
   (flats have floor number, houses have garden size, etc.)
2. Embedded media - store property photos and descriptions as embedded documents
3. Variable offer history - embed all offers within property document

REASONS AGAINST Document Database:
1. Commission queries harder - aggregating across agents requires complex pipelines
2. Transactional integrity - offer status changes (made -> accepted -> completed)
   need ACID guarantees that relational DBs provide naturally
3. Cross-entity queries - finding all properties viewed by a buyer requires joins
4. Data duplication - agent info repeated in each property document
5. Consistency issues - updating agent name requires updating all property documents

CONCLUSION: Relational is better for this use case due to:
- Structured relationships between entities
- Transactional requirements for offer processing
- Need for aggregate queries (commission calculations)
""")

---

# Question 3: IR/Document DB - Hathi Trust [30 marks]

## Q3(a): Precision Calculation [2 marks]

### Solution

In [None]:
# Q3(a) SOLUTION
listed_as_german = 2_200_000
precision = 0.80

true_positives = listed_as_german * precision

print(f"Listed as German: {listed_as_german:,}")
print(f"Precision: {precision}")
print(f"")
print(f"True Positives = Listed × Precision")
print(f"              = {listed_as_german:,} × {precision}")
print(f"              = {true_positives:,.0f} books are actually German")

## Q3(b): Total German Books [3 marks]

### Solution

In [None]:
# Q3(b) SOLUTION
recall = 0.88

# Recall = True Positives / All Actual German
# Therefore: All Actual German = True Positives / Recall
total_german = true_positives / recall

print(f"True Positives: {true_positives:,.0f}")
print(f"Recall: {recall}")
print(f"")
print(f"All German Books = True Positives / Recall")
print(f"                 = {true_positives:,.0f} / {recall}")
print(f"                 = {total_german:,.0f} books in the collection are German")

## Q3(c): Why Danish Precision is Better for ML [5 marks]

### Solution

In [None]:
# Q3(c) SOLUTION
print("""WHY 100% PRECISION IS BETTER FOR ML TRAINING:

1. DATA PURITY:
   - 100% precision = every labeled Danish book IS Danish
   - No noise or mislabeled examples in the training set

2. TRAINING QUALITY:
   - ML models learn wrong patterns from mislabeled examples
   - Even 20% noise (German's 80% precision) can significantly hurt model performance

3. SMALLER BUT CLEAN > LARGER BUT NOISY:
   - Better to have 76% of Danish books that are ALL correct
   - Than 88% of German books where 20% are wrong

4. FALSE POSITIVES HURT MORE:
   - A non-Danish book labeled Danish teaches wrong language patterns
   - Missing some Danish books (lower recall) just means less data
   - Less data is better than wrong data for training

TRADE-OFF:
   - Danish: Miss 24% of books, but 100% of what you have is correct
   - German: Get more books, but 20% are mislabeled noise
""")

## Q3(d): F1 Measure Definition [2 marks]

### Solution

In [None]:
# Q3(d) SOLUTION
print("""F1 MEASURE:

F1 = Harmonic mean of Precision and Recall

Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)

Properties:
- Ranges from 0 to 1 (higher is better)
- Balances precision and recall into single metric
- Penalizes extreme imbalance between P and R
- F1 = 1 only when both P and R are perfect
""")

# Calculate F1 for German and Danish
p_german, r_german = 0.80, 0.88
p_danish, r_danish = 1.00, 0.76

f1_german = 2 * (p_german * r_german) / (p_german + r_german)
f1_danish = 2 * (p_danish * r_danish) / (p_danish + r_danish)

print(f"German F1 = 2 × (0.80 × 0.88) / (0.80 + 0.88) = {f1_german:.3f}")
print(f"Danish F1 = 2 × (1.00 × 0.76) / (1.00 + 0.76) = {f1_danish:.3f}")

## Q3(e): MongoDB Find Command [1 mark]

### Solution

In [None]:
# Q3(e) SOLUTION
print("""db.books.find({ lang: "German" })

This command:
- Queries the 'books' collection
- Returns ALL documents where the 'lang' field equals "German"
- Returns complete documents (all fields)
""")

## Q3(f): 19th Century Query [5 marks]

### Solution

In [None]:
# Setup MongoDB with sample data
from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client['hathi_trust']
collection = db['books']

collection.delete_many({})
collection.insert_many([
    {"title": "Book1", "lang": "German", "year": 1850, "text": "Ein Wort Strudel..."},
    {"title": "Book2", "lang": "German", "year": 1905, "text": "Keine Erwähnung"},
    {"title": "Book3", "lang": "English", "year": 1845, "text": "Something about strudel."},
    {"title": "Book4", "lang": "German", "year": 1830, "text": "No mention of desserts"},
    {"title": "Book5", "lang": "German", "year": 1880, "text": "STRUDEL mania!"}
])
print(f"Inserted {collection.count_documents({})} books")

In [None]:
# Q3(f) SOLUTION: 19th century German books
print("Query: db.books.find({ lang: 'German', year: { $gte: 1800, $lt: 1900 } })")
print("")

query = {
    "lang": "German",
    "year": {"$gte": 1800, "$lt": 1900}
}

print("Results:")
for doc in collection.find(query):
    print(f"  {doc['title']}: {doc['year']}")

## Q3(g): Add Text Search for "Strudel" [2 marks]

### Solution

In [None]:
# Q3(g) SOLUTION: Add text search for "Strudel"
import re

print("""Query: db.books.find({
    lang: "German",
    year: { $gte: 1800, $lt: 1900 },
    text: { $regex: /Strudel/i }
})""")
print("")

query = {
    "lang": "German",
    "year": {"$gte": 1800, "$lt": 1900},
    "text": {"$regex": "Strudel", "$options": "i"}  # case-insensitive
}

print("Results (19th century German books containing 'Strudel'):")
for doc in collection.find(query):
    print(f"  {doc['title']}: {doc['text'][:50]}...")

## Q3(h): Document DB vs XML/TEI Decision [10 marks]

### Solution

In [None]:
# Q3(h) SOLUTION
print("""FACTORS FOR DOCUMENT DB vs XML/TEI DECISION:

| Factor              | Document DB (MongoDB)     | XML/TEI Database          |
|---------------------|---------------------------|---------------------------|
| Structural encoding | Limited - flat/nested JSON| Excellent - hierarchical  |
|                     |                           | markup for chapters, etc  |
| Query capability    | Simple field queries,     | XQuery/XPath for fine-    |
|                     | aggregation pipeline      | grained text queries      |
| Standards           | Proprietary format        | TEI is scholarly standard |
| Interoperability    | Need custom APIs          | Share with TEI community  |
| Scalability         | Excellent horizontal      | Can be challenging for    |
|                     | scaling                   | very large corpora        |
| Schema flexibility  | Easy changes              | More rigid but defined    |
| Existing tools      | Many general-purpose      | Specialized TEI tools     |
| Preservation        | Format may change         | TEI is archival standard  |
| Mixed content       | Harder to represent       | Natural fit for inline    |
|                     |                           | markup in text            |
| Full-text search    | Good with text indexes    | Native in XML databases   |

RECOMMENDATION DEPENDS ON:

1. Primary use case:
   - Detailed textual analysis -> TEI
   - Simple metadata queries -> MongoDB

2. Scale:
   - Millions of books, simple queries -> MongoDB
   - Smaller corpus, complex text analysis -> TEI

3. Interoperability:
   - Sharing with other scholars -> TEI standard
   - Internal tool -> MongoDB flexible

4. Development resources:
   - General developers -> MongoDB familiar
   - Digital humanities specialists -> TEI expertise

5. Long-term preservation:
   - Academic/library context -> TEI for archival
   - Commercial/temporary -> MongoDB acceptable
""")

---

# End of Solutions Notebook

All solutions have been provided. Compare with your attempts in the practice notebook!