<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/september-2023/notebook-september-2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 September 2023 - Practice Notebook

This notebook provides hands-on practice for the September 2023 exam.

**Exam Structure:**
- Section A: 10 MCQs - 40 marks
- Section B: Answer 2 of 3 questions - 60 marks
- Both parts completed together on Inspera (4 hours total)
  - Q1: Linked Data (RDF + SPARQL)
  - Q2: ER Question - Estate Agency
  - Q3: IR/Document DB - Hathi Trust

**Instructions:**
1. Run the Setup cells first
2. Write your answers in the empty code cells
3. Check your answers against the solution sheet

---

# 1. Environment Setup

Run these cells first to set up MySQL, MongoDB, and RDF tools.

In [None]:
# === MySQL Setup ===
!apt -qq update > /dev/null
!apt -y -qq install mysql-server > /dev/null
!service mysql start

# Create user and database
!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

# === Python libraries ===
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0 rdflib

%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

print("MySQL ready!")

In [None]:
# === MongoDB Setup ===
!wget -q http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb
!dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb > /dev/null 2>&1
!wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add - > /dev/null 2>&1
!echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.4 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list > /dev/null
!apt-get update -qq > /dev/null
!apt-get install -y -qq mongodb-org > /dev/null
!mkdir -p /data/db
!mongod --fork --logpath /var/log/mongodb.log --dbpath /data/db

!pip install -q pymongo

!mongo --quiet --eval 'print("MongoDB ready!")'

In [None]:
# === RDF/SPARQL Setup ===
import rdflib
from rdflib.plugins.sparql import prepareQuery

print("RDFLib ready for SPARQL queries!")

---

# Question 1: Linked Data (RDF + SPARQL) [30 marks]

## Context

Consider the document below, retrieved from http://babelnet.org/rdf/post_n_EN:

```turtle
@prefix bn: <http://babelnet.org/rdf/> .
@prefix lemon: <http://www.lemon-model.net/lemon#> .
@prefix lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> .

bn:post_n_EN a lemon:LexicalEntry ;
    lemon:canonicalForm <http://babelnet.org/rdf/post_n_EN/canonicalForm> ;
    lemon:language "EN" ;
    lexinfo:partOfSpeech lexinfo:noun .
```

The document at `/canonicalForm` includes: `lemon:writtenRep "post"`

In [None]:
# Load sample RDF data for testing
turtle_data = '''
@prefix bn: <http://babelnet.org/rdf/> .
@prefix lemon: <http://www.lemon-model.net/lemon#> .
@prefix lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> .

bn:post_n_EN a lemon:LexicalEntry ;
    lemon:canonicalForm bn:post_n_EN_form ;
    lemon:language "EN" ;
    lexinfo:partOfSpeech lexinfo:noun .

bn:post_n_EN_form lemon:writtenRep "post" .

bn:run_v_EN a lemon:LexicalEntry ;
    lemon:canonicalForm bn:run_v_EN_form ;
    lemon:language "EN" ;
    lexinfo:partOfSpeech lexinfo:verb .

bn:run_v_EN_form lemon:writtenRep "run" .

bn:house_n_EN a lemon:LexicalEntry ;
    lemon:canonicalForm bn:house_n_EN_form ;
    lemon:language "EN" ;
    lexinfo:partOfSpeech lexinfo:noun .

bn:house_n_EN_form lemon:writtenRep "house" .
'''

g = rdflib.Graph()
g.parse(data=turtle_data, format="turtle")
print(f"Loaded {len(g)} triples into the graph.")

## Question 1(a)(i) [1 mark]

**Question:** What is the generic data model this information is represented in?

In [None]:
# Q1(a)(i) YOUR ANSWER:


## Question 1(a)(ii) [1 mark]

**Question:** What is the serialisation format used for the data model?

In [None]:
# Q1(a)(ii) YOUR ANSWER:


## Question 1(b) [4 marks]

**Question:** One friend says it's impossible to know what word this RDF is talking about without more triples. Another says it's clearly the English word "post" as a noun. To what extent is either right? What further information would help?

In [None]:
# Q1(b) YOUR ANSWER:


## Question 1(c)(i) [6 marks]

**Question:** Write a SPARQL query that finds the written representation and language for all nouns.

In [None]:
# Q1(c)(i) YOUR SPARQL QUERY:
query_nouns = prepareQuery('''
# Write your query here

''')

print("All Nouns:")
for row in g.query(query_nouns):
    print(f"  {row}")

## Question 1(c)(ii) [4 marks]

**Question:** Write a SPARQL query that finds the language and part of speech for all words whose canonical form is written "post".

In [None]:
# Q1(c)(ii) YOUR SPARQL QUERY:
query_post = prepareQuery('''
# Write your query here

''')

print("Details for 'post':")
for row in g.query(query_post):
    print(f"  {row}")

## Question 1(d) [7 marks total]

**(i)** What is the role of the lemon ontology document? [1 mark]

**(ii)** What format is it in? [1 mark]

**(iii)** To what does the 'owl' prefix refer? [1 mark]

**(iv)** Write triples to provide one definition for the English noun "post". [4 marks]

In [None]:
# Q1(d) YOUR ANSWERS:
# (i) Role:
#
# (ii) Format:
#
# (iii) OWL prefix:
#
# (iv) Definition triples (in Turtle):


## Question 1(e) [7 marks]

**Question:** Sketch an ER diagram for a relational implementation of this model. Include cardinality.

In [None]:
# Q1(e) YOUR ANSWER - describe the tables and relationships:


---

# Question 2: ER Question - Estate Agency [30 marks]

## Context

An estate agency database tracks:
- **Seller** (Name, Address, Phone Number)
- **Estate Agent** (Name)
- **Property** (Address, #bedrooms, Type, Asking price)
- **Buyer** (Name, Address, Phone number)
- **Offers** (Offer date, Offer status, Offer value)
- **Views** (Date)

Relationships: Seller owns Property, Agent sells Property, Property has Offers/Views from Buyers.

## Question 2(a) [3 marks]

**Question:** Add cardinality indications for the ER diagram.

In [None]:
# Q2(a) YOUR ANSWER - list the cardinalities:


## Question 2(b) [5 marks]

**Question:** How would you adapt this to a relational model? Be specific about new entities, relations, or attributes.

In [None]:
# Q2(b) YOUR ANSWER:


## Question 2(c) [6 marks]

**Question:** List the tables, primary and foreign keys for a relational implementation.

In [None]:
# Q2(c) YOUR ANSWER - list the tables:


## Question 2(d) [3 marks]

**Question:** Give the MySQL command for creating one of those tables.

In [None]:
%%sql
-- Q2(d) YOUR CREATE TABLE:


## Question 2(e) [8 marks total]

Agents are paid 1% commission on completed sales.

**(i)** Write a MySQL query to calculate commission earned since 1 January 2023 for each agent. [6 marks]

**(ii)** Modify to list just the top earning agent. [2 marks]

In [None]:
%%sql
-- First, let's create the tables and sample data
DROP TABLE IF EXISTS Views;
DROP TABLE IF EXISTS Offers;
DROP TABLE IF EXISTS Property;
DROP TABLE IF EXISTS Seller;
DROP TABLE IF EXISTS EstateAgent;
DROP TABLE IF EXISTS Buyer;

CREATE TABLE Seller (Name VARCHAR(100) PRIMARY KEY, Address VARCHAR(200), PhoneNumber VARCHAR(50));
CREATE TABLE EstateAgent (Name VARCHAR(100) PRIMARY KEY);
CREATE TABLE Buyer (Name VARCHAR(100) PRIMARY KEY, Address VARCHAR(200), PhoneNumber VARCHAR(50));

CREATE TABLE Property (
    Address VARCHAR(200) PRIMARY KEY,
    Type VARCHAR(50),
    Bedrooms INT,
    AskingPrice DECIMAL(12,2),
    SellerName VARCHAR(100),
    AgentName VARCHAR(100),
    FOREIGN KEY (SellerName) REFERENCES Seller(Name),
    FOREIGN KEY (AgentName) REFERENCES EstateAgent(Name)
);

CREATE TABLE Offers (
    PropertyAddress VARCHAR(200),
    BuyerName VARCHAR(100),
    OfferDate DATE,
    OfferStatus VARCHAR(50),
    OfferValue DECIMAL(12,2),
    PRIMARY KEY (PropertyAddress, BuyerName, OfferDate),
    FOREIGN KEY (PropertyAddress) REFERENCES Property(Address),
    FOREIGN KEY (BuyerName) REFERENCES Buyer(Name)
);

-- Insert sample data
INSERT INTO Seller VALUES ('Alice Seller', '1 Seller St', '555-111');
INSERT INTO Seller VALUES ('Bob Seller', '2 Seller Rd', '555-222');
INSERT INTO EstateAgent VALUES ('AgentGrace');
INSERT INTO EstateAgent VALUES ('AgentHeidi');
INSERT INTO Buyer VALUES ('Charlie Buyer', '99 Buyer Rd', '555-333');
INSERT INTO Buyer VALUES ('Doris Buyer', '100 Buyer Ln', '555-444');
INSERT INTO Property VALUES ('10 Main Street', 'Flat', 2, 250000, 'Alice Seller', 'AgentGrace');
INSERT INTO Property VALUES ('20 Baker Avenue', 'Terraced House', 3, 350000, 'Bob Seller', 'AgentHeidi');
INSERT INTO Offers VALUES ('10 Main Street', 'Charlie Buyer', '2023-01-05', 'sale completed', 240000);
INSERT INTO Offers VALUES ('10 Main Street', 'Doris Buyer', '2023-01-10', 'rejected', 230000);
INSERT INTO Offers VALUES ('20 Baker Avenue', 'Doris Buyer', '2023-02-01', 'sale completed', 340000);

SELECT 'Sample data created!' AS Status;

In [None]:
%%sql
-- Q2(e)(i) YOUR COMMISSION QUERY:


In [None]:
%%sql
-- Q2(e)(ii) YOUR TOP AGENT QUERY:


## Question 2(f) [5 marks]

**Question:** Give reasons specific to this use case for why a document database might be good or bad.

In [None]:
# Q2(f) YOUR ANSWER:
# Reasons FOR document database:
#
# Reasons AGAINST document database:


---

# Question 3: IR/Document DB - Hathi Trust [30 marks]

## Context

The Hathi Trust Digital Library uses ML to classify book languages.
- German classifier: **80% precision**, **88% recall**
- Danish classifier: **100% precision**, **76% recall**

## Question 3(a) [2 marks]

**Question:** If the system lists 2,200,000 books as being in German, how many are likely to be in German?

In [None]:
# Q3(a) YOUR CALCULATION:


## Question 3(b) [3 marks]

**Question:** How many books in the whole collection are likely to be in German (including those not classified as German)?

In [None]:
# Q3(b) YOUR CALCULATION:


## Question 3(c) [5 marks]

**Question:** Danish is identified with 100% precision and 76% recall. Why might this be more useful for ML training than German's 80% precision?

In [None]:
# Q3(c) YOUR ANSWER:


## Question 3(d) [2 marks]

**Question:** What is an F1-measure?

In [None]:
# Q3(d) YOUR ANSWER:


## Question 3(e) [1 mark]

**Question:** What does `db.books.find({ lang: "German" })` do?

In [None]:
# Q3(e) YOUR ANSWER:


## Question 3(f) [5 marks]

**Question:** Rewrite the command to include only volumes published in the nineteenth century.

In [None]:
# First, let's set up MongoDB with sample data
from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client['hathi_trust']
collection = db['books']

# Clear and insert sample data
collection.delete_many({})
collection.insert_many([
    {"title": "Book1", "lang": "German", "year": 1850, "text": "Ein Wort Strudel..."},
    {"title": "Book2", "lang": "German", "year": 1905, "text": "Keine Erw√§hnung"},
    {"title": "Book3", "lang": "English", "year": 1845, "text": "Something about strudel."},
    {"title": "Book4", "lang": "German", "year": 1830, "text": "No mention of desserts"},
    {"title": "Book5", "lang": "German", "year": 1880, "text": "STRUDEL mania!"}
])
print(f"Inserted {collection.count_documents({})} books")

In [None]:
# Q3(f) YOUR MONGODB QUERY for 19th century German books:
query = {}  # Fill in your query

for doc in collection.find(query):
    print(doc)

## Question 3(g) [2 marks]

**Question:** How would you adjust your query to include only books containing the word "Strudel"?

In [None]:
# Q3(g) YOUR MONGODB QUERY with text search:
import re

query = {}  # Fill in your query

for doc in collection.find(query):
    print(doc)

## Question 3(h) [10 marks]

**Question:** What factors should the researcher consider when choosing between enriching the document database or switching to XML/TEI?

In [None]:
# Q3(h) YOUR ANSWER:
# Factors to consider:


---

# Done!

Check your answers against the **solution sheet**.