<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/march-2024/notebook-march-2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 March 2024 - Practice Notebook

This notebook provides hands-on practice for the March 2024 exam.

**Exam Structure:**
- Section A: 10 MCQs - 40 marks
- Section B: Answer 2 of 3 questions - 60 marks
- Both parts completed together on Inspera (4 hours total)
  - Q2: Carnegie Hall RDF/Linked Data
  - Q3: UK Government Exam Attainment Data
  - Q4: MongoDB Document Database

**Instructions:**
1. Run the Setup cells first
2. Write your answers in the empty code cells
3. Check your answers against the solution sheet

---

# 1. Environment Setup

Run these cells first to set up MySQL, MongoDB, and SPARQL.

In [None]:
# === MySQL Setup ===
!apt -qq update > /dev/null
!apt -y -qq install mysql-server > /dev/null
!service mysql start

# Create user and database
!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

# === Python libraries ===
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0 sparqlwrapper

%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

print("MySQL ready!")

In [None]:
# === MongoDB Setup ===
!wget -q http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb
!dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb > /dev/null 2>&1
!wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add - > /dev/null 2>&1
!echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.4 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list > /dev/null
!apt-get update -qq > /dev/null
!apt-get install -y -qq mongodb-org > /dev/null
!mkdir -p /data/db
!mongod --fork --logpath /var/log/mongodb.log --dbpath /data/db

# Test MongoDB is running
!mongo --quiet --eval 'print("MongoDB ready!")'

In [None]:
# === SPARQL Setup (for Wikidata and Carnegie Hall queries) ===
from SPARQLWrapper import SPARQLWrapper, JSON

def run_sparql(query, endpoint="https://query.wikidata.org/sparql"):
    """Run a SPARQL query against an endpoint and print results."""
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    try:
        results = sparql.query().convert()
        for result in results["results"]["bindings"]:
            print(result)
        return results
    except Exception as e:
        print(f"Error: {e}")
        return None

print("SPARQL ready!")

---

# Question 2: Carnegie Hall RDF/Linked Data [30 marks]

## Context

RDF data from the Carnegie Hall data lab describing Maria Callas:

```turtle
@prefix schema: <http://schema.org/> .
@prefix gnd: <http://d-nb.info/standards/elementset/gnd#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix chm: <http://data.carnegiehall.org/model/> .
@prefix chi: <http://data.carnegiehall.org/instruments/> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix wdt: <http://www.wikidata.org/prop/direct/> .

<http://data.carnegiehall.org/names/18065> a chm:Entity, schema:Person ;
    rdfs:label "Maria Callas" ;
    gnd:playedInstrument chi:61 ;
    schema:birthDate "1923-12-02"^^xsd:date ;
    schema:birthPlace <http://sws.geonames.org/5128581/> ;
    schema:deathDate "1977-09-16"^^xsd:date ;
    schema:name "Maria Callas" ;
    skos:exactMatch <http://dbpedia.org/resource/Maria_Callas>,
        <http://id.loc.gov/authorities/names/n50032183>,
        wd:Q128297,
        <https://musicbrainz.org/artist/9dee40b2-25ad-404c-9c9a-139feffd4b57> .
```

Additional triples from following URLs (chi:61 and wd:Q128297):

```turtle
chi:61 a <http://purl.org/ontology/mo/Instrument> ;
    rdfs:label "soprano" .

wd:Q128297 wdt:P1477 "Maria Anna Cecilia Sofia Kalogeropoulou"@en,
    "Μαρία Άννα Καικιλία Σοφία Καλογεροπούλου"@el .

wd:P1477 schema:description "full name of a person at birth, if different from their current, generally used name"@en .
```

## Question 2(a)(i) [1 mark]

**Question:** Which RDF serialisation is this?

In [None]:
# Q2(a)(i) YOUR ANSWER:


## Question 2(a)(ii) [2 marks]

**Question:** Name ONE other serialisation and, briefly, describe the difference.

In [None]:
# Q2(a)(ii) YOUR ANSWER:


## Question 2(a)(iii) [1 mark]

**Question:** How many triples are shown here?

In [None]:
# Q2(a)(iii) YOUR ANSWER:
# Count the triples in the first RDF block:


## Question 2(b)(i) [1 mark]

**Question:** What is the full URL of wd:Q128297?

In [None]:
# Q2(b)(i) YOUR ANSWER:


## Question 2(b)(ii) [5 marks]

**Question:** Given a triplestore with the RDF from these resources and a SPARQL endpoint, what query would list the birth name of all Sopranos?

In [None]:
# Q2(b)(ii) YOUR SPARQL QUERY:
soprano_birthnames_query = """

"""
print(soprano_birthnames_query)

## Question 2(b)(iii) [5 marks]

**Question:** Both Wikidata and Carnegie Hall have SPARQL endpoints, but the Carnegie Hall triplestore does not include Wikidata's triples, and Wikidata does not have Carnegie Hall data. Give TWO ways that queries like the one you give in (ii) could still be carried out.

In [None]:
# Q2(b)(iii) YOUR ANSWER:
# Method 1:
#
# Method 2:
#

## Question 2(c) [9 marks]

**Question:** Your project wants to use biographical data from Wikidata, concert listings from Carnegie Hall, and MusicBrainz discographies. Consider the relative merits and practicality of using the THREE existing resources as live Linked Open Data as opposed to downloading the data from each and creating a relational database for the data you need.

In [None]:
# Q2(c) YOUR ANSWER:
# Live Linked Open Data approach:
#   Pros:
#   Cons:
#
# Download to Relational Database approach:
#   Pros:
#   Cons:
#
# Recommendation:
#

## Question 2(d) [6 marks]

**Question:** Wikidata uses almost exclusively their own ontology with a bespoke set of properties and classes. Carnegie Hall Data Labs primarily use ontologies from other projects, especially schema.org. Why might they have chosen different approaches? What are the benefits of each?

In [None]:
# Q2(d) YOUR ANSWER:
# Wikidata's bespoke ontology:
#   Why:
#   Benefits:
#
# Carnegie Hall's reuse of existing ontologies:
#   Why:
#   Benefits:
#

---

# Question 3: UK Government Exam Attainment Data [30 marks]

## Context

The UK Government issues data on exam attainment for 16-18 year-olds in CSV files. Here is an extract (rotated to fit):

| Row | Col 1 (Gender/Male) | Col 2 (Gender/Female) | Col 3 (All/State-funded) | Col 4 (FSM/Eligible) |
|-----|---------------------|----------------------|--------------------------|----------------------|
| Characteristic type | Gender | Gender | All students | Free School Meals |
| Characteristic | Male | Female | State-funded students | Eligible for FSM |
| Subject name | Additional Mathematics | Classical Greek | Textiles Technology | Total STEM subjects |
| Subject Area | Maths | Classical Studies | Design and Technology | All STEM subjects |
| Total Students | z | 100 | 661 | 6084 |
| Total Students all subjects | z | 145989 | 228782 | 14865 |
| Number at grade A* | z | 27 | 27 | 372 |
| Number at grade A | z | 54 | 85 | 932 |
| Number at grade B | z | 9 | 164 | 1067 |
| Number at grade C | z | 9 | 199 | 1204 |
| Number at grade D | z | 0 | 115 | 1231 |
| Number at grade E | z | 1 | 58 | 847 |
| Number at grade U | z | 0 | 13 | 431 |
| Number achieving grade A*-A | z | 81 | 112 | 1304 |
| Number achieving grade A*-B | z | 90 | 276 | 2371 |
| Number achieving grade A*-C | z | 99 | 475 | 3575 |
| ... | ... | ... | ... | ... |

## Question 3(a) [2 marks]

**Question:** What Normal Forms (if any) is this table in? Justify your answer.

In [None]:
# Q3(a) YOUR ANSWER:


## Question 3(b) [3 marks]

**Question:** The CSV uses "Z" to indicate "not applicable". What problems might this create for SQL implementations? How would you avoid them?

In [None]:
# Q3(b) YOUR ANSWER:
# Problems:
#
# Solutions:
#

## Question 3(c) [15 marks]

**Question:** Design a relational model for the files, and give the CREATE commands needed. Explain your choices and show what Normal Forms your solution is in.

In [None]:
# Q3(c) YOUR ANSWER - Explain your design choices:
# Tables:
#
# Normal Forms:
#

In [None]:
%%sql
-- Q3(c) Create your tables:
DROP TABLE IF EXISTS Attainment;
DROP TABLE IF EXISTS GradeMetric;
DROP TABLE IF EXISTS Subject;
DROP TABLE IF EXISTS SubjectArea;
DROP TABLE IF EXISTS Characteristic;
DROP TABLE IF EXISTS CharacteristicType;

-- Add your CREATE TABLE statements here:


## Question 3(d) [4 marks]

**Question:** Give a query for your database that retrieves the percentage of A*-C grades for Classical Studies for each 'Characteristic' that the files track.

In [None]:
%%sql
-- Q3(d) YOUR SQL:


## Question 3(e) [6 marks]

**Question:** Is a relational model the best approach for this sort of data? Evaluate (briefly) this approach and at least two alternative models.

In [None]:
# Q3(e) YOUR ANSWER:
# Relational model:
#   Pros:
#   Cons:
#
# Alternative 1:
#   Pros:
#   Cons:
#
# Alternative 2:
#   Pros:
#   Cons:
#
# Conclusion:
#

---

# Question 4: MongoDB Document Database [30 marks]

## Context

The MongoDB website gives the following as an example of a JSON document for a document database:

```json
{
    "_id": 1,
    "first_name": "Tom",
    "email": "tom@example.com",
    "cell": "765-555-5555",
    "likes": [
        "fashion",
        "spas",
        "shopping"
    ],
    "businesses": [
        {
            "name": "Entertainment 1080",
            "partner": "Jean",
            "status": "Bankrupt",
            "date_founded": {
                "$date": "2012-05-19T04:00:00Z"
            }
        },
        {
            "name": "Swag for Tweens",
            "date_founded": {
                "$date": "2012-11-01T04:00:00Z"
            }
        }
    ]
}
```

In [None]:
# Set up sample data in MongoDB
!mongo exam_db --eval '
db.people.drop();
db.people.insertMany([
  {
    "_id": 1,
    "first_name": "Tom",
    "email": "tom@example.com",
    "cell": "765-555-5555",
    "likes": ["fashion", "spas", "shopping"],
    "businesses": [
      {"name": "Entertainment 1080", "partner": "Jean", "status": "Bankrupt", "date_founded": new Date("2012-05-19")},
      {"name": "Swag for Tweens", "date_founded": new Date("2012-11-01")}
    ]
  },
  {
    "_id": 2,
    "first_name": "Jane",
    "email": "jane@example.com",
    "cell": "555-123-4567",
    "likes": ["travel", "fashun", "reading"],
    "businesses": [
      {"name": "Tech Solutions", "status": "Active", "date_founded": new Date("2019-03-15")}
    ]
  },
  {
    "_id": 3,
    "first_name": "Bob",
    "email": "bob@example.com",
    "likes": ["spas", "golf"],
    "businesses": [
      {"name": "Old Venture", "status": "Bankrupt", "date_founded": new Date("2015-01-10")},
      {"name": "New Hope Ltd", "status": "Active", "date_founded": new Date("2021-06-01")}
    ]
  }
]);
print("Sample data inserted!");
'

## Question 4(a)(i) [2 marks]

**Question:** What query would return documents for people who like spas?

In [None]:
# Q4(a)(i) YOUR MONGODB QUERY:
!mongo exam_db --quiet --eval '
// Write your query here:
db.people.find({ /* YOUR QUERY */ }).pretty();
'

## Question 4(a)(ii) [4 marks]

**Question:** What query would find individuals with businesses founded before the first of March, 2020, who also have at least one business with the status of "Bankrupt"?

In [None]:
# Q4(a)(ii) YOUR MONGODB QUERY:
!mongo exam_db --quiet --eval '
// Write your query here:
db.people.find({ /* YOUR QUERY */ }).pretty();
'

## Question 4(b)(i) [4 marks]

**Question:** A bug in the data entry form for this database created several records with "likes" including "fashun" rather than "fashion". How would you construct a query that would fix entries with the wrong data? (Explain in words - you do not need to know the full syntax).

In [None]:
# Q4(b)(i) YOUR ANSWER:
# Explain the approach:
#
# Conceptual query:
#

In [None]:
# Test your fix query (optional):
!mongo exam_db --quiet --eval '
// First, check for documents with "fashun":
print("Before fix:");
db.people.find({ likes: "fashun" }).forEach(function(doc) {
    print(doc.first_name + " likes: " + doc.likes);
});

// Your update query here:
// db.people.updateMany(...);

// Verify the fix:
// print("After fix:");
// db.people.find({ likes: "fashion" }).forEach(...);
'

## Question 4(b)(ii) [4 marks]

**Question:** A colleague argues that this is a problem of referential integrity, and that you would be able to avoid this issue using a Linked Data database or a relational database. In each case, what strategy would you use?

In [None]:
# Q4(b)(ii) YOUR ANSWER:
# Relational database strategy:
#
# Linked Data database strategy:
#

## Question 4(b)(iii) [8 marks]

**Question:** List all the tables you would need for a relational model of this data, including primary and foreign keys for each.

In [None]:
# Q4(b)(iii) YOUR ANSWER - List the tables:
# Table 1:
#   Primary Key:
#   Foreign Keys:
#
# Table 2:
#   Primary Key:
#   Foreign Keys:
#
# (continue for all tables...)

In [None]:
%%sql
-- Q4(b)(iii) Optional: Create your tables to test
DROP TABLE IF EXISTS PersonInterest;
DROP TABLE IF EXISTS Business;
DROP TABLE IF EXISTS Interest;
DROP TABLE IF EXISTS Person;

-- Add your CREATE TABLE statements here:


## Question 4(b)(iv) [8 marks]

**Question:** Evaluate these THREE models (document, relational and Linked Data/graph) for this sort of data. What would you need to know about the intended application to decide between them?

In [None]:
# Q4(b)(iv) YOUR ANSWER:
# Document model (MongoDB):
#   Pros:
#   Cons:
#
# Relational model (MySQL):
#   Pros:
#   Cons:
#
# Graph/Linked Data model:
#   Pros:
#   Cons:
#
# Questions to ask about the application:
#
# Recommendation:
#

---

# Done!

Check your answers against the **solution sheet**.