<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/past-exam-papers/march-2024/notebook-march-2024-solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM3010 March 2024 - Solutions Notebook

This notebook contains **complete solutions** for the March 2024 exam.

**Exam Structure:**
- Section A: 10 MCQs - 40 marks
- Section B: Answer 2 of 3 questions - 60 marks
- Both parts completed together on Inspera (4 hours total)
  - Q2: Carnegie Hall RDF/Linked Data
  - Q3: UK Government Exam Attainment Data
  - Q4: MongoDB Document Database

**Instructions:**
1. Run the Setup cells first
2. All solution cells are pre-filled with correct answers
3. Compare with your own attempts from the practice notebook

---

# 1. Environment Setup

Run these cells first to set up MySQL, MongoDB, and SPARQL.

In [None]:
# === MySQL Setup ===
!apt-get update -qq > /dev/null
!apt-get install -y -qq mysql-server > /dev/null
!service mysql start
!mysql -e "CREATE USER IF NOT EXISTS 'examuser'@'localhost' IDENTIFIED BY 'exampass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS exam_db;"
!mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'examuser'@'localhost';"

# === SQL Magic ===
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0
%reload_ext sql
%sql mysql+pymysql://examuser:exampass@localhost/exam_db

# === SPARQL Magic (cellspell) ===
!pip install "cellspell[sparql] @ git+https://github.com/sreent/jupyter-query-magics.git" -q
%load_ext cellspell.sparql

In [None]:
# === MongoDB Setup ===
!wget -q http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb
!dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb > /dev/null 2>&1
!wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add - > /dev/null 2>&1
!echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.4 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list > /dev/null
!apt-get update -qq > /dev/null
!apt-get install -y -qq mongodb-org > /dev/null
!mkdir -p /data/db
!mongod --fork --logpath /var/log/mongodb.log --dbpath /data/db

!mongo --quiet --eval 'print("MongoDB ready!")'

# === MongoDB Magic (cellspell) ===
!pip install "cellspell[mongodb] @ git+https://github.com/sreent/jupyter-query-magics.git" -q
%load_ext cellspell.mongodb
%mongodb mongodb://localhost:27017/exam_db

SPARQL queries use %%sparql magic cells (loaded via cellspell)
SPARQL ready via `%%`sparql magic!

In [None]:
%%writefile carnegie_hall.ttl
@prefix schema: <http://schema.org/> .
@prefix gnd: <http://d-nb.info/standards/elementset/gnd#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix chm: <http://data.carnegiehall.org/model/> .
@prefix chi: <http://data.carnegiehall.org/instruments/> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix wdt: <http://www.wikidata.org/prop/direct/> .
@prefix mo: <http://purl.org/ontology/mo/> .

<http://data.carnegiehall.org/names/18065> a chm:Entity, schema:Person ;
    rdfs:label "Maria Callas" ;
    gnd:playedInstrument chi:61 ;
    schema:birthDate "1923-12-02"^^xsd:date ;
    schema:birthPlace <http://sws.geonames.org/5128581/> ;
    schema:deathDate "1977-09-16"^^xsd:date ;
    schema:name "Maria Callas" ;
    skos:exactMatch <http://dbpedia.org/resource/Maria_Callas>,
        <http://id.loc.gov/authorities/names/n50032183>,
        wd:Q128297,
        <https://musicbrainz.org/artist/9dee40b2-25ad-404c-9c9a-139feffd4b57> .

chi:61 a mo:Instrument ;
    rdfs:label "soprano" .

wd:Q128297 wdt:P1477 "Maria Anna Cecilia Sofia Kalogeropoulou"@en,
    "Μαρία Άννα Καικιλία Σοφία Καλογεροπούλου"@el .

wdt:P1477 schema:description "full name of a person at birth, if different from their current, generally used name"@en .

<http://data.carnegiehall.org/names/52432> a chm:Entity, schema:Person ;
    rdfs:label "Joan Sutherland" ;
    gnd:playedInstrument chi:61 ;
    schema:birthDate "1926-11-07"^^xsd:date ;
    schema:name "Joan Sutherland" ;
    skos:exactMatch wd:Q229444 .

wd:Q229444 wdt:P1477 "Joan Alston Sutherland"@en .

<http://data.carnegiehall.org/names/12345> a chm:Entity, schema:Person ;
    rdfs:label "Luciano Pavarotti" ;
    gnd:playedInstrument chi:62 ;
    schema:birthDate "1935-10-12"^^xsd:date ;
    schema:name "Luciano Pavarotti" ;
    skos:exactMatch wd:Q36767 .

chi:62 a mo:Instrument ;
    rdfs:label "tenor" .

wd:Q36767 wdt:P1477 "Luciano Pavarotti"@en .

---

# Question 2: Carnegie Hall RDF/Linked Data [30 marks]

## Context

RDF data from the Carnegie Hall data lab describing Maria Callas.

## Question 2(a)(i) [1 mark]

### Solution

**Q2(a)(i) SOLUTION**

Answer: Turtle (Terse RDF Triple Language)

Key indicators:
- @prefix declarations at the start
- Semicolons (;) to continue same subject with different predicates
- Commas (,) to continue same subject and predicate with different objects
- Periods (.) to end triple patterns
- Compact, human-readable syntax

## Question 2(a)(ii) [2 marks]

### Solution

**Q2(a)(ii) SOLUTION**

Answer: RDF/XML

Difference: RDF/XML uses XML syntax with nested elements like <rdf:RDF>
and <rdf:Description>. It is more verbose than Turtle but integrates
better with XML tools and existing XML infrastructure.

Other acceptable answers:
- N-Triples: One triple per line, no prefixes, very simple but verbose
- JSON-LD: JSON syntax, better for web APIs and JavaScript
- N-Quads: Adds graph name as fourth element for named graphs

## Question 2(a)(iii) [1 mark]

### Solution

**Q2(a)(iii) SOLUTION**

Answer: 12 triples

Breakdown:
1. <.../names/18065> a chm:Entity
2. <.../names/18065> a schema:Person
3. <.../names/18065> rdfs:label "Maria Callas"
4. <.../names/18065> gnd:playedInstrument chi:61
5. <.../names/18065> schema:birthDate "1923-12-02"
6. <.../names/18065> schema:birthPlace <.../5128581/>
7. <.../names/18065> schema:deathDate "1977-09-16"
8. <.../names/18065> schema:name "Maria Callas"
9. <.../names/18065> skos:exactMatch <.../Maria_Callas> (dbpedia)
10. <.../names/18065> skos:exactMatch <.../n50032183> (loc.gov)
11. <.../names/18065> skos:exactMatch wd:Q128297 (wikidata)
12. <.../names/18065> skos:exactMatch <musicbrainz...>

Counting explanation:
- 'a chm:Entity, schema:Person' uses comma = same predicate, different objects = 2 triples
- Other predicates with single objects = 6 triples  
- 'skos:exactMatch' with FOUR objects (comma-separated) = 4 triples

Total = 2 + 6 + 4 = 12 triples

## Question 2(b)(i) [1 mark]

### Solution

**Q2(b)(i) SOLUTION**

Answer: http://www.wikidata.org/entity/Q128297

Explanation:
The prefix declaration is:
  @prefix wd: <http://www.wikidata.org/entity/> .

So wd:Q128297 expands to:
  <http://www.wikidata.org/entity/> + Q128297
  = <http://www.wikidata.org/entity/Q128297>

## Question 2(b)(ii) [5 marks]

### Solution

In [None]:
%%sparql --file carnegie_hall.ttl
PREFIX gnd: <http://d-nb.info/standards/elementset/gnd#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX chi: <http://data.carnegiehall.org/instruments/>

SELECT ?person ?personLabel ?birthName
WHERE {
    ?person gnd:playedInstrument chi:61 .
    OPTIONAL { ?person rdfs:label ?personLabel }
    ?person skos:exactMatch ?wdEntity .
    ?wdEntity wdt:P1477 ?birthName .
}

## Question 2(b)(iii) [5 marks]

### Solution

**Q2(b)(iii) SOLUTION**

**Query Carnegie Hall endpoint (or local)**

**Query Wikidata endpoint**

TWO ways to query across Carnegie Hall and Wikidata:

METHOD 1: Federated SPARQL Query (SERVICE keyword)
============================================
Use SPARQL 1.1 federated queries to query both endpoints:

SELECT ?person ?birthName
WHERE {
    ?person gnd:playedInstrument chi:61 .
    ?person skos:exactMatch ?wdEntity .
    
    SERVICE <https://query.wikidata.org/sparql> {
        ?wdEntity wdt:P1477 ?birthName .
    }
}

METHOD 2: Data Integration/ETL
==============================
Download data from both sources and load into single triplestore:
1. Export RDF from Carnegie Hall Data Labs
2. Query Wikidata for relevant entities and export
3. Load both into local triplestore (e.g., Apache Jena Fuseki)
4. Run queries against the combined dataset

OTHER ACCEPTABLE METHODS:
- Link Traversal: Client follows URIs to dereference and fetch data
- Application-level JOIN: Query each endpoint, join in code
- Linked Data Fragments: Use TPF for client-side query processing

## Question 2(c) [9 marks]

### Solution

**Q2(c) SOLUTION**

LIVE LINKED OPEN DATA vs RELATIONAL DATABASE

| Aspect              | Live LOD              | Relational DB          |
|---------------------|----------------------|------------------------|
| Data Freshness      | Always current       | Stale (needs refresh)  |
| Query Speed         | Slow (network)       | Fast (local)           |
| Availability        | Depends on 3 services| Self-hosted, reliable  |
| Schema Flexibility  | Each has own ontology| Unified schema         |
| Data Volume         | Query what you need  | Must download subset   |
| Relationships       | Natural URI links    | Foreign keys           |

ARGUMENTS FOR LIVE LINKED DATA:
1. Wikidata updates constantly (births, deaths, discoveries)
2. No storage infrastructure needed
3. Links between sources already exist via skos:exactMatch
4. Exploratory queries discover unexpected connections
5. Legal simplicity - no need to store copies

ARGUMENTS FOR RELATIONAL DATABASE:
1. Performance - federated queries across 3 endpoints very slow
2. Reliability - not dependent on external availability
3. Complex analytics - aggregations easier in SQL
4. Data quality control - can clean imported data
5. Schema optimization - design tables for specific queries

RECOMMENDATION: Hybrid approach
1. Use LOD for exploration and discovery
2. Cache frequently-used data locally
3. Store project-specific data in relational DB
4. Periodic sync to update cached data
5. Keep URIs as identifiers to link back to sources

## Question 2(d) [6 marks]

### Solution

**Q2(d) SOLUTION**

WIKIDATA vs CARNEGIE HALL ONTOLOGY APPROACHES

WHY DIFFERENT APPROACHES?

| Factor       | Wikidata               | Carnegie Hall           |
|--------------|------------------------|-------------------------|
| Scope        | Universal knowledge    | Domain-specific (music) |
| Contributors | Millions of volunteers | Small professional team |
| Data Model   | Represent ANYTHING     | Focus on performers     |
| Governance   | Community consensus    | Institutional decisions |
| History      | Started from scratch   | Built on existing standards|

WIKIDATA'S BESPOKE ONTOLOGY:

Benefits:
1. Flexibility - add properties for any domain without approval
2. Consistency - all properties follow same design patterns
3. Qualifier support - properties can have metadata (dates, sources)
4. Community control - no external dependency
5. Language neutrality - P569 works across all languages

CARNEGIE HALL'S REUSE OF EXISTING ONTOLOGIES:

Benefits:
1. Interoperability - schema.org understood by search engines
2. Established semantics - schema:birthDate is well-defined
3. Tooling support - libraries, validators already exist
4. Discoverability - standard properties improve SEO
5. Credibility - using music ontology signals domain expertise

SUMMARY:
- Wikidata: Custom ontology for maximum flexibility at scale
- Carnegie Hall: Reuse ontologies for interoperability

---

# Question 3: UK Government Exam Attainment Data [30 marks]

## Question 3(a) [2 marks]

### Solution

**Q3(a) SOLUTION**

Answer: This table is NOT in First Normal Form (1NF)

Reasons:

1. NOT A PROPER RELATION:
   - The table is pivoted/transposed
   - Rows contain different types of data (metadata vs actual values)
   - First rows are category names, not data values

2. REPEATING GROUPS:
   - Each column represents a different combination of
     characteristic + subject
   - This is a classic repeating group pattern

3. NO CLEAR PRIMARY KEY:
   - Row labels like "Number at grade A*" are not proper attributes
   - Cannot uniquely identify rows with a key

4. MIXED DATA TYPES:
   - "z" used alongside numbers
   - First rows contain category names, not data

To be in 1NF, restructure as:
| CharType | Characteristic | Subject | Metric | Value |
|----------|---------------|---------|--------|-------|
| Gender   | Female        | Greek   | A*     | 27    |

## Question 3(b) [3 marks]

### Solution

**Q3(b) SOLUTION**

PROBLEMS WITH "Z" FOR NOT APPLICABLE:

1. Type Mismatch:
   - Numeric columns must be VARCHAR to store "Z"
   - Prevents mathematical operations (SUM, AVG)

2. Aggregation Errors:
   - SUM(), AVG() will fail or ignore "Z" unpredictably
   - May need CASE statements everywhere

3. Comparison Issues:
   - WHERE value > 100 won't work with mixed types
   - Type coercion may cause unexpected results

4. Sorting Problems:
   - "Z" sorts alphabetically, not as missing data
   - Appears after numbers in ASCII order

SOLUTIONS:

1. Use NULL instead of "Z":
   LOAD DATA ... SET value = NULLIF(value, 'Z');

2. Separate validity column:
   CREATE TABLE Results (
       Value INT,
       IsApplicable BOOLEAN DEFAULT TRUE
   );

3. Use NULL with view for display:
   CREATE VIEW Display AS
   SELECT COALESCE(CAST(Value AS CHAR), 'N/A') AS Val;

BEST PRACTICE: Use NULL for missing data - SQL is designed for it.

## Question 3(c) [15 marks]

### Solution

**Q3(c) SOLUTION - Model explanation**

RELATIONAL MODEL DESIGN:

Tables:
1. CharacteristicType - Gender, FSM, All students, etc.
2. Characteristic - Male, Female, Eligible for FSM, etc.
3. SubjectArea - Maths, Classical Studies, etc.
4. Subject - Additional Mathematics, Classical Greek, etc.
5. GradeMetric - Total Students, Number at grade A*, etc.
6. Attainment - Fact table with values

Design Choices:
- Separate CharacteristicType: Normalizes type/characteristic hierarchy
- Subject linked to SubjectArea: Enforces categorization
- GradeMetric table: Allows adding metrics without schema change
- NULL for Value: Handles "not applicable" properly
- Decimal for Value: Handles counts and percentages

Normal Forms:
- 1NF: All atomic values, proper primary keys
- 2NF: No partial dependencies
- 3NF: No transitive dependencies
- BCNF: All determinants are candidate keys

In [None]:
%%sql
-- Q3(c) SOLUTION - CREATE TABLE statements
DROP TABLE IF EXISTS Attainment;
DROP TABLE IF EXISTS GradeMetric;
DROP TABLE IF EXISTS Subject;
DROP TABLE IF EXISTS SubjectArea;
DROP TABLE IF EXISTS Characteristic;
DROP TABLE IF EXISTS CharacteristicType;

CREATE TABLE CharacteristicType (
    CharTypeId INT PRIMARY KEY AUTO_INCREMENT,
    TypeName VARCHAR(50) NOT NULL UNIQUE
);

CREATE TABLE Characteristic (
    CharId INT PRIMARY KEY AUTO_INCREMENT,
    CharTypeId INT NOT NULL,
    CharName VARCHAR(100) NOT NULL,
    FOREIGN KEY (CharTypeId) REFERENCES CharacteristicType(CharTypeId),
    UNIQUE (CharTypeId, CharName)
);

CREATE TABLE SubjectArea (
    SubjectAreaId INT PRIMARY KEY AUTO_INCREMENT,
    AreaName VARCHAR(100) NOT NULL UNIQUE
);

CREATE TABLE Subject (
    SubjectId INT PRIMARY KEY AUTO_INCREMENT,
    SubjectName VARCHAR(100) NOT NULL UNIQUE,
    SubjectAreaId INT NOT NULL,
    FOREIGN KEY (SubjectAreaId) REFERENCES SubjectArea(SubjectAreaId)
);

CREATE TABLE GradeMetric (
    MetricId INT PRIMARY KEY AUTO_INCREMENT,
    MetricName VARCHAR(50) NOT NULL UNIQUE,
    MetricType ENUM('count', 'cumulative', 'percentage') NOT NULL
);

CREATE TABLE Attainment (
    AttainmentId INT PRIMARY KEY AUTO_INCREMENT,
    CharId INT NOT NULL,
    SubjectId INT NOT NULL,
    MetricId INT NOT NULL,
    Value DECIMAL(10,4),
    AcademicYear VARCHAR(9),
    FOREIGN KEY (CharId) REFERENCES Characteristic(CharId),
    FOREIGN KEY (SubjectId) REFERENCES Subject(SubjectId),
    FOREIGN KEY (MetricId) REFERENCES GradeMetric(MetricId),
    UNIQUE (CharId, SubjectId, MetricId, AcademicYear)
);

SELECT 'Tables created!' AS Status;

In [None]:
%%sql
-- Insert sample data for testing
INSERT INTO CharacteristicType (TypeName) VALUES
('Gender'), ('All students'), ('Free School Meals');

INSERT INTO Characteristic (CharTypeId, CharName) VALUES
(1, 'Male'), (1, 'Female'),
(2, 'State-funded students'),
(3, 'Eligible for FSM');

INSERT INTO SubjectArea (AreaName) VALUES
('Maths'), ('Classical Studies'), ('Design and Technology'), ('All STEM subjects');

INSERT INTO Subject (SubjectName, SubjectAreaId) VALUES
('Additional Mathematics', 1),
('Classical Greek', 2),
('Textiles Technology', 3),
('Total STEM subjects', 4);

INSERT INTO GradeMetric (MetricName, MetricType) VALUES
('Total Students', 'count'),
('Number at grade A*', 'count'),
('Number achieving grade A*-C', 'cumulative'),
('Percent achieving grade A*-C', 'percentage');

-- Sample attainment data
INSERT INTO Attainment (CharId, SubjectId, MetricId, Value, AcademicYear) VALUES
(2, 2, 1, 100, '2023-2024'),      -- Female, Classical Greek, Total Students
(2, 2, 3, 99, '2023-2024'),       -- Female, Classical Greek, A*-C count
(3, 3, 1, 661, '2023-2024'),      -- State-funded, Textiles, Total Students
(3, 3, 3, 475, '2023-2024');      -- State-funded, Textiles, A*-C count

SELECT 'Sample data inserted!' AS Status;

## Question 3(d) [4 marks]

### Solution

In [None]:
%%sql
-- Q3(d) SOLUTION: Percentage of A*-C for Classical Studies by Characteristic
-- This calculates the actual percentage from count / total
SELECT
    ct.TypeName AS CharacteristicType,
    c.CharName AS Characteristic,
    ROUND(a_ac.Value / a_total.Value * 100, 2) AS PercentAStarToC
FROM Attainment a_ac
INNER JOIN Attainment a_total
    ON a_ac.CharId = a_total.CharId
    AND a_ac.SubjectId = a_total.SubjectId
    AND a_ac.AcademicYear = a_total.AcademicYear
INNER JOIN Characteristic c ON a_ac.CharId = c.CharId
INNER JOIN CharacteristicType ct ON c.CharTypeId = ct.CharTypeId
INNER JOIN Subject s ON a_ac.SubjectId = s.SubjectId
INNER JOIN SubjectArea sa ON s.SubjectAreaId = sa.SubjectAreaId
INNER JOIN GradeMetric m_ac ON a_ac.MetricId = m_ac.MetricId
INNER JOIN GradeMetric m_total ON a_total.MetricId = m_total.MetricId
WHERE sa.AreaName = 'Classical Studies'
  AND m_ac.MetricName = 'Number achieving grade A*-C'
  AND m_total.MetricName = 'Total Students'
  AND a_total.Value > 0
ORDER BY ct.TypeName, c.CharName;

## Question 3(e) [6 marks]

### Solution

**Q3(e) SOLUTION**

IS RELATIONAL MODEL BEST FOR THIS DATA?

| Model      | Pros                      | Cons                        |
|------------|---------------------------|-----------------------------|
| Relational | Aggregations, integrity   | Rigid schema, many JOINs    |
| Document   | Flexible, easy import     | Poor analytics, duplication |
| Columnar   | Fast aggregations         | Overkill for small data     |
| OLAP Cube  | Pre-computed, fast BI     | Complex setup, inflexible   |

RELATIONAL IS APPROPRIATE because:
1. Analytical queries need JOINs and aggregations
2. Data integrity - ensure characteristics link to valid types
3. Historical tracking - add academic year dimension easily
4. Moderate size - not "big data"
5. Reporting tools expect SQL databases

WHEN ALTERNATIVES WOULD BE BETTER:
- Very large scale (billions of rows) -> Columnar database
- Exploratory analysis only -> Keep as CSV with pandas
- API-first access -> Document database
- Self-service BI with complex drilling -> OLAP cube

CONCLUSION: Relational is a good fit for this analytical use case.

---

# Question 4: MongoDB Document Database [30 marks]

In [None]:
%%mongodb
db.people.drop()
db.people.insertMany([
  {
    "_id": 1,
    "first_name": "Tom",
    "email": "tom@example.com",
    "cell": "765-555-5555",
    "likes": ["fashion", "spas", "shopping"],
    "businesses": [
      {"name": "Entertainment 1080", "partner": "Jean", "status": "Bankrupt", "date_founded": new Date("2012-05-19")},
      {"name": "Swag for Tweens", "date_founded": new Date("2012-11-01")}
    ]
  },
  {
    "_id": 2,
    "first_name": "Jane",
    "email": "jane@example.com",
    "cell": "555-123-4567",
    "likes": ["travel", "fashun", "reading"],
    "businesses": [
      {"name": "Tech Solutions", "status": "Active", "date_founded": new Date("2019-03-15")}
    ]
  },
  {
    "_id": 3,
    "first_name": "Bob",
    "email": "bob@example.com",
    "likes": ["spas", "golf"],
    "businesses": [
      {"name": "Old Venture", "status": "Bankrupt", "date_founded": new Date("2015-01-10")},
      {"name": "New Hope Ltd", "status": "Active", "date_founded": new Date("2021-06-01")}
    ]
  }
])

## Question 4(a)(i) [2 marks]

### Solution

**Q4(a)(i) SOLUTION**

MongoDB Query for people who like spas:

db.people.find({ likes: "spas" })

Explanation: MongoDB automatically searches within arrays.
When likes is an array, { likes: "spas" } matches documents
where "spas" is an element of the array.

In [None]:
%%mongodb
db.people.find({"likes": "spas"})

## Question 4(a)(ii) [4 marks]

### Solution

**Q4(a)(ii) SOLUTION**

MongoDB Query for businesses founded before March 2020 AND at least one Bankrupt:

db.people.find({
    "businesses.date_founded": { $lt: ISODate("2020-03-01") },
    "businesses.status": "Bankrupt"
})

IMPORTANT NOTE:
This finds documents where:
- At least one business was founded before March 1, 2020, AND
- At least one business has status "Bankrupt"

These don't have to be the SAME business element.

If you need BOTH conditions on the SAME business, use $elemMatch:

db.people.find({
    businesses: {
        $elemMatch: {
            date_founded: { $lt: ISODate("2020-03-01") },
            status: "Bankrupt"
        }
    }
})

In [None]:
%%mongodb
db.people.find({
    "businesses.date_founded": {"$lt": new Date("2020-03-01")},
    "businesses.status": "Bankrupt"
})

## Question 4(b)(i) [4 marks]

### Solution

**Q4(b)(i) SOLUTION**

FIX "fashun" -> "fashion" IN MONGODB:

Approach:
Use updateMany() with $set and positional $ operator to update array elements.

Query:
db.people.updateMany(
    { likes: "fashun" },           // Find documents with "fashun"
    { $set: { "likes.$": "fashion" } }  // Replace matched element
)

Step-by-step:
1. updateMany() - Updates all matching documents (not just first)
2. { likes: "fashun" } - Filter for documents where likes contains "fashun"
3. $set - The update operator to set a value
4. "likes.$" - Positional operator refers to first matched array element
5. "fashion" - The new value to set

Alternative using $pull and $addToSet:
// Remove wrong value
db.people.updateMany(
    { likes: "fashun" },
    { $pull: { likes: "fashun" } }
);
// Add correct value if not present
db.people.updateMany(
    { likes: { $ne: "fashion" } },
    { $addToSet: { likes: "fashion" } }
);

In [None]:
%%mongodb
db.people.updateMany(
    {"likes": "fashun"},
    {"$set": {"likes.$": "fashion"}}
)

In [None]:
%%mongodb
db.people.find({"first_name": "Jane"})

## Question 4(b)(ii) [4 marks]

### Solution

**Q4(b)(ii) SOLUTION**

REFERENTIAL INTEGRITY STRATEGIES:

RELATIONAL DATABASE:

1. Create a lookup table for allowed values:
   CREATE TABLE Interest (
       InterestId INT PRIMARY KEY AUTO_INCREMENT,
       InterestName VARCHAR(50) NOT NULL UNIQUE
   );
   INSERT INTO Interest (InterestName) VALUES
   ('fashion'), ('spas'), ('shopping'), ('travel');

2. Reference with foreign key:
   CREATE TABLE PersonInterest (
       PersonId INT,
       InterestId INT,
       PRIMARY KEY (PersonId, InterestId),
       FOREIGN KEY (InterestId) REFERENCES Interest(InterestId)
   );

3. Result: Cannot insert "fashun" - it doesn't exist in Interest table


LINKED DATA / GRAPH DATABASE:

1. Define interests as resources with URIs:
   <http://example.org/interests/fashion> a :Interest ;
       rdfs:label "fashion" .

2. Reference by URI, not string:
   <http://example.org/person/1> :likes
       <http://example.org/interests/fashion> .

3. Use SHACL or OWL constraints:
   :PersonShape a sh:NodeShape ;
       sh:targetClass :Person ;
       sh:property [
           sh:path :likes ;
           sh:class :Interest ;  # Must be an Interest
       ] .

4. Result: Misspelled URIs either don't resolve or fail validation

## Question 4(b)(iii) [8 marks]

### Solution

**Q4(b)(iii) SOLUTION**

RELATIONAL MODEL TABLES:

| Table          | Primary Key            | Foreign Keys               |
|----------------|------------------------|----------------------------|
| Person         | PersonId               | -                          |
| Interest       | InterestId             | -                          |
| PersonInterest | (PersonId, InterestId) | PersonId -> Person         |
|                |                        | InterestId -> Interest     |
| Business       | BusinessId             | PersonId -> Person         |
| Partner (opt)  | (BusinessId, PartnerId)| BusinessId -> Business     |
|                |                        | PartnerId -> Person        |

In [None]:
%%sql
-- Q4(b)(iii) SOLUTION - CREATE TABLE statements
DROP TABLE IF EXISTS PersonInterest;
DROP TABLE IF EXISTS Business;
DROP TABLE IF EXISTS Interest;
DROP TABLE IF EXISTS Person;

-- 1. Person table
CREATE TABLE Person (
    PersonId INT PRIMARY KEY AUTO_INCREMENT,
    FirstName VARCHAR(100),
    Email VARCHAR(255) UNIQUE,
    Cell VARCHAR(20)
);

-- 2. Interest table (lookup for likes)
CREATE TABLE Interest (
    InterestId INT PRIMARY KEY AUTO_INCREMENT,
    InterestName VARCHAR(50) NOT NULL UNIQUE
);

-- 3. PersonInterest junction table
CREATE TABLE PersonInterest (
    PersonId INT,
    InterestId INT,
    PRIMARY KEY (PersonId, InterestId),
    FOREIGN KEY (PersonId) REFERENCES Person(PersonId) ON DELETE CASCADE,
    FOREIGN KEY (InterestId) REFERENCES Interest(InterestId)
);

-- 4. Business table
CREATE TABLE Business (
    BusinessId INT PRIMARY KEY AUTO_INCREMENT,
    PersonId INT NOT NULL,
    BusinessName VARCHAR(200) NOT NULL,
    PartnerName VARCHAR(100),  -- Simple string for partner
    Status VARCHAR(50),
    DateFounded DATE,
    FOREIGN KEY (PersonId) REFERENCES Person(PersonId) ON DELETE CASCADE
);

SELECT 'Tables created!' AS Status;

## Question 4(b)(iv) [8 marks]

### Solution

**Q4(b)(iv) SOLUTION**

DATABASE MODEL EVALUATION:

| Aspect           | Document (MongoDB) | Relational (MySQL) | Graph/Linked Data  |
|------------------|--------------------|--------------------|--------------------|  
| Schema flex      | Excellent          | Poor               | Moderate           |
| Query complexity | Simple reads       | Complex queries OK | Pattern matching   |
| Data integrity   | Application        | Database enforced  | SHACL/ontology     |
| Scalability      | Horizontal         | Vertical           | Varies             |
| Joins/relations  | Embedding/$lookup  | Native JOINs       | Natural traversal  |

DOCUMENT (MongoDB):
  Pros: Natural fit, fast profile reads, flexible schema
  Cons: Hard cross-entity queries, no referential integrity

RELATIONAL (MySQL):
  Pros: Easy analytics, data integrity, familiar tooling
  Cons: More tables, joins for every query, rigid schema

GRAPH/LINKED DATA:
  Pros: Network queries, LOD integration, flexible
  Cons: Overkill for CRUD, fewer developers know SPARQL

QUESTIONS TO DECIDE:
| Question                              | If Yes -> Model |
|---------------------------------------|----------------|
| Complex relationship queries?         | Graph          |
| Stable, well-defined schema?          | Relational     |
| Frequently read full profiles?        | Document       |
| Need transactions across entities?    | Relational     |
| Integrate external linked data?       | Graph          |
| Rapid development priority?           | Document       |
| Need BI/reporting?                    | Relational     |

FOR THIS DATA:
- Social networking app -> Document (fast profile reads)
- Business analytics -> Relational (aggregations)
- Investment network analysis -> Graph (relationships)

---

# End of Solutions Notebook

All solutions have been provided. Compare with your attempts in the practice notebook!