<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/Lectures/CM3010%20March%202023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Section 0: Environment Setup (Colab + MySQL + Python Packages)

In [1]:
# Install MySQL (if in Colab/Ubuntu environment), start the service
!apt -qq update > /dev/null
!apt -y -qq install mysql-server > /dev/null
!service mysql start

# Create user & DB for demonstration
!mysql -e "CREATE USER IF NOT EXISTS 'musicuser'@'localhost' IDENTIFIED BY 'musicpass';"
!mysql -e "CREATE DATABASE IF NOT EXISTS MusicDemo;"
!mysql -e "GRANT ALL PRIVILEGES ON MusicDemo.* TO 'musicuser'@'localhost';"

# Install Python libs for SQL, lxml (XML parsing), rdflib (RDF)
!pip install -q sqlalchemy==2.0.20 ipython-sql==0.5.0 pymysql==1.1.0 prettytable==2.0.0 lxml rdflib

# Load SQL extension
%reload_ext sql

# Connect to MusicDemo DB
%sql mysql+pymysql://musicuser:musicpass@localhost/MusicDemo
print("MySQL environment ready. Connected to MusicDemo database.")



W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


 * Starting MySQL database server mysqld
   ...done.
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.9/564.9 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
[?25hMySQL environment ready. Connected to MusicDemo database.


## Section 1: Relational DB (Band Membership) - Q3

In **Question 3**, we often model **Artist** (band or person) and a **Membership** table. We’ll show how to store data, run a query to find anomalies (members who joined before band founding), and highlight how this relates to MusicBrainz.

### 1.1 Create Tables for Band Membership

We use a **composite primary key**: `(BandID, MemberID)` in the `Membership` table.

In [None]:
%%sql
DROP TABLE IF EXISTS Membership;
DROP TABLE IF EXISTS Artist;

CREATE TABLE Artist (
  ArtistID     INT PRIMARY KEY,
  Name         VARCHAR(100) NOT NULL,
  Type         VARCHAR(20)  NOT NULL,  -- 'Person' or 'MusicGroup'
  FoundingDate DATE
);

CREATE TABLE Membership (
  BandID   INT NOT NULL,
  MemberID INT NOT NULL,
  StartDate DATE,
  RoleName  VARCHAR(100),
  PRIMARY KEY (BandID, MemberID),
  FOREIGN KEY (BandID)   REFERENCES Artist(ArtistID),
  FOREIGN KEY (MemberID) REFERENCES Artist(ArtistID)
);

### 1.2 Insert Sample Data

We’ll insert **one** band `BTS` (ID=1) founded `2013-06-13`, another band `AnotherBand` (ID=3) founded `2020-01-01`, plus two people (`JIN` ID=2, `Alice` ID=4). Then we link them via `Membership`.

In [None]:
%%sql
-- Insert sample artists
INSERT INTO Artist (ArtistID, Name, Type, FoundingDate)
VALUES
  (1, 'BTS', 'MusicGroup', '2013-06-13'),
  (2, 'JIN', 'Person', NULL),
  (3, 'AnotherBand', 'MusicGroup', '2020-01-01'),
  (4, 'Alice', 'Person', NULL);

-- Insert memberships
INSERT INTO Membership (BandID, MemberID, StartDate, RoleName)
VALUES
  (1, 2, '2013-06-13', 'Vocalist'),  -- JIN in BTS
  (3, 4, '2019-12-31', 'Guitarist'); -- Alice in AnotherBand (slightly before founding)

### 1.3 Verify Data

In [None]:
%%sql
SELECT * FROM Artist;

SELECT * FROM Membership;

We should see:

- **Artist**:

| ArtistID | Name         | Type        | FoundingDate |
|----------|------------- |------------ |------------- |
| 1        | BTS          | MusicGroup  | 2013-06-13   |
| 2        | JIN          | Person      | NULL         |
| 3        | AnotherBand  | MusicGroup  | 2020-01-01   |
| 4        | Alice        | Person      | NULL         |

- **Membership**:

| BandID | MemberID | StartDate   | RoleName   |
|--------|----------|------------ |----------- |
| 1      | 2        | 2013-06-13  | Vocalist   |
| 3      | 4        | 2019-12-31  | Guitarist  |

### 1.4 Query: Check for Anomalies (StartDate < FoundingDate)

**Exam Q3** sometimes asks for a query that finds members who joined before the band’s official founding date:

In [None]:
%%sql


We expect to see:

| MemberName | BandName      | StartDate   | FoundingDate |
|------------|-------------- |------------ |------------- |
| Alice      | AnotherBand   | 2019-12-31  | 2020-01-01   |

This flags a **potential** data error (Alice can’t join a band that didn’t exist yet).

## Section 2: (Optional) 16th-Century Music Model - Q4

**Question 4** focuses on storing **pages, lines, coordinates, regions,** etc. We’ll illustrate a minimal schema that parallels that scenario. If you prefer to focus only on band membership (Q3), skip this section.

### 2.1 Create Tables (Piece, Page, Region, Part, Line)

In [None]:
%%sql
DROP TABLE IF EXISTS Line;
DROP TABLE IF EXISTS InstrumentOrVoicePart;
DROP TABLE IF EXISTS Region;
DROP TABLE IF EXISTS Page;
DROP TABLE IF EXISTS Piece;

CREATE TABLE Piece (
  PieceID  INT PRIMARY KEY,
  Title    VARCHAR(100)
);

CREATE TABLE Page (
  PageID   INT PRIMARY KEY,
  BookID   VARCHAR(50)
);

CREATE TABLE Region (
  RegionID INT PRIMARY KEY,
  PageID   INT,
  Description VARCHAR(100),
  FOREIGN KEY (PageID) REFERENCES Page(PageID)
);

CREATE TABLE InstrumentOrVoicePart (
  PartID   INT PRIMARY KEY,
  PartName VARCHAR(100)
);

CREATE TABLE Line (
  LineID    INT PRIMARY KEY,
  PieceID   INT,
  PageID    INT,
  RegionID  INT,
  PartID    INT,
  LineOrder INT,
  XCoord    FLOAT,
  YCoord    FLOAT,
  FOREIGN KEY (PieceID)  REFERENCES Piece(PieceID),
  FOREIGN KEY (PageID)   REFERENCES Page(PageID),
  FOREIGN KEY (RegionID) REFERENCES Region(RegionID),
  FOREIGN KEY (PartID)   REFERENCES InstrumentOrVoicePart(PartID)
);

### 2.2 Insert Sample Data

Let’s simulate one **Piece** with lines on multiple pages/regions, for example.

In [None]:
%%sql
INSERT INTO Piece (PieceID, Title)
VALUES (101, 'Renaissance Madrigal');

INSERT INTO Page (PageID, BookID) VALUES
 (501, 'BookA'),
 (502, 'BookA'); -- Two pages from same book

INSERT INTO Region (RegionID, PageID, Description) VALUES
 (601, 501, 'Top half'),
 (602, 501, 'Bottom half'),
 (603, 502, 'Full page');

INSERT INTO InstrumentOrVoicePart (PartID, PartName)
VALUES (701, 'Soprano'), (702, 'Tenor');

INSERT INTO Line (LineID, PieceID, PageID, RegionID, PartID, LineOrder, XCoord, YCoord)
VALUES
  (801, 101, 501, 601, 701, 1,  10, 100),
  (802, 101, 501, 601, 702, 2,  15, 105),
  (803, 101, 501, 602, 701, 3,  5,  200),
  (804, 101, 502, 603, 702, 1,  20, 300);

We have a single piece (`PieceID=101`), multiple lines across pages (501, 502), some lines in top/bottom region, and two voice parts (Soprano, Tenor).

### 2.3 Query: List Pieces with Total Number of Lines

Mimicking **Q4(d)**, we want to see how many lines each piece has:


In [None]:
%%sql
SELECT p.Title, COUNT(*) AS TotalLines
FROM Piece p
JOIN Line l ON p.PieceID = l.PieceID
GROUP BY p.Title;

Expected result:

| Title                 | TotalLines |
|-----------------------|----------- |
| Renaissance Madrigal  | 4         |

### Discussion: Relational vs. XML / Document DB (Q4e)

- **Relational** is great for counting, grouping, strong constraints (PK, FK).  
- **XML** or document DB might be better if the layout is deeply nested or if structure is highly variable.

## Section 3: XML Parsing & RelaxNG (Q2)

**Question 2** deals with XML well‐formedness vs. validity, namespaces, etc. We demonstrate **lxml** usage for:

1. Parsing an XML snippet with ODF-like namespaces.  
2. Checking well‐formedness.  
3. Validating with a RelaxNG schema.  
4. Demonstrating an **XPath** query to show how `//text:list-item/text:p` differs from `//text:list//text:p`.

### 3.1 Minimal RelaxNG Schema

We define a small `.rng` that says `<document>` can have `<text:list>` elements, each with optional `<text:list-item>` containing `<text:p>` children. We'll allow `<text:p>` inside `<text:list-item>`.

In [None]:
mini_schema_content = """\
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
         xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
         xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">
  <start>
    <element name="office:document">
      <zeroOrMore>
        <ref name="listElement"/>
      </zeroOrMore>
    </element>
  </start>

  <define name="listElement">
    <element name="text:list">
      <zeroOrMore>
        <element name="text:list-item">
          <zeroOrMore>
            <element name="text:p">
              <text/>
            </element>
          </zeroOrMore>
        </element>
      </zeroOrMore>
    </element>
  </define>
</grammar>
"""

with open("mini_schema.rng", "w", encoding="utf-8") as f:
    f.write(mini_schema_content)

print("Wrote mini_schema.rng with both office: and text: namespaces.")

### 3.2 Sample Valid XML (Q2 Example)

In [None]:
xml_data_valid = """\
<office:document xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
                 xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">
  <text:list>
    <text:list-item>
      <text:p>Trees</text:p>
      <text:p>Graphs</text:p>
    </text:list-item>
    <text:list-item>
      <text:p>Relations</text:p>
    </text:list-item>
  </text:list>
</office:document>
"""

print(xml_data_valid)


### 3.3 Parsing (Well‐Formedness)

In [None]:
from lxml import etree

try:
    xml_root = etree.fromstring(xml_data_valid.encode("utf-8"))
    print("XML is well-formed. Root tag =", xml_root.tag)
except etree.XMLSyntaxError as e:
    print("XML is NOT well-formed:", e)

### 3.4 RelaxNG Validation (Check Validity)

In [None]:
rng_doc = etree.parse("mini_schema.rng")
relaxng = etree.RelaxNG(rng_doc)
print("Loaded mini_schema.rng successfully!")

if relaxng.validate(xml_root):
    print("Document is VALID according to RelaxNG.")
else:
    print("INVALID. Errors:")
    for err in relaxng.error_log:
        print(err.message, err.line)

### 3.5 Invalid Element Example (Q2(h))

In [None]:
xml_data_invalid = """\
<office:document xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
                 xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">
  <text:list>
    <text:list-item>
      <text:p>Trees</text:p>
    </text:list-item>
    <text:invalid-element>Whoops</text:invalid-element>
  </text:list>
</office:document>
"""

xml_bad_root = etree.fromstring(xml_data_invalid.encode("utf-8"))
if relaxng.validate(xml_bad_root):
    print("Unexpectedly valid!")
else:
    print("As expected, document is INVALID.")
    for err in relaxng.error_log:
        print("*", err.message, "(line:", err.line, ")")

We see an error because `<text:invalid-element>` is not defined by the schema.

### 3.6 XPath Demonstration (Q2(d))

In [None]:
tree = etree.fromstring(xml_data_valid.encode("utf-8"))

# Direct children of <text:list-item>
xpath1 = tree.xpath("//text:list-item/text:p",
                    namespaces={"text": "urn:oasis:names:tc:opendocument:xmlns:text:1.0"})
print("Results of //text:list-item/text:p => direct child <text:p> of <text:list-item>")
for node in xpath1:
    print("-", node.text)

# All descendant <text:p> under <text:list>
xpath2 = tree.xpath("//text:list//text:p",
                    namespaces={"text": "urn:oasis:names:tc:opendocument:xmlns:text:1.0"})
print("\nResults of //text:list//text:p => all <text:p> descendants of <text:list>")
for node in xpath2:
    print("-", node.text)

Since all `<text:p>` elements are direct children, both yield **Trees, Graphs, Relations**. With deeper nesting, they might differ.

## Section 4: RDF (MusicBrainz BTS Example) & SPARQL (Q3)

**Question 3** also discusses Linked Data. Let’s show a small **Turtle** snippet that references **BTS** with founding date, members JIN & SUGA, using `schema:member` in a **role-based** structure:

### 4.1 BTS Turtle Snippet

In [None]:
import rdflib

bts_ttl_data = """\
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix schema: <http://schema.org/> .
@prefix mba: <http://musicbrainz.org/artist/> .

mba:bts a schema:MusicGroup ;
  schema:foundingDate "2013-06-13"^^xsd:date ;
  schema:member [
    a schema:OrganizationRole ;
    schema:member mba:jin ;
    schema:startDate "2013-06-13"^^xsd:date
  ],
  [
    a schema:OrganizationRole ;
    schema:member mba:suga ;
    schema:startDate "2013-06-13"^^xsd:date
  ] ;
  schema:name "BTS" .

mba:jin a schema:MusicGroup, schema:Person ;
  schema:name "JIN" .

mba:suga a schema:MusicGroup, schema:Person ;
  schema:name "SUGA" .
"""

g = rdflib.Graph()
g.parse(data=bts_ttl_data, format="turtle")
print("RDF graph loaded with", len(g), "triples.")


Here, **BTS** is typed as a `schema:MusicGroup`. It has `schema:member` referencing blank nodes of type `schema:OrganizationRole`. Those nodes have `schema:startDate` plus `schema:member` pointing to `mba:jin` or `mba:suga`. Meanwhile, `JIN` and `SUGA` are typed both as `schema:MusicGroup` and `schema:Person` (like in MusicBrainz RDF exports).

### 4.2 SPARQL Query (Q3(f,g))

We want each **member name** plus **start date**:

In [None]:
q_sparql = """
PREFIX mba: <http://musicbrainz.org/artist/>
PREFIX schema: <http://schema.org/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?memberName ?start
WHERE {
  mba:bts schema:member ?role .
  ?role schema:startDate ?start ;
        schema:member ?person .
  ?person schema:name ?memberName .
}
"""

for row in g.query(q_sparql):
    print("Member name:", row.memberName, "| Start date:", row.start)

Expected:

```
Member name: JIN | Start date: 2013-06-13
Member name: SUGA | Start date: 2013-06-13
```

### 4.3 Database Dumps vs. Linked Data (Q3(k))

- **Database Dump**:  
  - **Pros**: Complete offline snapshot, no reliance on remote servers, can do big local queries.  
  - **Cons**: Can become outdated quickly, large disk usage.  

- **Linked Data**:  
  - **Pros**: Always up-to-date, easy to integrate with other RDF graphs.  
  - **Cons**: Dependent on network availability/performance, less local control.  