<a href="https://colab.research.google.com/github/sreent/data-management-intro/blob/main/Lectures/CM3010%20September%202021%3A%20MCQ%20Questions%201(e)%20and%201(f).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring the Movie XML Questions

This notebook demonstrates:

- **Why** the sample `movies.xml` is *not well-formed*.
- **How** to parse the XML and see the error.
- **Comparing** well-formedness vs. schema validity (with `movies.xsd`).


## 1) The XML Snippet (movies.xml)

```xml
<movie>
  <title>Citizen Kane</title>
  <cast>
    <actor>Orson Welles</actor>
    <actor role="Jebediah Leland">Joseph Cotton</actor>
</movie>
```

*Observing the code, we see it is missing `</cast>`.*

**Question (e):** “Look at the data and associated XML schema fragments below. The XML below is not well‐formed. Why not?”

**Short Answer:** The `<cast>` element is not closed. That alone breaks well‐formedness.

## Let’s Try Parsing This XML With `lxml`

In [1]:
!pip install lxml

from lxml import etree



In [2]:
# We'll store the snippet in a variable
xml_snippet = """
<movie>
  <title>Citizen Kane</title>
  <cast>
    <actor>Orson Welles</actor>
    <actor role="Jebediah Leland">Joseph Cotton</actor>
</movie>
"""

try:
    root = etree.fromstring(xml_snippet)
    print("This should never print, because we expect an error about unclosed <cast>.")
except etree.XMLSyntaxError as e:
    print("XMLSyntaxError caught!")
    print("Reason:", e)

XMLSyntaxError caught!
Reason: Opening and ending tag mismatch: cast line 4 and movie, line 7, column 9 (<string>, line 7)


**Explanation**  
- The parser immediately complains because `<cast>` never has a matching `</cast>` tag, violating well‐formedness.

### 2) Well-Formedness Explanation

An XML document is well-formed if:

1. Every start-tag has a matching end-tag.
2. Elements properly nest (no overlapping).
3. Exactly one root element, etc.

**In our snippet**:  
- `<cast>` is never properly closed with `</cast>`, hence it is not well-formed.

## 3) The Provided movies.xsd

```xml
<xs:element name="movie">
  <xs:complexType>
    <xs:all maxOccurs="unbounded">
      <xs:element ref="cast"/>
      <xs:element ref="releaseYear"/>
      <xs:element ref="title"/>
    </xs:all>
  </xs:complexType>
</xs:element>

<xs:element name="cast">
  <xs:complexType>
    <xs:sequence>
      <xs:element maxOccurs="unbounded" ref="actor"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="actor">
  <xs:complexType mixed="true">
    <xs:attribute name="role"/>
  </xs:complexType>
</xs:element>

<xs:element name="releaseYear" type="xs:integer"/>
<xs:element name="title">
  <xs:complexType mixed="true">
    <xs:attribute name="lang" use="required"/>
  </xs:complexType>
</xs:element>
```

**Observations**:
- The `<title>` must have a `lang` attribute (use="required"), so if it’s missing, that breaks *validity* (but not necessarily well-formedness).
- The schema also expects a `<releaseYear>` element, so omitting that also breaks *validity*.

Here we show what *would* happen if the XML was well-formed but still might fail **schema** validation.

In [3]:
# Let's define a corrected but incomplete XML:
corrected_xml = """
<movie>
  <title lang="en">Citizen Kane</title>
  <cast>
    <actor>Orson Welles</actor>
    <actor role="Jebediah Leland">Joseph Cotton</actor>
  </cast>
</movie>
"""

xsd_content = """
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  elementFormDefault="qualified">

  <xsd:element name="movie">
    <xsd:complexType>
      <xsd:all maxOccurs="unbounded">
        <xsd:element ref="cast"/>
        <xsd:element ref="releaseYear"/>
        <xsd:element ref="title"/>
      </xsd:all>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="cast">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element ref="actor" maxOccurs="unbounded"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="actor">
    <xsd:complexType mixed="true">
      <xsd:attribute name="role"/>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="releaseYear" type="xsd:integer"/>
  <xsd:element name="title">
    <xsd:complexType mixed="true">
      <xsd:attribute name="lang" use="required"/>
    </xsd:complexType>
  </xsd:element>

</xsd:schema>
"""

In [4]:
from lxml import etree

# Parse the corrected XML
xml_doc = etree.fromstring(corrected_xml)

# Parse the XSD
xsd_doc = etree.fromstring(xsd_content)
schema = etree.XMLSchema(xsd_doc)

# Now let's see if it is valid
if schema.validate(xml_doc):
    print("XML is valid according to movies.xsd!")
else:
    print("XML is NOT valid according to movies.xsd.")
    for error in schema.error_log:
        print(error.message)

XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}all', attribute 'maxOccurs': The value 'unbounded' is not valid. Expected is '1'., line 7

In [5]:
xsd_content = """
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  elementFormDefault="qualified">

  <xsd:element name="movie">
    <xsd:complexType>
      <xsd:sequence maxOccurs="unbounded">
        <xsd:element ref="cast"/>
        <xsd:element ref="releaseYear"/>
        <xsd:element ref="title"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="cast">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element ref="actor" maxOccurs="unbounded"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="actor">
    <xsd:complexType mixed="true">
      <xsd:attribute name="role"/>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="releaseYear" type="xsd:integer"/>
  <xsd:element name="title">
    <xsd:complexType mixed="true">
      <xsd:attribute name="lang" use="required"/>
    </xsd:complexType>
  </xsd:element>

</xsd:schema>
"""

In [6]:
# Parse the XSD
xsd_doc = etree.fromstring(xsd_content)
schema = etree.XMLSchema(xsd_doc)

# Now let's see if it is valid
if schema.validate(xml_doc):
    print("XML is valid according to movies.xsd!")
else:
    print("XML is NOT valid according to movies.xsd.")
    for error in schema.error_log:
        print(error.message)

XML is NOT valid according to movies.xsd.
Element 'title': This element is not expected. Expected is ( cast ).


**Explanation**  
- We made the XML well-formed by closing `<cast>` and adding `title lang="en"`.
- However, we did *not* include `<releaseYear>`. The schema demands it. So we expect the validation to fail, complaining about a missing `releaseYear`.

### Summaries

**(e) Why is the original snippet not well-formed?**
- Because `<cast>` is not closed.

**(f) Why is the XML not valid (excluding the well-formedness problem)?**
- The schema requires a `<title>` element with a `lang` attribute (which was missing originally).
- The schema also requires a `<releaseYear>` element.
- Additional minor points like the presence or order of elements if `<xs:all>` or `<xs:sequence>` is used.

Hence, those are the reasons for:

- Not well-formed: unclosed `<cast>` tag.
- Not valid: missing required fields (releaseYear, title@lang) or any other rule from `movies.xsd`.