# Semistructured Data

## XML Data

- Standard for data representation and exchange;
- Document format is similar to HTML: tags describe content instead of formating;
- Streaming format;

### Basic Constructs

- Tagged elements (nested);
- Attributes;
- Text;

### Relational Model vs XML

/             |Relational          |XML
--------------|--------------------|------------
Structure     |Tables              |Hierarchical, tree, graph
Schema        |Fixed in advance    |Flexible, self-describing
Queries       |Simple nice language|less so
Ordering      |None                |Implied
Implementation|Native              |Add-on

### Well-Formed XML

- Adheres to basic structural requirements:
    - single root element;
    - matched tags, proper nesting;
    - unique attributes within elements;
    
```
XML Document => XML Parser => Parsed XLM or "not well-formed"
```

### Displaying XML

- Use rule-based language to translate to HTML:
    - Cascading stylesheets;
    - Extensibal stylesheet language;
    
```
Parsed XML Document => CSS/XSL interpreter => HTML document
                            ^
                   rules____|                 
```

### DTDs, IDs, IDREFs

#### Valid XML

- Adheres to basic structural requirements;
- Adheres to content-specific specification:
    - Document Type Descriptor (DTD)
    - XML Schema (XSD)
    
```
XML Document => Validating XML Parser => Parsed XLM or "not valid"
                            ^
             DTD or XSD ____|  
```

#### Document Type Descriptor (DTD)

- Grammar-like language for specifying elements, attributes, nesting, ordering, #occurrences;
- Also special attribute types ID and IDREFs;


- DTD:

```xml
<!--  Bookstore with DTD                     -->
<!--  Try changes:                           -->
<!--    Make Edition required                -->
<!--    Swap order of First_Name, Last_Name  -->
<!--    Add empty Remark                     -->
<!--    Add Magazine, omit closing tag       -->
<Bookstore>
    <Book ISBN="ISBN-0-13-713526-2" Price="85" Edition="3rd">
        <Title>A First Course in Database Systems</Title>
        <Authors>
            <Author>
                <First_Name>Jeffrey</First_Name>
                <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
                <First_Name>Jennifer</First_Name>
                <Last_Name>Widom</Last_Name>
            </Author>
        </Authors>
    </Book>
    <Book ISBN="ISBN-0-13-815504-6" Price="100">
        <Title>Database Systems: The Complete Book</Title>
        <Authors>
            <Author>
                <First_Name>Hector</First_Name>
                <Last_Name>Garcia-Molina</Last_Name>
            </Author>
            <Author>
                <First_Name>Jeffrey</First_Name>
                <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
                <First_Name>Jennifer</First_Name>
                <Last_Name>Widom</Last_Name>
            </Author>
        </Authors>
        <Remark> Buy this book bundled with "A First Course" - a great deal! </Remark>
    </Book>
</Bookstore>
```

```xml
<!--  Bookstore DTD               -->
<!DOCTYPE Bookstore [
    <!ELEMENT Bookstore (Book | Magazine)*>
    <!ELEMENT Book (Title, Authors, Remark?)>
    <!ATTLIST Book ISBN CDATA #REQUIRED
                   Price CDATA #REQUIRED
                   Authors CDATA #REQUIRED
    >
    <!ELEMENT Magazine (Title)>
    <!ATTLIST Magazine Month CDATA #REQUIRED
                       Year CDATA #REQUIRED
    >
    <!ELEMENT Title (#PCDATA)>
    <!ELEMENT Authors (Author+)>
    <!ELEMENT Remark (#PCDATA)>
    <!ELEMENT BookRef EMPTY>
    <!ATTLIST BookRef book IDREF #REQUIRED>
    <!ELEMENT Author (First_Name, Last_Name)>
    <!ELEMENT First_Name (#PCDATA)>
    <!ELEMENT Last_Name (#PCDATA)>
]>
```


- using IDs/IDREF(S):

```xml
<!--  Bookstore using ID/IDREF(S)     -->
<!--  Try changes:                    -->
<!--    Ident JU to HG                -->
<!--    BookRef to HG                 -->
<!--    Add second BookRef in Remark  -->
<Bookstore>
    <Book ISBN="ISBN-0-13-713526-2" Price="100" Authors="JU JW">
        <Title>A First Course in Database Systems</Title>
    </Book>
    <Book ISBN="ISBN-0-13-815504-6" Price="85" Authors="HG JU JW">
        <Title>Database Systems: The Complete Book</Title>
        <Remark>
            Amazon.com says: Buy this book bundled with
            <BookRef book="ISBN-0-13-713526-2"/>
            - a great deal!
        </Remark>
    </Book>
    <Author Ident="HG">
        <First_Name>Hector</First_Name>
        <Last_Name>Garcia-Molina</Last_Name>
    </Author>
    <Author Ident="JU">
        <First_Name>Jeffrey</First_Name>
        <Last_Name>Ullman</Last_Name>
    </Author>
    <Author Ident="JW">
        <First_Name>Jennifer</First_Name>
        <Last_Name>Widom</Last_Name>
    </Author>
</Bookstore>
```

```xml
<!--  Bookstore with ID/IDREF(S)                -->
<!DOCTYPE Bookstore [
    <!ELEMENT Bookstore (Book*, Author*)>
    <!ELEMENT Book (Title, Remark?)>
    <!ATTLIST Book ISBN ID #REQUIRED
                   Price CDATA #REQUIRED
                   Authors IDREFS #REQUIRED
    >
    <!ELEMENT Title (#PCDATA)>
    <!ELEMENT Remark (#PCDATA | BookRef) *>
    <!ELEMENT BookRef EMPTY>
    <!ATTLIST BookRef book IDREF #REQUIRED>
    <!ELEMENT Author (First_Name, Last_Name)>
    <!ATTLIST Author Ident ID #REQUIRED>
    <!ELEMENT First_Name (#PCDATA)>
    <!ELEMENT Last_Name (#PCDATA)>
]>
```

```bash
xmllint --valid --noout Bookstore.xml
```

#### XML Schema

- Extensive language;
- Like DTDs, can specify elements, attributes, nesting, orderint, #currences;
- Also data types, keys, (typed) pointers, and more;
- Is written in XML;

```xml
<!--  Bookstore using XML Schema (Bookstore.xsd)              -->
<!--  Notice / try changes:                                   -->
<!--    Price is integer (search 'Price')                     -->
<!--      Make it not an integer                              -->
<!--    Key declarations (search 'key ')                      -->
<!--      Change JU to HG                                     -->
<!--      Change second ISBN to HG                            -->
<!--    References (search 'keyref')                          -->
<!--      Change first authIdent JU to foo                    -->
<!--      Change first authIdent JU to JW                     -->
<!--      Change BookRef to JW                                -->
<!--    Occurrence constraints (search 'occurs', default=1)   -->
<!--      Change to 0 authors                                 -->
<!--      Change to 2 remarks                                 -->
<Bookstore>
    <Book ISBN="ISBN-0-13-713526-2" Price="100">
        <Title>A First Course in Database Systems</Title>
        <Authors>
            <Auth authIdent="JU"/>
            <Auth authIdent="JW"/>
        </Authors>
    </Book>
    <Book ISBN="ISBN-0-13-815504-6" Price="85">
        <Title>Database Systems: The Complete Book</Title>
        <Authors>
            <Auth authIdent="HG"/>
            <Auth authIdent="JU"/>
            <Auth authIdent="JW"/>
        </Authors>
        <Remark>
            Amazon.com says: Buy this book bundled with
            <BookRef book="ISBN-0-13-713526-2"/>
            - a great deal!
        </Remark>
    </Book>
    <Author Ident="HG">
        <First_Name>Hector</First_Name>
        <Last_Name>Garcia-Molina</Last_Name>
    </Author>
    <Author Ident="JU">
        <First_Name>Jeffrey</First_Name>
        <Last_Name>Ullman</Last_Name>
    </Author>
    <Author Ident="JW">
        <First_Name>Jennifer</First_Name>
        <Last_Name>Widom</Last_Name>
    </Author>
</Bookstore>
```

```xml
<?xml version="1.0" ?>
<!-- XSD for Bookstore-XSD.xml -->

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   <xsd:element name="Bookstore">
      <xsd:complexType>
         <xsd:sequence>
            <xsd:element name="Book" type="BookType"
                         minOccurs="0" maxOccurs="unbounded" />
            <xsd:element name="Author" type="AuthorType"
                         minOccurs="0" maxOccurs="unbounded" />
         </xsd:sequence>
      </xsd:complexType>
      <xsd:key name="BookKey">
         <xsd:selector xpath="Book" />
         <xsd:field xpath="@ISBN" />
      </xsd:key>
      <xsd:key name="AuthorKey">
         <xsd:selector xpath="Author" />
         <xsd:field xpath="@Ident" />
      </xsd:key>
      <xsd:keyref name="AuthorKeyRef" refer="AuthorKey">
         <xsd:selector xpath="Book/Authors/Auth" />
         <xsd:field xpath="@authIdent" />
      </xsd:keyref>
      <xsd:keyref name="BookKeyRef" refer="BookKey">
         <xsd:selector xpath="Book/Remark/BookRef" />
         <xsd:field xpath="@book" />
      </xsd:keyref>
   </xsd:element>
   <xsd:complexType name="BookType">
      <xsd:sequence>
         <xsd:element name="Title" type="xsd:string" />
         <xsd:element name="Authors">
            <xsd:complexType>
               <xsd:sequence>
                  <xsd:element name="Auth" maxOccurs="unbounded">
                     <xsd:complexType>
                        <xsd:attribute name="authIdent" type="xsd:string"
                                       use="required" />
                     </xsd:complexType>
                  </xsd:element>
               </xsd:sequence>
            </xsd:complexType>
         </xsd:element>
         <xsd:element name="Remark" minOccurs="0">
            <xsd:complexType mixed="true">
               <xsd:sequence>
                  <xsd:element name="BookRef" minOccurs="0"
                               maxOccurs="unbounded">
                     <xsd:complexType>
                        <xsd:attribute name="book" type="xsd:string"
                                       use="required" />
                     </xsd:complexType>
                  </xsd:element>
               </xsd:sequence>
            </xsd:complexType>
         </xsd:element>
      </xsd:sequence>
      <xsd:attribute name="ISBN" type="xsd:string" use="required" />
      <xsd:attribute name="Price" type="xsd:integer" use="required" />
   </xsd:complexType>
   <xsd:complexType name="AuthorType">
      <xsd:sequence>
         <xsd:element name="First_Name" type="xsd:string" />
         <xsd:element name="Last_Name" type="xsd:string" />
      </xsd:sequence>
      <xsd:attribute name="Ident" type="xsd:string" use="required" />
   </xsd:complexType>
</xsd:schema>

```

```bash
xmllint -schema Bookstore.xsd --noout Bookstor-XSD.xml
```