# Evaluation: Conformance and Consistency
Part III of the computational evaluation of AI-generated linked data for [Linking Anthropology's Data and Archives (LADA)](https://ischool.umd.edu/projects/building-a-sustainable-future-for-anthropologys-archives-researching-primary-source-data-lifecycles-infrastructures-and-reuse/), focused on conformance to ontologies (i.e., CIDOC-CRM, Schema.org, Dublin Core) and consistency in the conformance across generated data points (e.g., is all data on one or two lines, or does each tag, subtag, etc. appear on its own line?).

---

**Table of Contents:**

I. [Data Loading](#data-loading)

II. [Dublin Core](#dublin-core)

III. [Schema.org](#schemaorg)

IV. [CIDOC-CRM](#cidoc-crm)

---

## Data Loading

In [None]:
import utils
import config
import pandas as pd
import numpy as np
import urllib.request
import urllib
import xml.etree.ElementTree as ET
import json
from lxml import etree
import rdflib
from rdflib.namespace import DC, SDO # Dublin Core, Schema.org
from pathlib import Path
import os
import re

# sax - to validate XML well-formed
# xml.etree.ElementTree - to validate text between tags
# xml.etree.ElementTree + xml.etree.ElementTree.XMLSchema's validate() - to validate XML well-formed
# lxml etree.XMLParser - to validate well-formed based on input XML schema
# json_checker - to validate Python data types (incl. but not limited to those obtained from JSON)
# jsonschema.validate
# ShEx - for RDF graphs, ShExJ for JSON - NOTE: couldn't install package
# OntoME - for CIDOC-CRM ontology alignment
# Try PyLD, if RDFLib not working/not doing what want

Create variables to reference existing directories and files.

In [2]:
dublin_path = "corrected/dublin_core/"  # XML data files
schema_path = "corrected/schema_org/"   # JSON data files
cidoc_path = "corrected/cidoc_crm/"     # JSON data files

dublin_t1_dir = config.task1_data+dublin_path
schema_t1_dir = config.task1_data+schema_path
cidoc_t1_dir = config.task1_data+cidoc_path

dublin_p1_dir = config.playgrd1_data+dublin_path
schema_p1_dir = config.playgrd1_data+schema_path
cidoc_p1_dir = config.playgrd1_data+cidoc_path

dublin_p3_dir = config.playgrd3_data+dublin_path
schema_p3_dir = config.playgrd3_data+schema_path
cidoc_p3_dir = config.playgrd3_data+cidoc_path

In [9]:
dc_paths = [dublin_t1_dir, dublin_p1_dir, dublin_p3_dir]
sdo_paths = [schema_t1_dir, schema_p1_dir, schema_p3_dir]
cidoc_paths = [cidoc_t1_dir, cidoc_p1_dir, cidoc_p3_dir]

## Dublin Core

***Note:***

*The Dublin Core schemas' URLs below are from [dublincore.org](https://www.dublincore.org/schemas/xmls/) under "Latest versions are always available as: ..."*

In [3]:
dc_elements_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dc.xsd"
dc_terms_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dcterms.xsd"
dc_mitype_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dcmitype.xsd"

Create variables for the two Dublin Core Metadata Initative namespaces.

In [4]:
dcmi_legacy = ["creator", "contributor", "date", "title", "publisher", 
             "language", "format", "subject", "description", "identifier", 
             "relation", "source", "type", "coverage", "rights"]
dcmi_terms = ["dcterms:creator", "dcterms:contributor", "dcterms:date", "dcterms:title", "dcterms:publisher", 
             "dcterms:language", "dcterms:format", "dcterms:subject", "dcterms:description", "dcterms:identifier", 
             "dcterms:relation", "dcterms:source", "dcterms:type", "dcterms:coverage", "dcterms:rights"]

In [5]:
url = dc_elements_schema_url

In [None]:
content = urllib.request.urlopen(url)
parser = etree.XMLParser()
dc_elements_tree = etree.parse(content, parser)
dc_elements_schema = etree.XMLSchema(dc_elements_tree)

In [10]:
dc_t1_files = os.listdir(dc_paths[0])
print(dc_t1_files[0])

dc_record_024.xml


In [12]:
xml_doc = etree.parse(dublin_t1_dir+dc_t1_files[0])
result = dc_elements_schema.validate(xml_doc)

In [None]:
dublin_file_paths[1]

'data/data_playground_task1/cleaned/dublin_core/dc_record_001.xml'

For metadata using the terms namespace (e.g., `dcterms:title`), make sure that only the properties permitted to have literals do, in fact, have literals (meaning they are not URIs, they're strings) and that only the properties permitted to have nonliterals do, in fact, have nonliterals (meaning they are URIs).  See a list of which properties can have what [here](https://www.dublincore.org/resources/userguide/publishing_metadata/#Properties_of_the_terms_namespace_used_only_with_literal_values).

For metadata using the legacy namespace (e.g., `dc:title`), all properties' values can be literals or nonliterals.


Check that each XML file includes a DOCTYPE and the simple Dublin Core namespace (an exception may not be thrown during parsing even if these are missing from an XML file).

In [None]:
prolog = '<?xml version="1.0"?>'
dc_namespace_simple = 'xmlns:dc="http://purl.org/dc/elements/1.1/">'

In [None]:
correct_files, other_errors = [], []
for f_xml in dublin_file_paths:
    if not f_xml in errored_files:
        f_txt = f_xml.replace(".xml", ".txt")
        with open(f_txt, "r") as f:
            content = f.read()
            if not prolog in content:
                errored_files += [f_xml]
                f_error = {"file": f_xml, "exception_type": "Custom syntax", "exception_subtype": "Prolog", "exception_message": "No prolog"}
                other_errors += [f_error]
            if not dc_namespace_simple in content:
                errored_files += [f_xml]
                f_error = {"file": f_xml, "exception_type": "Custom syntax", "exception_subtype": "Namespace", "exception_message": "No DC namespace"}
                other_errors += [f_error]
            if not f_xml in errored_files:
                correct_files += [f_xml]
            f.close()

errored_files = list(set(errored_files))
correct_files = list(set(correct_files))

print("Total correct Dublin Core XML files:", len(correct_files))
print("Files with errors:", 
      len(errored_files), "of", total_dcxml_files,
      f"({(len(errored_files)/total_dcxml_files)*100:.2f}%)")

Total correct Dublin Core XML files: 2
Files with errors: 105 of 107 (98.13%)


In [None]:
prolog = '<?xml version="1.0"?>'


# Be sure the XML data includes a DOCTYPE declaration if specified in the prolog parameter
has_prolog = re.findall("^\<\?xml version=.+\?>", f_string)
if not include_prolog:
    if len(has_prolog) > 0:
        f_string = f_string.replace(has_prolog[0], "")
else:
    if len(has_prolog) == 0:
        f_string = prolog + "\n" + f_string
    # Make sure the prolog is consistent across all files
    elif has_prolog[0] != prolog:
        f_string = f_string.replace(has_prolog[0], prolog)

ALSO??? Make sure that literals (non-URIs, or strings) are not in the subject or predicate of a triple (they can only be in the object position, but objects can also be nonliterals).

## Schema.org

Conformance basically a check that types exist and are referenced (spelled, structured) properly?

In [None]:
# JSON-LD Context (all types?): https://schema.org/docs/jsonldcontext.json
# Current version of vocabulary: https://schema.org/version/latest/schemaorg-current-https.jsonld (from https://schema.org/docs/developers.html)

## CIDOC-CRM

Conformance basically a check that types exist and are referenced (spelled, structured) properly?

In [None]:
# JSON-LD Context: https://cidoc-crm.org/rdfs/7.1.3/CIDOC_CRM_v7.1.3_JSON-LD_Context.jsonld
# Classes & properties: https://cidoc-crm.org/html/cidoc_crm_v7.1.3.html
# Namespace: http://www.cidoc-crm.org/cidoc-crm/



Possible to automate check for event-centric modeling???