# Evaluation: Conformance and Consistency
Part III of the computational evaluation of AI-generated linked data for [Linking Anthropology's Data and Archives (LADA)](https://ischool.umd.edu/projects/building-a-sustainable-future-for-anthropologys-archives-researching-primary-source-data-lifecycles-infrastructures-and-reuse/), focused on conformance to ontologies (i.e., Dublin Core, Schema.org, CIDOC-CRM) and consistency in the conformance across generated data points (e.g., is all data on one or two lines, or does each tag, subtag, etc. appear on its own line?).

---

**Table of Contents:**

I. [Data Loading](#data-loading)

II. [XML and RDF/XML](#xml-and-rdfxml)

  * [Dublin Core](#dublin-core)

III. [JSON-LD](#json-ld)

  * [Schema.org](#schemaorg)

  * [CIDOC-CRM](#cidoc-crm)

---

## Data Loading

In [1]:
import utils
import config
import pandas as pd
import numpy as np
import urllib
import urllib.request
import xml.etree.ElementTree as ET
import json
from lxml import etree
import rdflib
from rdflib.namespace import DC, SDO # Dublin Core, Schema.org
from pathlib import Path
import os
import re

# sax - to validate XML well-formed
# xml.etree.ElementTree - to validate text between tags
# xml.etree.ElementTree + xml.etree.ElementTree.XMLSchema's validate() - to validate XML well-formed
# lxml etree.XMLParser - to validate well-formed based on input XML schema
# json_checker - to validate Python data types (incl. but not limited to those obtained from JSON)
# jsonschema.validate
# ShEx - for RDF graphs, ShExJ for JSON - NOTE: couldn't install package
# OntoME - for CIDOC-CRM ontology alignment
# Try PyLD, if RDFLib not working/not doing what want

Create variables to reference existing directories and files.

In [2]:
dublin_path = "cleaned/dublin_core/"  # XML data files
schema_path = "cleaned/schema_org/"   # JSON data files
cidoc_path = "cleaned/cidoc_crm/"     # JSON data files

dublin_t1_dir = config.task1_data+dublin_path
schema_t1_dir = config.task1_data+schema_path
cidoc_t1_dir = config.task1_data+cidoc_path

dublin_p1_dir = config.playgrd1_data+dublin_path
schema_p1_dir = config.playgrd1_data+schema_path
cidoc_p1_dir = config.playgrd1_data+cidoc_path

dublin_p3_dir = config.playgrd3_data+dublin_path
schema_p3_dir = config.playgrd3_data+schema_path
cidoc_p3_dir = config.playgrd3_data+cidoc_path

In [3]:
dc_paths = [dublin_t1_dir, dublin_p1_dir, dublin_p3_dir]
sdo_paths = [schema_t1_dir, schema_p1_dir, schema_p3_dir]
cidoc_paths = [cidoc_t1_dir, cidoc_p1_dir, cidoc_p3_dir]

In [4]:
d = "conformance"
report_dir = f"data/error_reports/{d}/"
Path(report_dir).mkdir(parents=True, exist_ok=True)

# report_dir = f"data/error_reports/{d}/after_correction/"
# Path(report_dir).mkdir(parents=True, exist_ok=True)

## XML and RDF/XML

### Dublin Core

***Note:***

*The Dublin Core schemas' URLs below are from [dublincore.org](https://www.dublincore.org/schemas/xmls/) under "Latest versions are always available as: ..."*

*Another resource: https://www.dublincore.org/specifications/dublin-core/dc-rdf-notes/*

In [5]:
dc_elements_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dc.xsd"
dc_terms_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dcterms.xsd"
dc_mitype_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dcmitype.xsd"
dc_xsd_urls = [dc_elements_schema_url, dc_terms_schema_url, dc_mitype_schema_url]

In [None]:
# # # https://lxml.de/resolvers.html
class DCResolver(etree.Resolver):
    def resolve(self, url, pubid, context):
        try:
            remote_url = "https://www.dublincore.org/schemas/xmls/qdc/dc.xsd"
            with urllib.request.urlopen(remote_url, timeout=10) as response:
                content = response.read()
            return self.resolve_string(content, context)
        except Exception as e:
            return f"Failed to resolve {url}: {str(e)}"

class DCMITypeResolver(etree.Resolver):
    def resolve(self, url, pubid, context):
        try:
            remote_url = "https://www.dublincore.org/schemas/xmls/qdc/dcmitype.xsd"
            with urllib.request.urlopen(remote_url, timeout=10) as response:
                content = response.read()
            return self.resolve_string(content, context)
        except Exception as e:
            return f"Failed to resolve {url}: {str(e)}"

parser = etree.XMLParser()
parser.resolvers.add(DCResolver())
parser.resolvers.add(DCMITypeResolver())

In [41]:
def getXMLSchema(url, parser=parser, user_agent={'User-Agent': 'Mozilla/8.0'}):
    xsd_request = urllib.request.Request(url, headers=user_agent)
    xsd_content = urllib.request.urlopen(xsd_request, timeout=5).read()
    xml_xsd = etree.XML(xsd_content, parser)
    return etree.XMLSchema(xml_xsd)

In [47]:
# https://lxml.de/resolvers.html
class DCResolver(etree.Resolver):
    def resolve(self, url, pubid, context):
        try:
            remote_url = "https://www.dublincore.org/schemas/xmls/qdc/dc.xsd"
            with urllib.request.urlopen(remote_url, timeout=10) as response:
                content = response.read()
            return self.resolve_string(content, context)
        except Exception as e:
            return f"Failed to resolve {url}: {str(e)}"

class DCMITypeResolver(etree.Resolver):
    def resolve(self, url, pubid, context):
        try:
            remote_url = "https://www.dublincore.org/schemas/xmls/qdc/dcmitype.xsd"
            with urllib.request.urlopen(remote_url, timeout=10) as response:
                content = response.read()
            return self.resolve_string(content, context)
        except Exception as e:
            print(f"Failed to resolve {url}: {str(e)}")
            return None

parser = etree.XMLParser()
parser.resolvers.add(DCResolver())
parser.resolvers.add(DCMITypeResolver())

url = 'https://www.dublincore.org/schemas/xmls/qdc/dcterms.xsd'
request = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/9.0'})
with urllib.request.urlopen(request, timeout=5) as response:
    xsd_content = response.read()
xml_xsd = etree.XML(xsd_content, parser)
xsd = etree.XMLSchema(xml_xsd)

XMLSchemaParseError: Internal error: xmlSchemaBucketCreate, failed to add the schema bucket to the hash.

In [49]:
from io import BytesIO

class DublinCoreResolver(etree.Resolver):
    def resolve(self, url, pubid, context):
        try:
            base_url = "https://www.dublincore.org/schemas/xmls/qdc/"
            if url.endswith("dc.xsd"):
                target_url = base_url + "dc.xsd"
            elif url.endswith("dcmitype.xsd"):
                target_url = base_url + "dcmitype.xsd"
            else:
                return None  # Let default resolver handle others

            with urllib.request.urlopen(target_url, timeout=10) as response:
                content = response.read()

            # Create a file-like object for content
            fileobj = BytesIO(content)

            # Use resolve_fileobj and set system_url
            input_source = self.resolve_fileobj(fileobj, context)
            input_source.system_url = target_url
            return input_source

        except Exception as e:
            print(f"Failed to resolve {url}: {e}")
            return None

parser = etree.XMLParser()
parser.resolvers.add(DublinCoreResolver())

main_url = 'https://www.dublincore.org/schemas/xmls/qdc/dcterms.xsd'
request = urllib.request.Request(main_url, headers={'User-Agent': 'Mozilla/9.0'})
with urllib.request.urlopen(request, timeout=10) as response:
    xsd_content = response.read()

xml_schema_doc = etree.XML(xsd_content, parser)
xsd_schema = etree.XMLSchema(xml_schema_doc)

Failed to resolve dc.xsd: 'DublinCoreResolver' object has no attribute 'resolve_fileobj'
Failed to resolve dcmitype.xsd: 'DublinCoreResolver' object has no attribute 'resolve_fileobj'


XMLSchemaParseError: element decl. '{http://purl.org/dc/terms/}title', attribute 'substitutionGroup': The QName value '{http://purl.org/dc/elements/1.1/}title' does not resolve to a(n) element declaration.

In [42]:
xsd = getXMLSchema(dc_xsd_urls[1])

XMLSchemaParseError: Internal error: xmlSchemaBucketCreate, failed to add the schema bucket to the hash.

Create variables for the two Dublin Core Metadata Initative namespaces.

In [4]:
dcmi_legacy = ["creator", "contributor", "date", "title", "publisher", 
             "language", "format", "subject", "description", "identifier", 
             "relation", "source", "type", "coverage", "rights"]
dcmi_terms = ["dcterms:creator", "dcterms:contributor", "dcterms:date", "dcterms:title", "dcterms:publisher", 
             "dcterms:language", "dcterms:format", "dcterms:subject", "dcterms:description", "dcterms:identifier", 
             "dcterms:relation", "dcterms:source", "dcterms:type", "dcterms:coverage", "dcterms:rights"]

In [5]:
url = dc_elements_schema_url

In [None]:
content = urllib.request.urlopen(url)
parser = etree.XMLParser()
dc_elements_tree = etree.parse(content, parser)
dc_elements_schema = etree.XMLSchema(dc_elements_tree)

In [10]:
dc_t1_files = os.listdir(dc_paths[0])
print(dc_t1_files[0])

dc_record_024.xml


In [12]:
xml_doc = etree.parse(dublin_t1_dir+dc_t1_files[0])
result = dc_elements_schema.validate(xml_doc)

In [None]:
dublin_file_paths[1]

'data/data_playground_task1/cleaned/dublin_core/dc_record_001.xml'

For metadata using the terms namespace (e.g., `dcterms:title`), make sure that only the properties permitted to have literals do, in fact, have literals (meaning they are not URIs, they're strings) and that only the properties permitted to have nonliterals do, in fact, have nonliterals (meaning they are URIs).  See a list of which properties can have what [here](https://www.dublincore.org/resources/userguide/publishing_metadata/#Properties_of_the_terms_namespace_used_only_with_literal_values).

For metadata using the legacy namespace (e.g., `dc:title`), all properties' values can be literals or nonliterals.


Check that each XML file includes a DOCTYPE and the simple Dublin Core namespace (an exception may not be thrown during parsing even if these are missing from an XML file).

In [None]:
prolog = '<?xml version="1.0"?>'
dc_namespace_simple = 'xmlns:dc="http://purl.org/dc/elements/1.1/">'

In [None]:
correct_files, other_errors = [], []
for f_xml in dublin_file_paths:
    if not f_xml in errored_files:
        f_txt = f_xml.replace(".xml", ".txt")
        with open(f_txt, "r") as f:
            content = f.read()
            if not prolog in content:
                errored_files += [f_xml]
                f_error = {"file": f_xml, "exception_type": "Custom syntax", "exception_subtype": "Prolog", "exception_message": "No prolog"}
                other_errors += [f_error]
            if not dc_namespace_simple in content:
                errored_files += [f_xml]
                f_error = {"file": f_xml, "exception_type": "Custom syntax", "exception_subtype": "Namespace", "exception_message": "No DC namespace"}
                other_errors += [f_error]
            if not f_xml in errored_files:
                correct_files += [f_xml]
            f.close()

errored_files = list(set(errored_files))
correct_files = list(set(correct_files))

print("Total correct Dublin Core XML files:", len(correct_files))
print("Files with errors:", 
      len(errored_files), "of", total_dcxml_files,
      f"({(len(errored_files)/total_dcxml_files)*100:.2f}%)")

Total correct Dublin Core XML files: 2
Files with errors: 105 of 107 (98.13%)


In [None]:
prolog = '<?xml version="1.0"?>'


# Be sure the XML data includes a DOCTYPE declaration if specified in the prolog parameter
has_prolog = re.findall("^\<\?xml version=.+\?>", f_string)
if not include_prolog:
    if len(has_prolog) > 0:
        f_string = f_string.replace(has_prolog[0], "")
else:
    if len(has_prolog) == 0:
        f_string = prolog + "\n" + f_string
    # Make sure the prolog is consistent across all files
    elif has_prolog[0] != prolog:
        f_string = f_string.replace(has_prolog[0], prolog)

ALSO??? Make sure that literals (non-URIs, or strings) are not in the subject or predicate of a triple (they can only be in the object position, but objects can also be nonliterals).

## JSON-LD

In [None]:
# doc = {
#     "http://schema.org/name": "Manu Sporny",
#     "http://schema.org/url": {"@id": "http://manu.sporny.org/"},
#     "http://schema.org/image": {"@id": "http://manu.sporny.org/images/manu.png"}
# }
# playground_task3 sdo_record_006.json
doc = {
  "@context": "https://schema.org",
  # "@type": "Collection",
  "name": "Adam Makowicz Collection",
  "identifier": "06-034",
  "creator": {
    "@type": "Person",
    "name": "Adam Makowicz",
    # "birthDate": "1940"
  },
  # "description": "Collection of correspondence, promotional materials, photographs, sound recordings, and scores documenting the career of Adam Makowicz from 1973–2009.",
  # "temporalCoverage": "1973/2009",
  # "inLanguage": "en",
  "holdingArchive": {
    "@type": "ArchiveOrganization",
    "name": "Music Library, University of North Texas",
    "url": "http://url.unspecified"
  },
  # "accessMode": "Special arrangement required",
  # "materialExtent": "18 boxes",
  # "genre": "Jazz",
  "hasPart": [
    {
      "@type": "CreativeWork",
      "name": "Audio and video recordings",
      # "encodingFormat": "analog reel, cassette, CD, LP"
    },
    {
      "@type": "CreativeWork",
      "name": "Photographs, scores, promotional materials"
    }
  ]
}


In [None]:
jsonld.flatten(doc)

JsonLdError: ('Could not expand input before flattening.',)
Type: jsonld.FlattenError
Cause: ('Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, or more than one HTTP Link Header was provided for a remote context.',)
Type: jsonld.InvalidUrl
Code: loading remote context failed
Details: {'url': 'https://schema.org', 'cause': JsonLdError('Could not retrieve a JSON-LD document from the URL.')}  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/jsonld.py", line 912, in flatten
    expanded = self.expand(input_, options)
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/jsonld.py", line 870, in expand
    expanded = self._expand(active_ctx, None, document, options,
        inside_list=False)
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/jsonld.py", line 2302, in _expand
    active_ctx = self._process_context(
        active_ctx, element['@context'], options)
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/jsonld.py", line 3049, in _process_context
    resolved = options['contextResolver'].resolve(active_ctx, local_ctx, options.get('base', ''))
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/context_resolver.py", line 58, in resolve
    resolved = self._resolve_remote_context(
        active_ctx, ctx, base, cycles)
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/context_resolver.py", line 108, in _resolve_remote_context
    context, remote_doc = self._fetch_context(active_ctx, url, cycles)
                          ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/context_resolver.py", line 148, in _fetch_context
    raise jsonld.JsonLdError(
    ...<8 lines>...
        code='loading remote context failed')


In [None]:
context = {
    "name": "http://schema.org/name",
    # "url": {"@id": "http://schema.org/url", "@type": "@id"},
    # "image": {"@id": "http://schema.org/image", "@type": "@id"}
}

In [None]:
# compact a document according to a particular context (see: https://json-ld.org/spec/latest/json-ld/#compacted-document-form)
# compacted = jsonld.compact(doc, context)
# print(json.dumps(compacted, indent=2))

In [None]:
# expand a document, removing its context
# see: https://json-ld.org/spec/latest/json-ld/#expanded-document-form
expanded = jsonld.expand(doc) #compacted)

print(json.dumps(expanded, indent=2))

JsonLdError: ('Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, or more than one HTTP Link Header was provided for a remote context.',)
Type: jsonld.InvalidUrl
Code: loading remote context failed
Details: {'url': 'https://schema.org', 'cause': JsonLdError('Could not retrieve a JSON-LD document from the URL.')}

### Schema.org

Conformance basically a check that types exist and are referenced (spelled, structured) properly?

In [None]:
# JSON-LD Context (all types?): https://schema.org/docs/jsonldcontext.json
# Current version of vocabulary: https://schema.org/version/latest/schemaorg-current-https.jsonld (from https://schema.org/docs/developers.html)

In [None]:
# to get all field names in JSON: '"@?[a-zA-Z]+"(?=:)'

### CIDOC-CRM

Conformance basically a check that types exist and are referenced (spelled, structured) properly?

In [None]:
# JSON-LD Context: https://cidoc-crm.org/rdfs/7.1.3/CIDOC_CRM_v7.1.3_JSON-LD_Context.jsonld
# Classes & properties: https://cidoc-crm.org/html/cidoc_crm_v7.1.3.html
# Namespace: http://www.cidoc-crm.org/cidoc-crm/

In [None]:
# to get all field names in JSON: '"@?[a-zA-Z]+"(?=:)'

Possible to automate check for event-centric modeling???