# Evaluation: Conformance and Consistency
Part III of the computational evaluation of AI-generated linked data for [Linking Anthropology's Data and Archives (LADA)](https://ischool.umd.edu/projects/building-a-sustainable-future-for-anthropologys-archives-researching-primary-source-data-lifecycles-infrastructures-and-reuse/), focused on conformance to ontologies (i.e., Dublin Core, Schema.org, CIDOC-CRM) and consistency in the conformance across generated data points (e.g., is all data on one or two lines, or does each tag, subtag, etc. appear on its own line?).

---

**Table of Contents:**

I. [Data Loading](#data-loading)

II. [XML and RDF/XML](#xml-and-rdfxml)

  * [Dublin Core](#dublin-core)

III. [JSON-LD](#json-ld)

  * [Schema.org](#schemaorg)

  * [CIDOC-CRM](#cidoc-crm)

IV. [Linked Data Best Practices](#linked-data-best-practices)

---

**Resources:**
* [W3C Best Practices for Publishing Linked Data](https://www.w3.org/TR/ld-bp/)
* Dublin Core:
  * [DCMI Dublin Core in XML](https://www.dublincore.org/schemas/xmls/)
  * [DCMI Dublin Core in RDF](https://www.dublincore.org/specifications/dublin-core/dc-rdf-notes/)
* Schema.org:
  * [JSON-LD Context](https://schema.org/docs/jsonldcontext.json) - *question: is this for* ***all*** *Schema.org types?*
  * [Current version of vocabulary](https://schema.org/version/latest/schemaorg-current-https.jsonld)
  * [Documentation for developers](https://schema.org/docs/developers.html)
* CIDOC-CRM:
  * [JSON-LD Context](https://cidoc-crm.org/rdfs/7.1.3/CIDOC_CRM_v7.1.3_JSON-LD_Context.jsonld)
  * [Classes & properties](https://cidoc-crm.org/html/cidoc_crm_v7.1.3.html)
  * [Namespace](http://www.cidoc-crm.org/cidoc-crm/)

## Data Loading

In [8]:
import utils
import config
import pandas as pd
import numpy as np
import urllib
from urllib.request import Request, urlopen
import xml.etree.ElementTree as ET
import json
from lxml import etree
import rdflib
from rdflib.namespace import DC, SDO # Dublin Core, Schema.org
from pathlib import Path
import os
import re

# sax - to validate XML well-formed
# xml.etree.ElementTree - to validate text between tags
# xml.etree.ElementTree + xml.etree.ElementTree.XMLSchema's validate() - to validate XML well-formed
# lxml etree.XMLParser - to validate well-formed based on input XML schema
# json_checker - to validate Python data types (incl. but not limited to those obtained from JSON)
# jsonschema.validate
# ShEx - for RDF graphs, ShExJ for JSON - NOTE: couldn't install package
# OntoME - for CIDOC-CRM ontology alignment
# Try PyLD, if RDFLib not working/not doing what want

In [9]:
os.environ["no_proxy"] = "*"                                                                                                                     # https://docs.python.org/3/library/urllib.request.html 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}   # As suggested here: https://www.reddit.com/r/learnpython/comments/1ea3r0z/how_to_avoid_http_error_403_forbidden/

Create variables to reference existing directories and files.

In [10]:
dublin_path = "cleaned/dublin_core/"  # XML data files
schema_path = "cleaned/schema_org/"   # JSON data files
cidoc_path = "cleaned/cidoc_crm/"     # JSON data files

dublin_t1_dir = config.task1_data+dublin_path
schema_t1_dir = config.task1_data+schema_path
cidoc_t1_dir = config.task1_data+cidoc_path

dublin_p1_dir = config.playgrd1_data+dublin_path
schema_p1_dir = config.playgrd1_data+schema_path
cidoc_p1_dir = config.playgrd1_data+cidoc_path

dublin_p3_dir = config.playgrd3_data+dublin_path
schema_p3_dir = config.playgrd3_data+schema_path
cidoc_p3_dir = config.playgrd3_data+cidoc_path

In [11]:
dc_paths = [dublin_t1_dir, dublin_p1_dir, dublin_p3_dir]
sdo_paths = [schema_t1_dir, schema_p1_dir, schema_p3_dir]
cidoc_paths = [cidoc_t1_dir, cidoc_p1_dir, cidoc_p3_dir]

In [12]:
d = "conformance"
report_dir = f"data/error_reports/{d}/"
Path(report_dir).mkdir(parents=True, exist_ok=True)

## XML and RDF/XML

### Dublin Core

***Note:***

*The Dublin Core schemas' URLs below are from [dublincore.org](https://www.dublincore.org/schemas/xmls/) under "Latest versions are always available as: ..."*

Sample empty record conforming to Dublin Core RDF/XML:

```
<?xml version="1.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
     <rdf:Description>
          <dc:title></dc:title>
          <dc:creator></dc:creator>
          <dc:contributor></dc:contributor>
          <dc:date></dc:date>
          <dc:relation></dc:relation>
          <dc:language></dc:language>
          <dc:source></dc:source>
          <dc:subject></dc:subject>
          <dc:description></dc:description>
          <dc:publisher></dc:publisher>
          <dc:rights></dc:rights>
          <dc:type></dc:type>
          <dc:format></dc:format>
          <dc:coverage></dc:coverage>
          <dc:identifier></dc:identifier>
     </rdf:Description>
</rdf:RDF>
```

#### Metadata Field Names

Create variables for the 15 "core" elements, or properties of Dublin Core and for the two Dublin Core Metadata Initative (DCMI) namespaces.

In [22]:
dc_elements = ["creator", "contributor", "date", "title", "publisher", 
             "language", "format", "subject", "description", "identifier", 
             "relation", "source", "type", "coverage", "rights"]
dcmi_simple = ["dc:"+elem for elem in dc_elements]
dcmi_qualified = ["dcterms:"+elem for elem in dc_elements]
print(dcmi_simple[:3])
print(dcmi_qualified[:3])

['dc:creator', 'dc:contributor', 'dc:date']
['dcterms:creator', 'dcterms:contributor', 'dcterms:date']


In [23]:
# Read the TXT files so all generated metadata can be read, whether or not the XML is well-formed.
extension = ".txt"
dublin_file_paths = []
dublin_files_t1 = [f for f in os.listdir(dublin_t1_dir) if f.endswith(extension)]
dublin_file_paths += [dublin_t1_dir+f for f in dublin_files_t1]
dublin_files_p1 = [f for f in os.listdir(dublin_p1_dir) if f.endswith(extension)]
dublin_file_paths += [dublin_p1_dir+f for f in dublin_files_p1]
dublin_files_p3 = [f for f in os.listdir(dublin_p3_dir) if f.endswith(extension)]
dublin_file_paths += [dublin_p3_dir+f for f in dublin_files_p3]
dublin_file_paths.sort()
total_dc_files = len(dublin_file_paths)
print(f"Total Dublin Core {extension[1:].upper()} files:", total_dc_files)

Total Dublin Core TXT files: 107


In [61]:
simple_tag = re.compile('<dc:[a-z]+>')
qual_tag = re.compile('<dcterms:[a-z]+>')
# attrib_tag = re.compile('<[a-z]+ [a-z]+=[a-z"]+>')
# field_tag = re.compile('<[^:^>^\/^=]+>')
rdf_tags = re.compile('<\/?rdf:RDF\s*[^>]*>')
rdf_desc_tags = re.compile('<\/?rdf:Description\s*[^>]*>')
prolog = '<?xml version="1.0"'

  rdf_tags = re.compile('<\/?rdf:RDF\s*[^>]*>')
  rdf_desc_tags = re.compile('<\/?rdf:Description\s*[^>]*>')


In [68]:
# Check whether fields present in DC metadata records are included according to simple or qualified Dublin Core
use_prolog, use_simple, use_qualified, use_rdf, use_rdf_desc = [], [], [], [], []
for file_path in dublin_file_paths:
    with open(file_path, "r") as f:
        f_lower = f.read() #.lower()

        # Look for a prolog (with or without encoding specified)
        has_prolog = re.findall(prolog, f_lower)
        if len(has_prolog) > 0:
            use_prolog += [True]
        else:
            use_prolog += [False]
        
        # Look for tags in simple Dublin Core
        simples = re.findall(simple_tag, f_lower)
        if len(simples) > 0:
            use_simple += [simples]
        else:
            use_simple += [False]
        
        # Look for tags in qualified Dublin Core
        quals = re.findall(qual_tag, f_lower)
        if len(quals) > 0:
            use_qualified += [quals]
        else:
            use_qualified += [False]

        # Look for open and close rdf:RDF tags
        rdfs = re.findall(rdf_tags, f_lower)
        if len(rdfs) == 2:
            use_rdf += [rdfs]
        else:
            use_rdf += [False]

        # Look for open and close rdf:Description tags
        rdf_descs = re.findall(rdf_desc_tags, f_lower)
        if len(rdf_descs) == 2:
            use_rdf_desc += [rdf_descs]
        else:
            use_rdf_desc += [False]


In [69]:
df = pd.DataFrame.from_dict({"file":dublin_file_paths, "uses_prolog":use_prolog, "uses_simple":use_simple, "uses_qualified":use_qualified, "uses_rdf":use_rdf, "uses_rdf_desc":use_rdf_desc})
df.head()

Unnamed: 0,file,uses_prolog,uses_simple,uses_qualified,uses_rdf,uses_rdf_desc
0,data/data_playground_task1/cleaned/dublin_core...,False,False,False,False,False
1,data/data_playground_task1/cleaned/dublin_core...,False,"[<dc:title>, <dc:creator>, <dc:subject>, <dc:d...",False,False,False
2,data/data_playground_task1/cleaned/dublin_core...,True,"[<dc:title>, <dc:subject>, <dc:description>, <...",False,False,False
3,data/data_playground_task1/cleaned/dublin_core...,False,"[<dc:title>, <dc:creator>, <dc:subject>, <dc:d...",False,False,False
4,data/data_playground_task1/cleaned/dublin_core...,False,"[<dc:title>, <dc:creator>, <dc:contributor>, <...",False,False,False


Calculate the number of files with and without a prolog, the use of simple and qualified DC, and the use of RDF (Resource Description Framework).

In [71]:
prolog_counts = pd.DataFrame(df.uses_prolog.value_counts()).reset_index().rename(columns={"count":"file_count"})
proportions = (prolog_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
prolog_counts.insert(len(prolog_counts.columns), "proportion_of_files", percentages)
prolog_counts

Unnamed: 0,uses_prolog,file_count,proportion_of_files
0,False,64,59.81%
1,True,43,40.19%


In [72]:
col = "uses_simple"
subdf = df.loc[df[col] == False]
simple_counts = pd.DataFrame.from_dict({col:[False, True], "file_count":[subdf.shape[0], (df.shape[0] - subdf.shape[0])]})
proportions = (simple_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
simple_counts.insert(len(simple_counts.columns), "proportion_of_files", percentages)
simple_counts

Unnamed: 0,uses_simple,file_count,proportion_of_files
0,False,14,13.08%
1,True,93,86.92%


In [81]:
col = "uses_qualified"
subdf = df.loc[df[col] == False]
qual_counts = pd.DataFrame.from_dict({col:[False, True], "file_count":[subdf.shape[0], (df.shape[0] - subdf.shape[0])]})
proportions = (qual_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
qual_counts.insert(len(qual_counts.columns), "proportion_of_files", percentages)
dc_counts = pd.concat([simple_counts, qual_counts], axis=1)
dc_counts

Unnamed: 0,uses_simple,file_count,proportion_of_files,uses_qualified,file_count.1,proportion_of_files.1
0,False,14,13.08%,False,97,90.65%
1,True,93,86.92%,True,10,9.35%


In [84]:
col = "uses_rdf"
subdf = df.loc[df[col] == False]
rdf_counts = pd.DataFrame.from_dict({col:[False, True], "file_count":[subdf.shape[0], (df.shape[0] - subdf.shape[0])]})
proportions = (rdf_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
rdf_counts.insert(len(rdf_counts.columns), "proportion_of_files", percentages)
rdf_counts

Unnamed: 0,uses_rdf,file_count,proportion_of_files
0,False,107,100.00%
1,True,0,0.00%


In [85]:
col = "uses_rdf_desc"
subdf = df.loc[df[col] == False]
rdf_desc_counts = pd.DataFrame.from_dict({col:[False, True], "file_count":[subdf.shape[0], (df.shape[0] - subdf.shape[0])]})
proportions = (rdf_desc_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
rdf_desc_counts.insert(len(rdf_desc_counts.columns), "proportion_of_files", percentages)
rdf_counts = pd.concat([rdf_counts, rdf_desc_counts], axis=1)
rdf_counts

Unnamed: 0,uses_rdf,file_count,proportion_of_files,uses_rdf_desc,file_count.1,proportion_of_files.1
0,False,107,100.00%,False,99,92.52%
1,True,0,0.00%,True,8,7.48%


Save the reports as CSV files.

In [86]:
metadata_standard = "dublin_core"
data_serialization = "xml"

In [87]:
report_type = "conformance"
df.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

In [88]:
report_type = "prolog_stats"
dc_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

In [89]:
report_type = "dc_simple_qual_stats"
dc_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

In [90]:
report_type = "rdf_stats"
rdf_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

## JSON-LD

In [None]:
# field_values = re.compile('((?<=:)\s*)"[^"]+"')

In [None]:
# doc = {
#     "http://schema.org/name": "Manu Sporny",
#     "http://schema.org/url": {"@id": "http://manu.sporny.org/"},
#     "http://schema.org/image": {"@id": "http://manu.sporny.org/images/manu.png"}
# }
# playground_task3 sdo_record_006.json
doc = {
  "@context": "https://schema.org",
  # "@type": "Collection",
  "name": "Adam Makowicz Collection",
  "identifier": "06-034",
  "creator": {
    "@type": "Person",
    "name": "Adam Makowicz",
    # "birthDate": "1940"
  },
  # "description": "Collection of correspondence, promotional materials, photographs, sound recordings, and scores documenting the career of Adam Makowicz from 1973–2009.",
  # "temporalCoverage": "1973/2009",
  # "inLanguage": "en",
  "holdingArchive": {
    "@type": "ArchiveOrganization",
    "name": "Music Library, University of North Texas",
    "url": "http://url.unspecified"
  },
  # "accessMode": "Special arrangement required",
  # "materialExtent": "18 boxes",
  # "genre": "Jazz",
  "hasPart": [
    {
      "@type": "CreativeWork",
      "name": "Audio and video recordings",
      # "encodingFormat": "analog reel, cassette, CD, LP"
    },
    {
      "@type": "CreativeWork",
      "name": "Photographs, scores, promotional materials"
    }
  ]
}


In [None]:
jsonld.flatten(doc)

JsonLdError: ('Could not expand input before flattening.',)
Type: jsonld.FlattenError
Cause: ('Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, or more than one HTTP Link Header was provided for a remote context.',)
Type: jsonld.InvalidUrl
Code: loading remote context failed
Details: {'url': 'https://schema.org', 'cause': JsonLdError('Could not retrieve a JSON-LD document from the URL.')}  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/jsonld.py", line 912, in flatten
    expanded = self.expand(input_, options)
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/jsonld.py", line 870, in expand
    expanded = self._expand(active_ctx, None, document, options,
        inside_list=False)
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/jsonld.py", line 2302, in _expand
    active_ctx = self._process_context(
        active_ctx, element['@context'], options)
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/jsonld.py", line 3049, in _process_context
    resolved = options['contextResolver'].resolve(active_ctx, local_ctx, options.get('base', ''))
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/context_resolver.py", line 58, in resolve
    resolved = self._resolve_remote_context(
        active_ctx, ctx, base, cycles)
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/context_resolver.py", line 108, in _resolve_remote_context
    context, remote_doc = self._fetch_context(active_ctx, url, cycles)
                          ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lucyhavens/miniconda3/envs/ldeval/lib/python3.13/site-packages/pyld/context_resolver.py", line 148, in _fetch_context
    raise jsonld.JsonLdError(
    ...<8 lines>...
        code='loading remote context failed')


In [None]:
context = {
    "name": "http://schema.org/name",
    # "url": {"@id": "http://schema.org/url", "@type": "@id"},
    # "image": {"@id": "http://schema.org/image", "@type": "@id"}
}

In [None]:
# compact a document according to a particular context (see: https://json-ld.org/spec/latest/json-ld/#compacted-document-form)
# compacted = jsonld.compact(doc, context)
# print(json.dumps(compacted, indent=2))

In [None]:
# expand a document, removing its context
# see: https://json-ld.org/spec/latest/json-ld/#expanded-document-form
expanded = jsonld.expand(doc) #compacted)

print(json.dumps(expanded, indent=2))

JsonLdError: ('Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, or more than one HTTP Link Header was provided for a remote context.',)
Type: jsonld.InvalidUrl
Code: loading remote context failed
Details: {'url': 'https://schema.org', 'cause': JsonLdError('Could not retrieve a JSON-LD document from the URL.')}

### Schema.org

Conformance basically a check that types exist and are referenced (spelled, structured) properly?

In [None]:
# JSON-LD Context (all types?): https://schema.org/docs/jsonldcontext.json
# Current version of vocabulary: https://schema.org/version/latest/schemaorg-current-https.jsonld (from https://schema.org/docs/developers.html)

In [None]:
# to get all field names in JSON: '"@?[a-zA-Z]+"(?=:)'

### CIDOC-CRM

Conformance basically a check that types exist and are referenced (spelled, structured) properly?

In [None]:
# JSON-LD Context: https://cidoc-crm.org/rdfs/7.1.3/CIDOC_CRM_v7.1.3_JSON-LD_Context.jsonld
# Classes & properties: https://cidoc-crm.org/html/cidoc_crm_v7.1.3.html
# Namespace: http://www.cidoc-crm.org/cidoc-crm/

In [None]:
# to get all field names in JSON: '"@?[a-zA-Z]+"(?=:)'

Possible to automate check for event-centric modeling???