# Evaluation: Conformance and Consistency
Part III of the computational evaluation of AI-generated linked data for [Linking Anthropology's Data and Archives (LADA)](https://ischool.umd.edu/projects/building-a-sustainable-future-for-anthropologys-archives-researching-primary-source-data-lifecycles-infrastructures-and-reuse/), focused on conformance to ontologies (i.e., Dublin Core, Schema.org, CIDOC-CRM) and consistency in the conformance across generated data points (e.g., is all data on one or two lines, or does each tag, subtag, etc. appear on its own line?).

---

**Table of Contents:**

I. [Data Loading](#data-loading)

II. [XML and RDF/XML](#xml-and-rdfxml)

  * [Dublin Core](#dublin-core)

III. [JSON-LD](#json-ld)

  * [Schema.org](#schemaorg)

  * [CIDOC-CRM](#cidoc-crm)

IV. [Linked Data Best Practices](#linked-data-best-practices)

---

**Resources:**
* [W3C Best Practices for Publishing Linked Data](https://www.w3.org/TR/ld-bp/)
* Dublin Core:
  * [DCMI Dublin Core in XML](https://www.dublincore.org/schemas/xmls/)
  * [DCMI Dublin Core in RDF](https://www.dublincore.org/specifications/dublin-core/dc-rdf-notes/)
* Schema.org:
  * [JSON-LD Context](https://schema.org/docs/jsonldcontext.json) - *question: is this for* ***all*** *Schema.org types?*
  * [Current version of vocabulary](https://schema.org/version/latest/schemaorg-current-https.jsonld)
  * [Documentation for developers](https://schema.org/docs/developers.html)
  * [Stackoverflow Q&A on expansion and compaction](https://stackoverflow.com/questions/76795530/json-ld-expansion)
* CIDOC-CRM:
  * [JSON-LD Context](https://cidoc-crm.org/rdfs/7.1.3/CIDOC_CRM_v7.1.3_JSON-LD_Context.jsonld)
  * [Classes & properties](https://cidoc-crm.org/html/cidoc_crm_v7.1.3.html)
  * [Namespace](http://www.cidoc-crm.org/cidoc-crm/)

## Data Loading

In [16]:
import utils
import config
import pandas as pd
import numpy as np
import urllib
from urllib.request import Request, urlopen
import xml.etree.ElementTree as ET
from lxml import etree
from pyld import jsonld
import json
from pathlib import Path
import os
import re

In [2]:
os.environ["no_proxy"] = "*"                                                                                                                     # https://docs.python.org/3/library/urllib.request.html 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}   # As suggested here: https://www.reddit.com/r/learnpython/comments/1ea3r0z/how_to_avoid_http_error_403_forbidden/

Create variables to reference existing folders (a.k.a. directories) and files.

In [3]:
dublin_path = "cleaned/dublin_core/"  # XML data files
schema_path = "cleaned/schema_org/"   # JSON data files
cidoc_path = "cleaned/cidoc_crm/"     # JSON data files

dublin_t1_dir = config.task1_data+dublin_path
schema_t1_dir = config.task1_data+schema_path
cidoc_t1_dir = config.task1_data+cidoc_path

dublin_p1_dir = config.playgrd1_data+dublin_path
schema_p1_dir = config.playgrd1_data+schema_path
cidoc_p1_dir = config.playgrd1_data+cidoc_path

dublin_p3_dir = config.playgrd3_data+dublin_path
schema_p3_dir = config.playgrd3_data+schema_path
cidoc_p3_dir = config.playgrd3_data+cidoc_path

In [4]:
dc_paths = [dublin_t1_dir, dublin_p1_dir, dublin_p3_dir]
sdo_paths = [schema_t1_dir, schema_p1_dir, schema_p3_dir]
cidoc_paths = [cidoc_t1_dir, cidoc_p1_dir, cidoc_p3_dir]

In [5]:
d = "conformance"
report_dir = f"data/error_reports/{d}/"
Path(report_dir).mkdir(parents=True, exist_ok=True)

## XML and RDF/XML

### Dublin Core

***Note:***

*The Dublin Core schemas' URLs below are from [dublincore.org](https://www.dublincore.org/schemas/xmls/) under "Latest versions are always available as: ..."*

Sample empty record conforming to *simple* Dublin Core RDF/XML:

```
<?xml version="1.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
     <rdf:Description>
          <dc:title></dc:title>
          <dc:creator></dc:creator>
          <dc:contributor></dc:contributor>
          <dc:date></dc:date>
          <dc:relation></dc:relation>
          <dc:language></dc:language>
          <dc:source></dc:source>
          <dc:subject></dc:subject>
          <dc:description></dc:description>
          <dc:publisher></dc:publisher>
          <dc:rights></dc:rights>
          <dc:type></dc:type>
          <dc:format></dc:format>
          <dc:coverage></dc:coverage>
          <dc:identifier></dc:identifier>
     </rdf:Description>
</rdf:RDF>
```

Sample empty record conforming to *qualified* Dublin Core RDF/XML:

```
<?xml version="1.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
     <rdf:Description>
          <dcterms:title></dc:title>
          <dcterms:creator></dc:creator>
          <dcterms:contributor></dc:contributor>
          <dcterms:date></dc:date>
          <dc:relation></dc:relation>
          <dcterms:language></dc:language>
          <dcterms:source></dc:source>
          <dcterms:subject></dc:subject>
          <dcterms:description></dc:description>
          <dcterms:publisher></dc:publisher>
          <dcterms:rights></dc:rights>
          <dcterms:type></dc:type>
          <dcterms:format></dc:format>
          <dcterms:coverage></dc:coverage>
          <dcterms:identifier></dc:identifier>
     </rdf:Description>
</rdf:RDF>
```

#### Metadata Field Names

Create variables for the 15 "core" elements, or properties of Dublin Core and for the two Dublin Core Metadata Initative (DCMI) namespaces.

In [22]:
dc_elements = ["creator", "contributor", "date", "title", "publisher", 
             "language", "format", "subject", "description", "identifier", 
             "relation", "source", "type", "coverage", "rights"]
dcmi_simple = ["dc:"+elem for elem in dc_elements]
dcmi_qualified = ["dcterms:"+elem for elem in dc_elements]
print(dcmi_simple[:3])
print(dcmi_qualified[:3])

['dc:creator', 'dc:contributor', 'dc:date']
['dcterms:creator', 'dcterms:contributor', 'dcterms:date']


In [23]:
# Read the TXT files so all generated metadata can be read, whether or not the XML is well-formed.
extension = ".txt"
dublin_file_paths = []
dublin_files_t1 = [f for f in os.listdir(dublin_t1_dir) if f.endswith(extension)]
dublin_file_paths += [dublin_t1_dir+f for f in dublin_files_t1]
dublin_files_p1 = [f for f in os.listdir(dublin_p1_dir) if f.endswith(extension)]
dublin_file_paths += [dublin_p1_dir+f for f in dublin_files_p1]
dublin_files_p3 = [f for f in os.listdir(dublin_p3_dir) if f.endswith(extension)]
dublin_file_paths += [dublin_p3_dir+f for f in dublin_files_p3]
dublin_file_paths.sort()
total_dc_files = len(dublin_file_paths)
print(f"Total Dublin Core {extension[1:].upper()} files:", total_dc_files)

Total Dublin Core TXT files: 107


In [61]:
simple_tag = re.compile('<dc:[a-z]+>')
qual_tag = re.compile('<dcterms:[a-z]+>')
# attrib_tag = re.compile('<[a-z]+ [a-z]+=[a-z"]+>')
# field_tag = re.compile('<[^:^>^\/^=]+>')
rdf_tags = re.compile('<\/?rdf:RDF\s*[^>]*>')
rdf_desc_tags = re.compile('<\/?rdf:Description\s*[^>]*>')
prolog = '<?xml version="1.0"'

  rdf_tags = re.compile('<\/?rdf:RDF\s*[^>]*>')
  rdf_desc_tags = re.compile('<\/?rdf:Description\s*[^>]*>')


In [68]:
# Check whether fields present in DC metadata records are included according to simple or qualified Dublin Core
use_prolog, use_simple, use_qualified, use_rdf, use_rdf_desc = [], [], [], [], []
for file_path in dublin_file_paths:
    with open(file_path, "r") as f:
        f_lower = f.read() #.lower()

        # Look for a prolog (with or without encoding specified)
        has_prolog = re.findall(prolog, f_lower)
        if len(has_prolog) > 0:
            use_prolog += [True]
        else:
            use_prolog += [False]
        
        # Look for tags in simple Dublin Core
        simples = re.findall(simple_tag, f_lower)
        if len(simples) > 0:
            use_simple += [simples]
        else:
            use_simple += [False]
        
        # Look for tags in qualified Dublin Core
        quals = re.findall(qual_tag, f_lower)
        if len(quals) > 0:
            use_qualified += [quals]
        else:
            use_qualified += [False]

        # Look for open and close rdf:RDF tags
        rdfs = re.findall(rdf_tags, f_lower)
        if len(rdfs) == 2:
            use_rdf += [rdfs]
        else:
            use_rdf += [False]

        # Look for open and close rdf:Description tags
        rdf_descs = re.findall(rdf_desc_tags, f_lower)
        if len(rdf_descs) == 2:
            use_rdf_desc += [rdf_descs]
        else:
            use_rdf_desc += [False]


In [69]:
df = pd.DataFrame.from_dict({"file":dublin_file_paths, "uses_prolog":use_prolog, "uses_simple":use_simple, "uses_qualified":use_qualified, "uses_rdf":use_rdf, "uses_rdf_desc":use_rdf_desc})
df.head()

Unnamed: 0,file,uses_prolog,uses_simple,uses_qualified,uses_rdf,uses_rdf_desc
0,data/data_playground_task1/cleaned/dublin_core...,False,False,False,False,False
1,data/data_playground_task1/cleaned/dublin_core...,False,"[<dc:title>, <dc:creator>, <dc:subject>, <dc:d...",False,False,False
2,data/data_playground_task1/cleaned/dublin_core...,True,"[<dc:title>, <dc:subject>, <dc:description>, <...",False,False,False
3,data/data_playground_task1/cleaned/dublin_core...,False,"[<dc:title>, <dc:creator>, <dc:subject>, <dc:d...",False,False,False
4,data/data_playground_task1/cleaned/dublin_core...,False,"[<dc:title>, <dc:creator>, <dc:contributor>, <...",False,False,False


Calculate the number of files with and without a prolog, the use of simple and qualified DC, and the use of RDF (Resource Description Framework).

In [71]:
prolog_counts = pd.DataFrame(df.uses_prolog.value_counts()).reset_index().rename(columns={"count":"file_count"})
proportions = (prolog_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
prolog_counts.insert(len(prolog_counts.columns), "proportion_of_files", percentages)
prolog_counts

Unnamed: 0,uses_prolog,file_count,proportion_of_files
0,False,64,59.81%
1,True,43,40.19%


In [72]:
col = "uses_simple"
subdf = df.loc[df[col] == False]
simple_counts = pd.DataFrame.from_dict({col:[False, True], "file_count":[subdf.shape[0], (df.shape[0] - subdf.shape[0])]})
proportions = (simple_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
simple_counts.insert(len(simple_counts.columns), "proportion_of_files", percentages)
simple_counts

Unnamed: 0,uses_simple,file_count,proportion_of_files
0,False,14,13.08%
1,True,93,86.92%


In [81]:
col = "uses_qualified"
subdf = df.loc[df[col] == False]
qual_counts = pd.DataFrame.from_dict({col:[False, True], "file_count":[subdf.shape[0], (df.shape[0] - subdf.shape[0])]})
proportions = (qual_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
qual_counts.insert(len(qual_counts.columns), "proportion_of_files", percentages)
dc_counts = pd.concat([simple_counts, qual_counts], axis=1)
dc_counts

Unnamed: 0,uses_simple,file_count,proportion_of_files,uses_qualified,file_count.1,proportion_of_files.1
0,False,14,13.08%,False,97,90.65%
1,True,93,86.92%,True,10,9.35%


In [84]:
col = "uses_rdf"
subdf = df.loc[df[col] == False]
rdf_counts = pd.DataFrame.from_dict({col:[False, True], "file_count":[subdf.shape[0], (df.shape[0] - subdf.shape[0])]})
proportions = (rdf_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
rdf_counts.insert(len(rdf_counts.columns), "proportion_of_files", percentages)
rdf_counts

Unnamed: 0,uses_rdf,file_count,proportion_of_files
0,False,107,100.00%
1,True,0,0.00%


In [85]:
col = "uses_rdf_desc"
subdf = df.loc[df[col] == False]
rdf_desc_counts = pd.DataFrame.from_dict({col:[False, True], "file_count":[subdf.shape[0], (df.shape[0] - subdf.shape[0])]})
proportions = (rdf_desc_counts[["file_count"]]/(len(dublin_file_paths))).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
rdf_desc_counts.insert(len(rdf_desc_counts.columns), "proportion_of_files", percentages)
rdf_counts = pd.concat([rdf_counts, rdf_desc_counts], axis=1)
rdf_counts

Unnamed: 0,uses_rdf,file_count,proportion_of_files,uses_rdf_desc,file_count.1,proportion_of_files.1
0,False,107,100.00%,False,99,92.52%
1,True,0,0.00%,True,8,7.48%


Save the reports as CSV files.

In [86]:
metadata_standard = "dublin_core"
data_serialization = "xml"

In [87]:
report_type = "conformance"
df.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

In [None]:
report_type = "prolog_stats"
prolog_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

In [89]:
report_type = "dc_simple_qual_stats"
dc_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

In [90]:
report_type = "rdf_stats"
rdf_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

## JSON-LD

In [11]:
extension = ".json" #".txt"
cidoc_file_paths = []
cidoc_files_t1 = [f for f in os.listdir(cidoc_t1_dir) if f.endswith(extension)]
cidoc_file_paths += [cidoc_t1_dir+f for f in cidoc_files_t1]
cidoc_files_p1 = [f for f in os.listdir(cidoc_p1_dir) if f.endswith(extension)]
cidoc_file_paths += [cidoc_p1_dir+f for f in cidoc_files_p1]
cidoc_files_p3 = [f for f in os.listdir(cidoc_p3_dir) if f.endswith(extension)]
cidoc_file_paths += [cidoc_p3_dir+f for f in cidoc_files_p3]
cidoc_file_paths.sort()
print("Total CIDOC-CRM JSON files:", len(cidoc_file_paths))

Total CIDOC-CRM JSON files: 148


In [12]:
cidoc_file_paths[0]

'data/data_playground_task1/cleaned/cidoc_crm/cidoccrm_record_000.json'

In [13]:
extension = ".json" #".txt"
schema_file_paths = []
schema_files_t1 = os.listdir(schema_t1_dir)
schema_file_paths += [schema_t1_dir+f for f in schema_files_t1 if f.endswith(extension)]
schema_files_p1 = os.listdir(schema_p1_dir)
schema_file_paths += [schema_p1_dir+f for f in schema_files_p1 if f.endswith(extension)]
schema_files_p3 = os.listdir(schema_p3_dir)
schema_file_paths += [schema_p3_dir+f for f in schema_files_p3 if f.endswith(extension)]
schema_file_paths.sort()
print("Total Schema.org JSON files:", len(schema_file_paths))

Total Schema.org JSON files: 182


In [14]:
schema_file_paths[0]

'data/data_playground_task1/cleaned/schema_org/sdo_record_000.json'

In [15]:
json_file_paths = cidoc_file_paths + schema_file_paths
total_json_files = len(json_file_paths)
print(len(json_file_paths))

330


### Schema.org

Check whether the context for Schema.org is correctly included in the Schema.org JSON-LD metadata records.

In [None]:
context_pattern = re.compile('"@context":\s*\{\s*"@vocab":\s*"https?://schema.org/"\s*\}')
context_url_pattern = re.compile("https?://schema.org/?") # allow for both 'http' and 'https' and for the URL to end with or without a forward slash
data_model = "Schema.org"
df = utils.contextInclusion(schema_file_paths, data_model, context_pattern, context_url_pattern)

  context_pattern = re.compile('"@context":\s*\{\s*"@vocab":\s*"https?://schema.org/"\s*\}')


In [None]:
# context_correct, has_context_var, has_model_url = [], [], []
# for file_path in schema_file_paths:
#     with open(file_path, "r") as f:
#         f_string = f.read().lower()

#         if re.search(context_pattern, f_string):
#             context_correct += [True]
#             has_context_var += [True]
#             has_context_vocab += [True]
#             ha_model_url += [True]
#         else:
#             context_correct += [False]
#             if context_var in f_string:
#                 has_context_var += [True]
#             else:
#                 has_context_var += [False]
#             if re.search(context_url_pattern, f_string):
#                 has_model_url += [True]
#             else:
#                 has_model_url += [False]

#         f.close()
# df = pd.DataFrame.from_dict({
#     "file_path":schema_file_paths, "data_model":["Schema.org"]*len(schema_file_paths), 
#     "includes_context_correctly":context_correct, "includes_@context":has_context_var, 
#     "includes_data_model_url":has_model_url
#     })
# df.head()

Unnamed: 0,file_path,data_model,includes_context_correctly,includes_@context,includes_data_model_url
0,data/data_playground_task1/cleaned/schema_org/...,Schema.org,False,True,True
1,data/data_playground_task1/cleaned/schema_org/...,Schema.org,False,True,True
2,data/data_playground_task1/cleaned/schema_org/...,Schema.org,False,True,True
3,data/data_playground_task1/cleaned/schema_org/...,Schema.org,False,True,True
4,data/data_playground_task1/cleaned/schema_org/...,Schema.org,False,True,True


In [72]:
cols = ["includes_context_correctly", "includes_@context", "includes_data_model_url"]
df_counts = pd.DataFrame()
for col in cols:
    count = pd.DataFrame(df[col].value_counts()).rename(columns={"count":col+"_file_count"})
    df_counts = pd.concat([df_counts, count], axis=1)
df_counts = df_counts.fillna(0)
df_counts

Unnamed: 0,includes_context_correctly_file_count,includes_@context_file_count,includes_data_model_url_file_count
False,182.0,0.0,0.0
True,0.0,182.0,182.0


Save the results as CSV reports.

In [73]:
metadata_standard = "sdo"
data_serialization = "json-ld"

In [74]:
report_type = "context"
df.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

In [75]:
report_type = "context_counts"
df_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=True
    )

### CIDOC-CRM

Conformance basically a check that types exist and are referenced (spelled, structured) properly?

In [None]:
# JSON-LD Context: https://cidoc-crm.org/rdfs/7.1.3/CIDOC_CRM_v7.1.3_JSON-LD_Context.jsonld
# Classes & properties: https://cidoc-crm.org/html/cidoc_crm_v7.1.3.html
# Namespace: http://www.cidoc-crm.org/cidoc-crm/

In [None]:
# to get all field names in JSON: '"@?[a-zA-Z]+"(?=:)'