# Evaluation

### Linking Anthropology's Data and Archives (LADA)

### AI-Generated Linked Data Evaluation (part II)

**Considerations**:
 - Syntax (Does it adhere to the expected serialization format (e.g. well-formed XML)?)
 - Completeness (Fields are not empty or 'unknown')
 - Conformance to ontologies (i.e. CIDOC-CRM, Schema.org, Dublin Core)
 - Consistency (across generated data points)

---

**Table of Contents:**

I. [Data Loading](#data-loading)

II. [Syntax](#syntax)

  * [XML](#xml)
  
    * [Automated Correction](#automated-correction)

  * [JSON](#json)

III. [Completeness](#completeness)

IV. [Conformance and Consistency](#conformance-and-consistency)

  * [Dublin Core](#dublin-core)

  * [Schema.org](#schemaorg)

  * [CIDOC-CRM](#cidoc-crm)

---

### Data Loading

In [1]:
import utils
import config
import pandas as pd
import numpy as np
import urllib.request
import urllib
import xml.etree.ElementTree as ET
import json
from lxml import etree
import rdflib
from rdflib.namespace import DC, SDO # Dublin Core, Schema.org
from pathlib import Path
import os
import re

# sax - to validate XML well-formed
# xml.etree.ElementTree - to validate text between tags
# xml.etree.ElementTree + xml.etree.ElementTree.XMLSchema's validate() - to validate XML well-formed
# lxml etree.XMLParser - to validate well-formed based on input XML schema
# json_checker - to validate Python data types (incl. but not limited to those obtained from JSON)
# jsonschema.validate
# ShEx - for RDF graphs, ShExJ for JSON - NOTE: couldn't install package
# OntoME - for CIDOC-CRM ontology alignment

Create variables to reference existing directories and files.

In [2]:
dublin_path = "cleaned/dublin_core/"  # XML data files
schema_path = "cleaned/schema_org/"   # JSON data files
cidoc_path = "cleaned/cidoc_crm/"     # JSON data files

dublin_t1_dir = config.task1_data+dublin_path
schema_t1_dir = config.task1_data+schema_path
cidoc_t1_dir = config.task1_data+cidoc_path

dublin_p1_dir = config.playgrd1_data+dublin_path
schema_p1_dir = config.playgrd1_data+schema_path
cidoc_p1_dir = config.playgrd1_data+cidoc_path

dublin_p3_dir = config.playgrd3_data+dublin_path
schema_p3_dir = config.playgrd3_data+schema_path
cidoc_p3_dir = config.playgrd3_data+cidoc_path

Create variables to reference automatically corrected files and their directories.

In [3]:
dublin_path = "corrected/dublin_core/"  # XML data files
schema_path = "corrected/schema_org/"   # JSON data files
cidoc_path = "corrected/cidoc_crm/"     # JSON data files

dublin_t1_corrected_dir = config.task1_data+dublin_path
schema_t1_corrected_dir = config.task1_data+schema_path
cidoc_t1_corrected_dir = config.task1_data+cidoc_path

dublin_p1_corrected_dir = config.playgrd1_data+dublin_path
schema_p1_corrected_dir = config.playgrd1_data+schema_path
cidoc_p1_corrected_dir = config.playgrd1_data+cidoc_path

dublin_p3_corrected_dir = config.playgrd3_data+dublin_path
schema_p3_corrected_dir = config.playgrd3_data+schema_path
cidoc_p3_corrected_dir = config.playgrd3_data+cidoc_path

corrected_dirs = [dublin_t1_corrected_dir, schema_t1_corrected_dir, cidoc_t1_corrected_dir,
                  dublin_p1_corrected_dir, schema_p1_corrected_dir, cidoc_p1_corrected_dir,
                  dublin_p3_corrected_dir, schema_p3_corrected_dir, cidoc_p3_corrected_dir
                  ]
for corrected_dir in corrected_dirs:
    Path(corrected_dir).mkdir(parents=True, exist_ok=True)

In [4]:
dublin_file_paths = []
dublin_files_t1 = [f for f in os.listdir(dublin_t1_dir) if f.endswith(".xml")]
dublin_file_paths += [dublin_t1_dir+f for f in dublin_files_t1]
dublin_files_p1 = [f for f in os.listdir(dublin_p1_dir) if f.endswith(".xml")]
dublin_file_paths += [dublin_p1_dir+f for f in dublin_files_p1]
dublin_files_p3 = [f for f in os.listdir(dublin_p3_dir) if f.endswith(".xml")]
dublin_file_paths += [dublin_p3_dir+f for f in dublin_files_p3]
dublin_file_paths.sort()
print("Total Dublin Core XML files:", len(dublin_file_paths))

Total Dublin Core XML files: 99


## Syntax

### XML

First, read and evaluate only the files with a `.xml` extension.

In [5]:
syntax_errors = []
for f in dublin_file_paths:
    try:
        tree = etree.parse(f)
    except Exception as e:
        f_error = {"file": f, "exception_type": type(e), "exception_message": str(e)}
        syntax_errors += [f_error]
print("Files with errors:", 
      len(syntax_errors), "of", len(dublin_file_paths),
      f"({(len(syntax_errors)/len(dublin_file_paths))*100:.2f}%)")

Files with errors: 41 of 99 (41.41%)


In [6]:
print(syntax_errors[0])

{'file': 'data/data_playground_task1/cleaned/dublin_core/dc_record_000.xml', 'exception_type': <class 'lxml.etree.XMLSyntaxError'>, 'exception_message': 'Namespace prefix dc on title is not defined, line 2, column 10 (dc_record_000.xml, line 2)'}


In [7]:
df_se = pd.DataFrame.from_dict(syntax_errors)
df_se.head()

Unnamed: 0,file,exception_type,exception_message
0,data/data_playground_task1/cleaned/dublin_core...,<class 'lxml.etree.XMLSyntaxError'>,"Namespace prefix dc on title is not defined, l..."
1,data/data_playground_task1/cleaned/dublin_core...,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix rdf for about on Description ...
2,data/data_playground_task1/cleaned/dublin_core...,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix rdf on Description is not def...
3,data/data_playground_task1/cleaned/dublin_core...,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix rdf on Description is not def...
4,data/data_playground_task1/cleaned/dublin_core...,<class 'lxml.etree.XMLSyntaxError'>,"xmlParseEntityRef: no name, line 3, column 28 ..."


In [8]:
new_file_col = df_se["file"].apply(lambda x: x.split("/")[-1])
df_se = df_se.rename(columns={"file":"file_path"})
df_se.insert(1, "file_name", new_file_col)
df_se.head()

Unnamed: 0,file_path,file_name,exception_type,exception_message
0,data/data_playground_task1/cleaned/dublin_core...,dc_record_000.xml,<class 'lxml.etree.XMLSyntaxError'>,"Namespace prefix dc on title is not defined, l..."
1,data/data_playground_task1/cleaned/dublin_core...,dc_record_003.xml,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix rdf for about on Description ...
2,data/data_playground_task1/cleaned/dublin_core...,dc_record_018.xml,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix rdf on Description is not def...
3,data/data_playground_task1/cleaned/dublin_core...,dc_record_019.xml,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix rdf on Description is not def...
4,data/data_playground_task1/cleaned/dublin_core...,dc_record_020.xml,<class 'lxml.etree.XMLSyntaxError'>,"xmlParseEntityRef: no name, line 3, column 28 ..."


In [9]:
assert(len(df_se.file_path.unique()) == df_se.shape[0]), "File names may repeat, because they may be located in different directories (folders), but each file path should be unique"

In [10]:
pattern = "^[\D]+,"
new_exception_col = df_se["exception_message"].apply(lambda x: re.findall(pattern, x)[0][:-1])
df_se.insert(len(df_se.columns)-1, "exception_subtype", new_exception_col)
df_se.head()

  pattern = "^[\D]+,"


Unnamed: 0,file_path,file_name,exception_type,exception_subtype,exception_message
0,data/data_playground_task1/cleaned/dublin_core...,dc_record_000.xml,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix dc on title is not defined,"Namespace prefix dc on title is not defined, l..."
1,data/data_playground_task1/cleaned/dublin_core...,dc_record_003.xml,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix rdf for about on Description ...,Namespace prefix rdf for about on Description ...
2,data/data_playground_task1/cleaned/dublin_core...,dc_record_018.xml,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix rdf on Description is not def...,Namespace prefix rdf on Description is not def...
3,data/data_playground_task1/cleaned/dublin_core...,dc_record_019.xml,<class 'lxml.etree.XMLSyntaxError'>,Namespace prefix rdf on Description is not def...,Namespace prefix rdf on Description is not def...
4,data/data_playground_task1/cleaned/dublin_core...,dc_record_020.xml,<class 'lxml.etree.XMLSyntaxError'>,xmlParseEntityRef: no name,"xmlParseEntityRef: no name, line 3, column 28 ..."


In [11]:
subtype_report = pd.DataFrame(df_se.exception_subtype.value_counts()).reset_index()
subtype_report = subtype_report.rename(columns={"exception_subtype":"exception"})
subtype_report.insert(0, "dimension_counted", ["exception_subtype"]*subtype_report.shape[0])

In [12]:
type_report = pd.DataFrame(df_se.exception_type.value_counts()).reset_index()
type_report = type_report.rename(columns={"exception_type":"exception"})
type_report.insert(0, "dimension_counted", ["exception_type"]*type_report.shape[0])

In [13]:
totals_report = pd.DataFrame({
    "dimension_counted": ["total_files", "files_with_error"],
    "exception": ["NA", "NA"],
    "count": [len(dublin_file_paths), len(syntax_errors)]
    })

In [14]:
xml_report = pd.concat([type_report, subtype_report, totals_report])
proportions = (xml_report[["count"]]/96).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
# print(proportions)
# print(percentages)
xml_report.insert(len(xml_report.columns), "proportion_of_all_files", percentages)
xml_report


Unnamed: 0,dimension_counted,exception,count,proportion_of_all_files
0,exception_type,<class 'lxml.etree.XMLSyntaxError'>,41,42.71%
0,exception_subtype,Namespace prefix dc on title is not defined,31,32.29%
1,exception_subtype,Namespace prefix rdf for about on Description ...,7,7.29%
2,exception_subtype,Namespace prefix rdf on Description is not def...,2,2.08%
3,exception_subtype,xmlParseEntityRef: no name,1,1.04%
0,total_files,,99,103.12%
1,files_with_error,,41,42.71%


Save the reports as CSV files.

In [17]:
report_dir = "data/error_reports/"
Path(report_dir).mkdir(parents=True, exist_ok=True)

In [18]:
metadata_standard = "dublin_core"
data_serialization = "xml"
report_type = "syntax_error_stats"
xml_report.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [19]:
metadata_standard = "dublin_core"
data_serialization = "xml"
report_type = "syntax_errors"
df_se.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

##### Automated Correction

Try correcting undefined namespace prefix errors automatically, reading the errored files' equivalents with `.txt` extensions and saving the corrected files that can be parsed with an XML parser to a new directory, where each corrected file has a `.xml` extension.

In [15]:
df_se.exception_subtype.unique()

array(['Namespace prefix dc on title is not defined',
       'Namespace prefix rdf for about on Description is not defined',
       'Namespace prefix rdf on Description is not defined',
       'xmlParseEntityRef: no name'], dtype=object)

In [16]:
errored_files = list(df_se.file_path)
error_list = list(df_se.exception_subtype)
assert (len(error_list) == len(errored_files)), "Error list and errored files lists should be of the same length"

In [17]:
txt_errored_files = [f.replace(".xml", ".txt") for f in errored_files]
print(txt_errored_files[0])
print(error_list[0])

data/data_playground_task1/cleaned/dublin_core/dc_record_000.txt
Namespace prefix dc on title is not defined


In [18]:
correct_dc_files = []
for f in dublin_file_paths:
    if f not in errored_files:
        correct_dc_files += [f]
print("Total correct Dublin Core XML files:", len(correct_dc_files))
print("Sample:", correct_dc_files[0])

Total correct Dublin Core XML files: 58
Sample: data/data_playground_task1/cleaned/dublin_core/dc_record_001.xml


In [28]:
prolog = '<?xml version="1.0" encoding="UTF-8"?>'
dc_prefix_open_tag = '<metadata xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/">'
dc_prefix_close_tag = '</metadata>'

In [32]:
print(txt_errored_files[0])
print(error_list[0])

data/data_playground_task1/cleaned/dublin_core/dc_record_000.txt
Namespace prefix dc on title is not defined


In [48]:
i, maxI = 0, 1 #len(error_list)
still_incorrect = []
# while i < maxI:
txt_file, message = txt_errored_files[i], error_list[i]
if "Namespace prefix dc" in message:
    with open(txt_file, "r") as f:
        # print(txt_file)
        f_string = f.read()
        # print(content)

        # Look for different error patterns
        has_prolog = re.findall("^\<\?xml version=.+ encoding=.+", f_string)
        if len(has_prolog) == 0:
            f_string = prolog + "\n" + f_string
        #     print("with prolog:", f_string)

        open_tag = re.findall("<dublincore>", f_string)
        close_tag = re.findall("</dublincore>", f_string)
        # print(open_tag, close_tag)
        if (len(open_tag) == 1) and (len(close_tag) == 1):
            f_string = f_string.replace(open_tag[0], dc_prefix_open_tag)
            f_string = f_string.replace(close_tag[0],dc_prefix_close_tag)

try:
     root = ET.fromstring(f_string)
     xml_file = txt_file.replace("txt", "xml")
     with open(xml_file, "w") as f:
         f.write(f_string)
except:
    still_incorrect += [txt_file]

        
# print(f_string)
# To validate that the new data is well-formed, try parsing
# it with an XML parser
# new_xml = etree.fromstring(f_string)
# parser = ET.XMLParser #(encoding="utf-8", recover=True)
# root = ET.fromstring(f_string)
# print(root.tag)
    
    

    # i += 1

  has_prolog = re.findall("^\<\?xml version=.+ encoding=.+", f_string)


In [25]:
# f_string = str(f_string).strip('b"')
print(str(f_string))

<dublincore>
<dc:title>Inquisition Report on Ann Aldrich</dc:title>
<dc:creator>John Chichester (Coroner)</dc:creator>
<dc:subject>Accidental Death, Coroner's Inquest, 18th Century Legal Document</dc:subject>
<dc:description>An inquisition report dated January 9, 1793, documenting the accidental drowning of Ann Aldrich, a two-year-old child, in Mushmilling Hundred, Kent County. The report includes witness testimonies and jury findings.</dc:description>
<dc:publisher>Kent County Court Records</dc:publisher>
<dc:contributor>James Houston, John Grist, James White, David Judd, John Borie, John White, John Reed, John Farris, Richard Houston, William White, Anthony Pain, Edward Chimber</dc:contributor>
<dc:date>1793-01-09</dc:date>
<dc:type>Text</dc:type>
<dc:format>Manuscript, Handwritten Document</dc:format>
<dc:identifier>urn:kentcounty:inquest:1793:AnnAldrich</dc:identifier>
<dc:source>Kent County Archives</dc:source>
<dc:language>English</dc:language>
<dc:relation>Legal and Judicial Rec

### JSON

First, read and evaluate only the files with a `.json` extension.

In [66]:
cidoc_file_paths = []
cidoc_files_t1 = [f for f in os.listdir(cidoc_t1_dir) if f.endswith(".json")]
cidoc_file_paths += [cidoc_t1_dir+f for f in cidoc_files_t1]
cidoc_files_p1 = [f for f in os.listdir(cidoc_p1_dir) if f.endswith(".json")]
cidoc_file_paths += [cidoc_p1_dir+f for f in cidoc_files_p1]
cidoc_files_p3 = [f for f in os.listdir(cidoc_p3_dir) if f.endswith(".json")]
cidoc_file_paths += [cidoc_p3_dir+f for f in cidoc_files_p3]
cidoc_file_paths.sort()
print("Total CIDOC-CRM JSON files:", len(cidoc_file_paths))

Total CIDOC-CRM JSON files: 91


In [43]:
cidoc_file_paths[0]

'data/data_playground_task1/cleaned/cidoc_crm/cidoccrm_record_000.json'

In [None]:
schema_file_paths = []
schema_files_t1 = os.listdir(schema_t1_dir)
schema_file_paths += [schema_t1_dir+f for f in schema_files_t1]
schema_files_p1 = os.listdir(schema_p1_dir)
schema_file_paths += [schema_p1_dir+f for f in schema_files_p1]
schema_files_p3 = os.listdir(schema_p3_dir)
schema_file_paths += [schema_p3_dir+f for f in schema_files_p3]
schema_file_paths.sort()
print("Total Schema.org JSON files:", len(schema_file_paths))

Total Schema.org JSON files: 107


'data/data_playground_task1/cleaned/schema_org/sdo_record_000.json'

In [41]:
schema_file_paths[0]

'data/data_playground_task1/cleaned/schema_org/sdo_record_000.json'

In [46]:
json_file_paths = cidoc_file_paths + schema_file_paths
print(len(json_file_paths))

198


In [47]:
syntax_errors = []
for f in json_file_paths:
    with open(f) as f:
        try:
            data = json.load(f)
        except Exception as e:
            f_error = {"file": f, "exception_type": type(e), "exception_message": str(e)}
            syntax_errors += [f_error]
print(
    "Files with errors:", 
    len(syntax_errors), "of", len(json_file_paths),
    f"({(len(syntax_errors)/len(json_file_paths))*100:.2f}%)"
    )

Files with errors: 4 of 198 (2.02%)


In [52]:
# TO DO: Write & export report on syntax/semantic errors!

## Completeness

## Conformance and Consistency

Consider conformance to standards and consistency of presentation (e.g., is all data on one or two lines, or does each tag, subtag, etc. appear on its own line?)

### Dublin Core

***Note:***

*The Dublin Core schemas' URLs below are from [dublincore.org](https://www.dublincore.org/schemas/xmls/) under "Latest versions are always available as: ..."*

In [65]:
dc_elements_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dc.xsd"
dc_terms_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dcterms.xsd"
dc_mitype_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dcmitype.xsd"

In [66]:
url = dc_elements_schema_url

In [None]:
content = urllib.request.urlopen(url)
parser = etree.XMLParser()
dc_elements_tree = etree.parse(content, parser)
dc_elements_schema = etree.XMLSchema(dc_elements_tree)

    # dc_elements_tree = etree.parse(dc_elements_schema_path)
    # dc_elements_schema = etree.XMLSchema(dc_elements_tree)

In [None]:
xml_doc = etree.parse(dublin_file_paths[1]) # valid   #[0] - invalid as expected
result = dc_elements_schema.validate(xml_doc)

In [79]:
dublin_file_paths[1]

'data/data_playground_task1/cleaned/dublin_core/dc_record_001.xml'

### Schema.org

### CIDOC-CRM