# Evaluation

### Linking Anthropology's Data and Archives (LADA)

### AI-Generated Linked Data Evaluation (part II)

**Considerations**:
 - Syntax (Does it adhere to the expected serialization format (e.g. well-formed XML)?)
 - Completeness (Fields are not empty or 'unknown')
 - Conformance to ontologies (i.e. CIDOC-CRM, Schema.org, Dublin Core)
 - Consistency (across generated data points)

---

**Table of Contents:**

I. [Data Loading](#data-loading)

II. [Syntax](#syntax)

  * [XML](#xml)

  * [JSON](#json)

III. [Completeness](#completeness)

IV. [Conformance and Consistency](#conformance-and-consistency)

  * [Dublin Core](#dublin-core)

  * [Schema.org](#schemaorg)

  * [CIDOC-CRM](#cidoc-crm)

---

### Data Loading

In [60]:
import utils
import config
import pandas as pd
import numpy as np
import urllib.request
import urllib
import xml.etree.ElementTree as ET
import json
from lxml import etree
import rdflib
from rdflib.namespace import DC, SDO # Dublin Core, Schema.org
from pathlib import Path
import os
import re

# sax - to validate XML well-formed
# xml.etree.ElementTree - to validate text between tags
# xml.etree.ElementTree + xml.etree.ElementTree.XMLSchema's validate() - to validate XML well-formed
# lxml etree.XMLParser - to validate well-formed based on input XML schema
# json_checker - to validate Python data types (incl. but not limited to those obtained from JSON)
# jsonschema.validate
# ShEx - for RDF graphs, ShExJ for JSON - NOTE: couldn't install package
# OntoME - for CIDOC-CRM ontology alignment

In [None]:
dublin_path = "cleaned/dublin_core/"  # XML data files
schema_path = "cleaned/schema_org/"   # JSON data files
cidoc_path = "cleaned/cidoc_crm/"     # JSON data files

dublin_t1_dir = config.task1_data+dublin_path
schema_t1_dir = config.task1_data+schema_path
cidoc_t1_dir = config.task1_data+cidoc_path

dublin_p1_dir = config.playgrd1_data+dublin_path
schema_p1_dir = config.playgrd1_data+schema_path
cidoc_p1_dir = config.playgrd1_data+cidoc_path

dublin_p3_dir = config.playgrd3_data+dublin_path
schema_p3_dir = config.playgrd3_data+schema_path
cidoc_p3_dir = config.playgrd3_data+cidoc_path

In [19]:
dublin_file_paths = []
dublin_files_t1 = os.listdir(dublin_t1_dir)
dublin_file_paths += [dublin_t1_dir+f for f in dublin_files_t1]
dublin_files_p1 = os.listdir(dublin_p1_dir) 
dublin_file_paths += [dublin_p1_dir+f for f in dublin_files_p1]
dublin_files_p3 = os.listdir(dublin_p3_dir)
dublin_file_paths += [dublin_p3_dir+f for f in dublin_files_p3]
dublin_file_paths.sort()
print("Total Dublin Core XML files:", len(dublin_file_paths))

Total Dublin Core XML files: 96


In [20]:
dublin_file_paths[0]

'data/data_playground_task1/cleaned/dublin_core/dc_record_000.xml'

## Syntax

### XML

In [38]:
syntax_errors = []
for f in dublin_file_paths:
    try:
        tree = etree.parse(f)
    except Exception as e:
        f_error = {"file": f, "exception_type": type(e), "exception_message": str(e)}
        syntax_errors += [f_error]
print("Files with errors:", 
      len(syntax_errors), "of", len(dublin_file_paths),
      f"({(len(syntax_errors)/len(dublin_file_paths))*100:.2f}%)")

Files with errors: 38 of 96 (39.58%)


In [54]:
# TO DO: write report to TXT or CSV in dedicated directory!!!

### JSON

In [42]:
cidoc_file_paths = []
cidoc_files_t1 = os.listdir(cidoc_t1_dir)
cidoc_file_paths += [cidoc_t1_dir+f for f in cidoc_files_t1]
cidoc_files_p1 = os.listdir(cidoc_p1_dir)
cidoc_file_paths += [cidoc_p1_dir+f for f in cidoc_files_p1]
cidoc_files_p3 = os.listdir(cidoc_p3_dir)
cidoc_file_paths += [cidoc_p3_dir+f for f in cidoc_files_p3]
cidoc_file_paths.sort()
print("Total CIDOC-CRM JSON files:", len(cidoc_file_paths))

Total CIDOC-CRM JSON files: 91


In [43]:
cidoc_file_paths[0]

'data/data_playground_task1/cleaned/cidoc_crm/cidoccrm_record_000.json'

In [None]:
schema_file_paths = []
schema_files_t1 = os.listdir(schema_t1_dir)
schema_file_paths += [schema_t1_dir+f for f in schema_files_t1]
schema_files_p1 = os.listdir(schema_p1_dir)
schema_file_paths += [schema_p1_dir+f for f in schema_files_p1]
schema_files_p3 = os.listdir(schema_p3_dir)
schema_file_paths += [schema_p3_dir+f for f in schema_files_p3]
schema_file_paths.sort()
print("Total Schema.org JSON files:", len(schema_file_paths))

Total Schema.org JSON files: 107


'data/data_playground_task1/cleaned/schema_org/sdo_record_000.json'

In [41]:
schema_file_paths[0]

'data/data_playground_task1/cleaned/schema_org/sdo_record_000.json'

In [46]:
json_file_paths = cidoc_file_paths + schema_file_paths
print(len(json_file_paths))

198


In [47]:
syntax_errors = []
for f in json_file_paths:
    with open(f) as f:
        try:
            data = json.load(f)
        except Exception as e:
            f_error = {"file": f, "exception_type": type(e), "exception_message": str(e)}
            syntax_errors += [f_error]
print(
    "Files with errors:", 
    len(syntax_errors), "of", len(json_file_paths),
    f"({(len(syntax_errors)/len(json_file_paths))*100:.2f}%)"
    )

Files with errors: 4 of 198 (2.02%)


In [52]:
# TO DO: Write & export report on syntax/semantic errors!

## Completeness

## Conformance and Consistency

### Dublin Core

***Note:***

*The Dublin Core schemas' URLs below are from [dublincore.org](https://www.dublincore.org/schemas/xmls/) under "Latest versions are always available as: ..."*

In [65]:
dc_elements_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dc.xsd"
dc_terms_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dcterms.xsd"
dc_mitype_schema_url = "https://www.dublincore.org/schemas/xmls/qdc/dcmitype.xsd"

In [66]:
url = dc_elements_schema_url

In [None]:
content = urllib.request.urlopen(url)
parser = etree.XMLParser()
dc_elements_tree = etree.parse(content, parser)
dc_elements_schema = etree.XMLSchema(dc_elements_tree)

    # dc_elements_tree = etree.parse(dc_elements_schema_path)
    # dc_elements_schema = etree.XMLSchema(dc_elements_tree)

In [None]:
xml_doc = etree.parse(dublin_file_paths[1]) # valid   #[0] - invalid as expected
result = dc_elements_schema.validate(xml_doc)

In [79]:
dublin_file_paths[1]

'data/data_playground_task1/cleaned/dublin_core/dc_record_001.xml'

### Schema.org

### CIDOC-CRM