# AI-Generated Linked Data Evaluation

## Linking Anthropology's Data and Archives: Task 1

**Considerations**:
 - Syntax (Does it adhere to the expected serialization format (e.g. well-formed XML)?)
 - Completeness (Fields are not empty or 'unknown')
 - Conformance to ontologies (i.e. CIDOC-CRM, Schema.org, Dublin Core)
 - Consistency (across generated data points)

---

**Table of Contents:**

I. [Data Preparation](#data-preparation)

II. [Dublin Core](#dublin-core)

III. [Schema.org](#schemaorg)

IV. [CIDOC-CRM](#cidoc-crm)

---

### Data Preparation

In [1]:
import utils
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import json
import lxml
import rdflib
from rdflib.namespace import DC, SDO # Dublin Core, Schema.org
from pathlib import Path
import os
import re

# sax - to validate XML well-formed
# xml.etree.ElementTree - to validate text between tags
# xml.etree.ElementTree + xml.etree.ElementTree.XMLSchema's validate() - to validate XML well-formed
# lxml etree.XMLParser - to validate well-formed based on input XML schema
# json_checker - to validate Python data types (incl. but not limited to those obtained from JSON)
# jsonschema.validate
# ShEx - for RDF graphs, ShExJ for JSON - NOTE: couldn't install package
# OntoME - for CIDOC-CRM ontology alignment

In [None]:
f = "data_task1/4-HDataExperimentAssignmentsAndOutcomes-Outcomes-Task1.csv"
df = pd.read_csv(f)
df.head()

Unnamed: 0,ID,Filename,Metadata record,"Transcription or caption (or link to separate doc, if too long)",Schema.org Record,CIDOC-CRM Record
0,1.0,Recognition for Meritorious Service Plaque - S...,1. Recognition for Meritorious Service Plaque ...,See Google Link for results,See Google Link for results,See Google Link for results
1,2.0,Criteria for Tenure and Promotion and Examples...,2. Criteria for Tenure a,See Google Link for results,See Google Link for results,See Google Link for results
2,3.0,“Turnin’ Timez” original student poems.pdf,turnin_timez_dublin_core.xml \n<?xml version='...,turnin_timez_transcription.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...","{\n ""@context"": {\n ""crm"": ""http://www.cid..."
3,4.0,1985_11_03_Agenda_Executive_Committee_Meeting.jpg,3. 1985_11_03_Agenda_Executive_Committee_Meeti...,See Google Link for results,See Google Link for results,See Google Link for results
4,5.0,Climbing Up Fun Activites for You and Your Cat...,<dc:title>Climbing Up: Fun Activities for You ...,Climbing_Up_OCR_Text.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...",@prefix crm: <http://www.cidoc-crm.org/cidoc-c...


In [3]:
# df.tail()
df.dropna(inplace=True)
df.tail()

Unnamed: 0,ID,Filename,Metadata record,"Transcription or caption (or link to separate doc, if too long)",Schema.org Record,CIDOC-CRM Record
135,136.0,0119_0000_Development-Photos,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT
136,137.0,0063_1988_International-Symposium,Title: 4-H USA International Programs\nCreator...,I can see that this is a comprehensive speech ...,"{\n ""@context"": ""https://schema.org"",\n ""@ty...",@prefix crm: <http://www.cidoc-crm.org/cidoc-c...
137,138.0,0096_1985_Communications-Newsbreak,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT
138,139.0,0076_1982_History-Canada,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT
139,140.0,0025_1988_Cooperative-Extension-Booklet,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT


In [4]:
print(df.shape)

(103, 6)


In [5]:
df.rename(columns={
    "ID":"id", "Filename":"filename", "Metadata record": "dc_record", 
    "Transcription or caption (or link to separate doc, if too long)":"transcription_or_caption",
    "Schema.org Record":"sdo_record", "CIDOC-CRM Record":"cidoccrm_record"
    }, inplace=True)
df.head()

Unnamed: 0,id,filename,dc_record,transcription_or_caption,sdo_record,cidoccrm_record
0,1.0,Recognition for Meritorious Service Plaque - S...,1. Recognition for Meritorious Service Plaque ...,See Google Link for results,See Google Link for results,See Google Link for results
1,2.0,Criteria for Tenure and Promotion and Examples...,2. Criteria for Tenure a,See Google Link for results,See Google Link for results,See Google Link for results
2,3.0,“Turnin’ Timez” original student poems.pdf,turnin_timez_dublin_core.xml \n<?xml version='...,turnin_timez_transcription.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...","{\n ""@context"": {\n ""crm"": ""http://www.cid..."
3,4.0,1985_11_03_Agenda_Executive_Committee_Meeting.jpg,3. 1985_11_03_Agenda_Executive_Committee_Meeti...,See Google Link for results,See Google Link for results,See Google Link for results
4,5.0,Climbing Up Fun Activites for You and Your Cat...,<dc:title>Climbing Up: Fun Activities for You ...,Climbing_Up_OCR_Text.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...",@prefix crm: <http://www.cidoc-crm.org/cidoc-c...


Create a directory to store the cleaner version of the data:

In [None]:
# Path("data_task1/cleaned/").mkdir(parents=True, exist_ok=True)
# df.to_csv("data_task1/cleaned/4-HDataExperimentAssignmentsAndOutcomes-Outcomes-Task1.csv")

In [6]:
record_ids = list(df["id"])
print(record_ids[2])

3.0


### Dublin Core
Write the [Dublin Core](https://www.dublincore.org) (DC) records as XML files.

In [14]:
dc_records = list(df["dc_record"])
print(dc_records[2])

turnin_timez_dublin_core.xml 
<?xml version='1.0' encoding='utf-8'?>
<dublin_core><dc element="title">Turnin' Timez: Original Student Poems</dc><dc element="creator">Various Authors (Students)</dc><dc element="subject">Poetry, Student Creative Writing, Reflections</dc><dc element="description">A collection of original poems written by students, reflecting on various themes such as identity, change, and personal growth.</dc><dc element="publisher">Unknown</dc><dc element="date">2024-12-04</dc><dc element="format">PDF</dc><dc element="type">Text/Poetry Collection</dc><dc element="language">English</dc><dc element="identifier">urn:uuid:turnin-timez-001</dc></dublin_core>


In [8]:
dc_path = "data_task1/cleaned/dublin_core/"
Path(dc_path).mkdir(parents=True, exist_ok=True)

In [None]:
utils.write_xml(record_ids, dc_records, dc_path, file_prefix="dc_record_")

Wrote dc_record_003.xml!
Wrote dc_record_005.xml!
Wrote dc_record_006.xml!
Wrote dc_record_007.xml!
Wrote dc_record_008.xml!
Wrote dc_record_009.xml!
Wrote dc_record_010.xml!
Wrote dc_record_011.xml!
Wrote dc_record_013.xml!
Wrote dc_record_014.xml!
Wrote dc_record_016.xml!
Wrote dc_record_017.xml!
Wrote dc_record_019.xml!
Wrote dc_record_020.xml!
Wrote dc_record_033.xml!
Wrote dc_record_038.xml!
Wrote dc_record_043.xml!
Wrote dc_record_044.xml!
Wrote dc_record_045.xml!
Wrote dc_record_046.xml!
Wrote dc_record_047.xml!
Wrote dc_record_048.xml!
Wrote dc_record_049.xml!
Wrote dc_record_050.xml!
Wrote dc_record_052.xml!
Wrote dc_record_053.xml!
Wrote dc_record_056.xml!
Wrote dc_record_057.xml!
Wrote dc_record_058.xml!
Wrote dc_record_059.xml!
Wrote dc_record_062.xml!
Wrote dc_record_063.xml!
Wrote dc_record_064.xml!
Wrote dc_record_070.xml!
Wrote dc_record_071.xml!
Wrote dc_record_073.xml!
Wrote dc_record_074.xml!
Wrote dc_record_075.xml!
Wrote dc_record_077.xml!
Wrote dc_record_078.xml!


**Note:** There is inconsistency in the DC record formatting!  For example:

```
<?xml version='1.0' encoding='utf-8'?>
<dublin_core><dc element="title">Turnin' Timez: Original Student Poems</dc>
    ...
</dublin_core>
```
---
```
<?xml version="1.0"?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>National 4-H Center Major Pledges, Contributions, and Grants</dc:title>
    ...
</metadata>
```
---
```
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>4-H National Youth Science Day</dc:title>
    ...
</metadata>
```
---
```
<dc:title>Climbing Up: Fun Activities for You and Your Cat</dc:title>
   ...
<dc:rights>Unknown</dc:rights>
```

Also note that [DCMI documentation](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) encourages the use of http://purl.org/dc/terms/ over http://purl.org/dc/elements/1.1/.

### Schema.org
Write the [Schema.org](https://schema.org) records as JSON-LD files.

In [10]:
sdo_path = "data_task1/cleaned/schema_org/"
Path(sdo_path).mkdir(parents=True, exist_ok=True)


In [15]:
sdo_records = list(df["sdo_record"])
print(sdo_records[2])

{
  "@context": "https://schema.org",
  "@type": "CreativeWork",
  "name": "Turnin' Timez: Original Student Poems",
  "author": "Various Authors (Students)",
  "keywords": "Poetry, Student Creative Writing, Reflections",
  "description": "A collection of original poems written by students, reflecting on various themes such as identity, change, and personal growth.",
  "publisher": "Unknown",
  "datePublished": "2024-12-04",
  "encodingFormat": "PDF",
  "inLanguage": "English",
  "identifier": "urn:uuid:turnin-timez-001"
}


In [None]:
utils.write_json(record_ids, sdo_records, sdo_path, "sdo_record_")

Wrote sdo_record_003.json!
Wrote sdo_record_005.json!
Wrote sdo_record_006.json!
Wrote sdo_record_007.json!
Wrote sdo_record_008.json!
Wrote sdo_record_009.json!
Wrote sdo_record_010.json!
Wrote sdo_record_011.json!
Wrote sdo_record_013.json!
Wrote sdo_record_014.json!
Wrote sdo_record_016.json!
Wrote sdo_record_017.json!
Wrote sdo_record_018.json!
Wrote sdo_record_019.json!
Wrote sdo_record_020.json!
Wrote sdo_record_031.json!
Wrote sdo_record_032.json!
Wrote sdo_record_033.json!
Wrote sdo_record_034.json!
Wrote sdo_record_038.json!
Wrote sdo_record_043.json!
Wrote sdo_record_044.json!
Wrote sdo_record_045.json!
Wrote sdo_record_046.json!
Wrote sdo_record_047.json!
Wrote sdo_record_048.json!
Wrote sdo_record_049.json!
Wrote sdo_record_050.json!
Wrote sdo_record_052.json!
Wrote sdo_record_053.json!
Wrote sdo_record_056.json!
Wrote sdo_record_057.json!
Wrote sdo_record_058.json!
Wrote sdo_record_059.json!
Wrote sdo_record_062.json!
Wrote sdo_record_063.json!
Wrote sdo_record_064.json!
W

In [28]:
r = sdo_records[2]
print(r)

{
  "@context": "https://schema.org",
  "@type": "CreativeWork",
  "name": "Turnin' Timez: Original Student Poems",
  "author": "Various Authors (Students)",
  "keywords": "Poetry, Student Creative Writing, Reflections",
  "description": "A collection of original poems written by students, reflecting on various themes such as identity, change, and personal growth.",
  "publisher": "Unknown",
  "datePublished": "2024-12-04",
  "encodingFormat": "PDF",
  "inLanguage": "English",
  "identifier": "urn:uuid:turnin-timez-001"
}


In [None]:
x = re.findall(r"\{[\W\w]*\}", r)
print(x)

['{\n  "@context": "https://schema.org",\n  "@type": "CreativeWork",\n  "name": "Turnin\' Timez: Original Student Poems",\n  "author": "Various Authors (Students)",\n  "keywords": "Poetry, Student Creative Writing, Reflections",\n  "description": "A collection of original poems written by students, reflecting on various themes such as identity, change, and personal growth.",\n  "publisher": "Unknown",\n  "datePublished": "2024-12-04",\n  "encodingFormat": "PDF",\n  "inLanguage": "English",\n  "identifier": "urn:uuid:turnin-timez-001"\n}']


### CIDOC-CRM
Write the [CIDOC-CRM](https://cidoc-crm.org) records as JSON-LD files.

In [7]:
cidoc_path = "data_task1/cleaned/cidoc_crm/"
Path(cidoc_path).mkdir(parents=True, exist_ok=True)


In [8]:
cidoc_records = list(df["cidoccrm_record"])
print(cidoc_records[2])

{
  "@context": {
    "crm": "http://www.cidoc-crm.org/cidoc-crm/",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@type": "crm:E73_Information_Object",
  "crm:P102_has_title": {
    "@value": "Turnin' Timez: Original Student Poems",
    "@language": "en"
  },
  "crm:P94_has_created": {
    "@type": "crm:E21_Person",
    "rdfs:label": "Various Authors (Students)"
  },
  "crm:P3_has_note": {
    "@value": "A collection of original poems written by students, reflecting on various themes such as identity, change, and personal growth.",
    "@language": "en"
  },
  "crm:P4_has_time-span": {
    "@type": "crm:E52_Time-Span",
    "crm:P82_at_some_time_within": {
      "@value": "2024-12-04",
      "@type": "xsd:date"
    }
  },
  "crm:P108_has_produced": {
    "@type": "crm:E40_Legal_Body",
    "rdfs:label": "Unknown"
  },
  "crm:P72_has_language": {
    "@type": "crm:E56_Language",
    "rdfs:label": "English"
  },
  "crm:P1_is_ide

In [9]:
utils.write_json(record_ids, cidoc_records, cidoc_path, "cidoccrm_record_")

Wrote cidoccrm_record_003.json!
Wrote cidoccrm_record_006.json!
Wrote cidoccrm_record_007.json!
Wrote cidoccrm_record_008.json!
Wrote cidoccrm_record_009.json!
Wrote cidoccrm_record_010.json!
Wrote cidoccrm_record_011.json!
Wrote cidoccrm_record_013.json!
Wrote cidoccrm_record_014.json!
Wrote cidoccrm_record_016.json!
Wrote cidoccrm_record_017.json!
Wrote cidoccrm_record_018.json!
Wrote cidoccrm_record_019.json!
Wrote cidoccrm_record_020.json!
Wrote cidoccrm_record_043.json!
Wrote cidoccrm_record_044.json!
Wrote cidoccrm_record_045.json!
Wrote cidoccrm_record_046.json!
Wrote cidoccrm_record_047.json!
Wrote cidoccrm_record_048.json!
Wrote cidoccrm_record_049.json!
Wrote cidoccrm_record_050.json!
Wrote cidoccrm_record_052.json!
Wrote cidoccrm_record_053.json!
Wrote cidoccrm_record_056.json!
Wrote cidoccrm_record_057.json!
Wrote cidoccrm_record_058.json!
Wrote cidoccrm_record_059.json!
Wrote cidoccrm_record_061.json!
Wrote cidoccrm_record_062.json!
Wrote cidoccrm_record_063.json!
Wrote ci