# AI-Generated Linked Data Evaluation

## Linking Anthropology's Data and Archives: Task 1

**Considerations**:
 - Syntax (Does it adhere to the expected serialization format (e.g. well-formed XML)?)
 - Completeness (Fields are not empty or 'unknown')
 - Conformance to ontologies (i.e. CIDOC-CRM, Schema.org, Dublin Core)
 - Consistency (across generated data points)

In [36]:
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import json
import lxml
import rdflib
from rdflib.namespace import DC, SDO # Dublin Core, Schema.org
from pathlib import Path

# sax - to validate XML well-formed
# xml.etree.ElementTree - to validate text between tags
# xml.etree.ElementTree + xml.etree.ElementTree.XMLSchema's validate() - to validate XML well-formed
# lxml etree.XMLParser - to validate well-formed based on input XML schema
# json_checker - to validate Python data types (incl. but not limited to those obtained from JSON)
# jsonschema.validate
# ShEx - for RDF graphs, ShExJ for JSON - NOTE: couldn't install package
# OntoME - for CIDOC-CRM ontology alignment

In [30]:
f = "data/4-HDataExperimentAssignmentsAndOutcomes-Outcomes-Task1.csv"
df = pd.read_csv(f, sep=",", header=0, encoding="utf-8")
df.head()

Unnamed: 0,ID,Filename,Metadata record,"Transcription or caption (or link to separate doc, if too long)",Schema.org Record,CIDOC-CRM Record
0,1.0,Recognition for Meritorious Service Plaque - S...,1. Recognition for Meritorious Service Plaque ...,See Google Link for results,See Google Link for results,See Google Link for results
1,2.0,Criteria for Tenure and Promotion and Examples...,2. Criteria for Tenure a,See Google Link for results,See Google Link for results,See Google Link for results
2,3.0,“Turnin’ Timez” original student poems.pdf,turnin_timez_dublin_core.xml \n<?xml version='...,turnin_timez_transcription.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...","{\n ""@context"": {\n ""crm"": ""http://www.cid..."
3,4.0,1985_11_03_Agenda_Executive_Committee_Meeting.jpg,3. 1985_11_03_Agenda_Executive_Committee_Meeti...,See Google Link for results,See Google Link for results,See Google Link for results
4,5.0,Climbing Up Fun Activites for You and Your Cat...,<dc:title>Climbing Up: Fun Activities for You ...,Climbing_Up_OCR_Text.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...",@prefix crm: <http://www.cidoc-crm.org/cidoc-c...


In [31]:
# df.tail()
df.dropna(inplace=True)
df.tail()

Unnamed: 0,ID,Filename,Metadata record,"Transcription or caption (or link to separate doc, if too long)",Schema.org Record,CIDOC-CRM Record
135,136.0,0119_0000_Development-Photos,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT
136,137.0,0063_1988_International-Symposium,Title: 4-H USA International Programs\nCreator...,I can see that this is a comprehensive speech ...,"{\n ""@context"": ""https://schema.org"",\n ""@ty...",@prefix crm: <http://www.cidoc-crm.org/cidoc-c...
137,138.0,0096_1985_Communications-Newsbreak,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT
138,139.0,0076_1982_History-Canada,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT
139,140.0,0025_1988_Cooperative-Extension-Booklet,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT


In [32]:
print(df.shape)

(103, 6)


In [34]:
df.rename(columns={
    "ID":"id", "Filename":"filename", "Metadata record": "dc_record", 
    "Transcription or caption (or link to separate doc, if too long)":"transcription_or_caption",
    "Schema.org Record":"sdo_record", "CIDOC-CRM Record":"cidoccrm_record"
    }, inplace=True)
df.head()

Unnamed: 0,id,filename,dc_record,transcription_or_caption,sdo_record,cidoccrm_record
0,1.0,Recognition for Meritorious Service Plaque - S...,1. Recognition for Meritorious Service Plaque ...,See Google Link for results,See Google Link for results,See Google Link for results
1,2.0,Criteria for Tenure and Promotion and Examples...,2. Criteria for Tenure a,See Google Link for results,See Google Link for results,See Google Link for results
2,3.0,“Turnin’ Timez” original student poems.pdf,turnin_timez_dublin_core.xml \n<?xml version='...,turnin_timez_transcription.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...","{\n ""@context"": {\n ""crm"": ""http://www.cid..."
3,4.0,1985_11_03_Agenda_Executive_Committee_Meeting.jpg,3. 1985_11_03_Agenda_Executive_Committee_Meeti...,See Google Link for results,See Google Link for results,See Google Link for results
4,5.0,Climbing Up Fun Activites for You and Your Cat...,<dc:title>Climbing Up: Fun Activities for You ...,Climbing_Up_OCR_Text.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...",@prefix crm: <http://www.cidoc-crm.org/cidoc-c...


In [37]:
Path("data/cleaned/").mkdir(parents=True, exist_ok=True)
df.to_csv("data/cleaned/4-HDataExperimentAssignmentsAndOutcomes-Outcomes-Task1.csv")