# AI-Generated Linked Data Evaluation

## Linking Anthropology's Data and Archives: Task 1

**Considerations**:
 - Syntax (Does it adhere to the expected serialization format (e.g. well-formed XML)?)
 - Completeness (Fields are not empty or 'unknown')
 - Conformance to ontologies (i.e. CIDOC-CRM, Schema.org, Dublin Core)
 - Consistency (across generated data points)

In [43]:
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import json
import lxml
import rdflib
from rdflib.namespace import DC, SDO # Dublin Core, Schema.org
from pathlib import Path
import os
import re

# sax - to validate XML well-formed
# xml.etree.ElementTree - to validate text between tags
# xml.etree.ElementTree + xml.etree.ElementTree.XMLSchema's validate() - to validate XML well-formed
# lxml etree.XMLParser - to validate well-formed based on input XML schema
# json_checker - to validate Python data types (incl. but not limited to those obtained from JSON)
# jsonschema.validate
# ShEx - for RDF graphs, ShExJ for JSON - NOTE: couldn't install package
# OntoME - for CIDOC-CRM ontology alignment

In [67]:
f = "data_task1/4-HDataExperimentAssignmentsAndOutcomes-Outcomes-Task1.csv"
df = pd.read_csv(f, sep=",", header=0, encoding="utf-8")
df.head()

Unnamed: 0,ID,Filename,Metadata record,"Transcription or caption (or link to separate doc, if too long)",Schema.org Record,CIDOC-CRM Record
0,1.0,Recognition for Meritorious Service Plaque - S...,1. Recognition for Meritorious Service Plaque ...,See Google Link for results,See Google Link for results,See Google Link for results
1,2.0,Criteria for Tenure and Promotion and Examples...,2. Criteria for Tenure a,See Google Link for results,See Google Link for results,See Google Link for results
2,3.0,“Turnin’ Timez” original student poems.pdf,turnin_timez_dublin_core.xml \n<?xml version='...,turnin_timez_transcription.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...","{\n ""@context"": {\n ""crm"": ""http://www.cid..."
3,4.0,1985_11_03_Agenda_Executive_Committee_Meeting.jpg,3. 1985_11_03_Agenda_Executive_Committee_Meeti...,See Google Link for results,See Google Link for results,See Google Link for results
4,5.0,Climbing Up Fun Activites for You and Your Cat...,<dc:title>Climbing Up: Fun Activities for You ...,Climbing_Up_OCR_Text.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...",@prefix crm: <http://www.cidoc-crm.org/cidoc-c...


In [68]:
# df.tail()
df.dropna(inplace=True)
df.tail()

Unnamed: 0,ID,Filename,Metadata record,"Transcription or caption (or link to separate doc, if too long)",Schema.org Record,CIDOC-CRM Record
135,136.0,0119_0000_Development-Photos,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT
136,137.0,0063_1988_International-Symposium,Title: 4-H USA International Programs\nCreator...,I can see that this is a comprehensive speech ...,"{\n ""@context"": ""https://schema.org"",\n ""@ty...",@prefix crm: <http://www.cidoc-crm.org/cidoc-c...
137,138.0,0096_1985_Communications-Newsbreak,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT
138,139.0,0076_1982_History-Canada,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT
139,140.0,0025_1988_Cooperative-Extension-Booklet,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT,UNABLE TO EXTRACT TEXT


In [69]:
print(df.shape)

(103, 6)


In [70]:
df.rename(columns={
    "ID":"id", "Filename":"filename", "Metadata record": "dc_record", 
    "Transcription or caption (or link to separate doc, if too long)":"transcription_or_caption",
    "Schema.org Record":"sdo_record", "CIDOC-CRM Record":"cidoccrm_record"
    }, inplace=True)
df.head()

Unnamed: 0,id,filename,dc_record,transcription_or_caption,sdo_record,cidoccrm_record
0,1.0,Recognition for Meritorious Service Plaque - S...,1. Recognition for Meritorious Service Plaque ...,See Google Link for results,See Google Link for results,See Google Link for results
1,2.0,Criteria for Tenure and Promotion and Examples...,2. Criteria for Tenure a,See Google Link for results,See Google Link for results,See Google Link for results
2,3.0,“Turnin’ Timez” original student poems.pdf,turnin_timez_dublin_core.xml \n<?xml version='...,turnin_timez_transcription.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...","{\n ""@context"": {\n ""crm"": ""http://www.cid..."
3,4.0,1985_11_03_Agenda_Executive_Committee_Meeting.jpg,3. 1985_11_03_Agenda_Executive_Committee_Meeti...,See Google Link for results,See Google Link for results,See Google Link for results
4,5.0,Climbing Up Fun Activites for You and Your Cat...,<dc:title>Climbing Up: Fun Activities for You ...,Climbing_Up_OCR_Text.txt,"{\n ""@context"": ""https://schema.org"",\n ""@ty...",@prefix crm: <http://www.cidoc-crm.org/cidoc-c...


Create a directory to store the cleaner version of the data:

In [71]:
Path("data_task1/cleaned/").mkdir(parents=True, exist_ok=True)
df.to_csv("data_task1/cleaned/4-HDataExperimentAssignmentsAndOutcomes-Outcomes-Task1.csv")

Create an XML file of each Dublin Core (DC) record and store them in a dedicated directory:

In [72]:
record_ids = list(df["id"])
dc_records = list(df["dc_record"])
print(record_ids[2])
print(dc_records[2])

3.0
turnin_timez_dublin_core.xml 
<?xml version='1.0' encoding='utf-8'?>
<dublin_core><dc element="title">Turnin' Timez: Original Student Poems</dc><dc element="creator">Various Authors (Students)</dc><dc element="subject">Poetry, Student Creative Writing, Reflections</dc><dc element="description">A collection of original poems written by students, reflecting on various themes such as identity, change, and personal growth.</dc><dc element="publisher">Unknown</dc><dc element="date">2024-12-04</dc><dc element="format">PDF</dc><dc element="type">Text/Poetry Collection</dc><dc element="language">English</dc><dc element="identifier">urn:uuid:turnin-timez-001</dc></dublin_core>


In [73]:
dc_path = "data_task1/cleaned/dublin_core/"
Path(dc_path).mkdir(parents=True, exist_ok=True)

In [74]:
def write_dc_xml(records_ids_list, dc_records_list, dc_directory):
    """
    Write the Dublin Core XML records to a directory.
    """
    i, maxI = 0, len(records_ids_list)
    while i < maxI:
        id = str(int(records_ids_list[i]))
        record = dc_records_list[i]
        
        """
        Remove extraneous text so only data between tags (< >) remains
        """
        xml_data = re.findall(r"<.+>", record)

        """
        If xml_data isn't an empty list, define the file name, 
        padding the ID with leading zeros, and write the data
        to an XML file
        """
        if len(xml_data) > 0:
            file_prefix = "dc_record_"
            file_suffix = ".xml"
            if len(id) == 1:
                file_id = "00" + id
            elif len(id) == 2:
                file_id = "0" + id
            else:
                file_id = id
            filename = file_prefix + file_id + file_suffix
            filepath = dc_directory+filename

            with open(filepath, "w") as file:
                for line in xml_data:
                    file.write(line)
                    file.write("\n")
            file.close()
            print("Wrote", filename+"!")

        i += 1


In [75]:
write_dc_xml(record_ids, dc_records, dc_path)

Wrote dc_record_003.xml!
Wrote dc_record_005.xml!
Wrote dc_record_006.xml!
Wrote dc_record_007.xml!
Wrote dc_record_008.xml!
Wrote dc_record_009.xml!
Wrote dc_record_010.xml!
Wrote dc_record_011.xml!
Wrote dc_record_013.xml!
Wrote dc_record_014.xml!
Wrote dc_record_016.xml!
Wrote dc_record_017.xml!
Wrote dc_record_019.xml!
Wrote dc_record_020.xml!
Wrote dc_record_033.xml!
Wrote dc_record_038.xml!
Wrote dc_record_043.xml!
Wrote dc_record_044.xml!
Wrote dc_record_045.xml!
Wrote dc_record_046.xml!
Wrote dc_record_047.xml!
Wrote dc_record_048.xml!
Wrote dc_record_049.xml!
Wrote dc_record_050.xml!
Wrote dc_record_052.xml!
Wrote dc_record_053.xml!
Wrote dc_record_056.xml!
Wrote dc_record_057.xml!
Wrote dc_record_058.xml!
Wrote dc_record_059.xml!
Wrote dc_record_062.xml!
Wrote dc_record_063.xml!
Wrote dc_record_064.xml!
Wrote dc_record_070.xml!
Wrote dc_record_071.xml!
Wrote dc_record_073.xml!
Wrote dc_record_074.xml!
Wrote dc_record_075.xml!
Wrote dc_record_077.xml!
Wrote dc_record_078.xml!
