# Automated Correction: Dublin Core XML Syntax

Part IV of the computational evaluation of AI-generated linked data for [Linking Anthropology's Data and Archives (LADA)](https://ischool.umd.edu/projects/building-a-sustainable-future-for-anthropologys-archives-researching-primary-source-data-lifecycles-infrastructures-and-reuse/), focused on syntax (e.g., do the metadata adhere to the expected serialization formats?).

---

**Table of Contents:**

I. [Data Loading](#data-loading)

II. [Auomated Correction](#automated-correction)

---

## Data Loading

In [None]:
import utils
import config
import pandas as pd
import numpy as np
import urllib.request
import urllib
import xml.etree.ElementTree as ET
import json
from lxml import etree
import rdflib
from rdflib.namespace import DC, SDO # Dublin Core, Schema.org
from pathlib import Path
import os
import re

## Automated Correction

Try correcting undefined namespace prefix errors automatically, reading the errored files' equivalents with `.txt` extensions and saving the corrected files that can be parsed with an XML parser to a new directory, where each corrected file has a `.xml` extension.

In [None]:
df_se.exception_subtype.unique()

array(['Namespace prefix dc on title is not defined',
       'Namespace prefix rdf for about on Description is not defined',
       'xmlns:dc: Empty XML namespace is not allowed',
       'Namespace prefix rdf on Description is not defined',
       'xmlParseEntityRef: no name', 'Missing namespace',
       'Missing prolog'], dtype=object)

In [None]:
errored_files = list(df_se.file_path)
error_list = list(df_se.exception_subtype)
assert (len(error_list) == len(errored_files)), "Error list and errored files lists should be of the same length"

In [None]:
txt_errored_files = [f.replace(".xml", ".txt") for f in errored_files]
print(txt_errored_files[0])
print(error_list[0])

data/data_playground_task1/cleaned/dublin_core/dc_record_005.txt
Namespace prefix dc on title is not defined


In [None]:
still_incorrect = utils.correctXML(txt_errored_files, error_list)
print(f"Files that still need correcting: {still_incorrect}.")  #assert len(still_incorrect) == 0, 

Files that still need correcting: [{'file': 'data/data_playground_task1/cleaned/dublin_core/dc_record_005.txt', 'exception_type': <class 'xml.etree.ElementTree.ParseError'>, 'exception_message': 'unbound prefix: line 1, column 0'}, {'file': 'data/data_playground_task1/cleaned/dublin_core/dc_record_006.txt', 'exception_type': <class 'xml.etree.ElementTree.ParseError'>, 'exception_message': 'unbound prefix: line 3, column 0'}, {'file': 'data/data_playground_task1/cleaned/dublin_core/dc_record_007.txt', 'exception_type': <class 'xml.etree.ElementTree.ParseError'>, 'exception_message': 'unbound prefix: line 1, column 0'}, {'file': 'data/data_playground_task1/cleaned/dublin_core/dc_record_008.txt', 'exception_type': <class 'xml.etree.ElementTree.ParseError'>, 'exception_message': 'unbound prefix: line 2, column 0'}, {'file': 'data/data_playground_task1/cleaned/dublin_core/dc_record_009.txt', 'exception_type': <class 'xml.etree.ElementTree.ParseError'>, 'exception_message': 'unbound prefix: 

In [None]:
more_df_se = pd.DataFrame.from_dict(still_incorrect)
new_file_col = df_se["file_path"].apply(lambda x: x.split("/")[-1])
more_df_se.insert(1, "file_name", new_file_col)
more_df_se.head()

Unnamed: 0,file,file_name,exception_type,exception_message
0,data/data_playground_task1/cleaned/dublin_core...,dc_record_005.xml,Malformed XML,No closing tag found for outermost element.
1,data/data_playground_task1/cleaned/dublin_core...,dc_record_006.xml,Malformed XML,No closing tag found for outermost element.
2,data/data_playground_task1/cleaned/dublin_core...,dc_record_007.xml,Malformed XML,No closing tag found for outermost element.
3,data/data_playground_task1/cleaned/dublin_core...,dc_record_008.xml,Malformed XML,No closing tag found for outermost element.
4,data/data_playground_task1/cleaned/dublin_core...,dc_record_009.xml,Malformed XML,No closing tag found for outermost element.


Great!  We corrected all the Dublin Core XML metadata!

Update the report to show this.

In [None]:
updated = pd.concat([
    xml_report, 
    pd.DataFrame({
        "dimension_counted":"errored_files_after_auto_correction",
        "exception": "NA",
        "count":len(still_incorrect),
        "proportion_of_all_files":(len(still_incorrect)/total_dcxml_files)
    }, index=[xml_report.shape[0]])
])
updated

Unnamed: 0,dimension_counted,exception,count,proportion_of_all_files
0,exception_type,<class 'lxml.etree.XMLSyntaxError'>,43,40.19%
1,exception_subtype,Namespace prefix dc on title is not defined,33,30.84%
2,exception_subtype,Namespace prefix rdf for about on Description ...,7,6.54%
3,exception_subtype,xmlns:dc: Empty XML namespace is not allowed,1,0.93%
4,exception_subtype,Namespace prefix rdf on Description is not def...,1,0.93%
5,exception_subtype,xmlParseEntityRef: no name,1,0.93%
6,total_files,,107,100.00%
7,files_with_error,,43,40.19%
8,errored_files_after_auto_correction,,0,0.0


In [None]:
metadata_standard = "dublin_core"
data_serialization = "xml"
report_type = "syntax_error_stats"
xml_report.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

Put a copy of all the initially correct files in the same `corrected` directory as the corrected files.

In [None]:
correct_dc_files = []
for f in dublin_file_paths:
    if f not in errored_files:
        correct_dc_files += [f]
print("Files with correct syntax:", len(correct_dc_files), "of", len(dublin_file_paths))

In [None]:
corrected_dir_name = "corrected"
for correct_dc in correct_dc_files:
    with open(correct_dc, "r") as f:
        content = f.read()
        f.close()
    new_path = correct_dc.replace("cleaned", corrected_dir_name)
    with open(new_path, "w") as f:
        f.write(content)
        f.close()
print(f"Copied the rest of the correct files into the {corrected_dir_name} directory!")

Copied the rest of the correct files into the corrected directory!
