# Evaluation: Completeness

Part II of the computational evaluation of AI-generated linked data for [Linking Anthropology's Data and Archives (LADA)](https://ischool.umd.edu/projects/building-a-sustainable-future-for-anthropologys-archives-researching-primary-source-data-lifecycles-infrastructures-and-reuse/), focused on completeness (e.g., metadata fields are not empty or 'unknown').

---

**Table of Contents:**

I. [Data Loading](#data-loading)

II. [Completeness](#completeness)

  * [Content of Fields](#content-of-fields): check for emptiness and URL validity (and that URL provided is relevant???  Or is that conformance???)

    * [Dublin Core](#dublin-core)

    * [JSON-LD](#json-ld)

  * [Comparison to transcription???](#comparison-to-transcription)

---

## Data Loading

In [1]:
import utils
import config
import pandas as pd
import urllib
from urllib.request import Request, urlopen
import xml.etree.ElementTree as ET
import json
from pyld import jsonld
from lxml import etree
from pathlib import Path
import os
import re

Create variables to reference existing directories and files.

In [2]:
dublin_path = "cleaned/dublin_core/"  # XML data files
schema_path = "cleaned/schema_org/"   # JSON data files
cidoc_path = "cleaned/cidoc_crm/"     # JSON data files

dublin_t1_dir = config.task1_data+dublin_path
schema_t1_dir = config.task1_data+schema_path
cidoc_t1_dir = config.task1_data+cidoc_path

dublin_p1_dir = config.playgrd1_data+dublin_path
schema_p1_dir = config.playgrd1_data+schema_path
cidoc_p1_dir = config.playgrd1_data+cidoc_path

dublin_p3_dir = config.playgrd3_data+dublin_path
schema_p3_dir = config.playgrd3_data+schema_path
cidoc_p3_dir = config.playgrd3_data+cidoc_path

Create a directory to store the error reports in.

In [3]:
d = "completeness"
report_dir = f"data/error_reports/{d}/"
Path(report_dir).mkdir(parents=True, exist_ok=True)

For checking URL vaildity:

In [4]:
os.environ["no_proxy"] = "*"                                                                                                                     # https://docs.python.org/3/library/urllib.request.html 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}   # As suggested here: https://www.reddit.com/r/learnpython/comments/1ea3r0z/how_to_avoid_http_error_403_forbidden/

## Content of Fields

### Dublin Core

In [5]:
# Read the TXT files so all generated metadata can be read, whether or not the XML is well-formed.
extension = ".txt"
dublin_file_paths = []
dublin_files_t1 = [f for f in os.listdir(dublin_t1_dir) if f.endswith(extension)]
dublin_file_paths += [dublin_t1_dir+f for f in dublin_files_t1]
dublin_files_p1 = [f for f in os.listdir(dublin_p1_dir) if f.endswith(extension)]
dublin_file_paths += [dublin_p1_dir+f for f in dublin_files_p1]
dublin_files_p3 = [f for f in os.listdir(dublin_p3_dir) if f.endswith(extension)]
dublin_file_paths += [dublin_p3_dir+f for f in dublin_files_p3]
dublin_file_paths.sort()
total_dc_files = len(dublin_file_paths)
print(f"Total Dublin Core {extension[1:].upper()} files:", total_dc_files)

Total Dublin Core TXT files: 107


#### Empty Fields

Check for empty metadata fields.

In [9]:
empty = re.compile('(<[a-z]+:[a-z]+>|<[a-z=" ]+>)((unknown|none|na|""|\?|not specified|\n|)|[^<>]+(not specified|unknown))(</[a-z]+:[a-z]+>|</[a-z]+>)')

  empty = re.compile('(<[a-z]+:[a-z]+>|<[a-z=" ]+>)((unknown|none|na|""|\?|not specified|\n|)|[^<>]+(not specified|unknown))(</[a-z]+:[a-z]+>|</[a-z]+>)')


In [10]:
files_with_empty, empty_fields_per_file, fields_per_file = utils.findEmptyFields(empty, dublin_file_paths)

122 empty field(s) across 63 files found.


Create a DataFrame with the empty fields data so we can review it as a table.  We'll sort the data so the files with the most empty fields appear at the top and the files without any empty fields appear at the bottom of the table.

In [11]:
df_empty = pd.DataFrame.from_dict({"file_path":dublin_file_paths, "empty_field_count":empty_fields_per_file, "fields":fields_per_file}).sort_values(by="empty_field_count", ascending=False)
df_empty.head()

Unnamed: 0,file_path,empty_field_count,fields
1,data/data_playground_task1/cleaned/dublin_core...,6,"[<dc:creator>unknown</dc:creator>, <dc:publish..."
78,data/data_playground_task3/cleaned/dublin_core...,5,"[<dc:creator>unknown</dc:creator>, <dc:publish..."
15,data/data_playground_task1/cleaned/dublin_core...,5,"[<dc:creator>unknown</dc:creator>, <dc:publish..."
7,data/data_playground_task1/cleaned/dublin_core...,5,"[<dc:description>\n</dc:description>, <dc:date..."
60,data/data_playground_task1/cleaned/dublin_core...,4,"[<dc:contributor>\n</dc:contributor>, <dc:desc..."


In [12]:
df_empty.tail()

Unnamed: 0,file_path,empty_field_count,fields
69,data/data_playground_task3/cleaned/dublin_core...,0,[]
70,data/data_playground_task3/cleaned/dublin_core...,0,[]
71,data/data_playground_task3/cleaned/dublin_core...,0,[]
72,data/data_playground_task3/cleaned/dublin_core...,0,[]
106,data/data_task1/cleaned/dublin_core/dc_record_...,0,[]


In [13]:
assert df_empty.shape[0] == len(dublin_file_paths), "The new DataFrame should have exactly one row per Dublin Core metadata record (per file)."

Create a report showing how many files have different amounts of empty fields.

In [14]:
empty_field_count_report = pd.DataFrame(df_empty.empty_field_count.value_counts()).rename(columns={"count":"file_count"})
empty_field_count_report

Unnamed: 0_level_0,file_count
empty_field_count,Unnamed: 1_level_1
0,44
1,36
2,11
4,7
3,5
5,3
6,1


"Explode" the DataFrame so that instead of having one row per file, it has one row per metadata field.  For this, we'll remove ("drop") all the files that don't have any empty fields.

In [15]:
df_empty_exploded = df_empty.loc[df_empty.empty_field_count > 0].drop(columns=["empty_field_count"]).explode("fields")
assert sum(empty_fields_per_file) == df_empty_exploded.shape[0], "There should be exactly one row per empty field."

In [16]:
df_empty_exploded.head()

Unnamed: 0,file_path,fields
1,data/data_playground_task1/cleaned/dublin_core...,<dc:creator>unknown</dc:creator>
1,data/data_playground_task1/cleaned/dublin_core...,<dc:publisher>unknown</dc:publisher>
1,data/data_playground_task1/cleaned/dublin_core...,<dc:contributor>unknown</dc:contributor>
1,data/data_playground_task1/cleaned/dublin_core...,<dc:date>unknown</dc:date>
1,data/data_playground_task1/cleaned/dublin_core...,<dc:relation>none</dc:relation>


In [17]:
fields = (list(df_empty_exploded.fields))
# Extract the tag name or attribute that indicates the Dublin Core field intended. If the tag
# is 'dc' and the metadata field is provided as an attribute, such as '<dc element="title">',
# then the extracted data will be 'title,' not 'dc.'
tags = [re.search('(?<=<)([a-z:]+)(?=>)|(?<=")[a-z]+(?=")', field)[0] for field in fields]
values_lists = [re.findall('>[^<]*<', field) for field in fields]
values = []
for v in values_lists:
    if len(v) > 0:
        values += [v[0][1:-1]]
    else:
        values += ['']
df_empty_exploded.insert(len(df_empty_exploded.columns), "tag_or_attribute", tags)
df_empty_exploded.insert(len(df_empty_exploded.columns), "empty_value", values)
df_empty_exploded.tail()

Unnamed: 0,file_path,fields,tag_or_attribute,empty_value
95,data/data_task1/cleaned/dublin_core/dc_record_...,<dc:description>\n</dc:description>,dc:description,\n
94,data/data_task1/cleaned/dublin_core/dc_record_...,<dc:description>\n</dc:description>,dc:description,\n
38,data/data_playground_task1/cleaned/dublin_core...,<dc:description>\n</dc:description>,dc:description,\n
33,data/data_playground_task1/cleaned/dublin_core...,<dc:description>\n</dc:description>,dc:description,\n
92,data/data_task1/cleaned/dublin_core/dc_record_...,<dcterms:creator>unknown</dcterms:creator>,dcterms:creator,unknown


In [26]:
tag_counts = pd.DataFrame(df_empty_exploded.tag_or_attribute.value_counts()).reset_index()
tag_counts

Unnamed: 0,tag_or_attribute,count
0,dc:description,35
1,dc:rights,18
2,dc:creator,17
3,dc:contributor,11
4,dc:publisher,10
5,dc:date,6
6,dc:relation,5
7,dcterms:creator,4
8,creator,2
9,dcterms:source,2


In [27]:
tag_values = list(tag_counts.tag_or_attribute)
tag_cats = []
for t in tag_values:
    if ":" in t:
        tag_cats += [t.split(":")[-1]]
    else:
        tag_cats += [t]
tag_counts.insert(1, "category", tag_cats)
tag_counts

Unnamed: 0,tag_or_attribute,category,count
0,dc:description,description,35
1,dc:rights,rights,18
2,dc:creator,creator,17
3,dc:contributor,contributor,11
4,dc:publisher,publisher,10
5,dc:date,date,6
6,dc:relation,relation,5
7,dcterms:creator,creator,4
8,creator,creator,2
9,dcterms:source,source,2


In [29]:
df_cats = tag_counts.groupby(["category"]).transform("sum")
df_cats.insert(0, "category", tag_counts.category)
df_cats = df_cats.drop(columns=["tag_or_attribute"]).drop_duplicates()
df_cats

Unnamed: 0,category,count
0,description,37
1,rights,18
2,creator,23
3,contributor,11
4,publisher,13
5,date,6
6,relation,5
9,source,3
11,coverage,2
12,identifier,2


In [30]:
df_values = pd.DataFrame(df_empty_exploded.empty_value.value_counts())
df_values

Unnamed: 0_level_0,count
empty_value,Unnamed: 1_level_1
unknown,61
\n,50
,6
none,2
rights status unknown,1
status unknown,1
copyright status not specified,1


Save the reports as CSV files.

In [31]:
metadata_standard = "dublin_core"
data_serialization = "xml"
report_type = "empty_field_counts"
df_empty.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [32]:
metadata_standard = "dublin_core"
data_serialization = "xml"
report_type = "files_per_empty_field_count"
empty_field_count_report.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [33]:
metadata_standard = "dublin_core"
data_serialization = "xml"
report_type = "empty_fields_by_file"
df_empty_exploded.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [34]:
metadata_standard = "dublin_core"
data_serialization = "xml"
report_type = "empty_field_tag_counts"
tag_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [35]:
metadata_standard = "dublin_core"
data_serialization = "xml"
report_type = "empty_field_tag_category_counts"
df_cats.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [36]:
metadata_standard = "dublin_core"
data_serialization = "xml"
report_type = "empty_field_value_counts"
df_values.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

#### URLs

##### Namespace URLs
First, check that the namespace URLs are well-formed and that they exist.

In [6]:
url_pattern = re.compile('([a-z]+ns:[a-z]+|[a-z]+ns)=[^>]+( [^>])*(?=>)')

In [7]:
# Find all the URLs
files_with_urls, url_count_per_file, urls_per_file = [], [], []
for file_path in dublin_file_paths:
    with open(file_path, "r") as f:
        f_string = f.read().lower()
        
        # Look for URLs in the file
        has_urls = re.finditer(url_pattern, f_string)
        # Save the URLs in a list per file
        file_urls = []
        for match in has_urls:
            url = match[0]
            if " " in url:
                multiple = url.split(" ")
                file_urls = file_urls + multiple
                # print(file_urls)
            else:
                file_urls += [url]
        urls_per_file += [file_urls]
        url_count_per_file += [len(file_urls)]
        
        if len(file_urls) > 0:
            # Save the file path to the XML version of the file
            file_path.replace(".txt", ".xml")
            files_with_urls += [file_path]

        f.close()

print(sum(url_count_per_file), "URLs found in", len(files_with_urls), "files.")

98 URLs found in 70 files.


In [8]:
url_df = pd.DataFrame.from_dict({"file_path":dublin_file_paths, "url_count":url_count_per_file, "urls":urls_per_file}).sort_values(by="url_count", ascending=False)
url_df.head()

Unnamed: 0,file_path,url_count,urls
49,data/data_playground_task1/cleaned/dublin_core...,4,"[xmlns:dc=""http://purl.org/dc/elements/1.1/"", ..."
47,data/data_playground_task1/cleaned/dublin_core...,4,"[xmlns:dc=""http://purl.org/dc/elements/1.1/"", ..."
27,data/data_playground_task1/cleaned/dublin_core...,4,"[xmlns:dc=""http://purl.org/dc/elements/1.1/"", ..."
13,data/data_playground_task1/cleaned/dublin_core...,4,"[xmlns:dc=""http://purl.org/dc/elements/1.1/"", ..."
58,data/data_playground_task1/cleaned/dublin_core...,4,"[xmlns:dc=""http://purl.org/dc/elements/1.1/"", ..."


In [9]:
url_df = url_df.loc[url_df["url_count"] > 0]  # Keep only files with URLs
url_df_exploded = url_df.explode("urls").drop(columns=["url_count"])
url_df_exploded.head()

Unnamed: 0,file_path,urls
49,data/data_playground_task1/cleaned/dublin_core...,"xmlns:dc=""http://purl.org/dc/elements/1.1/"""
49,data/data_playground_task1/cleaned/dublin_core...,"xmlns:xsi=""http://www.w3.org/2001/xmlschema-in..."
49,data/data_playground_task1/cleaned/dublin_core...,"xsi:schemalocation=""http://purl.org/dc/element..."
49,data/data_playground_task1/cleaned/dublin_core...,http://dublincore.org/schemas/xmls/simpledc200...
47,data/data_playground_task1/cleaned/dublin_core...,"xmlns:dc=""http://purl.org/dc/elements/1.1/"""


In [10]:
urls = list(url_df_exploded.urls)
print(urls[:10])

['xmlns:dc="http://purl.org/dc/elements/1.1/"', 'xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"', 'xsi:schemalocation="http://purl.org/dc/elements/1.1/', 'http://dublincore.org/schemas/xmls/simpledc20021212.xsd"', 'xmlns:dc="http://purl.org/dc/elements/1.1/"', 'xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"', 'xsi:schemalocation="http://purl.org/dc/elements/1.1/', 'http://dublincore.org/schemas/xmls/simpledc20021212.xsd"', 'xmlns:dc="http://purl.org/dc/elements/1.1/"', 'xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"']


Check that each URL is preceded by a namespace and surrounded in quotes (i.e., `xmlns:dc="[URL_GOES_HERE]"`), otherwise the URL was incorrectly included in the metadata record.

In [11]:
correct_namespace = '([a-z]+ns:[a-z]+|[a-z]+ns)="https?://[a-z0-9\-._~:/?#@!$&\'()*+,;=%]+"'
correct_url = 'https?://[a-z0-9\-._~:/?#@!$&\'()*+,;=%]+'

  correct_namespace = '([a-z]+ns:[a-z]+|[a-z]+ns)="https?://[a-z0-9\-._~:/?#@!$&\'()*+,;=%]+"'
  correct_url = 'https?://[a-z0-9\-._~:/?#@!$&\'()*+,;=%]+'


In [12]:
valid_namespace, valid_url = [], []
for url in urls:
    if re.match(correct_namespace, url):
        valid_namespace += [True]
    else:
        valid_namespace += [False]
    
    if re.search(correct_url, url):
        valid_url += [True]
    else:
        valid_url += [False]

url_df_exploded.insert(len(url_df_exploded.columns), "valid_namespace_format", valid_namespace)
url_df_exploded.insert(len(url_df_exploded.columns), "valid_url_format", valid_url)
url_df_exploded.tail()

Unnamed: 0,file_path,urls,valid_namespace_format,valid_url_format
32,data/data_playground_task1/cleaned/dublin_core...,"xmlns=""http://purl.org/dc/elements/1.1/""",True,True
33,data/data_playground_task1/cleaned/dublin_core...,"xmlns=""http://purl.org/dc/elements/1.1/""",True,True
88,data/data_task1/cleaned/dublin_core/dc_record_...,"xmlns:dc=""http://purl.org/dc/elements/1.1/""",True,True
92,data/data_task1/cleaned/dublin_core/dc_record_...,"xmlns:dcterms=""http://purl.org/dc/terms/""",True,True
90,data/data_task1/cleaned/dublin_core/dc_record_...,"xmlns:dcterms=""http://purl.org/dc/terms/""",True,True


In [13]:
total_urls = url_df_exploded.shape[0]
print("Total URLs:", total_urls)

Total URLs: 98


In [14]:
url_status = pd.DataFrame(url_df_exploded.valid_namespace_format.value_counts()).rename(columns={"count":"total_urls"})
proportions = (url_status[["total_urls"]]/total_urls).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
url_status.insert(len(url_status.columns), "proportion_of_urls", percentages)
url_status

Unnamed: 0_level_0,total_urls,proportion_of_urls
valid_namespace_format,Unnamed: 1_level_1,Unnamed: 2_level_1
True,83,84.69%
False,15,15.31%


In [15]:
url_status2 = pd.DataFrame(url_df_exploded.valid_url_format.value_counts()).rename(columns={"count":"total_urls"})
proportions = (url_status2[["total_urls"]]/total_urls).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
url_status2.insert(len(url_status2.columns), "proportion_of_urls", percentages)
url_status2

Unnamed: 0_level_0,total_urls,proportion_of_urls
valid_url_format,Unnamed: 1_level_1,Unnamed: 2_level_1
True,98,100.00%


In [16]:
file_url_status = url_df_exploded.drop(columns=["urls"]).drop_duplicates()
file_url_status = pd.DataFrame(file_url_status.valid_namespace_format.value_counts()).rename(columns={"count":"file_count"})
df_url_status = url_status.join(file_url_status)
proportions = (df_url_status[["file_count"]]/total_dc_files).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
df_url_status.insert(len(df_url_status.columns), "proportion_of_files", percentages)
df_url_status

Unnamed: 0_level_0,total_urls,proportion_of_urls,file_count,proportion_of_files
valid_namespace_format,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
True,83,84.69%,69,64.49%
False,15,15.31%,8,7.48%


In [17]:
file_url_status2 = url_df_exploded.drop(columns=["urls"]).drop_duplicates()
file_url_status2 = pd.DataFrame(file_url_status2.valid_url_format.value_counts()).rename(columns={"count":"file_count"})
df_url_status2 = url_status2.join(file_url_status2)
proportions = (df_url_status2[["file_count"]]/total_dc_files).values
percentages = [f"{proportion[0]*100:.2f}%" for proportion in proportions]
df_url_status2.insert(len(df_url_status2.columns), "proportion_of_files", percentages)
df_url_status2

Unnamed: 0_level_0,total_urls,proportion_of_urls,file_count,proportion_of_files
valid_url_format,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
True,98,100.00%,77,71.96%


Extract the URLs provided, even if not in a valid format within a metadata record, and then check whether the URL exists.

In [18]:
request_errors = []
for url in urls:
    clean = re.findall('https?:\/\/[^>"]+', url)
    if len(clean) > 0:
        clean_url = clean[0]
        clean_url = clean_url.strip('"')
        clean_url = clean_url.strip(' ')
        try:
            url_request = urllib.request.Request(clean_url, headers=headers)
            html = urllib.request.urlopen(url_request, timeout=5).read()
            request_errors += ["No error"]  # Indicates a valid URL (though a manual check is needed to make sure it's a relevant URL)
        except Exception as e:
            request_errors += [str(e)]
    else:
        request_errors += ["Invalid format (no request made)"]
print("Finished requests!")

  clean = re.findall('https?:\/\/[^>"]+', url)


Finished requests!


In [19]:
url_df_exploded.insert(len(url_df_exploded.columns), "request_error", request_errors)
url_df_exploded.tail()

Unnamed: 0,file_path,urls,valid_namespace_format,valid_url_format,request_error
32,data/data_playground_task1/cleaned/dublin_core...,"xmlns=""http://purl.org/dc/elements/1.1/""",True,True,No error
33,data/data_playground_task1/cleaned/dublin_core...,"xmlns=""http://purl.org/dc/elements/1.1/""",True,True,No error
88,data/data_task1/cleaned/dublin_core/dc_record_...,"xmlns:dc=""http://purl.org/dc/elements/1.1/""",True,True,No error
92,data/data_task1/cleaned/dublin_core/dc_record_...,"xmlns:dcterms=""http://purl.org/dc/terms/""",True,True,No error
90,data/data_task1/cleaned/dublin_core/dc_record_...,"xmlns:dcterms=""http://purl.org/dc/terms/""",True,True,No error


In [20]:
url_df_exploded.request_error.value_counts()

request_error
No error                            88
HTTP Error 300: Multiple Choices     9
HTTP Error 404: Not Found            1
Name: count, dtype: int64

In [21]:
url_errors_df1 = url_df_exploded.loc[url_df_exploded.valid_namespace_format == False]
url_errors_df2 = url_df_exploded.loc[url_df_exploded.valid_url_format == False]
url_errors_df3 = url_df_exploded.loc[url_df_exploded.request_error == "HTTP Error 300: Multiple Choices"]
url_errors_df4 = url_df_exploded.loc[url_df_exploded.request_error == "HTTP Error 404: Not Found"]
url_errors_df = pd.concat([url_errors_df1, url_errors_df2, url_errors_df3, url_errors_df4])
url_errors_df.shape

(25, 5)

In [22]:
url_errors_df

Unnamed: 0,file_path,urls,valid_namespace_format,valid_url_format,request_error
49,data/data_playground_task1/cleaned/dublin_core...,"xsi:schemalocation=""http://purl.org/dc/element...",False,True,No error
49,data/data_playground_task1/cleaned/dublin_core...,http://dublincore.org/schemas/xmls/simpledc200...,False,True,No error
47,data/data_playground_task1/cleaned/dublin_core...,"xsi:schemalocation=""http://purl.org/dc/element...",False,True,No error
47,data/data_playground_task1/cleaned/dublin_core...,http://dublincore.org/schemas/xmls/simpledc200...,False,True,No error
27,data/data_playground_task1/cleaned/dublin_core...,"xsi:schemalocation=""http://purl.org/dc/element...",False,True,No error
27,data/data_playground_task1/cleaned/dublin_core...,http://dublincore.org/schemas/xmls/qdc/2008/02...,False,True,No error
13,data/data_playground_task1/cleaned/dublin_core...,"xsi:schemalocation=""http://purl.org/dc/element...",False,True,No error
13,data/data_playground_task1/cleaned/dublin_core...,http://dublincore.org/schemas/xmls/simpledc200...,False,True,No error
58,data/data_playground_task1/cleaned/dublin_core...,"xsi:schemalocation=""http://purl.org/dc/element...",False,True,No error
58,data/data_playground_task1/cleaned/dublin_core...,http://dublincore.org/schemas/xmls/simpledc200...,False,True,No error


In [23]:
validity_by_url = pd.DataFrame(url_df_exploded.request_error.value_counts()).rename(columns={"count":"url_count"})
validity_by_file = pd.DataFrame(url_df_exploded.drop(columns=["urls", "valid_namespace_format", "valid_url_format"]).drop_duplicates().request_error.value_counts()).rename(columns={"count":"file_count"})
validity_stats = validity_by_url.join(validity_by_file, how="outer").reset_index()
validity_stats = validity_stats.rename(columns={"request_error":"url_error_type"})

In [24]:
invalid_ref_urls = url_errors_df.loc[url_errors_df.valid_namespace_format == False].shape[0]
invalid_ref_files = url_errors_df.drop(columns=["urls"]).drop_duplicates()
invalid_ref_files = invalid_ref_files.loc[invalid_ref_files.valid_namespace_format == False].shape[0]
invalid_ref_df = pd.DataFrame({"url_error_type":["Invalid reference to namespace"], "url_count":[invalid_ref_urls], "file_count":[invalid_ref_files]})
validity_stats = pd.concat([validity_stats, invalid_ref_df], ignore_index=True)

invalid_ref_urls = url_errors_df.loc[url_errors_df.valid_url_format == False].shape[0]
invalid_ref_files = url_errors_df.drop(columns=["urls"]).drop_duplicates()
invalid_ref_files = invalid_ref_files.loc[invalid_ref_files.valid_url_format == False].shape[0]
invalid_ref_df = pd.DataFrame({"url_error_type":["Invalid URL format"], "url_count":[invalid_ref_urls], "file_count":[invalid_ref_files]})
validity_stats = pd.concat([validity_stats, invalid_ref_df], ignore_index=True)

validity_stats

Unnamed: 0,url_error_type,url_count,file_count
0,HTTP Error 300: Multiple Choices,9,9
1,HTTP Error 404: Not Found,1,1
2,No error,88,70
3,Invalid reference to namespace,15,8
4,Invalid URL format,0,0


##### All URLs
Next, extract all URLs included in the data, whether or not they're provided as a namespace.

In [25]:
url_pattern = 'https?://[a-z0-9\-._~:/?#@!$&\'()*+,;=%]+'

  url_pattern = 'https?://[a-z0-9\-._~:/?#@!$&\'()*+,;=%]+'


In [26]:
# Find all the URLs
files_with_urls, url_count_per_file, urls_per_file = [], [], []
for file_path in dublin_file_paths:
    with open(file_path, "r") as f:
        f_string = f.read().lower()
        
        # Look for URLs in the file
        has_urls = re.finditer(url_pattern, f_string)
        # Save the URLs in a list per file
        file_urls = []
        for match in has_urls:
            url = match[0]
            if " " in url:
                multiple = url.split(" ")
                file_urls = file_urls + multiple
                # print(file_urls)
            else:
                file_urls += [url]
        urls_per_file += [file_urls]
        url_count_per_file += [len(file_urls)]
        
        if len(file_urls) > 0:
            # Save the file path to the XML version of the file
            file_path.replace(".txt", ".xml")
            files_with_urls += [file_path]

        f.close()

print(sum(url_count_per_file), "URLs found in", len(files_with_urls), "files.")

129 URLs found in 80 files.


In [27]:
all_url_df = pd.DataFrame.from_dict({"file_path":dublin_file_paths, "url_count":url_count_per_file, "urls":urls_per_file}).sort_values(by="url_count", ascending=False)
all_url_df.head()

Unnamed: 0,file_path,url_count,urls
49,data/data_playground_task1/cleaned/dublin_core...,4,"[http://purl.org/dc/elements/1.1/, http://www...."
48,data/data_playground_task1/cleaned/dublin_core...,4,"[http://purl.org/dc/elements/1.1/, http://www...."
72,data/data_playground_task3/cleaned/dublin_core...,4,"[http://purl.org/dc/elements/1.1/, https://hdl..."
58,data/data_playground_task1/cleaned/dublin_core...,4,"[http://purl.org/dc/elements/1.1/, http://www...."
47,data/data_playground_task1/cleaned/dublin_core...,4,"[http://purl.org/dc/elements/1.1/, http://www...."


In [28]:
all_url_df = all_url_df.loc[all_url_df["url_count"] > 0]  # Keep only files with URLs
all_url_df_exploded = all_url_df.explode("urls").drop(columns=["url_count"]).rename(columns={"urls":"url"})
all_url_df_exploded.head()

Unnamed: 0,file_path,url
49,data/data_playground_task1/cleaned/dublin_core...,http://purl.org/dc/elements/1.1/
49,data/data_playground_task1/cleaned/dublin_core...,http://www.w3.org/2001/xmlschema-instance
49,data/data_playground_task1/cleaned/dublin_core...,http://purl.org/dc/elements/1.1/
49,data/data_playground_task1/cleaned/dublin_core...,http://dublincore.org/schemas/xmls/simpledc200...
48,data/data_playground_task1/cleaned/dublin_core...,http://purl.org/dc/elements/1.1/


See if any new URLs that aren't namespaces (or intended to be namespaces) were found.

In [29]:
# Create a new columns for both URL DataFrames with only the url, meaning every string should begin with http.
urls = list(url_df_exploded["urls"])
clean_urls = []
for url in urls:
    if "=" in url:
        clean_urls += [url.split('="')[-1].strip('"')]
    else:
        clean_urls += [url.strip('"')]
url_df_exploded.insert(2, "clean_url", clean_urls)
url_df_exploded.head()

Unnamed: 0,file_path,urls,clean_url,valid_namespace_format,valid_url_format,request_error
49,data/data_playground_task1/cleaned/dublin_core...,"xmlns:dc=""http://purl.org/dc/elements/1.1/""",http://purl.org/dc/elements/1.1/,True,True,No error
49,data/data_playground_task1/cleaned/dublin_core...,"xmlns:xsi=""http://www.w3.org/2001/xmlschema-in...",http://www.w3.org/2001/xmlschema-instance,True,True,HTTP Error 300: Multiple Choices
49,data/data_playground_task1/cleaned/dublin_core...,"xsi:schemalocation=""http://purl.org/dc/element...",http://purl.org/dc/elements/1.1/,False,True,No error
49,data/data_playground_task1/cleaned/dublin_core...,http://dublincore.org/schemas/xmls/simpledc200...,http://dublincore.org/schemas/xmls/simpledc200...,False,True,No error
47,data/data_playground_task1/cleaned/dublin_core...,"xmlns:dc=""http://purl.org/dc/elements/1.1/""",http://purl.org/dc/elements/1.1/,True,True,No error


In [30]:
# Then compare the pairs of files and cleaned URLs to the newly extracted URL-file pairs by combining the two DataFrames, removing duplicates, and counting what's left
sub_url_df = url_df_exploded[["file_path", "clean_url"]]
urls = sub_url_df.join(all_url_df_exploded.set_index("file_path"), on="file_path", how="outer")
urls.head()

Unnamed: 0,file_path,clean_url,url
4.0,data/data_playground_task1/cleaned/dublin_core...,http://purl.org/dc/elements/1.1/,http://purl.org/dc/elements/1.1/
7.0,data/data_playground_task1/cleaned/dublin_core...,http://purl.org/dc/elements/1.1/,http://purl.org/dc/elements/1.1/
9.0,data/data_playground_task1/cleaned/dublin_core...,http://purl.org/dc/elements/1.1/,http://purl.org/dc/elements/1.1/
10.0,data/data_playground_task1/cleaned/dublin_core...,http://purl.org/dc/elements/1.1/,http://purl.org/dc/elements/1.1/
13.0,data/data_playground_task1/cleaned/dublin_core...,http://purl.org/dc/elements/1.1/,http://purl.org/dc/elements/1.1/


In [31]:
print(urls.loc[urls.clean_url.isna()].shape)
print(urls.loc[urls.url.isna()].shape)

(12, 3)
(0, 3)


Look at the 12 newly found URLs (i.e., URLs that aren't namespaces).

In [32]:
non_ns_urls = urls.loc[urls.clean_url.isna()]
non_ns_urls

Unnamed: 0,file_path,clean_url,url
,data/data_playground_task1/cleaned/dublin_core...,,http://example.org/resource/resource-developme...
,data/data_playground_task1/cleaned/dublin_core...,,http://www.4-h.org/about/global-network/
,data/data_playground_task1/cleaned/dublin_core...,,http://www.aipt.org/news-releases/1993-01-13
,data/data_playground_task3/cleaned/dublin_core...,,https://library.unm.edu/cswr/index.php
,data/data_playground_task3/cleaned/dublin_core...,,https://nmdc.unm.edu/digital/collection/fapecft
,data/data_task1/cleaned/dublin_core/dc_record_...,,http://example.org/james_abbott_records
,data/data_task1/cleaned/dublin_core/dc_record_...,,http://example.org/james_abbott_records
,data/data_task1/cleaned/dublin_core/dc_record_...,,https://example.org/maps/soil-los-angeles-1916
,data/data_task1/cleaned/dublin_core/dc_record_...,,https://www.davidrumsey.com/luna/servlet/detai...
,data/data_task1/cleaned/dublin_core/dc_record_...,,https://example.org/images/map-of-silicon-vall...


Confirm that all URLs except the 12 above were already found as namespaces.

In [33]:
clean_urls_list = list(urls.loc[~urls.clean_url.isna()].clean_url)
url_list = clean_urls_list = list(urls.loc[~urls.clean_url.isna()].url)
i, maxI = 0, len(clean_urls_list)
while i < maxI:
    assert clean_urls_list[0] == url_list[0]
    i += 1

Check whether each of the newly found URLs is a valid URL.

In [34]:
request_errors = []
non_ns_url_list = list(non_ns_urls.url)
for url in non_ns_url_list:
        try:
            url_request = Request(url.strip(), headers=headers)
            html = urlopen(url_request, timeout=10).read()
            request_errors += ["No error"]  # Indicates a valid URL (though a manual check is needed to make sure it's a relevant URL)
        except Exception as e:
            request_errors += [str(e)]
print("Finished requests!")

Finished requests!


In [35]:
non_ns_urls = non_ns_urls.drop(columns=["clean_url"])
non_ns_urls.insert(len(non_ns_urls.columns), "request_error", request_errors)
non_ns_urls

Unnamed: 0,file_path,url,request_error
,data/data_playground_task1/cleaned/dublin_core...,http://example.org/resource/resource-developme...,HTTP Error 404: Not Found
,data/data_playground_task1/cleaned/dublin_core...,http://www.4-h.org/about/global-network/,No error
,data/data_playground_task1/cleaned/dublin_core...,http://www.aipt.org/news-releases/1993-01-13,HTTP Error 404: Not Found
,data/data_playground_task3/cleaned/dublin_core...,https://library.unm.edu/cswr/index.php,No error
,data/data_playground_task3/cleaned/dublin_core...,https://nmdc.unm.edu/digital/collection/fapecft,No error
,data/data_task1/cleaned/dublin_core/dc_record_...,http://example.org/james_abbott_records,HTTP Error 404: Not Found
,data/data_task1/cleaned/dublin_core/dc_record_...,http://example.org/james_abbott_records,HTTP Error 404: Not Found
,data/data_task1/cleaned/dublin_core/dc_record_...,https://example.org/maps/soil-los-angeles-1916,HTTP Error 404: Not Found
,data/data_task1/cleaned/dublin_core/dc_record_...,https://www.davidrumsey.com/luna/servlet/detai...,No error
,data/data_task1/cleaned/dublin_core/dc_record_...,https://example.org/images/map-of-silicon-vall...,HTTP Error 404: Not Found


In [36]:
non_ns_urls_stats = pd.DataFrame(non_ns_urls.request_error.value_counts())
non_ns_urls_stats

Unnamed: 0_level_0,count
request_error,Unnamed: 1_level_1
HTTP Error 404: Not Found,6
No error,6


Save the reports as CSV files.

In [37]:
metadata_standard = "dublin_core"
data_serialization = "xml"

In [38]:
report_type = "namespace_url_counts"
url_df.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [39]:
report_type = "namespace_url_validity_counts"
df_url_status.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [40]:
report_type = "namespace_url_errors"
url_df_exploded.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [41]:
report_type = "namespace_url_errors_stats"
validity_stats.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [42]:
report_type = "non-namespace_url_errors"
non_ns_urls.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [43]:
report_type = "non-namespace_url_errors_stats"
non_ns_urls_stats.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

### JSON-LD

In [44]:
extension = ".txt" #".json"
cidoc_file_paths = []
cidoc_files_t1 = [f for f in os.listdir(cidoc_t1_dir) if f.endswith(extension)]
cidoc_file_paths += [cidoc_t1_dir+f for f in cidoc_files_t1]
cidoc_files_p1 = [f for f in os.listdir(cidoc_p1_dir) if f.endswith(extension)]
cidoc_file_paths += [cidoc_p1_dir+f for f in cidoc_files_p1]
cidoc_files_p3 = [f for f in os.listdir(cidoc_p3_dir) if f.endswith(extension)]
cidoc_file_paths += [cidoc_p3_dir+f for f in cidoc_files_p3]
cidoc_file_paths.sort()
print("Total CIDOC-CRM JSON files:", len(cidoc_file_paths))

Total CIDOC-CRM JSON files: 97


In [45]:
cidoc_file_paths[0]

'data/data_playground_task1/cleaned/cidoc_crm/cidoccrm_record_003.txt'

In [46]:
extension = ".txt" #".json"
schema_file_paths = []
schema_files_t1 = os.listdir(schema_t1_dir)
schema_file_paths += [schema_t1_dir+f for f in schema_files_t1 if f.endswith(extension)]
schema_files_p1 = os.listdir(schema_p1_dir)
schema_file_paths += [schema_p1_dir+f for f in schema_files_p1 if f.endswith(extension)]
schema_files_p3 = os.listdir(schema_p3_dir)
schema_file_paths += [schema_p3_dir+f for f in schema_files_p3 if f.endswith(extension)]
schema_file_paths.sort()
print("Total Schema.org JSON files:", len(schema_file_paths))

Total Schema.org JSON files: 116


In [47]:
schema_file_paths[0]

'data/data_playground_task1/cleaned/schema_org/sdo_record_003.txt'

In [48]:
json_file_paths = cidoc_file_paths + schema_file_paths
total_json_files = len(json_file_paths)
print(len(json_file_paths))

213


#### Content of Fields
Check for empty metadata fields.

In [9]:
# field_values = re.compile('((?<=:)\s*)"[^"]+"')
empty = re.compile('("[^"]+":\s?)(("(unknown|none|na|\?|not specified)")|"")')

  empty = re.compile('("[^"]+":\s?)(("(unknown|none|na|\?|not specified)")|"")')


CHANGE SO HAVE SEPARATE DFs FOR SDO & CIDOC, THEN MERGE WITH MODEL DEFINED IN ANOTHER COLUMN

First find the empty fields for the metadata records in CIDOC-CRM JSON-LD.

In [10]:
files_with_empty, empty_fields_per_file, fields_per_file = utils.findEmptyFields(empty, cidoc_file_paths)

24 empty field(s) across 16 files found.


In [13]:
df_cidoc_empty = pd.DataFrame.from_dict({"file_path":cidoc_file_paths, "model":["CIDOC-CRM"]*len(cidoc_file_paths), "empty_field_count":empty_fields_per_file, "fields":fields_per_file}).sort_values(by="empty_field_count", ascending=False)
df_cidoc_empty.head()

Unnamed: 0,file_path,model,empty_field_count,fields
92,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,3,"[""crm:p82a_begin_of_the_begin"": ""unknown"", ""cr..."
90,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,3,"[""crm:p190_has_symbolic_content"": ""unknown"", ""..."
9,data/data_playground_task1/cleaned/cidoc_crm/c...,CIDOC-CRM,3,"[""rdfs:label"": ""unknown"", ""crm:p82a_begin_of_t..."
39,data/data_playground_task1/cleaned/cidoc_crm/c...,CIDOC-CRM,2,"[""crm:p190_has_symbolic_content"": ""unknown"", ""..."
10,data/data_playground_task1/cleaned/cidoc_crm/c...,CIDOC-CRM,2,"[""crm:p2_has_type"": ""unknown"", ""crm:p131_is_id..."


Next find the empty fields for the metadata records in Schema.org JSON-LD and add them to the `df_empty` DataFrame.

In [30]:
files_with_empty, empty_fields_per_file, fields_per_file = utils.findEmptyFields(empty, schema_file_paths)

53 empty field(s) across 31 files found.


In [31]:
df_sdo_empty = pd.DataFrame.from_dict({"file_path":schema_file_paths, "model":["Schema.org"]*len(schema_file_paths), "empty_field_count":empty_fields_per_file, "fields":fields_per_file}).sort_values(by="empty_field_count", ascending=False)
df_sdo_empty.head()

Unnamed: 0,file_path,model,empty_field_count,fields
1,data/data_playground_task1/cleaned/schema_org/...,Schema.org,5,"[""name"": ""unknown"", ""name"": ""unknown"", ""contri..."
19,data/data_playground_task1/cleaned/schema_org/...,Schema.org,4,"[""name"": ""unknown"", ""name"": ""unknown"", ""contri..."
101,data/data_task1/cleaned/schema_org/sdo_record_...,Schema.org,3,"[""name"": ""unknown"", ""datecreated"": ""unknown"", ..."
99,data/data_task1/cleaned/schema_org/sdo_record_...,Schema.org,3,"[""name"": ""unknown"", ""name"": ""unknown"", ""datecr..."
45,data/data_playground_task1/cleaned/schema_org/...,Schema.org,3,"[""name"": ""unknown"", ""name"": ""unknown"", ""name"":..."


In [35]:
df_empty = pd.concat([df_cidoc_empty, df_sdo_empty])
df_empty = df_empty.sort_values(by=["empty_field_count"], ascending=False)
df_empty.head()

Unnamed: 0,file_path,model,empty_field_count,fields
1,data/data_playground_task1/cleaned/schema_org/...,Schema.org,5,"[""name"": ""unknown"", ""name"": ""unknown"", ""contri..."
19,data/data_playground_task1/cleaned/schema_org/...,Schema.org,4,"[""name"": ""unknown"", ""name"": ""unknown"", ""contri..."
92,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,3,"[""crm:p82a_begin_of_the_begin"": ""unknown"", ""cr..."
9,data/data_playground_task1/cleaned/cidoc_crm/c...,CIDOC-CRM,3,"[""rdfs:label"": ""unknown"", ""crm:p82a_begin_of_t..."
90,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,3,"[""crm:p190_has_symbolic_content"": ""unknown"", ""..."


"Explode" the DataFrame so that there is one row per field, rather than one row per file.  We'll exclude all the files that don't have any empty fields from this version of the data.

In [36]:
df_empty_exploded = df_empty.loc[df_empty.empty_field_count > 0].drop(columns=["empty_field_count"]).explode("fields")
df_empty_exploded.head()

Unnamed: 0,file_path,model,fields
1,data/data_playground_task1/cleaned/schema_org/...,Schema.org,"""name"": ""unknown"""
1,data/data_playground_task1/cleaned/schema_org/...,Schema.org,"""name"": ""unknown"""
1,data/data_playground_task1/cleaned/schema_org/...,Schema.org,"""contributor"": ""unknown"""
1,data/data_playground_task1/cleaned/schema_org/...,Schema.org,"""datepublished"": ""unknown"""
1,data/data_playground_task1/cleaned/schema_org/...,Schema.org,"""license"": ""unknown"""


In [37]:
df_empty_exploded[["field", "value"]] = df_empty_exploded["fields"].str.split(": ", expand=True)
df_empty_exploded.tail()

Unnamed: 0,file_path,model,fields,field,value
25,data/data_playground_task1/cleaned/schema_org/...,Schema.org,"""name"": ""unknown""","""name""","""unknown"""
47,data/data_playground_task1/cleaned/schema_org/...,Schema.org,"""creator"": ""unknown""","""creator""","""unknown"""
98,data/data_task1/cleaned/schema_org/sdo_record_...,Schema.org,"""name"": ""unknown""","""name""","""unknown"""
96,data/data_task1/cleaned/schema_org/sdo_record_...,Schema.org,"""name"": ""unknown""","""name""","""unknown"""
0,data/data_playground_task1/cleaned/schema_org/...,Schema.org,"""publisher"": ""unknown""","""publisher""","""unknown"""


In [48]:
df_model_fields = pd.DataFrame(df_empty_exploded.model.value_counts()).rename(columns={"count":"field_count"})
total_empty_fields = df_empty_exploded.shape[0]
df_model_files = pd.DataFrame(df_empty.model.value_counts()).rename(columns={"count":"file_count"})
total_files_with_empty = df_empty.shape[0]
df_model_totals = df_model_fields.join(df_model_files).reset_index()
df_model_totals = pd.concat([df_model_totals, pd.DataFrame.from_dict({"model": ["TOTAL"], "field_count": total_empty_fields, "file_count": total_files_with_empty})])
df_model_totals

Unnamed: 0,model,field_count,file_count
0,Schema.org,53,116
1,CIDOC-CRM,24,97
0,TOTAL,77,213


In [None]:
col = "field"
all_field_counts = pd.DataFrame(df_empty_exploded[col].value_counts())
sdo_field_counts = pd.DataFrame(df_empty_exploded.loc[df_empty_exploded.model == "Schema.org"][col].value_counts())
cidoc_field_counts = pd.DataFrame(df_empty_exploded.loc[df_empty_exploded.model == "CIDOC-CRM"][col].value_counts())
field_counts = all_field_counts.join(sdo_field_counts, rsuffix="_sdo_fields").join(cidoc_field_counts, rsuffix="_cidoc_fields")
field_counts = field_counts.rename(columns={"count":"field_count"})
# field_counts
subdf = df_empty_exploded[["file_path", "model", col]].drop_duplicates()
all_file_counts = pd.DataFrame(subdf[col].value_counts())
sdo_file_counts = pd.DataFrame(subdf.loc[subdf.model == "Schema.org"][col].value_counts())
cidoc_file_counts = pd.DataFrame(subdf.loc[subdf.model == "CIDOC-CRM"][col].value_counts())
file_counts = all_file_counts.join(sdo_file_counts, rsuffix="_sdo_files").join(cidoc_file_counts, rsuffix="_cidoc_files")
file_counts = file_counts.rename(columns={"count":"file_count"})
# file_counts
field_counts = field_counts.join(file_counts)
field_counts = field_counts.fillna(0)
field_counts


Unnamed: 0_level_0,field_count,count_sdo_fields,count_cidoc_fields,file_count,count_sdo_files,count_cidoc_files
field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"""name""",28,28.0,0.0,19,19.0,0.0
"""license""",6,6.0,0.0,6,6.0,0.0
"""datecreated""",5,5.0,0.0,5,5.0,0.0
"""datepublished""",4,4.0,0.0,4,4.0,0.0
"""rdfs:label""",4,0.0,4.0,4,0.0,4.0
"""contributor""",4,4.0,0.0,4,4.0,0.0
"""crm:p131_is_identified_by""",4,0.0,4.0,4,0.0,4.0
"""crm:p82a_begin_of_the_begin""",3,0.0,3.0,3,0.0,3.0
"""crm:p82b_end_of_the_end""",3,0.0,3.0,3,0.0,3.0
"""crm:p1_is_identified_by""",3,0.0,3.0,3,0.0,3.0


In [87]:
col = "value"
all_value_counts = pd.DataFrame(df_empty_exploded[col].value_counts())
sdo_value_counts = pd.DataFrame(df_empty_exploded.loc[df_empty_exploded.model == "Schema.org"][col].value_counts())
cidoc_value_counts = pd.DataFrame(df_empty_exploded.loc[df_empty_exploded.model == "CIDOC-CRM"][col].value_counts())
value_counts = all_value_counts.join(sdo_value_counts, rsuffix="_sdo_fields").join(cidoc_value_counts, rsuffix="_cidoc_fields")
value_counts = value_counts.rename(columns={"count":"field_count"})
# value_counts
subdf = df_empty_exploded[["file_path", "model", col]].drop_duplicates()
all_file_counts = pd.DataFrame(subdf[col].value_counts())
sdo_file_counts = pd.DataFrame(subdf.loc[subdf.model == "Schema.org"][col].value_counts())
cidoc_file_counts = pd.DataFrame(subdf.loc[subdf.model == "CIDOC-CRM"][col].value_counts())
file_counts = all_file_counts.join(sdo_file_counts, rsuffix="_sdo_files").join(cidoc_file_counts, rsuffix="_cidoc_files")
file_counts = file_counts.rename(columns={"count":"file_count"})
# file_counts
value_counts = value_counts.join(file_counts)
value_counts = value_counts.fillna(0)
value_counts


Unnamed: 0_level_0,field_count,count_sdo_fields,count_cidoc_fields,file_count,count_sdo_files,count_cidoc_files
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"""unknown""",73,49,24.0,46,30,16.0
"""not specified""",2,2,0.0,1,1,0.0
"""""",2,2,0.0,1,1,0.0


Save the reports as CSV files.

In [49]:
metadata_standard = "sdo-cidoc"
data_serialization = "json-ld"

In [50]:
report_type = "empty_fields_by_file"
df_empty.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [51]:
report_type = "empty_fields_by_field"
df_empty_exploded.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [52]:
report_type = "empty_by_model"
df_model_totals.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [78]:
report_type = "empty_field_counts"
field_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [88]:
report_type = "empty_value_counts"
value_counts.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

#### URLs
Check that URLs are well-formed and that they exist.  ***Time permitting: check whether they connect to a relevant website***

In [None]:
# "url": "http://url.unspecified"
# license, identifier, sameAs, url, contentUrl

In [49]:
url_pattern = 'https?://[a-z0-9\-._~:/?#@!$&\'()*+,;=%]+'

  url_pattern = 'https?://[a-z0-9\-._~:/?#@!$&\'()*+,;=%]+'


In [50]:
# Find all the URLs
files_with_urls, url_count_per_file, urls_per_file = [], [], []
for file_path in schema_file_paths:
    with open(file_path, "r") as f:
        f_string = f.read().lower()
        
        # Look for URLs in the file
        has_urls = re.finditer(url_pattern, f_string)
        # Save the URLs in a list per file
        file_urls = []
        for match in has_urls:
            url = match[0]
            if " " in url:
                multiple = url.split(" ")
                file_urls = file_urls + multiple
                # print(file_urls)
            else:
                file_urls += [url]
        urls_per_file += [file_urls]
        url_count_per_file += [len(file_urls)]
        
        if len(file_urls) > 0:
            # Save the file path to the XML version of the file
            file_path.replace(".txt", ".xml")
            files_with_urls += [file_path]

        f.close()

print(sum(url_count_per_file), "URLs found in", len(files_with_urls), "files.")

270 URLs found in 116 files.


In [51]:
sdo_url_df = pd.DataFrame.from_dict({"file_path":schema_file_paths, "model": ["Schema.org"]*len(schema_file_paths), "url_count":url_count_per_file, "urls":urls_per_file}).sort_values(by="url_count", ascending=False)

In [52]:
# Find all the URLs
files_with_urls, url_count_per_file, urls_per_file = [], [], []
for file_path in cidoc_file_paths:
    with open(file_path, "r") as f:
        f_string = f.read().lower()
        
        # Look for URLs in the file
        has_urls = re.finditer(url_pattern, f_string)
        # Save the URLs in a list per file
        file_urls = []
        for match in has_urls:
            url = match[0]
            if " " in url:
                multiple = url.split(" ")
                file_urls = file_urls + multiple
                # print(file_urls)
            else:
                file_urls += [url]
        urls_per_file += [file_urls]
        url_count_per_file += [len(file_urls)]
        
        if len(file_urls) > 0:
            # Save the file path to the XML version of the file
            file_path.replace(".txt", ".xml")
            files_with_urls += [file_path]

        f.close()

print(sum(url_count_per_file), "URLs found in", len(files_with_urls), "files.")

282 URLs found in 96 files.


In [53]:
cidoc_url_df = pd.DataFrame.from_dict({"file_path":cidoc_file_paths, "model": ["CIDOC-CRM"]*len(cidoc_file_paths), "url_count":url_count_per_file, "urls":urls_per_file}).sort_values(by="url_count", ascending=False)

In [54]:
url_df = pd.concat([sdo_url_df, cidoc_url_df])
url_df = url_df.sort_values(by="url_count", ascending=False)
url_df.head()

Unnamed: 0,file_path,model,url_count,urls
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,14,"[http://www.cidoc-crm.org/cidoc-crm/, http://w..."
78,data/data_playground_task3/cleaned/schema_org/...,Schema.org,11,"[https://schema.org, https://www.wikidata.org/..."
93,data/data_task1/cleaned/schema_org/sdo_record_...,Schema.org,11,"[https://schema.org, https://www.wikidata.org/..."
94,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,8,[http://www.cidoc-crm.org/cidoc-crm/e22_man-ma...
83,data/data_playground_task3/cleaned/schema_org/...,Schema.org,8,"[https://schema.org, https://www.wikidata.org/..."


In [55]:
url_df = url_df.loc[url_df["url_count"] > 0]  # Keep only files with URLs
url_df_exploded = url_df.explode("urls").drop(columns=["url_count"])
url_df_exploded.head()

Unnamed: 0,file_path,model,urls
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,http://www.cidoc-crm.org/cidoc-crm/
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,http://www.w3.org/2000/01/rdf-schema#
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,http://www.w3.org/2001/xmlschema#
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,https://diglib.amphilsoc.org/islandora/object/...
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,https://www.wikidata.org/wiki/q18533306


Check whether each of the newly found URLs is a valid URL.

In [58]:
request_errors = []
json_url_list = list(url_df_exploded.urls)
for url in json_url_list:
        try:
            url_request = Request(url.strip(), headers=headers)
            html = urlopen(url_request, timeout=10).read()
            request_errors += ["No error"]  # Indicates a valid URL (though a manual check is needed to make sure it's a relevant URL)
        except Exception as e:
            request_errors += [str(e)]
print("Finished requests!")

Finished requests!


In [59]:
url_df_exploded.insert(len(url_df_exploded.columns), "request_error", request_errors)
url_df_exploded.head()

Unnamed: 0,file_path,model,urls,request_error
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,http://www.cidoc-crm.org/cidoc-crm/,No error
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,http://www.w3.org/2000/01/rdf-schema#,No error
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,http://www.w3.org/2001/xmlschema#,HTTP Error 300: Multiple Choices
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,https://diglib.amphilsoc.org/islandora/object/...,HTTP Error 404: Not Found
84,data/data_task1/cleaned/cidoc_crm/cidoccrm_rec...,CIDOC-CRM,https://www.wikidata.org/wiki/q18533306,HTTP Error 404: Not Found


In [76]:
all_urls_df = pd.DataFrame.from_dict({"model":["TOTAL"], "url_count":[url_df_exploded.shape[0]], "file_count":[url_df.shape[0]]}).set_index("model")
urls_model_df = pd.DataFrame(url_df_exploded.model.value_counts()).rename(columns={"count":"url_count"})
files_model_df = pd.DataFrame(url_df.model.value_counts()).rename(columns={"count":"file_count"})
model_df = urls_model_df.join(files_model_df)
model_df = pd.concat([model_df, all_urls_df])
model_df

Unnamed: 0_level_0,url_count,file_count
model,Unnamed: 1_level_1,Unnamed: 2_level_1
CIDOC-CRM,282,96
Schema.org,270,116
TOTAL,552,212


In [85]:
error_stats = pd.DataFrame(url_df_exploded.request_error.value_counts())
cidoc_error_stats = pd.DataFrame(url_df_exploded.loc[url_df_exploded.model == "CIDOC-CRM"].request_error.value_counts()).rename(columns={"count":"cidoc-crm_count"})
schema_error_stats = pd.DataFrame(url_df_exploded.loc[url_df_exploded.model == "Schema.org"].request_error.value_counts()).rename(columns={"count":"schema-org_count"})
error_stats = error_stats.join(cidoc_error_stats).join(schema_error_stats).fillna(0)

all_errors = url_df_exploded.loc[url_df_exploded.request_error != "No error"]
cidoc_errors = all_errors.loc[all_errors.model == "CIDOC-CRM"].shape[0]
schema_errors = all_errors.loc[all_errors.model == "Schema.org"].shape[0]
total_errors = pd.DataFrame.from_dict({"request_error":["ALL REQUEST ERRORS"], "count":[all_errors.shape[0]], "cidoc-crm_count":[cidoc_errors], "schema-org_count":[schema_errors]}).set_index("request_error")

error_stats = pd.concat([error_stats, total_errors])
error_stats

Unnamed: 0_level_0,count,cidoc-crm_count,schema-org_count
request_error,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No error,409,179.0,230.0
HTTP Error 404: Not Found,82,46.0,36.0
HTTP Error 300: Multiple Choices,57,57.0,0.0
HTTP Error 404: File Not Found,1,0.0,1.0
HTTP Error 403: Forbidden,1,0.0,1.0
"<urlopen error [Errno 8] nodename nor servname provided, or not known>",1,0.0,1.0
HTTP Error 500: Internal Server Error,1,0.0,1.0
ALL REQUEST ERRORS,143,103.0,40.0


Save the reports as CSV files.

In [86]:
metadata_standard = "sdo-cidoc"
data_serialization = "json-ld"

In [87]:
report_type = "url_counts_per_file"
url_df.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [88]:
report_type = "url_counts_per_model"
model_df.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [None]:
report_type = "urls" # includes column for request errors
url_df_exploded.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [91]:
report_type = "url_errors"
all_errors.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )

In [92]:
report_type = "url_errors_stats"
error_stats.to_csv(
    report_dir+"{metadata_standard}_{data_serialization}_{report_type}.csv".format(
        metadata_standard=metadata_standard,
        data_serialization=data_serialization,
        report_type=report_type
        ), index=False
    )