# Data Preparation 

### Linking Anthropology's Data and Archives (LADA)

### AI-Generated Linked Data Evaluation (part I)

In [None]:
import config
import utils
import pandas as pd
import numpy as np
from pathlib import Path
import os
import re

Read in CSV data as a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (a type of data structure in the Python library pandas used for tabular data), where `f` is the name of the data file and `data_dir` is the path to the folder where the data file is located:

In [None]:
# Uncomment (delete the pound sign in front of) one of the following 
# code lines to create XML and JSON data files in the folder (a.k.a.
# directory) stored in the `data_dir` variable from the CSV file stored
# in the `f` variable.
# ------------------------------------
data_dir, f = config.task1_data, "4-H Data Experiment Assignments and Outcomes - Outcomes - Task 1.csv"
# data_dir, f = config.playgrd1_data, "4-H Data Experiment Assignments and Outcomes - Playground Outcomes - Task 1.csv"
# data_dir, f = config.playgrd3_data, "4-H Data Experiment Assignments and Outcomes - Playground Outcomes - Task 3.csv"

# Alternatively, to create XML and JSON files from a different CSV file,
# store the path to the folder that file is located within in the `data_dir`
# variable and the name of the CSV file in the `f` variable by uncommenting
# (deleting the pound sign in front of) the line of code below and replacing 
# the text between the quotes with the folder path and file name.
# ------------------------------------
# data_dir, f = "path/to/your/data/", "your_data_file_name.csv"

In [None]:
df = pd.read_csv(data_dir + f)
print("Total rows:", df.shape[0])
print("Total columns:", df.shape[1])

In [None]:
df.head(10)

The table displayed above is a partial view of the DataFrame created from the CSV file.  We're looking at the *first 10 rows* as specified by the `10` in `df.head(10)`.  You can change `head` to `tail` to view the last 10 rows of the DataFrame (e.g., `df.tail(10)`) and you can change the number between the parentheses to control how many rows to view (e.g., `df.head(3)`).  If no number is included between the parentheses, by default, 5 rows will display (e.g., `df.head()`).

Next we'll clean up the table.  First, let's remove any empty columns from the DataFrame.  If there aren't any empty columns, the DataFrame will stay the same.

In [None]:
df = df.dropna(axis=1, how='all')
print("Total rows:", df.shape[0], "| Total columns:", df.shape[1])

In [None]:
df.head(2)

Next, remove any rows without a metadata record from the DataFrame.  If all the rows have a metadata record, the DataFrame will stay the same.

In [None]:
df.dropna(subset=["Metadata record"], inplace=True)
print("Total rows:", df.shape[0], "| Total columns:", df.shape[1])

If there's no column with unique identifiers per row, make one:

In [None]:
col_list = list(df.columns)
identifier_col = ""
for col in col_list:
    if (col.lower() == "id") or (col.lower() == "identifier") or ("_id" in col.lower()):
        # Check that the existing IDs are, in fact, unique to each row
        existing_ids = list(df[col])
        if len(existing_ids) == len(set(existing_ids)):
            identifier_col = col
if len(identifier_col) == 0:
    # Create a unique identifier per row
    new_ids = list(range(0, df.shape[0]))
    df.insert(0, "id", new_ids)

In [None]:
print("Total rows:", df.shape[0], "| Total columns:", df.shape[1])

Rename the columns for clarity, consistency, and conciseness:

In [None]:
# Note: if your data header is different, replace the strings of column names before
# each colon with the column names of your data (i.e., "ID" might become "identifier").
# The strings after the colon are the new column names that the code below will expect.
if len(identifier_col) > 0:
        df.rename(columns={
        identifier_col:"id",                  # Unique identifier for each row
        "Filename":"filename", 
        "Metadata record": "dc_record",       # Dublin Core metadata record
        #"Transcription or caption (or link to separate doc, if too long)":"transcription_or_caption",
        "Schema.org Record":"sdo_record",     # Schema.org metadata record
        "CIDOC-CRM Record":"cidoccrm_record"  # CIDOC-CRM metadata record
        }, inplace=True)
else:
    df.rename(columns={
        "Filename":"filename", 
        "Metadata record": "dc_record",       # Dublin Core metadata record
        #"Transcription or caption (or link to separate doc, if too long)":"transcription_or_caption",
        "Schema.org Record":"sdo_record",     # Schema.org metadata record
        "CIDOC-CRM Record":"cidoccrm_record"  # CIDOC-CRM metadata record
        }, inplace=True)
    
df.head()


Create a folder to store the DataFrame (the cleaner version of the data) and save a new CSV file with the data there.

In [None]:
data_dir = data_dir + "cleaned/"
Path(data_dir).mkdir(parents=True, exist_ok=True)
df.to_csv(data_dir + f)
print(f"Wrote a new file with the cleaner version of the data that can be found at:\n{data_dir+f}")

Next let's create XML and JSON files from the metadata records in our DataFrame.

### Dublin Core
Write the [Dublin Core](https://www.dublincore.org) (DC) records as XML files.

In [None]:
# Remove rows with an empty value in the dc_record column
df_dc = df[df["dc_record"].notna()]
dc_records_ids = list(df_dc["id"])

Let's look at an example of one of the XML files!

In [None]:
dc_records = list(df_dc["dc_record"])
print(dc_records[2])

Create a new folder to save the XML files in.

In [None]:
dc_path = data_dir+"dublin_core/"
Path(dc_path).mkdir(parents=True, exist_ok=True)
print("The files will be saved under:", dc_path)

In [None]:
utils.write_xml(dc_records_ids, dc_records, dc_path, "dc_record_", ".xml")
utils.write_xml(dc_records_ids, dc_records, dc_path, "dc_record_", ".txt")

**Note:** If you open some of the files, you'll see inconsistency in the Dublin Core (DC) records' formatting!  For example:

```
<?xml version='1.0' encoding='utf-8'?>
<dublin_core><dc element="title">Turnin' Timez: Original Student Poems</dc>
    ...
</dublin_core>
```
---
```
<?xml version="1.0"?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>National 4-H Center Major Pledges, Contributions, and Grants</dc:title>
    ...
</metadata>
```
---
```
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>4-H National Youth Science Day</dc:title>
    ...
</metadata>
```
---
```
<dc:title>Climbing Up: Fun Activities for You and Your Cat</dc:title>
   ...
<dc:rights>Unknown</dc:rights>
```

Also note that [DCMI documentation](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) encourages the use of http://purl.org/dc/terms/ over http://purl.org/dc/elements/1.1/.

We'll address this in later Jupyter Notebooks that evaluate and correct these metadata records.

### Schema.org
Write the [Schema.org](https://schema.org) records as JSON-LD files.

In [None]:
# Remove rows with an empty value in the sdo_record column
df_sdo = df[df["sdo_record"].notna()]
sdo_records_ids = list(df_sdo["id"])

Let's look at an example of a Schema.org metadata record in JSON-LD!

In [None]:
sdo_records = list(df_sdo["sdo_record"])
print(sdo_records[2])

Create a new folder to save the Schema.org JSON-LD files in.

In [None]:
sdo_path = data_dir+"schema_org/"
Path(sdo_path).mkdir(parents=True, exist_ok=True)
print("The files will be saved under:", sdo_path)

In [None]:
utils.write_json(sdo_records_ids, sdo_records, sdo_path, "sdo_record_", ".json")
utils.write_json(sdo_records_ids, sdo_records, sdo_path, "sdo_record_", ".txt")

### CIDOC-CRM
Write the [CIDOC-CRM](https://cidoc-crm.org) records as JSON-LD files.

In [None]:
# Remove rows with an empty value in the cidoccrm_record column
df_cidoc = df[df["cidoccrm_record"].notna()]
cidoc_records_ids = list(df_cidoc["id"])

Let's look at an example of a metadata record in CIDOC-CRM JSON-LD!

In [None]:
cidoc_records = list(df_cidoc["cidoccrm_record"])
print(cidoc_records[2])

Create a new folder to save the CIDOC-CRM JSON-LD files in.

In [None]:
cidoc_path = data_dir+"cidoc_crm/"
Path(cidoc_path).mkdir(parents=True, exist_ok=True)
print("The files will be saved under:", cidoc_path)

In [None]:
utils.write_json(cidoc_records_ids, cidoc_records, cidoc_path, "cidoccrm_record_", ".json")
utils.write_json(cidoc_records_ids, cidoc_records, cidoc_path, "cidoccrm_record_", ".txt")

Now we can move on to evaluating the quality of the metadata records we just exported as individual files!  

We'll begin by evaluating their syntax with the Jupyter Notebook named `evaluation-syntax.ipynb`.