## Unstructured Data to Knowledge Graph for the Antiquities / Art Trade World

This notebook contains a workflow for taking unstructured information about antiquities or art trade and rendering it into a knowledge graph according to a predefined schema. It can use a variety of LLM via the langchain library to handle the natural language processing, and it uses pydantic and enum libraries to handle the validation. This means that *most* of the time, the entity names and predicates (relationships) are consistent.

That is to say, you'll usually get 'J. Paul Getty Museum' rather than several variations ('the Getty'; 'Getty Museum' etc) and the relationships will be those in the schema.py and prompt.py files. But sometimes there will be relationships returned that are not in your schema. These cases will be gathered into a separate dataframe for evaluation.


If you wish to add or modify the relationship and entities you are after, you need to modify both those files appropriately, save them. The code is written to reload those modules every time, assuming you want to play around with the schema/prompt.

**You will need** api access to a large language model access to which is supported by langchain. (eg. OpenAI, Groq, Mistral, etc). OpenAI will cost you, but Groq can be free for experimentation.

Click on where it says 'hidden' in the cells below to see the relevant code. Examine the comments; you'll have to make some changes for your own use case.

## Get the repo

In [1]:
#from https://towardsdatascience.com/the-lesser-known-rising-application-of-llms-775834116477

!git clone https://github.com/shawngraham/structured-cultural-heritage-crime

Cloning into 'structured-cultural-heritage-crime'...
remote: Enumerating objects: 249, done.[K
remote: Counting objects: 100% (249/249), done.[K
remote: Compressing objects: 100% (185/185), done.[K
remote: Total 249 (delta 125), reused 147 (delta 55), pack-reused 0[K
Receiving objects: 100% (249/249), 1.35 MiB | 3.03 MiB/s, done.
Resolving deltas: 100% (125/125), done.


In [2]:
%cd structured-cultural-heritage-crime

/content/structured-cultural-heritage-crime


In [None]:
!pip install -r requirements.txt
#should remove mistral and replace with openai


In [None]:
# if you want to use openai, uncomment/comment appropriately:

#!pip install langchain-openai
!pip install langchain-groq


## initial config

if you're using mistral or openai, comment/uncomment the lines as appropriate.

use colab's 'secrets' function to keep track of your api keys. follow the example to point to whatever's appropriate for your use case.

In [5]:
import os
import json
import time
from pathlib import Path

import pandas as pd
from langchain.output_parsers import PydanticOutputParser
#from langchain_mistralai.chat_models import ChatMistralAI
#from langchain_openai import ChatOpenAI
from langchain_groq import ChatGroq
#from dotenv import load_dotenv

from core import run
from prompt import DEFAULT_BASE_PROMPT, create_prompt
from schemas import CulturalHeritageSchema


In [6]:

from google.colab import userdata
#OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
GROQ_API_KEY = userdata.get("GROQ")

## Working with Texts

The texts should have coreference resolution done as much as possible before processing them here. (See SG notebook for coref resolution).

If you've got a zip of the text you want to process, drag and drop it into the file tray and unzip it. The 129 articles from trafficking culture project are already in the repo by the way

In [None]:
## change to your own file.zip
!unzip coref-tc.zip

In [8]:
# create a dataframe from the separate text files.

import os
import pandas as pd

def read_files_to_dataframe(folder_path: str) -> pd.DataFrame:
    # List all text files in the folder
    file_names = [f for f in os.listdir(folder_path) if f.endswith('.txt')]

    # Read each file's content into a list
    data = []
    for file_name in file_names:
        with open(os.path.join(folder_path, file_name), 'r', encoding='utf-8') as file:
            content = file.read()
            data.append(content)

    # Create a DataFrame with the file contents
    df = pd.DataFrame(data, columns=['content'])

    return df

# Usage example
folder_path = 'resolved' #change this to whatever your folder of files is called
df = read_files_to_dataframe(folder_path)
print(df)

                                               content
0    Classic Maya stone sculpture from the site of ...
1    A cultural tradition known for tradition ceram...
2    Antiquities dealer Leonardo Patterson convicte...
3    The Getty Museum returned a looted terracotta ...
4    Also known as : Getty Museum Returns to Italy ...
..                                                 ...
135  A 12th century Hindu sculpture stolen from Nep...
136  Attorney - General of New Zealand v Ortiz \n A...
137  Two pre - Columbian antiquities offered for sa...
138  A classic Maya stela , cut into pieces for tra...
139  Also known as : Silver Hoard of Everbeek , & Z...

[140 rows x 1 columns]


### OR use an existing csv

You'll have to examine the csv to work out which columns you want to process, eg ```df["Provenance"][i]```. If you want to process all the data in a row, then in the appropriate places below you'll need to tell the code ```df.iloc[i]```

In [None]:
## may 27 - let's try a csv
import pandas as pd
df = pd.read_csv('/content/openartdata-latchford-et-al.csv')

## this csv has a column 'Provenance' and another 'Provenance Plus'
## let's merge them so that we can process both columns of data

df['to_analyze'] = df['Provenance'] + df['Provenance Plus']

In [None]:
df

## Set model, llm, parser

In [9]:
## CONFIG for all scenarios
import importlib
import sys
importlib.reload(sys.modules['prompt'])
importlib.reload(sys.modules['schemas'])
from prompt import DEFAULT_BASE_PROMPT, create_prompt
from schemas import CulturalHeritageSchema

#model_name = "gpt-4o"# for openai
model_name = "llama3-70b-8192" #mixtral-8x7b-32768 has a larger context window, might be better?
temperature = 0
#llm = ChatOpenAI(api_key=OPENAI_API_KEY, model_name=model_name, temperature=temperature) # when using OpenAi
llm = ChatGroq(api_key=GROQ_API_KEY, model_name=model_name, temperature=temperature) #when using GROQ
parser = PydanticOutputParser(pydantic_object=CulturalHeritageSchema)

## Test on a single line of your data

Modify as appropriate (point it to the correct column of your data, or an entire row of your data)

In [None]:
## run this to create the prompt for testing on a single line of the df

##
prompt = create_prompt(DEFAULT_BASE_PROMPT, parser, df["content"][0]) #for when the df is the individual text files loaded in
#prompt = create_prompt(DEFAULT_BASE_PROMPT, parser, df.iloc[0]) #trying to read in the first row of a regular csv, for testin
prompt


In [None]:
example = await run(llm, prompt, parser)
example

## Process stuff en masse

The LLM is *pretty* good at returning only relationships defined by the schema, but it's not 100%. The code here will iterate over all of your data, and when things extract correctly per the schema, it will write the result to a dataframe. Errors are gathered into a separate dataframe so you can examine them later. Keep an eye on things.

**warning** This can take a long time, depending on your source data, rate limits on the LLM you're using.

**OpenAI GPT4-o returns the best results**. In my experiments, using the trafficking culture data as base line, GPT4-o cost almost \$3 to process (141 files), returning nearly 2000 triples, with only 80 containing relationships not envisioned by the schema. By contrast, 800 rows of provenance data from Open Art Data took about the same length of time to run, but cost closer to \$10. There were only two rows returned with relationships not in the schema.

---

When it's done, you'll have two new files in the file tray:

```clean-results-out.csv``` and ```errors-to-examine.csv```. They'll use a semicolon to separate fields. You can import into excel, specify the semicolon as delimiter, then use filters to examine things.

Line 58 below has to be modified to point to the column you want processed in a dataframe (or, alternatively, if you want every column per row dealt with)

eg,

```df["Provenance"][i]```.

vs

``` df.iloc[i]```

In [17]:
### code for processing the df-from-appended-texts that writes
### errors to a separate dataframe for manual evaluation

import pandas as pd
import asyncio
import nest_asyncio
from pydantic import BaseModel, ValidationError
from typing import List, Union

# Allow nested use of asyncio in environments like Jupyter
nest_asyncio.apply()

raw_outputs = []
outputs = []
errors = []

# Placeholder for CulturalHeritageSchema class (Add appropriate schema attributes)
class Entity(BaseModel):
    name: str

class Relation(BaseModel):
    name: str

class Triplet(BaseModel):
    entity1: Entity
    relation: Relation
    entity2: Entity

class CulturalHeritageSchema(BaseModel):
    triplets: List[Triplet]

# Assuming you have an asynchronous function to run the model
async def process_content(content: str, idx: int, llm, parser):
    prompt = create_prompt(DEFAULT_BASE_PROMPT, parser, content)
    result = await run(llm, prompt, parser)
    raw_outputs.append(result)  # Store raw output for logging or debugging purposes

    output_map = {
        "article_id": idx,
    }

    try:
        validated_result = CulturalHeritageSchema.parse_obj(result)
        output = output_map | validated_result.dict()
        outputs.append(output)
        return validated_result
    except ValidationError as e:
        # Handle validation error and log it
        output_map["raw_llm_output"] = result
        output_map["validation_errors"] = e.errors()
        errors.append(output_map)
        return None

async def main():
    processed_triplets = []

    for i in range(df.shape[0]):
        content = df["content"][i] # change to df.iloc[i] for ordinary csv, as from openartdata OR change to df["Provenance"][i] when you know the exact column you want
        result = await process_content(content, i, llm, parser)

        if result:  # Only add if the result is valid
            for triplet in result.triplets:
                processed_triplets.append({
                    "article_id": i,
                    "entity1": triplet.entity1.name,
                    "relation": triplet.relation.name,
                    "entity2": triplet.entity2.name
                })

    # Create DataFrame with columns entity1, relation, entity2
    result_df = pd.DataFrame(processed_triplets, columns=["article_id", "entity1", "relation", "entity2"])

    # Create DataFrame for errors
    error_df = pd.DataFrame(errors, columns=["article_id", "raw_llm_output", "validation_errors"])

    return result_df, error_df

# Function to run the async function in both environments
def run_main_loop():
    # Get or create the event loop
    try:
        # If there's already an event loop, use it
        loop = asyncio.get_running_loop()
    except RuntimeError:
        # If no event loop is running, create one
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)

    result_df, error_df = loop.run_until_complete(main())
    return result_df, error_df

# Run the main async function
result_df, error_df = run_main_loop()
print("Result DataFrame:")
print(result_df)
print("\nError DataFrame:")
print(error_df)

Error in parsing Failed to parse CulturalHeritageSchema from completion {"triplets": [{"entity1": {"name": "Looters"}, "relation": {"name": "loots"}, "entity2": {"name": "Ballcourt Marker"}}, {"entity1": {"name": "Looters"}, "relation": {"name": "sells"}, "entity2": {"name": "Antiquities Dealer"}}, {"entity1": {"name": "Antiquities Dealer"}, "relation": {"name": "has_possession_of"}, "entity2": {"name": "Ballcourt Marker"}}, {"entity1": {"name": "Arthur Demarest"}, "relation": {"name": "works_with"}, "entity2": {"name": "Vanderbilt University"}}, {"entity1": {"name": "Arthur Demarest"}, "relation": {"name": "works_with"}, "entity2": {"name": "Universid\u00e1d del Valle"}}, {"entity1": {"name": "Arthur Demarest"}, "relation": {"name": "works_with"}, "entity2": {"name": "Guatemalan Ministry of Culture"}}, {"entity1": {"name": "District Governor"}, "relation": {"name": "is_the_owner_of"}, "entity2": {"name": "Drug Trafficking Operation"}}, {"entity1": {"name": "Drug Traffickers"}, "relati

In [None]:
## write the outputs to csv files.
## we're using the semicolon as the column seperator
## because the llm might sometimes use commas as part
## of an entity name and so on. REGEX in sublime to do final tidying up.
## alternatively, import csv into excel, specify ; as separator
## and use filters to examine results and decide what to do with them.
## source article id / row ids are included in output
result_df.to_csv('clean-results-out.csv', index=True, sep=";")
error_df.to_csv('errors-to-examine.csv', index=True, sep=";")

## earlier experiments, ignore

In [None]:
# ## on all the csv?
raw_outputs = []
outputs = []

for i in range(df.shape[0]):

     prompt = create_prompt(DEFAULT_BASE_PROMPT, parser, df.iloc[i])
     result = await run(llm, prompt, parser)
     raw_outputs.append(result)

     output_map = {
         "article_id":i,
     }

     if isinstance(result, CulturalHeritageSchema):
         output =  output_map | result.dict()
         outputs.append(output)
     else:
         output_map["raw_llm_output"] = result
         outputs.append(output_map)

     time.sleep(1)

In [None]:
#cost was 8.90 on may 27th with gpt-o
# only two lines with errors
outputs

with open('openartdata-latchford-et-al-out.json', 'w') as outfile:
  outfile.write('\n'.join(str(i) for i in outputs))


In [None]:
##-----
## for when the df is created by appending individual texts
## use this with the dataframe created by appending texts together
import pandas as pd
import asyncio
import nest_asyncio

# Allow nested use of asyncio in environments like Jupyter
nest_asyncio.apply()

raw_outputs = []
outputs = []

# Assuming you have an asynchronous function to run the model
async def process_content(content: str, idx: int, llm, parser):
    prompt = create_prompt(DEFAULT_BASE_PROMPT, parser, content)
    result = await run(llm, prompt, parser)
    raw_outputs.append(result)  # Store raw output for logging or debugging purposes

    output_map = {
        "article_id": idx,
    }

    if isinstance(result, CulturalHeritageSchema):
        output = output_map | result.dict()
        outputs.append(output)
    else:
        output_map["raw_llm_output"] = result
        outputs.append(output_map)

    return result

async def main():
    # Placeholder for the processed triplets
    processed_triplets = []

    for i in range(df.shape[0]):
        content = df["content"][i]
        result = await process_content(content, i, llm, parser)

        # Only continue if the result is of the expected type
        if isinstance(result, CulturalHeritageSchema):
            for triplet in result.triplets:
                processed_triplets.append({
                    "article_id": i,
                    "entity1": triplet.entity1.name,
                    "relation": triplet.relation.name,
                    "entity2": triplet.entity2.name
                })

    # Create DataFrame with columns entity1, relation, entity2
    result_df = pd.DataFrame(processed_triplets, columns=["article_id", "entity1", "relation", "entity2"])

    return result_df

# Function to run the async function in both environments
def run_main_loop():
    # Get or create the event loop
    try:
        # If there's already an event loop, use it
        loop = asyncio.get_running_loop()
    except RuntimeError:
        # If no event loop is running, create one
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)

    return loop.run_until_complete(main())

# Run the main async function
result_df = run_main_loop()
print(result_df)

Cost: $3.12 for files in coref-tc.zip with gpt-o

In [None]:
result_df.to_csv('out.csv', index=True, sep=";")

In [16]:
result_df

NameError: name 'result_df' is not defined