<a href="https://colab.research.google.com/github/shawngraham/homecooked-history/blob/main/structured_data_extractor_using_groq_and_llm_and_coreferee.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Save a copy of this notebook first, and then fire up your copy so you can save your work or any modifications you make.**

This notebook does some interesting things. I have found, through experiments, that while LLMs get us _some_ of the way towards one-shot or no-shot structured data extraction from unstructured texts (eg, historical documents, articles, reports), reworking the text with coreference resolution performed first gets us _much_ closer to what we want. And using a dedicated coreference model works much better than trying to get an LLM to do it.

This notebook demonstrates a flow, and saves our work at each step for subsequent examination.

1.  use coreference resolution to sort out pronouns/noun agreement etc.; return modified text
2. use a template with an LLM to further massage that text so that any instance of eg 'Graham' gets switched to 'Shawn Graham'; return modified text
3. pass that modified text through a prompt that defines our desired list of predicates. Since steps 1 and 2 reduce opportunities for confusion over who or what is doing the acting, the results of 3 tend to be higher quality than would otherwise be the case. Write to csv.
4. pass the csv through a 'checker' that marks rows that are either not well formed, or use a predicate not in our desired list.

The user can then manually inspect the results to find easily rows that need fixing, and can decide how to deal with them.

For access to the Llama3.1-70b model, get a free developer key from [Groq](https://console.groq.com/playground). You'll paste it in below when you run the command `llm keys set groq`. (paste, hit enter; the cell will stop running without further comment once you do).

Dec 3 addition

~~Seeing what happens when the preprocessed text is run through nuextract~~ nothing useful

Dec 4

Use Gemini. Add rate limits to stay within the lines.

## initialize

In [1]:
%%capture
!pip install llm
#!llm install llm-groq


In [2]:
import spacy
spacy.prefer_gpu()

True

In [25]:
%%capture
!pip install --upgrade pydantic

In [4]:
%%capture
#!pip install -U spacy
!python3 -m pip install coreferee
!python3 -m coreferee install en
!python -m spacy download en_core_web_trf
!python -m spacy download en_core_web_lg


In [None]:
#!llm keys set groq

Enter key: 


In [None]:
# set llama3.1-70b as default model
#!llm aliases set themodel groq-llama3.1-70b

#models via groq are so much faster! You might wish to experiment with other options.
#eg

#!llm install llm-gguf
#!llm gguf download-model https://huggingface.co/lmstudio-community/SmolLM2-1.7B-Instruct-GGUF/resolve/main/SmolLM2-1.7B-Instruct-Q8_0.gguf -a smol17

# now available via llm -m smol17
# but it _will_ be much slower

In [None]:
# let's try llm-gemini
!llm install llm-gemini



In [None]:
!llm keys set gemini

In [None]:
!llm models

In [6]:
!llm aliases set themodel gemini-1.5-pro-002

## test text

We create a folder called 'source-texts' and, for testing, put the full text for an article about Giacomo Medici into it. [via Trafficking Culture](https://traffickingculture.org/encyclopedia/case-studies/giacomo-medici/). Note - while that article begins with 'Medici started dealing' I added the word 'Giacomo' so that we don't get confused in model space about other famous Medicis.

You would put your own source txt files into this folder.

In [7]:
!mkdir source-texts
!echo -e """Giacomo Medici started dealing in antiquities in Rome during the 1960s (Silver 2009: 25). In July 1967, he was convicted in Italy of receiving looted artefacts, though in the same year he met and became an important supplier of antiquities to US dealer Robert Hecht (Silver 2009: 27-9). In 1968, Medici opened the gallery Antiquaria Romana in Rome and began to explore business opportunities in Switzerland (Silver 2009: 34). It is widely believed that in December 1971 he bought the illegally-excavated Euphronios (Sarpedon) krater from tombaroli before transporting it to Switzerland and selling it to Hecht (Silver 2009: 50).\n\n In 1978, he closed his Rome gallery, and entered into partnership with Geneva resident Christian Boursaud, who started consigning material supplied by Medici for sale at Sotheby’s London (Silver 2009: 121-2, 139; Watson and Todeschini 2007: 27). Together, they opened Hydra Gallery in Geneva in 1983 (Silver 2009: 139). It has been estimated that throughout the 1980s Medici was the source of more consignments to Sotheby’s London than any other vendor (Watson and Todeschini 2007: 27). At any one time, Boursaud might consign anything up to seventy objects, worth together as much as £500,000 (Watson 1997: 112). Material would be delivered to Sotheby’s from Geneva by courier (Watson 1997: 112). \n\n In October 1985, the Hydra Gallery sold fragments of the Onesimos kylix to the J. Paul Getty Museum for $100,000, providing a false provenance by way of the fictitious Zbinden collection, a provenance that was sometimes used for material offered at Sotheby’s (Silver 2009: 145; Watson and Todeschini 2007: 95). The Getty returned the kylix to Italy in 1999. \n\n In 1986, bad publicity surrounding the sale of looted Apulian vases at Sotheby’s London caused Medici and Boursaud to part company, and Medici bought the Geneva-based Editions Services to continue consigning material to Sotheby’s (Silver 2009: 147; Watson and Todeschini 2007: 27; Watson 1997: 117, 183-6). From 1987 until 1994, he was also consigning material to Sotheby’s through other ‘front companies’, including Mat Securitas, Arts Franc and Tecafin Fiduciaire (Watson and Todeschini 2007: 73). He developed a triangulating system of consigning through one company and purchasing the same piece through another company. There were two potentially positive outcomes of this triangulation manoeuvre: first, it artificially created demand, suggesting to potential customers that the market was stronger than it actually was; and second, it was a way of providing illegally-excavated or -exported pieces with a ‘Sotheby’s’ provenance, and, in effect, laundering them (Watson and Todeschini 2007: 135-41). \n\n By the late 1980s, Medici had developed commercial relations with other major antiquities dealers including Robin Symes, Frieda Tchacos, Nikolas Koutoulakis, Robert Hecht, and the brothers Ali and Hicham Aboutaam (Watson and Todeschini 2007: 73-4). He was the ultimate source of artefacts that would subsequently be sold through dealers or auction houses to private collectors, including Lawrence and Barbara Fleischman, Maurice Tempelsman, Shelby White and Leon Levy, the Hunt brothers, George Ortiz, and José Luis Várez Fisa (Watson and Todeschini 2007: 112-34; Isman 2010), and to museums including the J. Paul Getty, the Metropolitan Museum of Art, the Cleveland Museum of Art, and the Boston Museum of Fine Arts. \n\n In 1995, a Sotheby’s London auction catalogue advertised for sale a sarcophagus recognized by the Carabinieri to have been stolen from the church of San Saba, in Rome. Sotheby’s informed the Carabinieri that it had been consigned by Editions Services (Watson and Todeschini 2007: 19). This was around the same time that the ‘organigram’ had been discovered, revealing Medici’s central position in the organisation of the antiquities trade out of Italy (Watson and Todeschini 2007: 19), and putting the evidence together, the Carabinieri decided to act. On 13 September 1995, in concert with Swiss police, they raided Medici’s storage space in the Geneva Freeport, which comprised five rooms with a combined area of about 200 sq metres (Silver 2009: 174; Watson and Todeschini 2007: 20). One room was equipped as a laboratory for cleaning and restoring artefacts, another was fitted out as a showroom, presumably for receiving potential customers (Silver 2009: 180-1). In January 1997, Medici was arrested in Rome (Silver 2009: 175-6), and in July 1997, his Geneva storerooms, which had remained sealed since 1995, were opened again for the process of examination and inventory. \n\n The official report of the contents of Medici’s storerooms was submitted in July 1999. The storerooms had been found to contain 3,800 whole or fragmentary objects, more than 4,000 photographs of artefacts, and 35,000 sheets of paper containing information relating to Medici’s business practices and connections. The artefacts were mainly from Italy, but there were also hundreds from Egypt, Syria, Greece and Asia. The Swiss authorities turned over Italian material to Italy, but returned the rest to Medici (Silver 2009: 192). The photographs were mainly Polaroids, showing what appeared to be illegally-excavated artefacts, sometimes with several views of the same one, in various stages of restoration. Some artefacts were shown still covered with dirt after their excavation, some fragmentary, and others cleaned and reassembled prior to sale (Watson and Todeschini 2007: 54-68). In 2002, Carabinieri raided Medici’s home in Santa Marinella (Watson and Todeschini 2007: 200). \n\n Medici was charged with receiving stolen goods, illegal export of goods, and conspiracy to traffic, and his trial in Rome commenced on 4 December 2003. On 12 May 2005, he was found guilty of all charges. The judge declared that Medici had trafficked thousands of artefacts, including the sarcophagus fragment that had started the investigation, and the Euphronios (Sarpedon) krater (Silver 2009: 212). He was sentenced to ten years in prison and received a €10 million fine, with the money going to the Italian state in compensation for damage caused to cultural heritage (Silver 2009: 214). In July 2009, an appeals court in Rome dismissed the trafficking conviction against him because of the expired limitation period, but reaffirmed the convictions for receiving and conspiracy. His jail sentence was reduced to eight years, but the €10 million fine remained in place (Scherer 2009). In December 2011, a further appeal failed (Felch 2012). \n\n The evidence recovered during the investigation into Medici’s business was instrumental in forcing several museums and private collectors to return artefacts to Italy, and triggered further investigations and ultimately the prosecutions of Marion True and Robert Hecht.""" > source-texts/giacomo.txt
!echo -e """Marion True was appointed curatorial assistant at the J. Paul Getty Museum in 1982, under the supervision of Jiri Frel, the then Curator of Antiquities. After Frel’s departure in 1986, True was promoted to replace him as curator (Felch and Frammolino 2011: 77-8). During the period of her curatorship, True was responsible for some controversial acquisitions of unprovenanced material, including the 1988 purchase of the Getty Aphrodite, the 1993 purchase of a fourth-century BC gold funerary wreath from Greece, and the 1996 acquisition of the Barbara and Lawrence Fleischman Collection. She also rejected some potentially high-profile assemblages, including the Kanakaria mosaics in 1988, which she recognized as stolen (Felch and Frammolino 2011: 115-17), and the Sevso Treasure, when she discovered that the accompanying documents of provenance were probably forgeries (True 1997: 140). She was prepared to return material to its country of origin if it was shown convincingly to have an illicit provenance, including several hundred ceramic fragments acquired by donation between 1979 and 1981, that in 1994 were found to have been looted from a sanctuary at Francavilla Maritima in Italy (Lyons 2010: 422-5; True 1997: 143). True was the driving force behind the Getty’s adoption of clear policy guidelines as regards its acquisition of unprovenanced antiquities. The first version, in 1987, which was believed at the time by True to be the only policy of its kind in place at a major collecting museum, required the Getty to notify in writing the appropriate authority of a possible country of origin about a potential acquisition, and request information about theft or illegal export. The acquisition would only proceed if no such information was forthcoming. Furthermore, if at any time after acquisition a country could make a verifiable claim of theft or illegal export, the Getty would return the object in question, notwithstanding any legal protection offered by a statute of limitations (True 1997: 138). This policy was strengthened in November 1995 by the assurance that an unprovenanced object would only be acquired if it had been published or otherwise publicly documented as out of its country of origin prior to November 1995 (True 1997: 138). This requirement for public documentation was intended to protect against forged provenances, which True believed were rife in the antiquities trade (Kaufman 1996). Several pieces were returned in accordance with the policy (Lee 1999). Nevertheless, the sincerity of the Getty’s motives was called into question by the 1996 acquisition of the Fleischman Collection, comprising largely unprovenanced material, which had been published only in 1994 by the Getty itself. True was offered the position of Curator of Greek and Roman Art at the Metropolitan Museum in New York in 1991 after Dietrich von Bothmer’s retirement the year before, but declined the offer when she was presented with the opportunity to oversee the projected renovation and redesign of the Getty Villa in Malibu, which was to house the Getty’s antiquities collection (Eakin 2007). The project cost $275 million and the Villa reopened in January 2006 (Felch and Frammolino 2011: 273-7). True suggested that the 1995 revision of the acquisitions policy was part of a larger change in mission occasioned by this reimagining of the Getty Villa, which envisaged a shift in primary purpose for the museum from collecting antiquities to conservation abroad and incoming loan exhibitions (Somers Cocks 1995: 6). In 1995, True bought a house on the Greek island of Paros with the help of a four-year loan of $400,000 from Christos Michaelides, partner of antiquities dealer Robin Symes, from whom the Getty had bought several important pieces, including the Aphrodite (Felch and Frammolino 2011: 135-8; Watson and Todeschini 2007: 288-9). Just days after the acquisition of the Fleischman Collection in 1996, she paid back Michaelides with money borrowed from the Fleischmans on a twenty-year mortgage (Felch and Frammolino 2011: 146). When these loans came to the attention of the Getty trustees in September 2005, True was fired for failing to declare them in contravention of the Getty’s conflict-of-interest policy (Felch and Frammolino 2011: 266). On 1 April 2005, a few months before her dismissal, True was charged in Italy with receiving stolen antiquities and conspiring with dealers Robert Hecht and Giacomo Medici to receive stolen antiquities, and she was ordered to stand trial in Rome (Felch and Frammolino 2011: 259; Wilkinson and Muchnic 2005). The case against True had materialized as Italian investigators working through photographic and documentary material seized from Medici’s Geneva storerooms began to suspect her involvement. It was quickly recognized that True had acquired a  fifth-century BC bronze tripod and candelabrum for the Getty in 1990 that had been stolen from the long established Guglielmo Collection in Italy (Felch and Frammolino 2011: 153-4). Letters were also discovered revealing what appeared to be friendly relations between True, Medici and Hecht (Felch and Frammolino 2011: 212-13; Watson and Todeschini 2007: 85, 98). Finally, there was a set of photographs recording forty-two objects that had passed through the hands of Medici before ultimately being acquired by the Getty (Felch and Frammolino 2011: 197; Watson and Todeschini 2007: 87). The investigators also came to believe that prior to 1996 True had been encouraging the Fleischmans to buy objects of dubious provenance in the knowledge that they would ultimately be donated to the Getty—in effect, using the Fleischman Collection to launder potentially illicit material (Felch and Frammolino 2011: 257-9). True argued in her defence that staying on good terms with antiquities dealers was a professional requirement of her position as curator (True 2011), that she had not acquired objects for her own benefit, but for the museum, and that she should not take sole responsibility for acquisitions made during her tenure as curator, as they had all been made with the approval of the Getty CEO (Harold Williams until the end of 1997, Barry Munitz until 2006), Director (John Walsh until September 2000, Deborah Gribbon until 2004), in-house counsel, and Board of Trustees (Eakin 2010). This statement was supported by internal Getty documentation (Felch and Frammolino 2011: 218). Indeed, many of the objects in the photographs seized at Medici’s storerooms had been acquired before True was curator (Felch and Frammolino 2011: 198, 248). Both True and Barbara Fleischman rejected out of hand any imputation of collusion (Felch and Frammolino 2011: 254-5). The trial commenced on 16 November 2005, and was abandoned without verdict on 13 October 2010 as the limitation period on True’s alleged offences expired (Eakin 2010; Felch and Frammolino 2011: 312). True complained that she had been ‘neither condemned nor vindicated’ (True 2011). In November 2006, Greek prosecutors charged True in connection with the fourth-century BC gold funerary wreath acquired in 1993, which was by then believed to have been taken out of Greece illegally (Felch and Frammolino 2011: 290; Zirganos 2007: 320). In November 2007, her trial was ended without resolution after the expiry of the statute of limitations (Felch 2007; Felch and Frammolino 2011: 306) ."""  > source-texts/true.txt

## pre-process the text

We're going to do a bit of munging to make sure the text is as good as it's going to get before we try extracting data

In [8]:
!llm -m themodel 'is this thing on? Which model are you?'

Yes, this thing is on!  I'm a large language model, trained by Google.



We create a template called 'namefix' for the LLM to apply so that any time our text says something like

'John Smith did x. Later Smith did y'

...all references to 'John Smith' _or_ 'Smith' become john_smith. Ditto for organizations.

In [19]:
#define the template
!llm --system """Replace all references to individuals in this text with a consistent firstname_surname format. Rules: 1. First full name mention: Replace with 'firstname_surname' 2. Subsequent surname-only mentions: Replace with the same 'firstname_surname' 3. Preserve original text structure and context 4. Ensure replacements are uniform throughout the text Examples - 'Trudy True arrived late' -> 'Trudy_True arrived late' - 'True apologized' -> 'Trudy_True apologized' RETURN ONLY the modified text output.""" --save namefix4


In [15]:
!llm --system """Replace all references to organizations in this text with a consistent full_name format. Rules: 1. Do not adjust personal individual names. 2. First full organization name -> mention: Replace with 'firstword_secondword' and so on for the full organization name 3. Subsequent shortname-only mentions: Replace with 'firstword_secondword' and so on for the full organization name 4. Expand abbreviations fully when they refer to such an organization. 5. Preserve original text structure and context 6. Ensure replacements are uniform throughout the text. Examples -'The Ottawa Art Gallery opened in 2010' -> 'The Ottawa_Art_Gallery opened in 2010'. RETURN ONLY the modified text output."""  --save orgnamefix4

We create a template to remove scholarly citations, which can cause trouble elsewhere.

In [12]:
##a template for removing scholarly citations
!llm --system "Return the complete text but remove scholarly citations. Scholarly citations look similar to this: (Jung, 2010, p. 4) and are often a surname, a date, a page range in parenthesis. Rules: 1. Do not adjust personal individual names. 2. When encountering a scholarly citation, remove it. Example - 'John_Smith argued differently (Jones 2012).' -> 'John_Smith argued differently.'" --save cleaner

In [16]:
!llm templates

cleaner     : system: Return the complete text but remove scholarly citations. Scholarly citation...
namefix4    : system: Replace all references to individuals in this text with a consistent firstn...
orgnamefix4 : system: Replace all references to organizations in this text with a consistent full...


In [17]:
#source-texts -> namefixed -> orgfixed -> cleaned -> resolved -> results -> checked -> manually fix things -> finished
!mkdir namefixed
!mkdir orgfixed
!mkdir cleaned

In [21]:
# run the name fix
import os
import subprocess

for filename in os.listdir("source-texts"):
    if filename.endswith(".txt"):
        input_path = os.path.join("source-texts", filename)
        output_path = os.path.join("namefixed", filename[:-4] + "_namefixed.txt")

        command = f"cat {input_path} | llm -m themodel -t namefix4 > {output_path}"
        subprocess.run(command, shell=True)

for filename in os.listdir("namefixed"):
    if filename.endswith(".txt"):
        input_path = os.path.join("namefixed", filename)
        output_path = os.path.join("orgfixed", filename[:-4] + "_orgfixed.txt")

        command = f"cat {input_path} | llm -m themodel -t orgnamefix4 > {output_path}"
        subprocess.run(command, shell=True)


for filename in os.listdir("orgfixed"):
    if filename.endswith(".txt"):
        input_path = os.path.join("orgfixed", filename)
        output_path = os.path.join("cleaned", filename[:-4] + "_cleaned.txt")

        command = f"cat {input_path} | llm -m themodel -t cleaner > {output_path}"
        subprocess.run(command, shell=True)

## coreference resolution

Coreference resolution involves figuring out which pronouns go with what nouns, and replacing them with the nouns. So, 'John Smith worked in Ottawa. Later he moved to Montreal' _should_ become 'John Smith worked in Ottawa. Later John Smith moved to Montreal'.

You might want to snoop in the 'resolved' folder to see how well this has worked.

In [22]:
# ok, let's try on a full folder

import coreferee, spacy
import spacy_transformers
import os

# Load the Spacy language model and add the Coreferee pipeline component
nlp = spacy.load('en_core_web_trf')
nlp.add_pipe('coreferee')

# Define the input folder containing text files and the output folder for the resolved texts
input_folder = "cleaned"  # Replace with the path to your input folder
output_folder = "resolved"  # Replace with the path to your output folder

# Create output directory if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Iterate over all text files in the input directory
for filename in os.listdir(input_folder):
    if filename.endswith(".txt"):
        # Construct the full file paths
        input_file_path = os.path.join(input_folder, filename)
        output_file_path = os.path.join(output_folder, filename)

        # Read the content of the text file
        with open(input_file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        # Process the text with Spacy and Coreferee
        coref_doc = nlp(text)

        # Perform entity co-resolution
        resolved_text = ""
        for token in coref_doc:
            repres = coref_doc._.coref_chains.resolve(token)
            if repres:
                resolved_text += " " + " and ".join([t.text for t in repres])
            else:
                resolved_text += " " + token.text

        # Write the resolved text to the output file
        with open(output_file_path, 'w', encoding='utf-8') as file:
            file.write(resolved_text.strip())  # Remove leading space

  """Check if string maps to a package installed via pip.
  elif val == 1:
  self._model.load_state_dict(torch.load(filelike, map_location=device))
  """Check if string maps to a package installed via pip.
  with torch.cuda.amp.autocast(self._mixed_precision):


___

Dec 4 - let's see if we can work out from the text a set of predicates that capture the antiquities trade


In [30]:
import re
import llm

model = llm.get_model("themodel")

def suggest_predicates_rules(text):
    """Suggests predicates based on keywords and patterns in the text."""
    suggested_predicates = set()

    # Rule 1: Look for common verbs associated with transactions
    transaction_verbs = r"(?:bought|sold|traded|acquired|consigned|purchased|exported|imported|donated|received|transferred)"
    matches = re.findall(transaction_verbs, text, re.IGNORECASE)
    suggested_predicates.update(matches)

    # Rule 2: Look for verbs indicating collaboration or connection
    collaboration_verbs = r"(?:worked with|partnered with|collaborated with|associated with|connected to|supplied to|met with)"
    matches = re.findall(collaboration_verbs, text, re.IGNORECASE)
    suggested_predicates.update(matches)

    # Rule 3: Look for verbs related to legal actions
    legal_verbs = r"(?:charged with|convicted of|sentenced to|arrested for|investigated for)"
    matches = re.findall(legal_verbs, text, re.IGNORECASE)
    suggested_predicates.update(matches)


    # Rule 4: Look for location words implying operation
    location_verbs = r"(?:operated in|located in|based in|established in)"
    matches = re.findall(location_verbs, text, re.IGNORECASE)
    suggested_predicates.update(matches)


    # Rule 5: Look for ownership and provenance
    ownership_verbs = r"(?:owned by|belonged to|originated from|stolen from|traced to)"
    matches = re.findall(ownership_verbs, text, re.IGNORECASE)
    suggested_predicates.update(matches)

    return list(suggested_predicates)

import llm

def refine_predicates_llm(suggested_predicates, text):
    """Refines the suggested predicates using an LLM."""
    prompt = f"""The following predicates were suggested for extracting relationships from a text about the antiquities trade: {', '.join(suggested_predicates)}.  The text is:  {text}.  Refine this list, removing irrelevant predicates, adding any crucial missing predicates relevant to the antiquities trade (focus on key players, institutions, objects, and transactions), and prioritizing predicates that will illuminate the key aspects of the trade network. Return the refined list as a comma-separated string. Provide a brief rationale."""
    refined_predicates = model.prompt(prompt, temperature=0)
    return refined_predicates

def select_predicates(text):
    """Selects predicates using a hybrid rule-based and LLM approach."""
    initial_suggestions = suggest_predicates_rules(text)
    refined_predicates = refine_predicates_llm(initial_suggestions, text)
    return refined_predicates




In [31]:
#
text = open("/content/resolved/giacomo_namefixed_orgfixed_cleaned.txt", "r").read()
predicates = select_predicates(text)
print(f"Selected Predicates: {predicates}")

# Now use these predicates in the triple extraction step.

Selected Predicates: **Refined Predicate List:**

sold, bought, consigned, received, exported, imported, looted, excavated, restored, supplied, owned, possessed,  trafficked, stolen_from,  returned,  charged_with, sentenced_to, located_at, associated_with

**Rationale:**

* **Removed:** "dealing in" is too general.  The other predicates capture the specific activities within "dealing."
* **Added:**
    * **imported:**  Essential counterpart to exported, especially given the international nature of the trade.
    * **looted, excavated:**  These specify the illegal origins of many items.
    * **restored:**  Key activity in preparing objects for sale and disguising their origins.
    * **supplied:** Captures the relationship between Medici and other dealers/auction houses.
    * **owned, possessed:**  Important for establishing provenance and responsibility.
    * **trafficked:**  A more serious charge than simply buying/selling.
    * **returned:**  Crucial for tracking the repatriation

## Extractor!

Now we do the actual extraction. Note the array below where we specify the target relations we're after.

In [32]:
long_list_predicates = [
    "dealerIn",
    "convictedOf",
    "operatedBusiness",
    "soldTo",
    "sourceOf",
    "chargedWith",
    "sentencedTo",
    "tradedThrough",
    "locatedIn",
    "suppliedArtifactsTo",
    "businessPartnerOf",
    "tradeConnectionWith",
    "soldViaIntermediary",
    "consignedThrough",
    "providedArtifactsTo",
    "collaboratedWith",
    "introducedTo"
    ]
predicates = ["involvedIn", "transactedWith", "connectedTo", "legalStatus", "originatedFrom", "operatedIn"]
# maybe should use the old schema

suggested_predicates =['sold_object_to', 'bought_object', 'purchased_from', 'consigned', 'received', 'exported', 'imported', 'looted', 'excavated', 'restored', 'supplied', 'owned', 'possessed',  'trafficked', 'stolen_from',  'returned',  'charged_with', 'sentenced_to', 'located_at', 'associated_with']

change line 33 below to whatever predicates you want

In [33]:
import os
import llm
import re

# Ensure results directory exists
os.makedirs("results", exist_ok=True)

# Get the LLM model
model = llm.get_model("themodel")

# Path to the ready-to-go folder
input_folder = "resolved"

# Iterate through all text files in the ready-to-go folder
for filename in os.listdir(input_folder):
    if filename.endswith(".txt"):
        # Full path to the input file
        input_path = os.path.join(input_folder, filename)

        # Read the text content
        with open(input_path, "r") as file:
            text_content = file.read()

        # Split the text content into paragraphs using regular expressions
        # This is so that everything fits inside the context window
        paragraphs = re.split(r'\n\s*\n', text_content)

        # Prepare output file path in results folder
        output_filename = filename.replace(".txt", "_triplets.csv")
        output_path = os.path.join("results", output_filename)

        # Convert the predicates list to a comma-separated string
        predicates_str = ", ".join(suggested_predicates)  ### change predicates here

        # Open the output file to write results
        with open(output_path, "w") as output_file:
            # Write CSV header
            output_file.write("subject,verb,object\n")

            # Iterate through each paragraph
            for paragraph_index, paragraph in enumerate(paragraphs, 1):
                # Construct the prompt for the current paragraph
                # THIS IS WHERE YOU ALSO INDICATE TARGET VERBS/PREDICATES
                # MAKE SURE THESE ARE THE SAME AS IN THE VALIDATION BLOCK IN THE NEXT CODE CELL
               #prompt = paragraph + f"\n\n Your output will be in csv format with columns 'subject','verb','object'. Extract subject,verb,object triplets that capture the nuance of the text. IGNORE SCHOLARLY PARENTHETICAL CITATIONS. The target predicates are {predicates_str}. RETURN ONLY THE LIST OF TRIPLETS."
                prompt = paragraph + f"\n\n Extract subject,verb,object triplets that capture the nuance of the text. Your output will be in csv format with columns 'subject','verb','object'. STRICT RULES FOR ENTITY EXTRACTION: - Use only substantive, named entities from the main text The target predicates are {predicates_str}. RETURN ONLY THE LIST OF TRIPLETS. Here is the text to process"


                # Send the prompt to the LLM and get the response
                try:
                    response = model.prompt(prompt, temperature=0)

                    # Combine response chunks
                    full_response = ''.join(chunk for chunk in response)

                    # Print the response for the current paragraph to console
                    print(f"Paragraph {paragraph_index} from {filename}:")
                    print(full_response)
                    print("\n---\n")

                    # Write the response to the output file
                    output_file.write(full_response)
                except Exception as e:
                    print(f"Error processing paragraph {paragraph_index} in {filename}: {e}")

print("Extraction complete. Results saved in 'results' folder.")

Paragraph 1 from true_namefixed_orgfixed_cleaned.txt:
```csv
subject,verb,object
Marion_True,appointed,curatorial assistant
Marion_True,supervised by,Jiri_Frel
Marion_True,promoted to,curator
J_Paul_Getty_Museum,purchased,Getty Aphrodite
J_Paul_Getty_Museum,purchased,gold funerary wreath
J_Paul_Getty_Museum,acquired,Barbara_and_Lawrence_Fleischman_Collection
Marion_True,rejected,Kanakaria mosaics
Marion_True,rejected,Sevso Treasure
Marion_True,returned,ceramic fragments
J_Paul_Getty_Museum,adopted,policy guidelines
J_Paul_Getty_Museum,required,notification
J_Paul_Getty_Museum,would return,object
J_Paul_Getty_Museum,strengthened,policy
Marion_True,believed,forged provenances
J_Paul_Getty_Museum,returned,Several pieces
J_Paul_Getty_Museum,acquired,Barbara_and_Lawrence_Fleischman_Collection
J_Paul_Getty_Museum,published,Barbara_and_Lawrence_Fleischman_Collection
Marion_True,offered,position of Curator
Marion_True,declined,offer
Dietrich_von_Bothmer,presented with,opportunity
Marion_True,s

## validate results

This last bit runs your results through a checker to mark up rows that DO NOT have your target predicates present or DO NOT have 3 columns of data. This way, it becomes easy for you to manually inspect the results and decide how you want to handle things.

If you change your list of desired predicates back where you extract things, make sure the 'valid_predicates' list below is updated accordingly.

The code below will run on any file with _triplets.csv as part of the filename.

In [43]:
import csv
import os

def error_check_predicates(input_file, output_file, valid_predicates=None):
    """
    Check CSV file for valid predicates and column count.

    Args:
    input_file (str): Path to input CSV file
    output_file (str): Path to output error-checked file
    valid_predicates (list): List of valid predicate verbs
    """
    with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)

        # Write header
        header = next(reader)
        writer.writerow(header)

        for row in reader:
            # Mark line with ### if more than 3 columns
            if len(row) != 3:
                row.insert(0, '###')
                writer.writerow(row)
                continue

            # Check predicate validity - EXACT MATCH ONLY
            verb = row[1].strip()
            if verb not in valid_predicates:
                row.insert(0, '###')

            writer.writerow(row)

def process_all_files(input_dir='results', output_dir='error-checked'):
    os.makedirs(output_dir, exist_ok=True)

    for filename in os.listdir(input_dir):
        if filename.endswith('_triplets.csv'):
            input_path = os.path.join(input_dir, filename)
            output_path = os.path.join(output_dir, f'checked_{filename}')

            error_check_predicates(input_path, output_path, valid_predicates=suggested_predicates) #### remember to change to whatever list of predicates you're using!
            print(f'Processed:{output_dir}/{filename}')

# Uncomment to process all files
# process_all_files()

# Direct call for testing
#error_check_predicates('/content/results/giacomo_namefixed_cleaned_triplets.csv', 'giacomo_checked.txt', valid_predicates=predicates)
#print("Now you can inspect the results; fix things, save, then change the file extension to .csv instead of .txt")

In [44]:
process_all_files()

Processed:error-checked/true_namefixed_orgfixed_cleaned_triplets.csv
Processed:error-checked/giacomo_namefixed_orgfixed_cleaned_triplets.csv


In [45]:
#append .txt to files in the folder error-checked/
#so we can check for errors

import os

def add_txt_extension(directory):
    for filename in os.listdir(directory):
        if filename.endswith(".csv"):
            base_name, _ = os.path.splitext(filename)
            new_filename = base_name + ".txt"
            old_path = os.path.join(directory, filename)
            new_path = os.path.join(directory, new_filename)
            os.rename(old_path, new_path)

add_txt_extension("error-checked")

Now you can download your csv file with entities & relations, for use in knowledge graph embedding models, network analysis, or whatever else.

Open the csv file in a text editor FIRST though, and sort the lines alphabetically. The lines marked with ### will be put at the top, and you can manually work through them to decide what to do with the predicate or the extra columns of data (where extra commas have crept in).

Once you've saved your changes and everything is in well-formed csv, this last block below could be used to do some last data munging _on that fixed csv file_ in preparation for whatever you do next (eg, knowledge graph embedding model).

In [46]:
#append .txt to files in the folder error-checked/
#so we can check for errors

import os

def add_csv_extension(directory):
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            base_name, _ = os.path.splitext(filename)
            new_filename = base_name + ".csv"
            old_path = os.path.join(directory, filename)
            new_path = os.path.join(directory, new_filename)
            os.rename(old_path, new_path)

add_csv_extension("error-checked")

ok, so I've checked the 'giacomo' results manually (file was called giacomo_checked(2).csv), fixed the handful of relationships that got borked (which were mostly from not putting in a newline sometimes), now let's turn into gexf:

In [47]:
!mkdir finished

In [51]:
import os
import pandas as pd

def process_csv_files(input_folder, output_folder=None):
    # Create output folder if not specified
    if output_folder is None:
        output_folder = input_folder

    # Ensure output folder exists
    os.makedirs(output_folder, exist_ok=True)

    # Iterate through all files in the input folder
    for filename in os.listdir(input_folder):
        # Process only CSV files
        if filename.endswith('.csv'):
            # Full input file path
            input_path = os.path.join(input_folder, filename)

            # Read the CSV file
            thing = pd.read_csv(input_path, names=['subject', 'verb', 'object'], header=0, on_bad_lines='skip')


            # Drop rows with NaN values
            thing = thing.dropna()

            # Replace spaces with underscores in the specified columns
            thing['subject'] = thing['subject'].str.replace(' ', '_')
            thing['verb'] = thing['verb'].str.replace(' ', '_')
            thing['object'] = thing['object'].str.replace(' ', '_')

            # Lowercase only specific columns
            # This will preserve the case of the 'verb' column
            for col in [thing.columns[0], thing.columns[-1]]:  # subject and object columns
                thing[col] = thing[col].astype(str).str.lower()

            # Remove quotation marks from the first and last columns
            for col in [thing.columns[0], thing.columns[-1]]:
                thing.loc[:, col] = thing[col].astype(str).str.strip('"')

            # Create output filename (optionally add a prefix or suffix)
            output_filename = f"processed_{filename}"
            output_path = os.path.join(output_folder, output_filename)

            # Save the processed DataFrame to a new CSV file
            thing.to_csv(output_path, index=False)

            print(f"Processed: {filename} → {output_filename}")

# Usage example
process_csv_files('error-checked', 'finished')

Processed: checked_true_namefixed_orgfixed_cleaned_triplets.csv → processed_checked_true_namefixed_orgfixed_cleaned_triplets.csv
Processed: checked_giacomo_namefixed_orgfixed_cleaned_triplets.csv → processed_checked_giacomo_namefixed_orgfixed_cleaned_triplets.csv


In [52]:
### Experimental!
### template for inferring - for the machine - relations obvious to the reader
!llm --system "Analyze the existing relationship graph in the input CSV. Your task is to DEDUCE missing relationships ONLY for entities already present in the graph, using logical inference and contextual understanding.  Constraints: 1. ONLY propose relationships between entities ALREADY mentioned in the existing data 2. Base inferences on historical context, professional relationships, and known interactions 3. Provide high-confidence deductions that are strongly supported by implicit connections  Reasoning Guidelines: - Look for implied but not explicitly stated connections - Consider professional networks, geographical associations, and documented interactions - Avoid speculative or far-fetched relationships  Output Format: subject,verb,object  Example Reasoning: - If an antiquities dealer is known to work with a specific gallery, infer ownership or operational connection - If multiple people appear in legal proceedings, infer potential collaborative or antagonistic relationships - If an artifact is traced to a specific location, infer provenance or ownership connections  RETURN ONLY THE DEDUCED TRIPLES IN CSV FORMAT. No explanatory text. Provide a comment with # indicating WHY" --save infer

In [55]:
### EXperimental!

#! cat giacomo_finished_triplets.csv | llm -m llama3 'Given these relationships, DEDUCE relationships that are missing ONLY FOR ENTITIES ALREADY IN THE GRAPH. For instance, if someone is a known actor, and there is an entity called {storeroom of person}, we could assume a connection. Return likely triples.'

!cat /content/error-checked/checked_giacomo_namefixed_orgfixed_cleaned_triplets.csv | llm -m themodel -t infer
!cat /content/error-checked/checked_true_namefixed_orgfixed_cleaned_triplets.csv | llm -m themodel -c -t infer

tombaroli,supplied,Giacomo_Medici #Tomb robbers excavated the krater that Medici sold, implying a supply chain.
Robert_Hecht,associated_with,Giacomo_Medici #Hecht bought the krater from Medici, implying a business relationship.
Christian_Boursaud,associated_with,Giacomo_Medici #Boursaud and Medici both consigned material and Medici supplied material, suggesting a business connection.
Sotheby’s_London,sold,material #Sotheby's received material that Medici supplied and they sold vases, suggesting the material included vases.
Editions_Services,published,material #Medici bought Editions Services and consigned material, implying a publishing relationship.
Sotheby’s_London,associated_with,Giacomo_Medici #Sotheby's handled material from Medici and Editions Services, implying a business relationship.
Robin_Symes,associated_with,Robert_Hecht #Symes, Hecht, and others were associated with Medici, who dealt in illicit antiquities, suggesting a shared network.
Frieda_Tchacos,associated_with,Robert

In [56]:
import os
import pandas as pd

def concatenate_csv_files(input_folder, output_filename='combined_output.csv'):
    # List to store all dataframes
    all_dataframes = []

    # Flag to track whether headers have been processed
    first_file = True

    # Iterate through all files in the input folder
    for filename in sorted(os.listdir(input_folder)):
        # Process only CSV files
        if filename.endswith('.csv'):
            # Full input file path
            input_path = os.path.join(input_folder, filename)

            # Read the CSV file
            if first_file:
                # For the first file, read with headers
                df = pd.read_csv(input_path)
                all_dataframes.append(df)
                first_file = False
            else:
                # For subsequent files, skip the header row
                df = pd.read_csv(input_path, header=None, skiprows=1)

                # Ensure the columns match the first file
                if len(df.columns) == len(all_dataframes[0].columns):
                    df.columns = all_dataframes[0].columns
                    all_dataframes.append(df)
                else:
                    print(f"Skipping {filename}: Column mismatch")

    # Concatenate all dataframes
    combined_df = pd.concat(all_dataframes, ignore_index=True)

    # Save the combined dataframe
    combined_df.to_csv(output_filename, index=False)

    print(f"Combined {len(all_dataframes)} files into {output_filename}")
    print(f"Total rows: {len(combined_df)}")

# Usage example
concatenate_csv_files('finished', 'final_combined_output.csv')

Combined 2 files into final_combined_output.csv
Total rows: 116


At this point, something like Open Refine would be a good idea, so that 'Getty', 'Getty Museum', 'J. Paul Getty' etc all get smooshed into a single node. Then progress to visualize or kg-embedding and so on.

## Visualize via GEXF

In [57]:
import pandas as pd
import networkx as nx
import xml.etree.ElementTree as ET
import xml.dom.minidom as minidom

def csv_to_gexf(input_csv, output_gexf, source_col='source', target_col='target', weight_col=None, relationship_col='relationship'):
    """
    Convert a CSV file to a GEXF network file.

    Args:
        input_csv (str): Path to the input CSV file.
        output_gexf (str): Path to the output GEXF file.
        source_col (str, optional): Name of the source node column. Defaults to 'source'.
        target_col (str, optional): Name of the target node column. Defaults to 'target'.
        weight_col (str, optional): Name of the weight column. Defaults to None.
        relationship_col (str, optional): Name of the relationship column (used for edge labels). Defaults to 'relationship'.

    Returns:
        networkx.Graph: The created network graph.
    """
    # Read the CSV file
    try:
        df = pd.read_csv(input_csv)
    except Exception as e:
        print(f"Error reading CSV file: {e}")
        return None

    # Validate required columns
    if source_col not in df.columns or target_col not in df.columns:
        print(f"Error: Required columns {source_col} or {target_col} not found in the CSV.")
        return None

    # Create a graph
    G = nx.from_pandas_edgelist(
        df,
        source=source_col,
        target=target_col,
        edge_attr=([weight_col] if weight_col and weight_col in df.columns else None)
    )

    # Prepare GEXF XML structure
    gexf = ET.Element('gexf', {
        'xmlns': 'http://www.gexf.net/1.2draft',
        'xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
        'xsi:schemaLocation': 'http://www.gexf.net/1.2draft http://www.gexf.net/1.2draft/gexf.xsd',
        'version': '1.2'
    })

    # Meta information
    meta = ET.SubElement(gexf, 'meta')
    ET.SubElement(meta, 'creator').text = 'CSV to GEXF Converter'
    ET.SubElement(meta, 'description').text = f'Network generated from {input_csv}'

    # Graph element
    graph = ET.SubElement(gexf, 'graph', {'defaultedgetype': 'undirected'})

    # Nodes
    nodes = ET.SubElement(graph, 'nodes')
    for i, node in enumerate(G.nodes()):
        node_elem = ET.SubElement(nodes, 'node', {
            'id': str(node),
            'label': str(node)
        })

    # Edges
    edges = ET.SubElement(graph, 'edges')
    for i, (source, target, data) in enumerate(G.edges(data=True)):
        # Find the corresponding row in the original DataFrame
        matching_row = df[(df[source_col] == source) & (df[target_col] == target)]

        edge_attrs = {
            'id': str(i),
            'source': str(source),
            'target': str(target)
        }

        # Add weight if available
        if weight_col and weight_col in data:
            edge_attrs['weight'] = str(data[weight_col])

        # Add relationship label if column exists
        if relationship_col in df.columns and not matching_row.empty:
            relationship_value = matching_row[relationship_col].iloc[0]
            edge_attrs['label'] = str(relationship_value)

        ET.SubElement(edges, 'edge', edge_attrs)

    # Convert to pretty-printed XML
    rough_string = ET.tostring(gexf, 'utf-8')
    reparsed = minidom.parseString(rough_string)

    # Write to file
    with open(output_gexf, 'w', encoding='utf-8') as f:
        f.write(reparsed.toprettyxml(indent="  "))

    print(f"GEXF file created successfully: {output_gexf}")
    print(f"Network stats - Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")

    return G

In [59]:
csv_to_gexf(
    input_csv='final_combined_output.csv',
    output_gexf='final_combined_output.gexf',
    source_col='subject',
    target_col='object',
    relationship_col='verb'
)

GEXF file created successfully: final_combined_output.gexf
Network stats - Nodes: 139, Edges: 145


<networkx.classes.graph.Graph at 0x79eba72eada0>

## send it to neo4j?

I've got some code somewhere that ought to generate neo4j, maybe that'd be useful

---

### Now Let's Try Nuextract!

Nuextract is a small model fine-tuned on extracting structured data. But a lot depends on finding the exact right template structure to do it. But let's see what we can do. I suspect having done the pre-processing on the input text will help things.

Requires a Huggingface token, saved in secrets here as HF_TOKEN

see also this notebook https://colab.research.google.com/drive/15SL9vCumXvAkoqn2va5b5vhmRVYu9arO?usp=sharing#scrollTo=e8SU1yEb0yI6

In [None]:
### code for calling the model

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def predict_NuExtract(model, tokenizer, texts, template, batch_size=1, max_length=10_000, max_new_tokens=4_000):
    template = json.dumps(json.loads(template), indent=4)
    prompts = [f"""<|input|>\n### Template:\n{template}\n### Text:\n{text}\n\n<|output|>""" for text in texts]

    outputs = []
    with torch.no_grad():
        for i in range(0, len(prompts), batch_size):
            batch_prompts = prompts[i:i+batch_size]
            batch_encodings = tokenizer(batch_prompts, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(model.device)

            pred_ids = model.generate(**batch_encodings, max_new_tokens=max_new_tokens)
            outputs += tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    return [output.split("<|output|>")[1] for output in outputs]

# function for writing json to csv

import csv

def json_to_csv(json_data, output_file='nuextract_output.csv'):
    # Parse JSON if it's a string
    if isinstance(json_data, str):
        json_data = json.loads(json_data)

    # Flatten nested JSON
    def flatten_json(data, prefix=''):
        flat_dict = {}
        for key, value in data.items():
            new_key = f"{prefix}{key}" if prefix else key

            if isinstance(value, dict):
                flat_dict.update(flatten_json(value, f"{new_key}_"))
            elif isinstance(value, list):
                flat_dict[new_key] = ', '.join(map(str, value))
            else:
                flat_dict[new_key] = value
        return flat_dict

    # Flatten the JSON
    flat_data = flatten_json(json_data)

    # Write to CSV
    with open(output_file, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=flat_data.keys())
        writer.writeheader()
        writer.writerow(flat_data)

    print(f"CSV file '{output_file}' has been created.")

In [None]:
# get the smallest model which fits in colab
# the only nuextract model that works in vanilla colab is the tiny one. try using smol via Ollama w/ temperature at 0.
#model_name = "numind/NuExtract-tiny-v1.5"
model_name = "numind//NuExtract-1.5"
device = "cuda"

## model is used with llm, so maybe I need a new name here?
hfmodel = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)


In [None]:
looting_template = """{
    "Antiquities_Trade_Record": {
        "Entities": {
            "Actors": [
                {
                    "name": "",
                    "type": "",
                    "nationality": "",
                    "roles": []
                }
            ],
            "Organizations": [
                {
                    "name": "",
                    "type": "",
                    "location": "",
                    "established_year": ""
                }
            ],
            "Institutions": [
                {
                    "name": "",
                    "type": "",
                    "jurisdiction": "",
                    "founding_date": ""
                }
            ],
            "Objects": [
                {
                    "name": "",
                    "type": "",
                    "origin": "",
                    "estimated_value": "",
                    "date_of_creation": "",
                    "cultural_origin": ""
                }
            ]
        },
        "Relationships": {
            "involvedIn": [
                {
                    "entity1": "",
                    "entity2": "",
                    "role": "",
                    "date": "",
                    "details": ""
                }
            ],
            "transactedWith": [
                {
                    "buyer": "",
                    "seller": "",
                    "object": "",
                    "transaction_date": "",
                    "transaction_value": "",
                    "location": ""
                }
            ],
            "connectedTo": [
                {
                    "entity1": "",
                    "entity2": "",
                    "connection_type": "",
                    "strength_of_connection": "",
                    "time_period": ""
                }
            ],
            "legalStatus": [
                {
                    "entity": "",
                    "status": "",
                    "jurisdiction": "",
                    "date_of_status_change": "",
                    "legal_details": ""
                }
            ],
            "originatedFrom": [
                {
                    "object": "",
                    "original_location": "",
                    "excavation_date": "",
                    "archaeological_context": ""
                }
            ],
            "operatedIn": [
                {
                    "entity": "",
                    "geographic_region": "",
                    "time_period": "",
                    "operational_details": ""
                }
            ]
        },
        "Context": {
            "source_document": "",
            "date_of_record": "",
            "additional_notes": ""
        }
    }
}"""

In [None]:
text = open("giacomo_namefixed.txt", "r").read()

In [None]:
#predict!
prediction = predict_NuExtract(hfmodel, tokenizer, [text], looting_template)[0]
print(prediction)

In [None]:
json_to_csv(prediction)