<a href="https://colab.research.google.com/github/shawngraham/homecooked-history/blob/main/structured_data_extractor_using_groq_and_llm_and_coreferee.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook does some interesting things. I have found, through experiments, that while LLMs get us _some_ of the way towards one-shot or no-shot structured data extraction from unstructured texts (eg, historical documents, articles, reports), reworking the text with coreference resolution performed first gets us _much_ closer to what we want. And using a dedicated coreference model works much better than trying to get an LLM to do it.

So this notebook demonstrates a flow, and saves our work at each step for subsequent examination.

1.  use coreference resolution to sort out pronouns/noun agreement etc.; return modified text
2. use a template with an LLM to further massage that text so that any instance of eg 'Graham' gets switched to 'Shawn Graham'; return modified text
3. pass that modified text through a prompt that defines our desired list of predicates. Since steps 1 and 2 reduce opportunities for confusion over who or what is doing the acting, the results of 3 tend to be higher quality than would otherwise be the case. Write to csv.
4. pass the csv through a 'checker' that marks rows that are either not well formed, or use a predicate not in our desired list.

The user can then manually inspect the results to find easily rows that need fixing, and can decide how to deal with them.

## initialize

In [1]:
%%capture
!pip install llm
!llm install llm-groq


In [2]:
import spacy
spacy.prefer_gpu()

True

In [3]:
%%capture
#!pip install -U spacy
!python3 -m pip install coreferee
!python3 -m coreferee install en
!python -m spacy download en_core_web_trf
!python -m spacy download en_core_web_lg


In [4]:
!llm keys set groq

Enter key: 


In [5]:
# set llama3.1-70b as default model
!llm aliases set llama3 groq-llama3.1-70b

#models via groq are so much faster! You might wish to experiment with other options.
#eg

#!llm install llm-gguf
#!llm gguf download-model https://huggingface.co/lmstudio-community/SmolLM2-1.7B-Instruct-GGUF/resolve/main/SmolLM2-1.7B-Instruct-Q8_0.gguf -a smol17

# now available via llm -m smol17

## test text

We create a folder called 'source-texts' and, for testing, put the full text for an article about Giacomo Medici into it. [via Trafficking Culture](https://traffickingculture.org/encyclopedia/case-studies/giacomo-medici/). Note - while that article begins with 'Medici started dealing' I added the word 'Giacomo' so that we don't get confused in model space about other famous Medicis.

You would put your own source txt files into this folder.

In [6]:
!mkdir source-texts
!echo -e """Giacomo Medici started dealing in antiquities in Rome during the 1960s (Silver 2009: 25). In July 1967, he was convicted in Italy of receiving looted artefacts, though in the same year he met and became an important supplier of antiquities to US dealer Robert Hecht (Silver 2009: 27-9). In 1968, Medici opened the gallery Antiquaria Romana in Rome and began to explore business opportunities in Switzerland (Silver 2009: 34). It is widely believed that in December 1971 he bought the illegally-excavated Euphronios (Sarpedon) krater from tombaroli before transporting it to Switzerland and selling it to Hecht (Silver 2009: 50).\n\n In 1978, he closed his Rome gallery, and entered into partnership with Geneva resident Christian Boursaud, who started consigning material supplied by Medici for sale at Sotheby’s London (Silver 2009: 121-2, 139; Watson and Todeschini 2007: 27). Together, they opened Hydra Gallery in Geneva in 1983 (Silver 2009: 139). It has been estimated that throughout the 1980s Medici was the source of more consignments to Sotheby’s London than any other vendor (Watson and Todeschini 2007: 27). At any one time, Boursaud might consign anything up to seventy objects, worth together as much as £500,000 (Watson 1997: 112). Material would be delivered to Sotheby’s from Geneva by courier (Watson 1997: 112). \n\n In October 1985, the Hydra Gallery sold fragments of the Onesimos kylix to the J. Paul Getty Museum for $100,000, providing a false provenance by way of the fictitious Zbinden collection, a provenance that was sometimes used for material offered at Sotheby’s (Silver 2009: 145; Watson and Todeschini 2007: 95). The Getty returned the kylix to Italy in 1999. \n\n In 1986, bad publicity surrounding the sale of looted Apulian vases at Sotheby’s London caused Medici and Boursaud to part company, and Medici bought the Geneva-based Editions Services to continue consigning material to Sotheby’s (Silver 2009: 147; Watson and Todeschini 2007: 27; Watson 1997: 117, 183-6). From 1987 until 1994, he was also consigning material to Sotheby’s through other ‘front companies’, including Mat Securitas, Arts Franc and Tecafin Fiduciaire (Watson and Todeschini 2007: 73). He developed a triangulating system of consigning through one company and purchasing the same piece through another company. There were two potentially positive outcomes of this triangulation manoeuvre: first, it artificially created demand, suggesting to potential customers that the market was stronger than it actually was; and second, it was a way of providing illegally-excavated or -exported pieces with a ‘Sotheby’s’ provenance, and, in effect, laundering them (Watson and Todeschini 2007: 135-41). \n\n By the late 1980s, Medici had developed commercial relations with other major antiquities dealers including Robin Symes, Frieda Tchacos, Nikolas Koutoulakis, Robert Hecht, and the brothers Ali and Hicham Aboutaam (Watson and Todeschini 2007: 73-4). He was the ultimate source of artefacts that would subsequently be sold through dealers or auction houses to private collectors, including Lawrence and Barbara Fleischman, Maurice Tempelsman, Shelby White and Leon Levy, the Hunt brothers, George Ortiz, and José Luis Várez Fisa (Watson and Todeschini 2007: 112-34; Isman 2010), and to museums including the J. Paul Getty, the Metropolitan Museum of Art, the Cleveland Museum of Art, and the Boston Museum of Fine Arts. \n\n In 1995, a Sotheby’s London auction catalogue advertised for sale a sarcophagus recognized by the Carabinieri to have been stolen from the church of San Saba, in Rome. Sotheby’s informed the Carabinieri that it had been consigned by Editions Services (Watson and Todeschini 2007: 19). This was around the same time that the ‘organigram’ had been discovered, revealing Medici’s central position in the organisation of the antiquities trade out of Italy (Watson and Todeschini 2007: 19), and putting the evidence together, the Carabinieri decided to act. On 13 September 1995, in concert with Swiss police, they raided Medici’s storage space in the Geneva Freeport, which comprised five rooms with a combined area of about 200 sq metres (Silver 2009: 174; Watson and Todeschini 2007: 20). One room was equipped as a laboratory for cleaning and restoring artefacts, another was fitted out as a showroom, presumably for receiving potential customers (Silver 2009: 180-1). In January 1997, Medici was arrested in Rome (Silver 2009: 175-6), and in July 1997, his Geneva storerooms, which had remained sealed since 1995, were opened again for the process of examination and inventory. \n\n The official report of the contents of Medici’s storerooms was submitted in July 1999. The storerooms had been found to contain 3,800 whole or fragmentary objects, more than 4,000 photographs of artefacts, and 35,000 sheets of paper containing information relating to Medici’s business practices and connections. The artefacts were mainly from Italy, but there were also hundreds from Egypt, Syria, Greece and Asia. The Swiss authorities turned over Italian material to Italy, but returned the rest to Medici (Silver 2009: 192). The photographs were mainly Polaroids, showing what appeared to be illegally-excavated artefacts, sometimes with several views of the same one, in various stages of restoration. Some artefacts were shown still covered with dirt after their excavation, some fragmentary, and others cleaned and reassembled prior to sale (Watson and Todeschini 2007: 54-68). In 2002, Carabinieri raided Medici’s home in Santa Marinella (Watson and Todeschini 2007: 200). \n\n Medici was charged with receiving stolen goods, illegal export of goods, and conspiracy to traffic, and his trial in Rome commenced on 4 December 2003. On 12 May 2005, he was found guilty of all charges. The judge declared that Medici had trafficked thousands of artefacts, including the sarcophagus fragment that had started the investigation, and the Euphronios (Sarpedon) krater (Silver 2009: 212). He was sentenced to ten years in prison and received a €10 million fine, with the money going to the Italian state in compensation for damage caused to cultural heritage (Silver 2009: 214). In July 2009, an appeals court in Rome dismissed the trafficking conviction against him because of the expired limitation period, but reaffirmed the convictions for receiving and conspiracy. His jail sentence was reduced to eight years, but the €10 million fine remained in place (Scherer 2009). In December 2011, a further appeal failed (Felch 2012). \n\n The evidence recovered during the investigation into Medici’s business was instrumental in forcing several museums and private collectors to return artefacts to Italy, and triggered further investigations and ultimately the prosecutions of Marion True and Robert Hecht.""" > testing/giacomo.txt

## coreference resolution

Coreference resolution involves figuring out which pronouns go with what nouns, and replacing them with the nouns. So, 'John Smith worked in Ottawa. Later he moved to Montreal' _should_ become 'John Smith worked in Ottawa. Later John Smith moved to Montreal'.

You might want to snoop in the 'resolved' folder to see how well this has worked.

In [None]:
# ok, let's try on a full folder

import coreferee, spacy
import spacy_transformers
import os

# Load the Spacy language model and add the Coreferee pipeline component
nlp = spacy.load('en_core_web_trf')
nlp.add_pipe('coreferee')

# Define the input folder containing text files and the output folder for the resolved texts
input_folder = "source-texts"  # Replace with the path to your input folder
output_folder = "resolved"  # Replace with the path to your output folder

# Create output directory if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Iterate over all text files in the input directory
for filename in os.listdir(input_folder):
    if filename.endswith(".txt"):
        # Construct the full file paths
        input_file_path = os.path.join(input_folder, filename)
        output_file_path = os.path.join(output_folder, filename)

        # Read the content of the text file
        with open(input_file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        # Process the text with Spacy and Coreferee
        coref_doc = nlp(text)

        # Perform entity co-resolution
        resolved_text = ""
        for token in coref_doc:
            repres = coref_doc._.coref_chains.resolve(token)
            if repres:
                resolved_text += " " + " and ".join([t.text for t in repres])
            else:
                resolved_text += " " + token.text

        # Write the resolved text to the output file
        with open(output_file_path, 'w', encoding='utf-8') as file:
            file.write(resolved_text.strip())  # Remove leading space

## extract!

In [9]:
!llm -m llama3 'is this thing on'

It looks like it is. I'm here and ready to chat. How can I help you today?


We create a template for the LLM to apply so that any time our text says something like

'John Smith did x. Later Smith did y'

...all references to Smith become john_smith.

In [40]:
#define the template
!llm --system "Replace all references to individuals in this text with a consistent firstname_surname format. Rules: 1. First full name mention: Replace with 'firstname_surname' 2. Subsequent surname-only mentions: Replace with the same 'firstname_surname' 3. Preserve original text structure and context 4. Ensure replacements are uniform throughout the text Examples - 'John Smith arrived late' -> 'firstname_surname arrived late' - 'Smith apologized' -> 'firstname_surname apologized' RETURN ONLY the modified text output.""" --save namefix3

In [77]:
!mkdir namefixed

In [78]:
# run the name fix
import os
import subprocess

for filename in os.listdir("resolved"):
    if filename.endswith(".txt"):
        input_path = os.path.join("resolved", filename)
        output_path = os.path.join("namefixed", filename[:-4] + "_namefixed.txt")

        command = f"cat {input_path} | llm -m llama3 -t namefix3 > {output_path}"
        subprocess.run(command, shell=True)

In [None]:
!pip install --upgrade pydantic

Now we do the actual extraction. Note line 41 where we specify the target relations we're after.

In [79]:
import os
import llm
import re

# Ensure results directory exists
os.makedirs("results", exist_ok=True)

# Get the LLM model
model = llm.get_model("llama3")

# Path to the ready-to-go folder
input_folder = "namefixed"

# Iterate through all text files in the ready-to-go folder
for filename in os.listdir(input_folder):
    if filename.endswith(".txt"):
        # Full path to the input file
        input_path = os.path.join(input_folder, filename)

        # Read the text content
        with open(input_path, "r") as file:
            text_content = file.read()

        # Split the text content into paragraphs using regular expressions
        paragraphs = re.split(r'\n\s*\n', text_content)

        # Prepare output file path in results folder
        output_filename = filename.replace(".txt", "_triplets.csv")
        output_path = os.path.join("results", output_filename)

        # Open the output file to write results
        with open(output_path, "w") as output_file:
            # Write CSV header
            output_file.write("subject,verb,object\n")

            # Iterate through each paragraph
            for paragraph_index, paragraph in enumerate(paragraphs, 1):
                # Construct the prompt for the current paragraph
                # THIS IS WHERE YOU ALSO INDICATE TARGET VERBS/PREDICATES
                # MAKE SURE THESE ARE THE SAME AS IN THE VALIDATION BLOCK IN THE NEXT CODE CELL
                prompt = paragraph + "\n\n Your output will be in csv format with columns 'subject','verb','object'. Extract subject,verb,object triplets that capture the nuance of the text. Ignore scholarly citations. The target predicates are sold_to, worked_with, purchased, sold, stole, was_responsible_for. RETURN ONLY THE LIST OF TRIPLETS."

                # Send the prompt to the LLM and get the response
                try:
                    response = model.prompt(prompt, temperature=0)

                    # Combine response chunks
                    full_response = ''.join(chunk for chunk in response)

                    # Print the response for the current paragraph to console
                    print(f"Paragraph {paragraph_index} from {filename}:")
                    print(full_response)
                    print("\n---\n")

                    # Write the response to the output file
                    output_file.write(full_response)
                except Exception as e:
                    print(f"Error processing paragraph {paragraph_index} in {filename}: {e}")

print("Extraction complete. Results saved in 'results' folder.")

Paragraph 1 from giacomo_namefixed.txt:
giacomo_medici,was_responsible_for,dealing in antiquities
giacomo_medici,was convicted of,receiving looted artefacts
giacomo_medici,worked_with,robert_hecht
giacomo_medici,purchased,Euphronios (Sarpedon) krater
giacomo_medici,sold_to,robert_hecht
giacomo_medici,opened,Antiquaria Romana 
tombaroli,stole,Euphronios (Sarpedon) krater

---

Paragraph 2 from giacomo_namefixed.txt:
giacomo_medici,worked_with,christian_boursaud
giacomo_medici,sold_to,Sotheby's London
giacomo_medici,was_responsible_for,consignments to Sotheby's London
christian_boursaud,sold_to,Sotheby's London
giacomo_medici,worked_with,christian_boursaud at Hydra Gallery 
giacomo_medici,sold,objects to Sotheby's London

---

Paragraph 3 from giacomo_namefixed.txt:
"Hydra Gallery","sold","fragments of the Onesimos kylix"
"Hydra Gallery","sold_to","J. Paul Getty Museum"
"J. Paul Getty Museum","purchased","fragments of the Onesimos kylix"
"J. Paul Getty Museum","returned","kylix"
"J. Paul

As you look at the results, note that sometimes names aren't rendered with an underscore. You _could_ run the file through the 'namefix' template again to try to fix this. You'd run something like this:

`!cat results/giacomo_namefixed_triplets.csv | llm -m llama3 -t namefix3 > output.csv`

...choosing a sensible name/place instead of `output.csv`


Or you could just fix this manually later.


## validate results

This last bit runs your results through a checker to mark up rows that DO NOT have your target predicates present or DO NOT have 3 columns of data. This way, it becomes easy for you to manually inspect the results and decide how you want to handle things.

If you change your list of desired predicates back where you extract things, make sure the 'valid_predicates' list below is updated accordingly.


In [75]:
Timport csv
import os

def error_check_predicates(input_file, output_file, valid_predicates=None):
    """
    Check CSV file for valid predicates and column count.

    Args:
    input_file (str): Path to input CSV file
    output_file (str): Path to output error-checked file
    valid_predicates (list): List of valid predicate verbs
    """
    if valid_predicates is None:
        valid_predicates = [
            'sold_to', 'worked_with', 'purchased',
            'sold', 'stole', 'was_responsible_for'
        ]

    with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)

        # Write header
        header = next(reader)
        writer.writerow(header)

        for row in reader:
            # Mark line with ### if more than 3 columns
            if len(row) != 3:
                row.insert(0, '###')
                writer.writerow(row)
                continue

            # Check predicate validity
            verb = row[1].strip().lower()
            if verb not in valid_predicates:
                row.insert(0, '###')

            writer.writerow(row)

def process_all_files(input_dir='results', output_dir='error-checked'):
    os.makedirs(output_dir, exist_ok=True)

    for filename in os.listdir(input_dir):
        if filename.endswith('_triplets.csv'):
            input_path = os.path.join(input_dir, filename)
            output_path = os.path.join(output_dir, f'checked_{filename}')

            error_check_predicates(input_path, output_path)
            print(f'Processed: {filename}')



process_all_files()

Processed: giacomo_namefixed_triplets.csv


Now you can download your csv file with entities & relations, for use in knowledge graph embedding models, network analysis, or whatever else.

Open the csv file in a text editor FIRST though, and sort the lines alphabetically. The lines marked with ### will be put at the top, and you can manually work through them to decide what to do with the predicate or the extra columns of data (where extra commas have crept in).