# Idea

use out of the box NER to extract entity names, turn to LLM to label with custom set of entities

use outline to run against LM Studio (presumably quantized 7b models)


### Observations

Memory requirements, at least via spacy-llm, seem really high for running local models

unclear if that's because of inefficiencies stemming from spacy library, or inherent to running models

wrapping models in libraries seems to generate inefficient use of api's, getting lots of rate limit errors etc with spacy and outline

lm studio seems like a good approach to doing openai-like calls without incurring cost or rate limits

### renting gpus

paperspace is probably still best vs other options, best ui, relatively easy to get notebooks running. some thrash in disconnected kernels seeming to continue to run workloads

azure is very enterprisey still, not friendly to solo dev

google collab is underbaked, keeps you in their sub-par notebook environment. feels like abandonware/promotionware

probably the pricing model everyone lands on is subscription. access to higher GPU RAM machines is gated on higher subscription costs.

# Setup

## pipeline config

In [1]:
import os


CONFIG_CONTENT = """

[nlp]
lang = "en"
pipeline = ["ner"]

[components]

[components.ner]
source = "en_core_web_md"


[initialize]
vectors = "en_core_web_md"
"""


with open('config.cfg', 'w') as f:
    f.write(CONFIG_CONTENT)
    
    
DATA_SOURCE_DIR = ''

os.environ['TOKENIZERS_PARALLELISM'] = 'false'


## install requirements

# Init pipeline

In [2]:
import logging
import spacy_llm
from spacy_llm.util import assemble
import spacy



config = "config.cfg"

model_name = "en_core_web_md"
try:
    nlp = assemble(config)
except OSError:
    spacy.cli.download(model_name)
    nlp = assemble(config)

# set log level to stream to STDOUT
spacy_llm.logger.addHandler(logging.StreamHandler())
spacy_llm.logger.setLevel(logging.DEBUG)

nlp = assemble("config.cfg")



### Refine NER

resolve disagreements by iterating


In [3]:
import pandas as pd
from tqdm import tqdm

DATA_PATH = 'msgs.pkl'

with open(DATA_PATH, 'rb') as f:
    docs_df = pd.read_pickle(f)
    
docs = docs_df[0].tolist()


entities_rows = []
for doc in tqdm(docs):
    enriched_doc = nlp(doc)
    ents = enriched_doc.ents
    for ent in ents:
        entities_rows.append({"name": ent.text, "label": ent.label_, "fact": doc, "enriched_doc": enriched_doc})  # type: ignore


entities_df = pd.DataFrame(entities_rows)

with open('entities.pkl', 'wb') as f:
    entities_df.to_pickle(f)
    

100%|██████████| 193/193 [00:00<00:00, 203.82it/s]


### Resolve inconsistencies

In [6]:


# refine entity labels
import os
import re
from openai import OpenAI
import pandas as pd
import random


os.environ['OPENAI_BASE_URL'] = "http://localhost:1234/v1"
os.environ['OPENAI_API_KEY'] = "lm-studio"
client = OpenAI()
model = client.models.list().data[0].id

VALID_LABELS = [
    "PERSON",
    "PET",
    "ORG",
    "PRODUCT",
    "WEBSITE",
    "GPE",
    "TVSHOW",
    "BOOK",
    "MOVIE",
    "TECHNICAL_CONCEPT",
    "MUSICAL_GROUP",
    "EVENT"
]

DISCARD_LABELS = [
    "CARDINAL",
    "DATE",
    "TIME"
]

# discard entities with any of the discard labels
entities_df = entities_df[~entities_df["label"].isin(DISCARD_LABELS)]

# get list of unique labels:
unique_labels = entities_df["name"].unique()

# new dataframe, with a name column and a facts column, which contains the array of facts
name_df = entities_df.groupby("name")["fact"].apply(lambda x: pd.unique(x)).reset_index()


entity_rows = []
for _, row in tqdm(entities_df.iterrows()):
    name = row['name']
    fact = row['fact']
    

    if name == 'Tom':
        entity_rows.append({"name": name, 'label': 'PERSON', "fact": fact, 'score': 100})
        continue

    label_choices = VALID_LABELS.copy()
    random.shuffle(label_choices)
    labels_str = ", ".join(label_choices)
    candidates = {}
    
    prompt = f"""You are an entity resolution assistant. 
    You must classify the entity with name = {name}
    
    The valid choices are: {labels_str}
    
    
    Use both your inherent knowledge, and this facts derived from chat logs:
    {fact}
    
    Your response should begin with just one word from the following choices: {labels_str}.
    Then, a score from 1 to 100, where 100 is the most confident and 1.
    Then should follow with a short explanation of your reasoning.
    """
    
    response = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}]).choices[0].message.content
    assert(response)
    print(f"***\nname: {name}\n\nfact: {fact}\n\nresponse: {response}\n\n***\n")
    
    # whichever label appears first wins
    winner = 'None'
    min_idx = float('inf')
    for label in label_choices:
        if label in response:
            idx = response.index(label)
            if idx < min_idx:
                min_idx = idx
                winner = label
    
    try:
        score = int(re.search(r'\d+', response).group())
    except AttributeError:
        score = 0
        
    entity_rows.append({"name": name, 'label': winner, "fact": fact, 'score': score})
            
with open('refined_labels.pkl', 'wb') as f:
    pd.DataFrame(entity_rows).to_pickle(f)



0it [00:00, ?it/s]

In [22]:
import re


VALID_RELATIONSHIPS = {
    ("PERSON", "PERSON"): [
        "IS_FRIEND_TO",
        "IS_RELATIVE_OF",
        "IS_ROMANTIC_PARTNER_OF",
        "IS_COWORKER_OF",
        "FOLLOWS_IN_MEDIA",
        "IS_SAME_PERSON_AS",
    ],
    ("PERSON", "PET"): [
        "is owner of",
        "interacted with",
        "took care of",
    ],
    ("PERSON", "EVENT"): [
        "attended",
    ],
    ("EVENT", "TIME"): [
        "occured at"
    ],
    ("EVENT", "DATE"): [
        "occured on"
    ]
}

resolve_entities = pd.read_pickle('refined_labels.pkl')['name']
# first element in the tuple is the label
resolved_labels = pd.read_pickle('refined_labels.pkl')['label'].apply(lambda x: x[0])

resolved_entity_df = pd.DataFrame({"name": resolve_entities, "label": resolved_labels})

relation_rows = []

for label_1, label_2 in VALID_RELATIONSHIPS.keys():
    entity_1 = resolved_entity_df[resolved_entity_df["label"] == label_1]
    entity_2 = resolved_entity_df[resolved_entity_df["label"] == label_2]

    for _, row_1 in entity_1.iterrows():
        for _, row_2 in entity_2.iterrows():
            name_1 = row_1['name']
            name_2 = row_2['name']
            if name_1 == name_2:
                continue
            facts_1 = entities_df[entities_df["name"] == name_1]["fact"]
            facts_2 = entities_df[entities_df["name"] == name_2]["fact"]
            
            intersecting_facts = set(facts_1).intersection(set(facts_2))
            
            if len(intersecting_facts) == 0:
                continue

            print(f"Intersecting facts for {name_1} and {name_2}: {intersecting_facts}")
        
            relation_candidates = {}
            
            for fact in intersecting_facts:
                print(f"Entity 1: {row_1['name']}, a {label_1}. Entity 2: {row_2['name']}. A {label_2}. Fact: {fact}")
                prompt = f"""You are an entity resolution assistant. 
                You must classify the relationship between two entities:
                Entity 1: name = {row_1['name']}, type = {label_1} 
                Entity 2: name = {row_2['name']}, type = {label_2}
            
                The valid choices are: {VALID_RELATIONSHIPS[(label_1, label_2)]}. If none fit specify NONE.
                
                Use both your inherent knowledge, and this fact derived from chat logs:
                {fact}
                
                Your response begin with one of the following choices: {VALID_RELATIONSHIPS[(label_1, label_2)]}, NONE. 
                A score from 1 to 100 should follow. 100 means you are very confident in your choice, 1 means you are not confident at all.
                Then it should follow with a short explanation of your reasoning.
                """
                
                response = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}]).choices[0].message.content
                assert(response)
                print(f"***\nname: {row_1['name']}, {row_2['name']}\n\nresponse: {response}\n\n***\n")
                
                # whichever label appears first wins
                winner = None
                min_idx = float('inf')
                for label in VALID_RELATIONSHIPS[(label_1, label_2)]:
                    if label in response:
                        idx = response.index(label)
                        if idx < min_idx:
                            min_idx = idx
                            winner = label
                # score is regex match for first number in the response
                try:
                    score = int(re.search(r'\d+', response).group())
                except AttributeError:
                    score = 0
                    
                row = {"entity_1": row_1['name'], "entity_2": row_2['name'], "relationship": winner, 'fact': fact, 'score': score}
                print(row)
                relation_rows.append(row)
            

# data frame from list of rows (each entry is a dict)
relation_df = pd.DataFrame(relation_rows)
with open('relationships.pkl', 'wb') as f:
    relation_df.to_pickle(f)
            
        




Intersecting facts for Allison and Justina: {"On 2024-02-26, Tom shared that he and Justina had a weekend full of dog activities, as they were taking care of his sister Allison's dog, Cesar. This presents an addition to Tom's pet-friendly lifestyle and his on-going family connections."}
Entity 1: Allison, a PERSON. Entity 2: Justina. A PERSON. Fact: On 2024-02-26, Tom shared that he and Justina had a weekend full of dog activities, as they were taking care of his sister Allison's dog, Cesar. This presents an addition to Tom's pet-friendly lifestyle and his on-going family connections.
***
name: Allison, Justina

response: IS_RELATIVE_OF
99

Based on the provided information, I have classified Allison as Tom's sister, which makes Justina her caretaker or a friend/family member helping with dog care, rather than a romantic partner or coworker. The fact that they took care of Cesar, Allison's dog, suggests a familial connection between Tom and Allison, making it likely that Justina is als

In [7]:
entities_df


# check logic,

# multiple romantic partners, romantic partners that are also a sibling, familial relationships


# need to hone by confidence score or something similar, not every fact is equally important

Unnamed: 0,name,label,fact,enriched_doc
0,2024-01-27,DATE,"On 2024-01-27, Tom expressed a preference for ...","(On, 2024, -, 01, -, 27, ,, Tom, expressed, a,..."
1,Tom,PERSON,"On 2024-01-27, Tom expressed a preference for ...","(On, 2024, -, 01, -, 27, ,, Tom, expressed, a,..."
2,Neo4j Community,ORG,"On 2024-01-27, Tom expressed a preference for ...","(On, 2024, -, 01, -, 27, ,, Tom, expressed, a,..."
3,JanusGraph,ORG,"On 2024-01-27, Tom expressed a preference for ...","(On, 2024, -, 01, -, 27, ,, Tom, expressed, a,..."
4,JanusGraph,ORG,"On 2024-01-27, Tom expressed a preference for ...","(On, 2024, -, 01, -, 27, ,, Tom, expressed, a,..."
...,...,...,...,...
688,2024-04-29,DATE,"On 2024-04-29, Tom asked about applying a func...","(On, 2024, -, 04, -, 29, ,, Tom, asked, about,..."
689,Tom,PERSON,"On 2024-04-29, Tom asked about applying a func...","(On, 2024, -, 04, -, 29, ,, Tom, asked, about,..."
690,one,CARDINAL,"On 2024-04-29, Tom asked about applying a func...","(On, 2024, -, 04, -, 29, ,, Tom, asked, about,..."
691,Sam,PERSON,"On 2024-04-29, Tom asked about applying a func...","(On, 2024, -, 04, -, 29, ,, Tom, asked, about,..."
