# Idea

use out of the box NER to extract entity names, turn to LLM to label with custom set of entities

use outline to run against LM Studio (presumably quantized 7b models)


### Observations

Memory requirements, at least via spacy-llm, seem really high for running local models

unclear if that's because of inefficiencies stemming from spacy library, or inherent to running models

wrapping models in libraries seems to generate inefficient use of api's, getting lots of rate limit errors etc with spacy and outline

lm studio seems like a good approach to doing openai-like calls without incurring cost or rate limits

### renting gpus

paperspace is probably still best vs other options, best ui, relatively easy to get notebooks running. some thrash in disconnected kernels seeming to continue to run workloads

azure is very enterprisey still, not friendly to solo dev

google collab is underbaked, keeps you in their sub-par notebook environment. feels like abandonware/promotionware

probably the pricing model everyone lands on is subscription. access to higher GPU RAM machines is gated on higher subscription costs.

# Setup

## pipeline config

In [1]:
import os


CONFIG_CONTENT = """

[nlp]
lang = "en"
pipeline = ["ner"]

[components]

[components.ner]
source = "en_core_web_md"


[initialize]
vectors = "en_core_web_md"
"""


with open('config.cfg', 'w') as f:
    f.write(CONFIG_CONTENT)
    
    
DATA_SOURCE_DIR = ''

os.environ['TOKENIZERS_PARALLELISM'] = 'false'


## install requirements

# Init pipeline

In [2]:
import logging
import spacy_llm
from spacy_llm.util import assemble
import spacy



config = "config.cfg"

model_name = "en_core_web_md"
try:
    nlp = assemble(config)
except OSError:
    spacy.cli.download(model_name)
    nlp = assemble(config)

# set log level to stream to STDOUT
spacy_llm.logger.addHandler(logging.StreamHandler())
spacy_llm.logger.setLevel(logging.DEBUG)

nlp = assemble("config.cfg")



### Refine NER

resolve disagreements by iterating


In [4]:
import pandas as pd
from tqdm import tqdm

DATA_PATH = 'msgs.pkl'

with open(DATA_PATH, 'rb') as f:
    docs_df = pd.read_pickle(f)
    
docs = docs_df[0].tolist()


entities_rows = []
for doc in tqdm(docs):
    enriched_doc = nlp(doc)
    ents = enriched_doc.ents
    for ent in ents:
        entities_rows.append({"name": ent.text, "label": ent.label_, "fact": doc})  # type: ignore


entities_df = pd.DataFrame(entities_rows)

with open('entities.pkl', 'wb') as f:
    entities_df.to_pickle(f)
    

100%|██████████| 71/71 [00:00<00:00, 201.08it/s]


### Resolve inconsistencies

In [12]:


# refine entity labels
import os
from openai import OpenAI
import pandas as pd
import random

from math import ceil

os.environ['OPENAI_BASE_URL'] = "http://localhost:1234/v1"
os.environ['OPENAI_API_KEY'] = "lm-studio"
client = OpenAI()
model = client.models.list().data[0].id

VALID_LABELS = [
    "PERSON",
    "PET",
    "ORG",
    "PRODUCT",
    "WEBSITE",
    "GPE",
    "TVSHOW",
    "BOOK",
    "MOVIE",
    "TECHNICALCONCEPT",
    "EVENT"
]

DISCARD_LABELS = [
    "CARDINAL",
    "DATE",
    "TIME"
]

# discard entities with any of the discard labels
entities_df = entities_df[~entities_df["label"].isin(DISCARD_LABELS)]

# get list of unique labels:
unique_labels = entities_df["name"].unique()

# new dataframe, with a name column and a facts column, which contains the array of facts
name_df = entities_df.groupby("name")["fact"].apply(lambda x: pd.unique(x)).reset_index()


def chunks(l, n): 
    # looping till length l 
    for i in range(0, len(l), n):  
        yield l[i:i + n] 
# for all entities labeled as person, disambiguate between pets and people 
def refine_labels(row: pd.Series, current_round = 1, label_choices = VALID_LABELS):    
    MAX_ROUNDS = 3
    MAX_BATCH_FACTS = 1
    MIN_FACTS = 3
    
    name = row["name"]
    fact_set = row["fact"]
    
    if len(fact_set) < MIN_FACTS:
        print('upsampling facts')
        fact_list = list(fact_set) * ceil(MIN_FACTS / len(fact_set))
    else:
        fact_list = list(fact_set)
    print(f"considering name {name}. round {current_round}, label choices: {label_choices}. Number of facts: {len(fact_set)}")
    
    
    if current_round > MAX_ROUNDS:
        print("exhausted retries for name {name}, remaining choices are: {label_choices}")
        return label_choices
    
    
    random.shuffle(fact_set)
    random.shuffle(label_choices)
    labels_str = ", ".join(label_choices)
    
    candidates = set()
    for fact_chunk in chunks(fact_list, MAX_BATCH_FACTS):
        facts_str = "\n".join(set(fact_chunk))
        prompt = f"""You are an entity resolution assistant. 
        You must classify the entity with name = {name}
        
        The valid choices are: {labels_str}
        
        
        Use both your inherent knowledge, and these facts derived from chat logs:
        {facts_str}
        
        Your response should begin with just one word from the following choices: {labels_str}, then should follow with an explanation of your reasoning.
        """
        
        response = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}]).choices[0].message.content
        assert(response)
        print(f"***\nname: {name}\n\nfacts: {facts_str}\n\nresponse: {response}\n\n***\n")
        
        # whichever label appears first wins
        winner = None
        min_idx = float('inf')
        for label in label_choices:
            if label in response:
                idx = response.index(label)
                if idx < min_idx:
                    min_idx = idx
                    winner = label
        if winner:
            candidates.add(winner)
    if len(candidates) == 0:
        print('no candidates found, retrying')
        return refine_labels(row, current_round + 1, label_choices)
    elif len(candidates) == 1:
        winner = candidates.pop()
        print(f"winning candidate found: {winner}")
        return winner
    else:
        print(f"narrowed to candidates: {candidates}")
        
        return refine_labels(row, current_round + 1, list(candidates))
        
# curl http://localhost:1234/v1/chat/completions \
#   -H "Content-Type: application/json" \
#   -d '{ 
#     "model": "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF",
#     "messages": [ 
#       { "role": "system", "content": "Always answer in rhymes." },
#       { "role": "user", "content": "Introduce yourself." }
#     ], 
#     "temperature": 0.7, 
#     "max_tokens": -1,
#     "stream": true
# }'

# apply the refine_labels function to each row in the name_df
name_df['label'] = name_df.apply(refine_labels, axis=1)



upsampling facts
considering name 'Test Event'. round 1, label choices: ['PERSON', 'PET', 'ORG', 'PRODUCT', 'WEBSITE', 'GPE', 'TVSHOW', 'BOOK', 'MOVIE', 'TECHNICALCONCEPT', 'EVENT']. Number of facts: 1
***
name: 'Test Event'

facts: Google email 'tombedor@gmail.com' was linked for calendar event usage. Successfully created events 'Test Event' and 'Leave for LA' on request.

response: EVENT

I classified "Test Event" as an EVENT because it is mentioned in the context of being created on a calendar, which suggests that it is a scheduled occurrence or meeting, rather than any other type of entity. The fact that it was created along with another event ("Leave for LA") further supports this classification.

***

***
name: 'Test Event'

facts: Google email 'tombedor@gmail.com' was linked for calendar event usage. Successfully created events 'Test Event' and 'Leave for LA' on request.

response: EVENT

I classified "Test Event" as an EVENT because it is mentioned in the context of calendar ev

In [None]:


# refine entity labels
from email import generator
import os
from openai import AsyncOpenAI
import pandas as pd

person_labels = ["PERSON", "PET"]

os.environ['OPENAI_BASE_URL'] = "http://localhost:1234/v1"
os.environ['OPENAI_API_KEY'] = "lm-studio"
# model = "LM Studio Community/Meta-Llama-3-8B-Instruct-GGUF"
model = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
client = OpenAI()



# for all entities labeled as person, disambiguate between pets and people 
def refine_labels(row: pd.Series):
    name = row["name"]
    label = row["label"]
    if label == "PERSON" and name != "Tom":
        prompt = f"""You are an entity resolution assistant. 
        You must classify the entity with name = {name}
        
        The valid choices are: PERSON, PET, UNKNOWN
        A previous classifier labeled this entity as: {label}
        
        Use both your inherent knowledge, and these facts derived from chat logs:
        {row["fact"]} 
        
        The first word of your response should be one of the following: PERSON, PET, UNKNOWN
        Follow this with explanation of your reasoning.
        """

        answer = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}]).choices[0].message.content
        print(f"NAME: {name}\nFact:{row['fact']}.\n\n***\nWINNER = {answer}\n***")

        return answer
    else:
        return label

        
# apply the function to all rows
# add column, new_label
entities_df["new_label"] = entities_df.apply(refine_labels, axis=1)
    

