# Idea

use out of the box NER to extract entity names, turn to LLM to label with custom set of entities

use outline to run against LM Studio (presumably quantized 7b models)


### Observations

Memory requirements, at least via spacy-llm, seem really high for running local models

unclear if that's because of inefficiencies stemming from spacy library, or inherent to running models

wrapping models in libraries seems to generate inefficient use of api's, getting lots of rate limit errors etc with spacy and outline

lm studio seems like a good approach to doing openai-like calls without incurring cost or rate limits

### renting gpus

paperspace is probably still best vs other options, best ui, relatively easy to get notebooks running. some thrash in disconnected kernels seeming to continue to run workloads

azure is very enterprisey still, not friendly to solo dev

google collab is underbaked, keeps you in their sub-par notebook environment. feels like abandonware/promotionware

probably the pricing model everyone lands on is subscription. access to higher GPU RAM machines is gated on higher subscription costs.

# Setup

## pipeline config

In [6]:
import os


CONFIG_CONTENT = """

[nlp]
lang = "en"
pipeline = ["ner"]

[components]

[components.ner]
source = "en_core_web_md"


[initialize]
vectors = "en_core_web_md"
"""


with open('config.cfg', 'w') as f:
    f.write(CONFIG_CONTENT)
    
    
DATA_SOURCE_DIR = ''

os.environ['TOKENIZERS_PARALLELISM'] = 'false'


## install requirements

# Init pipeline

In [7]:
import logging
import spacy_llm
from spacy_llm.util import assemble
import spacy
from transformers import AutoTokenizer



config = "config.cfg"

model_name = "en_core_web_md"
try:
    nlp = assemble(config)
except OSError:
    spacy.cli.download(model_name)
    nlp = assemble(config)

# set log level to stream to STDOUT
spacy_llm.logger.addHandler(logging.StreamHandler())
spacy_llm.logger.setLevel(logging.DEBUG)

nlp = assemble("config.cfg")



KeyboardInterrupt: 

### Refine NER

resolve disagreements by iterating


In [None]:
import pandas as pd
from tqdm import tqdm

DATA_PATH = 'msgs.pkl'

with open(DATA_PATH, 'rb') as f:
    docs_df = pd.read_pickle(f)
    
docs = docs_df[0].tolist()


entities_rows = []
for doc in tqdm(docs):
    enriched_doc = nlp(doc)
    ents = enriched_doc.ents
    for ent in ents:
        entities_rows.append({"name": ent.text, "label": ent.label_, "fact": doc})  # type: ignore


entities_df = pd.DataFrame(entities_rows)

with open('entities.pkl', 'wb') as f:
    entities_df.to_pickle(f)
    

100%|██████████| 193/193 [00:00<00:00, 207.79it/s]


### Resolve inconsistencies

In [None]:


# refine entity labels
from email import generator
import os
import outlines
from outlines import models
from outlines import generate
from outlines.models.openai import OpenAIConfig
from openai import AsyncOpenAI, OpenAI
import pandas as pd

person_labels = ["PERSON", "PET"]

os.environ['OPENAI_BASE_URL'] = "http://localhost:1234/v1"
os.environ['OPENAI_API_KEY'] = "lm-studio"
# model = "LM Studio Community/Meta-Llama-3-8B-Instruct-GGUF"
model = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
client = AsyncOpenAI(api_key=os.environ['OPENAI_API_KEY'])
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

client = OpenAI()




# for all entities labeled as person, disambiguate between pets and people 
def refine_labels(row: pd.Series):
    name = row["name"]
    label = row["label"]
    if label == "PERSON":
        prompt = f"""You are an entity resolution assistant. 
        You must classify the entity with name = {name}
        
        The valid choices are: PERSON, PET
        
        A previous classifier labeled this entity as: {label}
        Use both your inherent knowledge, and these facts derived from chat logs:
        {row["fact"]} 
        
        Your response should be just one word from the following choices: PERSON, PET
        """
        
        answer = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}])

        return answer
    else:
        return label
    
# apply the function to all rows
# duplicate entities_df

labels = ['foo', 'bar']
prompt = f"pick between the two options at random: {labels}"
generator = generate.choice(model, labels)
answer = generator(prompt)



Task exception was never retrieved
future: <Task finished name='Task-712636' coro=<generate_chat() done, defined at /Users/tombedor/development/youbot/.venv/lib/python3.11/site-packages/outlines/models/openai.py:263> exception=KeyboardInterrupt()>
Traceback (most recent call last):
  File "/Users/tombedor/development/youbot/.venv/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/8x/lhh8gmz10ts3rv_qd0jk3tyc0000gn/T/ipykernel_80732/1925375367.py", line 51, in <module>
    answer = generator(prompt)
             ^^^^^^^^^^^^^^^^^
  File "/Users/tombedor/development/youbot/.venv/lib/python3.11/site-packages/outlines/generate/choice.py", line 34, in generate_choice
    return model.generate_choice(prompt, choices, max_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tombedor/development/youbot/.venv/lib/python3.11/site-packages/outlines/models/o

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/tombedor/.pyenv/versions/3.11.7/lib/python3.11/asyncio/selector_events.py", line 265, in _add_reader
    key = self._selector.get_key(fd)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tombedor/.pyenv/versions/3.11.7/lib/python3.11/selectors.py", line 192, in get_key
    raise KeyError("{!r} is not registered".format(fileobj)) from None
KeyError: '80 is not registered'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/tombedor/development/youbot/.venv/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/8x/lhh8gmz10ts3rv_qd0jk3tyc0000gn/T/ipykernel_80732/759284326.py", line 51, in <module>
    answer = generator(prompt)
             ^^^^^^^^^^^^^^^^^
  File "/Users/tombedor/development/youbot/.venv/lib/python3.11/site-packages/outlines/generate/choice.py",

In [18]:


# refine entity labels
from email import generator
import os
from openai import AsyncOpenAI
import pandas as pd

person_labels = ["PERSON", "PET"]

os.environ['OPENAI_BASE_URL'] = "http://localhost:1234/v1"
os.environ['OPENAI_API_KEY'] = "lm-studio"
# model = "LM Studio Community/Meta-Llama-3-8B-Instruct-GGUF"
model = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
client = OpenAI()



# for all entities labeled as person, disambiguate between pets and people 
def refine_labels(row: pd.Series):
    name = row["name"]
    label = row["label"]
    if label == "PERSON" and name != "Tom":
        prompt = f"""You are an entity resolution assistant. 
        You must classify the entity with name = {name}
        
        The valid choices are: PERSON, PET, UNKNOWN
        A previous classifier labeled this entity as: {label}
        
        Use both your inherent knowledge, and these facts derived from chat logs:
        {row["fact"]} 
        
        The first word of your response should be one of the following: PERSON, PET, UNKNOWN
        Follow this with explanation of your reasoning.
        """

        answer = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}]).choices[0].message.content
        print(f"NAME: {name}\nFact:{row['fact']}.\n\n***\nWINNER = {answer}\n***")

        return answer
    else:
        return label

        
# apply the function to all rows
# add column, new_label
entities_df["new_label"] = entities_df.apply(refine_labels, axis=1)
    



NAME: https://gist.github.com/rauchg/c5f0b1dc245ad95c593de8336aa382ac
Fact:On 28-Jan-2024, Tom shared a link for reference: 'https://gist.github.com/rauchg/c5f0b1dc245ad95c593de8336aa382ac'. This link pertains to searching via a perplexity API, potentially useful for his project..

***
WINNER =  Based on the given information and the fact that the previous classifier labeled this entity as a PERSON, I would maintain that classification, as there is no indication in the provided chat log that suggests this entity represents anything other than Tom, a person. The link shared by Tom does not have a name or any context indicating it is related to anything but his project and the perplexity API search. Therefore, my response is:

PERSON (Tom) - This classification remains unchanged based on both inherent knowledge and the given chat log context.
***
NAME: gspread
Fact:On 28-Jan-2024, Tom and I embarked on a quest to enhance our collaboration by integrating Google Suite tools. Tom planned to

KeyboardInterrupt: 