# Context-Dependent Embeddings

- Modified 13 Feb 2024

- This is the notebook for the experiments of the Embeddings walkthrough session
- https://www.youtube.com/watch?v=gVZryxJRdSY

- Key findings:
    - Long chunks of similar text tends to make the model classify embeddings as more similar (Key Finding 1)
    - Prepending Context (Approach 1) and appending context (Approach 1.5) do not solve this problem
    - Modifying text based on context (Approach 2) appears to solve this problem
    
- Other findings:
    - text-embedding-3-large and text-embedding-3-small handles negation well compared to text-embedding-ada-002

In [2]:
import numpy as np
import os
from openai import OpenAI

#API Keys
os.environ['OPENAI_API_KEY'] = '<YOUR OPENAI KEY HERE>'

# Helper Functions

In [67]:
def chat(system_prompt: str, user_prompt: str, model: str = 'gpt-3.5-turbo', temperature: float = 0, verbose: bool = False, host: str = 'openai', **kwargs):
    '''Performs a chat with the host's LLM model with system prompt, user prompt, model, verbose and kwargs
    Returns the output string res
    - system_prompt: String. Write in whatever you want the LLM to become. e.g. "You are a \<purpose in life\>"
    - user_prompt: String. The user input. Later, when we use it as a function, this is the function input
    - model: String. The LLM model to use for json generation
    - verbose: Boolean (default: False). Whether or not to print out the system prompt, user prompt, GPT response
    - host: String. The provider of the LLM
    - **kwargs: Dict. Additional arguments for LLM chat'''
    
    if host == 'openai':
        client = OpenAI()
        response = client.chat.completions.create(
            model=model,
            temperature = temperature,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            **kwargs
        )
        res = response.choices[0].message.content

        if verbose:
            print('System prompt:', system_prompt)
            print('\nUser prompt:', user_prompt)
            print('\nGPT response:', res)
            
    return res

In [68]:
def get_embedding(text: str, model="text-embedding-3-large", num_tries = 10):
    ''' Generates a text embedding using OpenAI Embeddings 
    Gives num_tries repeat before moving on to cater for API throttling issues'''
    text = text.replace("\n", " ")
    for tries in range(num_tries):
        try:
            client = OpenAI()
            embedding = client.embeddings.create(input = [text], model=model).data[0].embedding
            break    
        except Exception as e:
            continue
    return embedding

In [69]:
def top_k_neighbours(query_embedding: list, embedding_list: list, k:int = 3, removedindex: list = []):
    ''' Given a query embedding, finds out the top k embeddings that are the most relevant
    Returns the index vector of top k embeddings, in sorted order from most relevant to least relevant'''
    embedding_similarity = [np.dot(query_embedding, emb) for emb in embedding_list]
    # set some indices to 0
    np.array(embedding_similarity)[selectedindex] = 0
    # this is if you do not care about the order in the top k
    # return np.argpartition(embedding_similarity, -k)[-k:] 
    # this is if you care about order in the top k
    return np.argsort(embedding_similarity)[-k:][::-1].tolist()

# Key Finding 1: Shorter text is better for embeddings

In [70]:
model = 'text-embedding-3-large'

In [71]:
text1 = "John went to the supermarket. Peter went to the gym. Mary went to the garden."
text2 = "John went to the airport. Peter went to the gym. Mary went to the garden."
np.dot(get_embedding(text1, model), get_embedding(text2, model))

0.9297407654620802

In [72]:
text1 = "Peter went to the gym. Mary went to the garden. John went to the supermarket."
text2 = "Peter went to the gym. Mary went to the garden. John went to the airport."
np.dot(get_embedding(text1, model), get_embedding(text2, model))

0.9600638691627703

In [73]:
text1 = "John went to the supermarket."
text2 = "John went to the airport."
np.dot(get_embedding(text1, model), get_embedding(text2, model))

0.6328660633894473

# Context Dependent Embeddings (Baseline)

In [74]:
model = 'text-embedding-3-large'

In [75]:
# should be similar
np.dot(get_embedding('I went to the bank', model), get_embedding('I went to the river', model))

0.5897400164080131

In [76]:
# should be similar
np.dot(get_embedding('I went to the bank', model), get_embedding('I went to get money', model))

0.8018812180559

# Context Dependent Embeddings (Approach 1) - Prepending context

In [77]:
# should be similar
np.dot(get_embedding('Context: water. I went to the bank', model), 
       get_embedding('Context: water. I went to the river', model))

0.7513987620212337

In [78]:
# should be different
np.dot(get_embedding('Context: water. I went to the bank', model), 
       get_embedding('Context: water. I went to get money', model))

0.8789128046393934

In [79]:
# should be different
np.dot(get_embedding('Context: finance. I went to the bank', model), 
       get_embedding('Context: finance. I went to the river', model))

0.7817484942487075

In [80]:
# should be similar
np.dot(get_embedding('Context: finance. I went to the bank', model), 
       get_embedding('Context: finance. I went to get money', model))

0.8711684911398004

# Context Dependent Embeddings (Approach 1.5) - Appending context

In [81]:
# should be similar
np.dot(get_embedding('I went to the bank. Summarise this sentence in context of water:', model), 
       get_embedding('I went to the river. Summarise this sentence in context of water:', model))

0.8096693719142252

In [82]:
# should be different
np.dot(get_embedding('I went to the bank. Summarise this sentence in context of water:', model), 
       get_embedding('I went to get money. Summarise this sentence in context of water:', model))

0.9065902572233908

In [83]:
# should be different
np.dot(get_embedding('I went to the bank. Summarise this sentence in context of finance:', model), 
       get_embedding('I went to the river. Summarise this sentence in context of finance:', model))

0.8251738788226315

In [84]:
# should be similar
np.dot(get_embedding('I went to the bank. Summarise this sentence in context of finance:', model), 
       get_embedding('I went to get money. Summarise this sentence in context of finance:', model))

0.9151529270746628

# Context Dependent Embeddings (Approach 2) - Modify the text prompt and use it for embeddings

In [85]:
def text_conversion(context, text):
    return chat(f'''Context is {context}. 
Refine text based on context without changing the text's meaning.
Do not add in what is not present in the text.
Some parts of the text may have more meaning based on context, highlight those.
If unable to refine, output original text''', text)

In [86]:
text_conversion('academia', 'I read a paper')

'I read an academic paper.'

In [87]:
text_conversion('news', 'I read a paper')

'I read a news article.'

In [88]:
text_conversion('exam', 'I read a paper')

'I read an exam paper.'

In [89]:
model = 'text-embedding-3-large'

In [90]:
def get_embedding_by_context(text: str, context: str = '', model: str = 'gpt-3.5-turbo'):
    ''' Gets an embedding based on the context. If context not given, does not do conversion '''
    converted_sentence = text_conversion(context, text) if context != '' else text
    return get_embedding(converted_sentence, model)

In [91]:
# should be similar
x, y, context = 'I went to the bank', 'I went to the river', ''
print('Embedding similarity', np.dot(get_embedding_by_context(x, context, model), 
                                     get_embedding_by_context(y, context, model)))

Embedding similarity 0.5897400164080131


In [92]:
# should be similar
x, y, context = 'I went to the bank', 'I went to the river', 'water'
print('Embedding similarity', np.dot(get_embedding_by_context(x, context, model), 
                                     get_embedding_by_context(y, context, model)))

Embedding similarity 0.8759913799435554


In [93]:
# should be different
x, y, context = 'I went to the bank', 'I went to the river', 'finance'
print('Embedding similarity', np.dot(get_embedding_by_context(x, context, model), 
                                     get_embedding_by_context(y, context, model)))

Embedding similarity 0.4956905762475252


In [94]:
# should be similar
x, y, context = 'I went to the bank', 'I went to get money', ''
print('Embedding similarity', np.dot(get_embedding_by_context(x, context, model), 
                                     get_embedding_by_context(y, context, model)))

Embedding similarity 0.8018812180559


In [95]:
# should be different
x, y, context = 'I went to the bank', 'I went to get money', 'water'
print('Embedding similarity', np.dot(get_embedding_by_context(x, context, model), 
                                     get_embedding_by_context(y, context, model)))

Embedding similarity 0.5029410724258297


In [96]:
# should be similar
x, y, context = 'I went to the bank', 'I went to get money', 'finance'
print('Embedding similarity', np.dot(get_embedding_by_context(x, context, model), 
                                     get_embedding_by_context(y, context, model)))

Embedding similarity 0.6847778551131576


# Negation of values (text-embedding-3-large)
- More performant compared to text-embedding-ada-002

In [97]:
model = 'text-embedding-3-large'

In [98]:
np.dot(get_embedding('have', model), get_embedding('do not have', model))

0.537687984526523

In [99]:
np.dot(get_embedding('Jonathan was present', model), get_embedding('Jonathan was absent', model))

0.7664456530397772

In [100]:
np.dot(get_embedding('present', model), get_embedding('absent', model))

0.38758236896648085

In [101]:
np.dot(get_embedding('present', model), get_embedding('not present', model))

0.48815064811640596

# Negation of values (text-embedding-ada-002)

In [102]:
model = 'text-embedding-ada-002'

In [103]:
np.dot(get_embedding('have', model), get_embedding('do not have', model))

0.8612163753882298

In [104]:
np.dot(get_embedding('Jonathan was present', model), get_embedding('Jonathan was absent', model))

0.9487405481131359

In [105]:
np.dot(get_embedding('present', model), get_embedding('absent', model))

0.8294703415200366

In [106]:
np.dot(get_embedding('present', model), get_embedding('not present', model))

0.8508408212164676

# Negation of values (text-embedding-3-small)

In [107]:
model = 'text-embedding-3-small'

In [108]:
np.dot(get_embedding('have', model), get_embedding('do not have', model))

0.5316601210294082

In [109]:
np.dot(get_embedding('Jonathan was present', model), get_embedding('Jonathan was absent', model))

0.835158308129629

In [110]:
np.dot(get_embedding('present', model), get_embedding('absent', model))

0.4002265066603564

In [111]:
np.dot(get_embedding('present', model), get_embedding('not present', model))

0.44254542744698094