Check the InstructOR 'advanced' case confirm the behaviour of the vector database.

In addition, since we test three different models, and two different uses of the InstructOR model, looking at the range of the similarity scores over even this very small example gives an indication of the relative performance of the embedding models. Better models should have a wider range between matching and non-matching answers.

In [143]:
query  = ['If I become involved in treatment, what do I need to know?']

corpus = ['Feeling comfortable with the professional you or your child is working with is critical to the success of your treatment. Finding the professional who best fits your needs may require some research.',
          'There are many types of mental health professionals. Finding the right one for you may require some research.',
          'There are many types of mental health professionals. The variety of providers and their services may be confusing. Each have various levels of education, training, and may have different areas of expertise. Finding the professional who best fits your needs may require some research.',
          'When healing from mental illness, early identification and treatment are of vital importance. Based on the nature of the illness, there are a range of effective treatments available. For any type of treatment, it is essential that the person affected is proactive and fully engaged in their own recovery process.\nMany people with mental illnesses who are diagnosed and treated respond well, although some might experience a return of symptoms. Even in such cases, with careful monitoring and management of the disorder, it is still quite possible to live a fulfilled and productive life.']

In [32]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

all_MiniLM_L6_v2 is easy to use and fast. Having a working example with this, allows for easy debugging. It also gives a benchmark in terms of how vector embeddings differentiate between matching and non-matching answers.

In [2]:
from sentence_transformers import SentenceTransformer
all_MiniLM_L6_v2 = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# returns embedding as an numpy.ndarray
def embed_all_MiniLM_L6_v2(text):
    text = text.replace("\n", " ")
    return all_MiniLM_L6_v2.encode(text).reshape(1,384)


Calculate the distance manually between the query and the 4 answers that make up the knowledge base corpus

In [153]:
query_embedding_all_MiniLM_L6_v2 = embed_all_MiniLM_L6_v2(query[0])
corpus_embedding_all_miniLM_L6_v2 = np.concatenate([embed_all_MiniLM_L6_v2(str(x)) for x in corpus])
manual_similarity = cosine_similarity(query_embedding_all_MiniLM_L6_v2, corpus_embedding_all_miniLM_L6_v2)
print("Similarity scores: " + str(manual_similarity))
print("Similarity Range: " + str(manual_similarity.max() - manual_similarity.min()))

Similarity scores: [[0.33290067 0.3841326  0.34424317 0.40948355]]
Similarity Range: 0.07658288


Perform a similar calculation using the OpenAI embedding example. This is only here for completeness, it really does not add anything to the exercise

In [107]:
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openAIAda ="text-embedding-ada-002"

def embed_ada(text):
    text = text.replace("\n", " ")
    ans_as_list = openai.Embedding.create(input = [text], model=openAIAda)['data'][0]['embedding']
    return np.array(ans_as_list).reshape(1,1536)


In [152]:
query_embedding_ada = embed_ada(query[0])
corpus_embedding_ada = np.concatenate([embed_ada(str(x)) for x in corpus])
manual_similarity = cosine_similarity(query_embedding_ada, corpus_embedding_ada)
manual_similarity
print("Base Similarity Score: " + str(manual_similarity))
print("Similarity Range: " + str(manual_similarity.max() - manual_similarity.min()))


Base Similarity Score: [[0.80554999 0.78572934 0.77822775 0.80581569]]
Similarity Range: 0.027587943818157212


The InstructOR series of embedding models can be used naively (called 'base' below) or with a set of instructions. Using the instructions improves the model performance but care needs to be taken to include the correct number of square brackets when moving to the advanced version of the Embedding Model

In [134]:
from InstructorEmbedding import INSTRUCTOR
instructor_embedding_large = INSTRUCTOR('hkunlp/instructor-large')

def embed_instructor_base(text):
    text = text.replace("\n", " ")
    return instructor_embedding_large.encode([text])[0].reshape(1,768)

# For the Advanced case, where we preface the text with an instruction, we create separate functions to embed the knowledge base and the queries 
# NB, check the usage of square brackets in the input to the encode function
def embed_instructor_corpus(text):
    text = text.replace("\n", " ")
    return instructor_embedding_large.encode([['Represent the Medical document for retrieval: ', text]])[0].reshape(1,768)

def embed_instructor_query(text):
    text = text.replace("\n", " ")
    return instructor_embedding_large.encode([['Represent the Medical question for retrieving supporting documents: ', text]])[0].reshape(1,768)


load INSTRUCTOR_Transformer
max_seq_length  512


In [151]:
query_embedding_instructor_base = embed_instructor_base(query[0])
corpus_embedding_instructor_base = np.concatenate([embed_instructor_base(str(x)) for x in corpus])
manual_similarity = cosine_similarity(query_embedding_instructor_base, corpus_embedding_instructor_base)
print("Base Similarity Score: " + str(manual_similarity))
print("Similarity Range: " + str(manual_similarity.max() - manual_similarity.min()))

query_embedding_instructor_advanced = embed_instructor_query(query[0])
corpus_embedding_instructor_advanced = np.concatenate([embed_instructor_corpus(str(x)) for x in corpus])
manual_similarity_advanced = cosine_similarity(query_embedding_instructor_advanced, corpus_embedding_instructor_advanced)
manual_similarity_advanced
print("Advanced: " + str(manual_similarity_advanced))
print("Similarity Range: " + str(manual_similarity_advanced.max() - manual_similarity_advanced.min()))


print("Notice the increased similarity score range for the advanced model.")



Base Similarity Score: [[0.88351697 0.8718145  0.86523116 0.88637376]]
Similarity Range: 0.021142602
Advanced: [[0.8625516  0.8537026  0.8550642  0.88183063]]
Similarity Range: 0.028128028
Notice the increased similarity score range for the advanced model.


In [137]:
import chromadb
import uuid # to generate unique ids for each entry into the database - these should be something like the FAQ id

# in-memory database
client = chromadb.Client()
collection_all_MiniLM_L6_v2 = client.create_collection(name="all_MiniLM_L6_v2", metadata={"hnsw:space": "cosine"})

# add each row from corpus_embedding_all_miniLM_L6_v2 to the collection
for i in range(corpus_embedding_all_miniLM_L6_v2.shape[0]):
    collection_all_MiniLM_L6_v2.add(embeddings=corpus_embedding_all_miniLM_L6_v2[i].tolist(), 
                                    documents=[corpus[i]],
                                    ids = [str(uuid.uuid1())])

collection_instructor_base = client.create_collection(name="instructor_base", metadata={"hnsw:space": "cosine"})
for i in range(corpus_embedding_instructor_base.shape[0]):
    collection_instructor_base.add(embeddings=corpus_embedding_instructor_base[i].tolist(), 
                                    documents=[corpus[i]],
                                    ids = [str(uuid.uuid1())])

collection_instructor_advanced = client.create_collection(name="instructor_advanced", metadata={"hnsw:space": "cosine"})
for i in range(corpus_embedding_instructor_advanced.shape[0]):
    collection_instructor_advanced.add(embeddings=corpus_embedding_instructor_advanced[i].tolist(), 
                                    documents=[corpus[i]],
                                    ids = [str(uuid.uuid1())])
                                    

collection_ada = client.create_collection(name="ada", metadata={"hnsw:space": "cosine"})
for i in range(corpus_embedding_ada.shape[0]):
    collection_ada.add(embeddings=corpus_embedding_ada[i].tolist(), 
                                    documents=[corpus[i]],
                                    ids = [str(uuid.uuid1())])
                                    


Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


In [154]:
def search_collection_using_embedding(collection, embedding, num_docs):
    results = collection.query(query_embeddings=embedding, n_results=num_docs)
    df = pd.DataFrame({"Answer": results['documents'][0],
                        "Score":results['distances'][0]})
    return df


In [138]:
results_all_MiniLM_L6_v2 = search_collection_using_embedding(collection = collection_all_MiniLM_L6_v2, embedding = embed_all_MiniLM_L6_v2(query[0]).tolist(), num_docs=4)
results_instructor_base = search_collection_using_embedding(collection = collection_instructor_base, embedding = embed_instructor_base(query[0]).tolist(), num_docs=4)
results_instructor_advanced = search_collection_using_embedding(collection = collection_instructor_advanced, embedding = embed_instructor_query(query[0]).tolist(), num_docs=4)
results_ada = search_collection_using_embedding(collection = collection_ada, embedding = embed_ada(query[0]).tolist(), num_docs=4)

In [139]:
# add the column 'Manual_Score' by iterating though results_df and calling the cosine_similarity on the query_embedding and the results of embedding the value from results_...['Answer]
results_all_MiniLM_L6_v2['Manual_Score'] = [cosine_similarity(query_embedding_all_MiniLM_L6_v2, embed_all_MiniLM_L6_v2(x))[0][0] for x in results_all_MiniLM_L6_v2['Answer']]
results_instructor_base['Manual_Score'] = [cosine_similarity(query_embedding_instructor_base, embed_instructor_base(x))[0][0] for x in results_instructor_base['Answer']]
results_instructor_advanced['Manual_Score'] = [cosine_similarity(query_embedding_instructor_advanced, embed_instructor_corpus(x))[0][0] for x in results_instructor_advanced['Answer']]
results_ada['Manual_Score'] = [cosine_similarity(query_embedding_ada, embed_ada(x))[0][0] for x in results_ada['Answer']]

In [140]:
results_all_MiniLM_L6_v2['Total_Score'] = results_all_MiniLM_L6_v2['Score'] + results_all_MiniLM_L6_v2['Manual_Score']
results_instructor_base['Total_Score'] = results_instructor_base['Score'] + results_instructor_base['Manual_Score']
results_instructor_advanced['Total_Score'] = results_instructor_advanced['Score'] + results_instructor_advanced['Manual_Score']
results_ada['Total_Score'] = results_ada['Score'] + results_ada['Manual_Score']

In [141]:
results_instructor_base

Unnamed: 0,Answer,Score,Manual_Score,Total_Score
0,"When healing from mental illness, early identi...",0.113626,0.886374,1.0
1,Feeling comfortable with the professional you ...,0.116483,0.883517,1.0
2,There are many types of mental health professi...,0.128185,0.871814,1.0
3,There are many types of mental health professi...,0.134768,0.865231,0.999999


In [142]:
results_instructor_advanced

Unnamed: 0,Answer,Score,Manual_Score,Total_Score
0,"When healing from mental illness, early identi...",0.11817,0.881831,1.0
1,Feeling comfortable with the professional you ...,0.137449,0.862552,1.0
2,There are many types of mental health professi...,0.144936,0.855064,1.0
3,There are many types of mental health professi...,0.146298,0.853703,1.0


In [126]:
results_ada

Unnamed: 0,Answer,Score,Manual_Score,Total_Score
0,"When healing from mental illness, early identi...",0.194157,0.805804,0.999961
1,Feeling comfortable with the professional you ...,0.194554,0.805409,0.999963
2,There are many types of mental health professi...,0.214309,0.785691,1.0
3,There are many types of mental health professi...,0.221809,0.778192,1.000001


In [127]:
results_all_MiniLM_L6_v2

Unnamed: 0,Answer,Score,Manual_Score,Total_Score
0,"When healing from mental illness, early identi...",0.590517,0.409484,1.0
1,There are many types of mental health professi...,0.615867,0.384133,1.0
2,There are many types of mental health professi...,0.655757,0.344243,1.0
3,Feeling comfortable with the professional you ...,0.667099,0.332901,1.0
