A copy of the main model performance test but where we split the "Answers" text into paragraphs, using the newline character as a separator. This is done to check of changing the granularity of the embedding changes the model performance.  

Spoiler alert - it does. Don't make the text that goes to your embedding model too small!

In [1]:
import pandas as pd
raw_data = pd.read_csv('./inputdata/Mental_Health_FAQ.csv')
raw_data.shape

(98, 3)

In [2]:
# Count the words in each entry in the "Answers" column of the dataframe and output the minimum, maximum and average word count 
raw_data['word_count'] = raw_data['Answers'].apply(lambda x: len(str(x).split(" ")))
print(raw_data['word_count'].describe())

count      98.000000
mean      261.030612
std       232.316060
min        16.000000
25%        84.750000
50%       197.000000
75%       396.000000
max      1453.000000
Name: word_count, dtype: float64


In [3]:
# create a new dataframe. Each row will consist of the Question_ID, the Questions column but the last column will be one paragraph of the Answers column. 
# This way if a value from the original Answers column consists of two paragraphs, the new dataframe will split that into two rows
data = []
for index, row in raw_data.iterrows():
    answer = row['Answers']
    if "\n" in answer:
        paragraphs = answer.split("\n")
        for paragraph in paragraphs:
            data.append({'Question_ID': row['Question_ID'], 'Questions': row['Questions'], 'Answers': paragraph})
    else:
        data.append({'Question_ID': row['Question_ID'], 'Questions': row['Questions'], 'Answers': answer})

df = pd.DataFrame(data)

df['word_count'] = df['Answers'].apply(lambda x: len(str(x).split(" ")))
print(df['word_count'].describe())

n = 128
count = df['word_count'][df['word_count'] > n].count()
print("The number of rows that contain more than " + str(n) + " words is " + str(count) + " or " + str(count/len(df)*100) + "% of the total")

count    600.000000
mean      43.471667
std       35.725571
min        1.000000
25%       16.000000
50%       35.000000
75%       60.000000
max      215.000000
Name: word_count, dtype: float64
The number of rows that contain more than 128 words is 18 or 3.0% of the total


In [4]:
df.head()

Unnamed: 0,Question_ID,Questions,Answers,word_count
0,1590140,What does it mean to have a mental illness?,Mental illnesses are health conditions that di...,32
1,1590140,What does it mean to have a mental illness?,Mental illnesses fall along a continuum of sev...,68
2,1590140,What does it mean to have a mental illness?,It is important to know that mental illnesses ...,43
3,1590140,What does it mean to have a mental illness?,Similarly to how one would treat diabetes with...,65
4,2110618,Who does mental illness affect?,It is estimated that mental illness affects 1 ...,46


In [5]:
model_list = ["all-MiniLM-L6-v2", "text-embedding-ada-002", "instructor_large", "instructor_xl", "e5-base-v2", "e5-large-v2"]

model_to_use = model_list[0]
print("Using model " + model_to_use)


dbf_postscript = ""
print("Embedding on paragraphs")
dbf_postscript = "_paragraphs"
data_frame_to_use = df

dbf = ".\db_" + model_to_use + dbf_postscript 

Using model all-MiniLM-L6-v2
Embedding on paragraphs


In [6]:
'''
Define a dictionary where the keys are the model names and the values are the functions or actions we want to take for each one.

The challenge here is that the actions for each model_to_use value are not consistent - in some cases, you're importing a model 
and in others, you're setting a value.

For the models where you're importing a model and instantiating it with a string, define a function that does that and use 
the function in your dictionary. For the models where you're just setting a string, you can just put the string in your dictionary.
'''

def import_instructor_model(model_name):
    from InstructorEmbedding import INSTRUCTOR
    return INSTRUCTOR(model_name)

def import_sentence_transformer(model_name):
    from sentence_transformers import SentenceTransformer
    return SentenceTransformer(model_name)

def import_openai_model(model_name):
    import openai
    openai.api_key = os.getenv("OPENAI_API_KEY")
    return model_name

model_to_function_map = {
    "instructor_large": lambda: import_instructor_model('hkunlp/instructor-large'),
    "instructor_xl": lambda: import_instructor_model('hkunlp/instructor-xl'),
    "e5-base-v2": 'intfloat/e5-base-v2',
    "e5-large-v2": 'intfloat/e5-large-v2',
    "all-MiniLM-L6-v2": lambda: import_sentence_transformer('sentence-transformers/all-MiniLM-L6-v2'),
    "text-embedding-ada-002": lambda: import_openai_model('text-embedding-ada-002'),
}

if model_to_use not in model_to_function_map:
    raise ValueError(f"model_to_use must be one of {', '.join(model_to_function_map.keys())}")

model = model_to_function_map[model_to_use]
if callable(model):
    model = model()  # Call the function if the model is a function

In [7]:
from create_vector_db import InstructorEmbeddingModel, e5EmbeddingModel, AllMiniLML6v2, OpenAIAda
from importlib import reload
import create_vector_db

reload(create_vector_db)

model_class_map = {
    "instructor_large": InstructorEmbeddingModel,
    "instructor_xl": InstructorEmbeddingModel,
    "e5-base-v2": e5EmbeddingModel,
    "e5-large-v2": e5EmbeddingModel,
    "all-MiniLM-L6-v2": AllMiniLML6v2,
    "text-embedding-ada-002": OpenAIAda
}

if model_to_use not in model_class_map:
    raise ValueError(f"model_to_use must be one of {', '.join(model_class_map.keys())}")

my_model = model_class_map[model_to_use](model, dbf)

Using embedded DuckDB with persistence: data will be stored in: .\db_all-MiniLM-L6-v2_paragraphs


running on GPU


In [8]:
# While the EmbeddingModel class has a method to create embeddings based on other columns, or a combination of columns,
# I have only tested this when we use the "Answers" column as the source of the embeddings.  

name_answer = "Answers"
#name_combined = "Combined"
#name_faq = "Questions"


In [9]:
try:
    client = my_model.client
    collection_answers = client.get_collection(name=name_answer.lower())
    #collection_question = client.get_collection(name=name_faq.lower())
    #collection_combined = client.get_collection(name=name_combined.lower())
    print("Database already exists, reading from it")

except (NameError, ValueError):
    print("Collection does not exist, creating it")
    my_model.embed_and_save_knowledge_base(text_faq_data_df = data_frame_to_use, embedding_column = name_answer)
    client = my_model.client
    collection_answers = client.get_collection(name=name_answer.lower())
    #collection_question = client.get_collection(name=name_faq.lower())
    #collection_combined = client.get_collection(name=name_combined.lower())


No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


Database already exists, reading from it


In [10]:
import ast
from pathlib import Path

question_embedding_file_to_create = Path("./question_embeddings") / f"{model_to_use}.csv"

if question_embedding_file_to_create.is_file():
    print(f"File {question_embedding_file_to_create} already exists so just loading it")
    questions_df = pd.read_csv(question_embedding_file_to_create)
    # Convert the value in row['Embeddings'] from a string to a list
    questions_df['Embeddings'] = questions_df['Embeddings'].apply(ast.literal_eval)
else:
    questions_df = pd.DataFrame(raw_data['Questions'])
    questions_df = my_model.embed_test_questions(questions_df, 'Questions')
    questions_df.to_csv(question_embedding_file_to_create, index=False)

File question_embeddings\all-MiniLM-L6-v2.csv already exists so just loading it


In [11]:
def check_matches(row):
    embedding = row['Embeddings']
    search_results = my_model.search_collection_using_embedding(collection_answers, embedding=embedding, num_docs=4)
    
    # Check if current question is in search results
    match = row['Questions'] in search_results['Questions'].values

    # Check if current question is the top match in search results. The top match has the highest cosine similarity or lowest Score
    # I could equally have rerun the search with num_docs=1 and checked if the expected answer was the only result
    top_match = row['Questions'] == search_results.loc[search_results['Scores'].idxmin(), 'Questions']

    return pd.Series([match, top_match])

# Apply check_matches to each row in questions_df and count True values
questions_df[['Match', 'Top_Match']] = questions_df.apply(check_matches, axis=1)
match_count = questions_df['Match'].sum()
top_match_count = questions_df['Top_Match'].sum()

print(f"Total questions: {len(questions_df)}")
print(f"Top Matched questions: {top_match_count}")
print(f"Matched questions: {match_count}")

Total questions: 98
Top Matched questions: 60
Matched questions: 76
