Read raw kaggle data into datframe

In [2]:
import pandas as pd
raw_data = pd.read_csv('./inputdata/Mental_Health_FAQ.csv')
raw_data.shape

(98, 3)

If you want to look at some of the data, run this cell a few times

In [3]:
# select the Answers column for a random row to get a flavor of the data
answer = raw_data['Answers'].sample(1).values[0]
print(answer)

# if you want to see an example of a long answer, uncomment this
# answer = raw_data.loc[raw_data['Question_ID'] == 7535002, 'Answers'].iloc[0]
# print(answer)

A personality disorder is a pattern of thoughts, feelings, and behaviours that last for a long time and causes some sort of problem or distress. 
 Schizoid personality disorder or SPD affects social interactions and relationships. People with SPD may have a hard time relating to others and showing emotions. They may avoid close relationships and prefer to spend their time alone, seeming distant even to close family members. Many people don’t respond to strong emotions like anger, even when others try to provoke them. On the outside, people with SPD may seem cold or aloof, showing little emotion. 
 While they have a similar name, schizoid personality disorder isn’t the same as schizophrenia. 
 Schizoid personality disorder is believed to be relatively uncommon. While some people with SPD may see it as part of who they are, other people may feel a lot of distress, especially around social interactions. Some medications may help people manage symptoms and psychotherapy may help people bui

Choose the model to run here

In [28]:
model_list = ["all-MiniLM-L6-v2", "text-embedding-ada-002", "instructor_large", "instructor_xl", "e5-base-v2", "e5-large-v2"]
model_to_use = model_list[4]
print("Using model " + model_to_use)


# create the output folder name
dbf_postscript = "_full_answers"
print("Embedding on full answers")
data_frame_to_use = raw_data
dbf = ".\db_" + model_to_use + dbf_postscript 

Using model e5-base-v2
Embedding on full answers


I have split the instantiation of the model into two cells which makes it look a little convoluted. I did this because loading big models takes time. I was also continually making changes to my `create_vector_db.py` file and I did not want to have to reload the big model every time I made a change to my file. You can obviously combine the two cells and make it more readable but only do that when you are done making changes to `create_vector_db.py`

In [29]:
'''
Define a dictionary where the keys are the model names and the values are the functions or actions we want to take for each one.

The challenge here is that the actions for each model_to_use value are not consistent - in some cases, you're importing a model 
and in others, you're setting a value.

For the models where you're importing a model and instantiating it with a string, define a function that does that and use 
the function in your dictionary. For the models where you're just setting a string, you can just put the string in your dictionary.
'''

def import_instructor_model(model_name):
    from InstructorEmbedding import INSTRUCTOR
    return INSTRUCTOR(model_name)

def import_sentence_transformer(model_name):
    from sentence_transformers import SentenceTransformer
    return SentenceTransformer(model_name)

def import_openai_model(model_name):
    import openai
    openai.api_key = os.getenv("OPENAI_API_KEY")
    return model_name

model_to_function_map = {
    "instructor_large": lambda: import_instructor_model('hkunlp/instructor-large'),
    "instructor_xl": lambda: import_instructor_model('hkunlp/instructor-xl'),
    "e5-base-v2": 'intfloat/e5-base-v2',
    "e5-large-v2": 'intfloat/e5-large-v2',
    "all-MiniLM-L6-v2": lambda: import_sentence_transformer('sentence-transformers/all-MiniLM-L6-v2'),
    "text-embedding-ada-002": lambda: import_openai_model('text-embedding-ada-002'),
}

if model_to_use not in model_to_function_map:
    raise ValueError(f"model_to_use must be one of {', '.join(model_to_function_map.keys())}")

model = model_to_function_map[model_to_use]
if callable(model):
    model = model()  # Call the function if the model is a function

HuggingFace models have a useful summary of important model parameters. If you paste the model name in a cell you get to see these. For reference here are the outputs from Instructor_Large and all_miniLM_L6_v2

```
INSTRUCTOR(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()
)
```
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)
```


Part 2 of my convoluted load process, please feel free to combine this with the previous cell code once you are done making changes to `create_vector_db.py`

In [30]:
from create_vector_db import InstructorEmbeddingModel, e5EmbeddingModel, AllMiniLML6v2, OpenAIAda
from importlib import reload
import create_vector_db

reload(create_vector_db)

model_class_map = {
    "instructor_large": InstructorEmbeddingModel,
    "instructor_xl": InstructorEmbeddingModel,
    "e5-base-v2": e5EmbeddingModel,
    "e5-large-v2": e5EmbeddingModel,
    "all-MiniLM-L6-v2": AllMiniLML6v2,
    "text-embedding-ada-002": OpenAIAda
}

if model_to_use not in model_class_map:
    raise ValueError(f"model_to_use must be one of {', '.join(model_class_map.keys())}")

my_model = model_class_map[model_to_use](model, dbf)

In [31]:
# While the EmbeddingModel class has a method to create embeddings based on other columns, or a combination of columns,
# I have only tested this when we use the "Answers" column as the source of the embeddings.  

name_answer = "Answers"
#name_combined = "Combined"
#name_faq = "Questions"


In [32]:
try:
    client = my_model.client
    collection_answers = client.get_collection(name=name_answer.lower())
    #collection_question = client.get_collection(name=name_faq.lower())
    #collection_combined = client.get_collection(name=name_combined.lower())
    print("Database already exists, reading from it")

except (NameError, ValueError):
    print("Collection does not exist, creating it")
    my_model.embed_and_save_knowledge_base(text_faq_data_df = data_frame_to_use, embedding_column = name_answer)
    client = my_model.client
    collection_answers = client.get_collection(name=name_answer.lower())
    #collection_question = client.get_collection(name=name_faq.lower())
    #collection_combined = client.get_collection(name=name_combined.lower())


No embedding_function provided, using default embedding function: DefaultEmbeddingFunction https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2


Database already exists, reading from it


In [33]:
import ast
from pathlib import Path

question_embedding_file_to_create = Path("./question_embeddings") / f"{model_to_use}.csv"

if question_embedding_file_to_create.is_file():
    print(f"File {question_embedding_file_to_create} already exists so just loading it")
    questions_df = pd.read_csv(question_embedding_file_to_create)
    # Convert the value in row['Embeddings'] from a string to a list
    questions_df['Embeddings'] = questions_df['Embeddings'].apply(ast.literal_eval)
else:
    questions_df = pd.DataFrame(raw_data['Questions'])
    questions_df = my_model.embed_test_questions(questions_df, 'Questions')
    questions_df.to_csv(question_embedding_file_to_create, index=False)

File question_embeddings\e5-base-v2.csv already exists so just loading it


An example of how to perform a single lookup of the database.   
Note that the "Scores" column in the dataframe is :  `1 - (cosine_similarity(embedded of question, retrieved embedding vector from db))`

In [34]:
#question = "What is depression?" # not in FAQ so expect "No match found"
#question = "Does psilocybin help with depression?" # not in FAQ so expect "No match found"
#question = 'How can I find a mental health professional for myself or my child?' # from FAQ so expect "Match found"
question = "What does it mean to have a mental illness?" # from FAQ so expect "Match found"

search_results = my_model.search_collection_using_text(collection=collection_answers, text=question, num_docs=4)
print(search_results)

if question in search_results['Questions'].values:
    print("Match found")
else:
    print("No match found")


                                           Questions   
0        What does it mean to have a mental illness?  \
1  What's the difference between mental health an...   
2                    Who does mental illness affect?   
3                        What causes mental illness?   

                                             Answers    Scores  
0  Mental illnesses are health conditions that di...  0.130379  
1  ‘Mental health’ and ‘mental illness’ are incre...  0.140432  
2  It is estimated that mental illness affects 1 ...  0.140581  
3  It is estimated that mental illness affects 1 ...  0.141543  
Match found


Run the test which iterates through all 98 questions and checks if the top search result matches the expected answer. We also check to see if the top 4 search results contains the expected answer

In [35]:
def check_matches(row):
    embedding = row['Embeddings']
    search_results = my_model.search_collection_using_embedding(collection_answers, embedding=embedding, num_docs=4)
    
    # Check if current question is in search results
    match = row['Questions'] in search_results['Questions'].values

    # Check if current question is the top match in search results. The top match has the highest cosine similarity or lowest Score
    # I could equally have rerun the search with num_docs=1 and checked if the expected answer was the only result
    top_match = row['Questions'] == search_results.loc[search_results['Scores'].idxmin(), 'Questions']

    return pd.Series([match, top_match])

# Apply check_matches to each row in questions_df and count True values
questions_df[['Match', 'Top_Match']] = questions_df.apply(check_matches, axis=1)
match_count = questions_df['Match'].sum()
top_match_count = questions_df['Top_Match'].sum()

print(f"Total questions: {len(questions_df)}")
print(f"Top Matched questions: {top_match_count}")
print(f"Matched questions: {match_count}")

Total questions: 98
Top Matched questions: 70
Matched questions: 93
