### Task 5b

This notebook uses the text embedding model `hkunlp/instructor-large` to check for the existence of words or phrases that are similar to the predefined hot words: "be careful", "destroy", and "stranger". The steps involved are:

1. **Text Embedding Generation**: The texts from `cv-valid-dev.csv` are processed using the `hkunlp/instructor-large` model to generate embeddings for the text.
   
2. **Similarity Analysis**: The embeddings for the texts are compared to the embeddings of the hot words. The model evaluates the cosine similarity between the transcription and the hot words "be careful", "destroy", and "stranger".

3. **Hot Word Detection**: If the similarity score exceeds a predefined threshold (0.8), the transcription is flagged as containing a hot word or a similar phrase.

4. **Output**: The Boolean (True for a record containing similar phrases to the hor words or False for a record that is not similar) is saved into a new column called `similarity` and the updated file is saved to `cv-valid-dev-similarity.csv`. 

In [25]:
import pandas as pd
import numpy as np

from numpy.linalg import norm
from tqdm import tqdm
from InstructorEmbedding import INSTRUCTOR
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Load model
model = INSTRUCTOR('hkunlp/instructor-large')



load INSTRUCTOR_Transformer
max_seq_length  512


In [35]:
instruction = "Represent the text for semantic similarity:"

def get_embedding(text):
    """
    Embeds a list of texts using INSTRUCTOR.
    Returns a NumPy array of the embedding.
    """
    embedding = model.encode([[instruction, text]])

    # Normalise embedding
    norm_emb = embedding / norm(embedding, axis=1, keepdims=True)
    
    return norm_emb

In [36]:
# Load data
cv_valid_dev = pd.read_csv("cv-valid-dev.csv")

hot_words = ["be careful", "destroy", "stranger"]

hotword_embeddings_list = []
for hot_word in hot_words:
    hotword_embedding = get_embedding(hot_word)
    hotword_embeddings_list.append(hotword_embedding)
    print(hotword_embedding.shape)

hotword_embeddings = np.vstack(hotword_embeddings_list)
print(hotword_embeddings.shape)

(1, 768)
(1, 768)
(1, 768)
(3, 768)


A threshold of 0.8 was chosen because, based on experiments and an assessment of the similarity score distribution, it gives a more strict criterion for distinguishing actually comparable words from those that are just slightly related. Cosine similarity values tend to cluster rather high in our embedding space as a result of the instruction's normalisation and semantic representation. However, putting the criterion at 0.8 assures that only texts with a very strong semantic connection—those with substantially similar contextual or conceptual features—are marked as similar. This helps to limit false positives in our hotword identification by guaranteeing that the Boolean conclusion returns True only when there is a high degree of similarity between the record text and the hot words.

In [39]:
# Iterate through each row in the dataset
for idx, row in tqdm(cv_valid_dev.iterrows(), total=cv_valid_dev.shape[0], desc="Checking similarity..."):
    text = row["text"]
    text_embedding = get_embedding(text)

    # Compute cosine similarity for the text with the hotwords
    similarity = cosine_similarity(text_embedding, hotword_embeddings)
    max_similarity = similarity.max()
    
    # Use a threshold of 0.8
    is_similar = max_similarity >= 0.8

    # Update generated_text column for the current row
    cv_valid_dev.loc[idx, "similarity"] = is_similar

    # Save the updated file into the same file
    cv_valid_dev.iloc[:idx+1].to_csv("cv-valid-dev-similarity.csv", index=False)
    
print("Updated file successfully saved!")

Checking similarity...:   0%|          | 0/4076 [00:00<?, ?it/s]

Checking similarity...: 100%|██████████| 4076/4076 [4:38:51<00:00,  4.10s/it]  

Updated file successfully saved!



