# Task 5b: Hotword Detection using Text Embedding Models

Using the text embedding model, write a python jupyter notebook called cv-hotword-similarity-5b.ipynb to find similar phrases to the 3 hot words in task 5a. Using cv-valid-dev.csv, write the Boolean (true for a record containing similar phrases to the hot words; false for a record that is not similar) into a new column called similarity. Save this updated file in this folder.

#### Required libraries are installed here

In [None]:
# Install required libraries
!pip install transformers pandas scikit-learn torch

#### Importation of libraries

In [1]:
# Import libraries
import pandas as pd
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import torch

#### Loading Tokenizer and Model

The `AutoTokenizer` and `AutoModel` classes from the Hugging Face Transformers library are used to load the `hkunlp/instructor-large` model. The `AutoTokenizer` then tokenizes the input text, converting it into a format that the model can process and `AutoModel`loads the corresponding pre-trained model to generate text embeddings.

This approach is used instead of the `InstructorEmbedding` library because it provides more flexibility and direct control over the tokenization and model loading processes. Additionally, the Hugging Face library ensures compatibility with a wide range of models and simplifies customization, such as modifying the tokenization strategy or extracting specific layers from the model.

Also, the 'from InstructorEmbedding import INSTRUCTOR' seem to not be able to use for this model


In [2]:
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("hkunlp/instructor-large")
model = AutoModel.from_pretrained("hkunlp/instructor-large")

Some weights of T5Model were not initialized from the model checkpoint at hkunlp/instructor-large and are newly initialized: ['decoder.block.0.layer.0.SelfAttention.k.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'decoder.block.0.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.0.layer_norm.weight', 'decoder.block.0.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.o.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.1.layer_norm.weight', 'decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.2.layer_norm.weight', 'decoder.block.1.layer.0.SelfAttention.k.weight', 'decoder.block.1.layer.0.SelfAttention.o.weight', 'decoder.block.1.layer.0.SelfAttention.q.weight', 'deco

### Generating Embeddings for Input Text

The 'compute_embedding()' function generates embeddings for a given input text using the `hkunlp/instructor-large` model. 

- **Tokenization**: The input text is first tokenized using the `tokenizer`, converting it into a format the model can process (e.g., token IDs and attention masks). The tokenization process includes padding, truncation, and limiting the sequence length to 512 tokens.
  
- **Model Inference**: The tokenized input is passed through the encoder of the model using `torch.no_grad()` to disable gradient computations, improving performance and reducing memory usage during inference.

- **Embedding Extraction**: The function extracts the hidden state of the first token in the sequence (`[CLS]` equivalent for T5 models). Since T5 does not have a specific `[CLS]` token, the embedding of the first token is used to represent the entire input sequence.

This function ensures that embeddings can be generated efficiently and consistently for use in tasks such as semantic similarity or hot word detection.


In [3]:
def compute_embedding(text):
    """
    Generate embeddings for the given text using the instructor-large model.
    """
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    with torch.no_grad():
        # Pass the inputs through the model's encoder
        encoder_outputs = model.encoder(**inputs)
        
        # Extract the hidden state of the [CLS] token or the first token
        # (T5 doesn't have a specific [CLS] token, so use the first token)
        embedding = encoder_outputs.last_hidden_state[:, 0, :].squeeze().numpy()
    
    return embedding


### Defining Hot Words and Computing Their Embeddings

In this cell, the list of hot words (`"be careful"`, `"destroy"`, and `"stranger"`) are defined and represent phrases of interest for detecting semantic similarity in the dataset. 

- **Instruction**: The variable `instruction` provides context to the model by specifying how the phrases should be represented and detected. This instruction helps the `hkunlp/instructor-large` model focus on generating embeddings relevant to semantic similarity tasks.

- **Embedding Computation**: The `compute_embedding` function is used to generate embeddings for each hot word. By combining the `instruction` with each hot word, the model can generate contextualized embeddings that are more aligned with the task objective.

The resulting embeddings (`hot_word_embeddings`) are then used to compare with embeddings of phrases in the dataset to determine semantic similarity.


In [4]:
# Define the hot words and compute their embeddings
hot_words = ["be careful", "destroy", "stranger"]
instruction = "Represent the phrase for semantic similarity:"
hot_word_embeddings = [compute_embedding(f"{instruction} {word}") for word in hot_words]


In [5]:
# Load the CSV data
df = pd.read_csv("../common-voice/cv-valid-dev.csv")

In [6]:
# Ensure the column to process is named "phrase" (adjust if your column has a different name)
if "filename" not in df.columns:
    raise ValueError("The column 'phrase' does not exist in the CSV. Check the column names.")

In [7]:
# Compute embeddings for all phrases in the DataFrame
df["embedding"] = df["text"].apply(lambda x: compute_embedding(f"{instruction} {x}"))

### Checking Similarity with Hot Words

The`is_similar` function determines whether a given embedding is semantically similar to any of the hot word embeddings. 

- **Similarity Calculation**: For each hot word, the cosine similarity between its embedding and the input embedding is computed. Cosine similarity measures the angle between two vectors, making it a standard metric for comparing text embeddings.

- **Maximum Similarity**: The function identifies the highest similarity score (`max_similarity`) among all the hot word embeddings.

- **Threshold**: A similarity threshold (set to `0.9` in this case) is used to decide whether the input embedding is similar enough to the hot words. If `max_similarity` exceeds the threshold, the function returns `True`; otherwise, it returns `False`, in Boolean fashion. This hyperparameter was tuned to suit the findings of the detected.txt in **Task 5a** more (as the initial value of 0.5 resulted in a lot of 'True' results)

This function is critical for filtering phrases in the dataset and identifying those that are contextually relevant to the hot words.


In [8]:
# Check similarity with the hot words
def is_similar(embedding):
    similarities = [cosine_similarity(embedding.reshape(1, -1), hot_word_emb.reshape(1, -1))[0][0] for hot_word_emb in hot_word_embeddings]
    max_similarity = max(similarities)
    threshold = 0.9  # Define a threshold for similarity
    return max_similarity >= threshold

In [9]:
# Apply similarity check and add a new column
df["similarity"] = df["embedding"].apply(is_similar)

### Calculating Similarity Scores for Each Phrase ('text' column)

The `calculate_similarity_scores` function computes the similarity scores between a given embedding and each hot word embedding.

- **Cosine Similarity**: The function calculates cosine similarity between the input embedding and all hot word embeddings. This results in a list of similarity scores, one for each hot word.

- **Storing Similarity Scores**: The similarity scores are stored in a new column of the DataFrame, `similarity_scores`. This allows detailed analysis of how each phrase in the dataset compares to the hot words.

I did this so that I could see why some phrases were considered similar and some were not, given the files in detected.txt, and tried to match meaning to it. 

A sanity check were then performed and the dataset was cleaned up of unneeded columns (however it is an option to keep the columns by commenting out the df.drop() functions), and exported into a .csv file called cv-valid-dev-hotword.csv)


In [10]:
def calculate_similarity_scores(embedding):
    return [cosine_similarity(embedding.reshape(1, -1), hot_word_emb.reshape(1, -1))[0][0] for hot_word_emb in hot_word_embeddings]

df["similarity_scores"] = df["embedding"].apply(calculate_similarity_scores)

In [11]:
# Sanity Check
print(df[["text", "similarity_scores"]].head())

                                                text  \
0  be careful with your prognostications said the...   
1  then why should they be surprised when they se...   
2  a young arab also loaded down with baggage ent...   
3  i thought that everything i owned would be des...   
4  he moved about invisible but everyone could he...   

                     similarity_scores  
0   [0.89178586, 0.8908732, 0.8919756]  
1   [0.8107527, 0.8043457, 0.83265984]  
2  [0.82800245, 0.82864827, 0.8212614]  
3    [0.86525965, 0.897887, 0.8828435]  
4   [0.87202746, 0.8933013, 0.8849986]  


In [12]:
# Dropping of the embedding and similarity_scores columns (optional, for cleaner output)
df.drop(columns=["embedding"], inplace=True)
df.drop(columns=["similarity_scores"], inplace=True)

In [13]:
# Save the updated DataFrame to a new CSV file
df.to_csv("cv-valid-dev-updated.csv", index=False)

print("Updated file saved as 'cv-valid-dev-hotword.csv'")

Updated file saved as 'cv-valid-dev-hotword.csv'
