## Hotword Similarity Detection

Here, we perform hotword detection that does not require exact word matches, but considers similar words too. This is done using `InstructorEmbedding` for text embedding. We use https://huggingface.co/hkunlp/instructor-large below.

In [1]:
import os
import re
import numpy as np
import pandas as pd

from scipy.stats import zscore

from sklearn.metrics.pairwise import cosine_similarity
from InstructorEmbedding import INSTRUCTOR

HOME_DIR = os.path.expanduser('~')

  from tqdm.autonotebook import trange


First, we load the required embedder.

In [2]:
model = INSTRUCTOR('hkunlp/instructor-large').to('cuda')

load INSTRUCTOR_Transformer
max_seq_length  512


  model.load_state_dict(torch.load(os.path.join(input_path, 'pytorch_model.bin'), map_location=torch.device('cpu')))


Then, we load the transcript produced from the sister notebook in `asr_project/hotword-detection/cv-hotword-5a.ipynb`, given by the file `new_transcription.csv`

In [3]:
new_transcription_path = os.path.join(HOME_DIR,'asr_project/hotword-detection/new_transcription.csv')
df_raw = pd.read_csv(new_transcription_path).dropna().reset_index()
df_raw['pred_str'] = df_raw['pred_str'].str.lower()
df_raw

Unnamed: 0,index,filename,pred_str
0,0,sample-000000.mp3,be careful with your prognostications said the...
1,1,sample-000001.mp3,then why should they be surprised when they se...
2,2,sample-000002.mp3,a young arab also loaded down with baggage ent...
3,3,sample-000003.mp3,i felt that everything i owned would be destroyed
4,4,sample-000004.mp3,he moved about invisible but everyone could he...
...,...,...,...
4070,4071,sample-004071.mp3,but they could never have taught him arabic
4071,4072,sample-004072.mp3,he decided to concentrate on more practical ma...
4072,4073,sample-004073.mp3,that's what i'm not supposed to say
4073,4074,sample-004074.mp3,just here the winple made him feel better


To get the texts with similar phrases, we need to provide intructions to the model to represent the transcripts accordingly.

In [4]:
# Store text embedment
text_repr = df_raw['pred_str'].map(lambda x: ['Represent the sentence:',x])
target_sentences = text_repr.to_list()
target_embeddings = model.encode(target_sentences)

Next, we encode the queried words/phrases and use cosine similarity to get texts expressing the same sentiments.

In [5]:
ref_phrases = [['Represent the sentence: ','stranger'],
               ['Represent the sentence: ','be careful'],
               ['Represent the sentence: ','destroy']]

# Function to get similar sentences given reference phrases
def get_similar_sentences(df_raw, ref_phrases, threshold=3, detailed=False):
    # Get reference phrase embedment
    df = df_raw.copy()
    ref_embeddings = model.encode(ref_phrases)
    similarity = pd.DataFrame(cosine_similarity(ref_embeddings,target_embeddings).T)

    # Calculate the Z-score and identify sentences beyond threshold, default 3 standard deviations
    z_scores = similarity.apply(zscore)
    df = pd.concat([df,z_scores],axis=1)
    df['stranger_detected'] = df.apply(lambda row: row[0]>threshold,axis=1)
    df['becareful_detected'] = df.apply(lambda row: row[1]>threshold,axis=1)
    df['destroy_detected'] = df.apply(lambda row: row[2]>threshold,axis=1)
    df['similarity'] = df[['stranger_detected','becareful_detected','destroy_detected']].sum(axis=1)>0
    if detailed:
        return df
    else:
        return df.loc[df['similarity'],['filename','pred_str','stranger_detected','becareful_detected','destroy_detected']]

df = get_similar_sentences(df_raw, ref_phrases, threshold=3)
print(f'Total detected: {len(df)}')
print(f'Stranger detected: {df['stranger_detected'].sum()}')
print(f'Be careful detected: {df['becareful_detected'].sum()}')
print(f'Destroy detect: {df['destroy_detected'].sum()}')

Total detected: 114
Stranger detected: 40
Be careful detected: 45
Destroy detect: 31


We examine the captured text.

In [6]:
for i,row in df.iterrows():
    print(row['filename'])
    print(row['pred_str'])

sample-000000.mp3
be careful with your prognostications said the stranger
sample-000003.mp3
i felt that everything i owned would be destroyed
sample-000018.mp3
to rourish the falcon
sample-000089.mp3
the stranger seemed satisfied ith the answer
sample-000180.mp3
what a load of trash sarah apined
sample-000202.mp3
how strange africa is thought the boy
sample-000205.mp3
and eventually man wool nourish your sands where the game wool wont again flourish
sample-000231.mp3
he felt uneasy at the man's presence
sample-000261.mp3
he didn't know the man yet but his practiced eye would recognize him when he appeared
sample-000303.mp3
the boy noticed that the man's clothing was strange
sample-000351.mp3
the turf and gravel around it seemed charred as if by a sudden explosion
sample-000390.mp3
this was the strangest of all things that ever came to earth from outer space
sample-000419.mp3
the risk if you get in here and wriht
sample-000508.mp3
i had to test your courage the stranger said
sample-0005

The captured texts appear to be in line with our expectations. We capture the full record and save it as csv (__task 5b__).

In [7]:
# Get full record, including negative samples
df_full = get_similar_sentences(df_raw, ref_phrases, threshold=3, detailed=True).drop(columns=[0,1,2])
df_full['filename'] = df_full['filename'].map(lambda x: 'cv-valid-dev/'+x)

# Save as csv in the same format as cv-valid-dev
cv_valid_dev_path = os.path.join(HOME_DIR,'asr_project/common_voice/cv-valid-dev.csv')
updated_transcript_path = os.path.join(HOME_DIR,'asr_project/hotword-detection/cv-valid-dev-updated.csv')

df_or = pd.read_csv(cv_valid_dev_path)
df = df_or.merge(df_full[['filename','similarity']], on='filename', how='left')
df.to_csv(updated_transcript_path)