# Task 5a: Hotword Detection: Extraction of .mp3 files
### Importing Required Libraries

Necessary libraries for the notebook are imported, filepaths are defined, and hot words that are supposed to be detected are stored in a list.

In [None]:
# Install required libraries
%pip install pandas transformers torch

In [12]:
import os
import pandas as pd
from transformers import AutoTokenizer, AutoModel
import torch

In [13]:
# Load transcriptions from cv-valid-dev.csv
cv_valid_dev_path = "../common-voice/cv-valid-dev.csv"
df = pd.read_csv(cv_valid_dev_path)

In [14]:
# Define hot words to detect
hot_words = ["be careful", "destroy", "stranger"]

### Defining Hot Words and Detect Their Presence in Transcriptions

The `detect_hot_words` function identifies transcriptions containing specific hot words.

- **Inputs**:
  - `transcriptions`: A list of transcription strings to be analyzed.
  - `hot_words`: A list of hot words to detect within the transcriptions.

- **Process**:
  - Iterates through the transcriptions.
  - Checks each transcription for the presence of any of the hot words (case-insensitive).
  - If a hot word is detected, the corresponding index (or filename) is added to the `detected_files` list.

- **Output**:
  - Returns a list of indices (or filenames) of transcriptions containing the hot words.

This function enables quick identification of which files or phrases contain the specified hot words.


In [15]:
def detect_hot_words(transcriptions, hot_words):
    detected_files = []
    for idx, transcription in enumerate(transcriptions):
        for hot_word in hot_words:
            if hot_word in transcription.lower():
                detected_files.append(idx)  # Add index or filename here
                break
    return detected_files

### Detecting Hot Words in Text

The `is_hot_word` function initializes the `detected_filenames` list. It checks if any of the specified hot words are present in a given text and returns a Boolean value (True or False) depending if the hot word is found by iterating through the list of hot words to check for their presence in the input text.

An empty list is initialized to store filenames corresponding to transcriptions that contain the hot words.

In [16]:
def is_hot_word(text, hot_words):
    for word in hot_words:
        if word in text:
            return True
    return False

# Detect hot words in the transcriptions and save the filenames
detected_filenames = []

This is a failsafe that processes the `generated_text` column in the DataFrame to detect the presence of hot words while handling potential missing or non-string values. A new column, `detected`, is added to the DataFrame. It contains `True` for rows where the `generated_text` contains hot words, and `False` otherwise.

In [17]:
# Ensure that we handle missing or non-string values in the 'generated_text' column
df['detected'] = df['generated_text'].apply(
    lambda x: any(hw in str(x).lower() for hw in hot_words) if isinstance(x, str) else False
)

Filenames are then added to the list initialised earlier.

In [18]:
# Get filenames where hot words are detected
detected_filenames = df[df['detected']]['filename'].tolist()

### Saving of Detected Filenames

The filenames of transcriptions containing hot words are then saved to a file named `detected.txt`.

In [19]:
# Save detected filenames to detected.txt
detected_path = os.path.join("detected.txt")
with open(detected_path, "w") as f:
    for filename in detected_filenames:
        f.write(f"{filename}\n")

print(f"Detected filenames saved to {detected_path}.")

Detected filenames saved to detected.txt.
