# Main

This model will take in a valence-arousal vector and a Bouba-Kiki vector. It will then output an encoded earcon representation, which will be passed to the MusicGen Decoder to generate the final earcon.

The output of the MusicGen Decoder will then be encoded by EncodecFeatureExtractor, and the vectors will be used to calculate the loss

In [1]:
# import relevant libraries
from transformers import EncodecFeatureExtractor
from utils import *
import pandas as pd
import numpy as np
import ast
import clip
import torch

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"

## Dataset

Each row in the dataset will consist of:
- An Earcon represented by an Encodec vector
- An image represented in a Valence Arousal Vector
- A Bouba-Kiki Value derived from the image
- A Pseudoword

The rows will be paired by cosine similarity between the Earcon's Encodec vector and the VA Vector from the image. The Bouba-Kiki Value and Pseudoword will be generated after the images are paired with the audio

### Earcons

In [3]:

# load in earcons
earcons = pd.read_csv('dataset\earcon_dataset\earcon_dataset.csv')

earcons['query'] = earcons['query'].apply(ast.literal_eval)
earcons["query"] = earcons["query"].apply(lambda x: x[0])

earcons = earcons[["query", "name"]]

earcons["filepaths"] = earcons["name"].apply(lambda x: f"dataset/earcon_dataset/earcons/{x}")

earcons

Unnamed: 0,query,name,filepaths
0,bright,Power_Up_Bright_A.wav,dataset/earcon_dataset/earcons/Power_Up_Bright...
1,bright,Magic_Spell_06.wav,dataset/earcon_dataset/earcons/Magic_Spell_06.wav
2,bright,Hand_Bells_High-C_Single.wav,dataset/earcon_dataset/earcons/Hand_Bells_High...
3,bright,Hand_Bells_G#_Ab_Single.wav,dataset/earcon_dataset/earcons/Hand_Bells_G#_A...
4,bright,Hand_Bells_A#_Bb_Single.wav,dataset/earcon_dataset/earcons/Hand_Bells_A#_B...
...,...,...,...
642,warm,warm_basssynth_4101.wav,dataset/earcon_dataset/earcons/warm_basssynth_...
643,warm,warm_basssynth_4001.wav,dataset/earcon_dataset/earcons/warm_basssynth_...
644,warm,31WarmStrings.wav,dataset/earcon_dataset/earcons/31WarmStrings.wav
645,warm,11WarmBrass.wav,dataset/earcon_dataset/earcons/11WarmBrass.wav


In [4]:
import librosa

# Load the audio files and store them in a list
audio_data = []
audio_paths = earcons["filepaths"].tolist()

# Load audio files and determine the maximum length, ensuring mono audio
max_length = 0
for path in audio_paths:
    temp, _ = librosa.load(path, sr=24000, mono=True)
    audio_data.append(temp)
    if len(temp) > max_length:
        max_length = len(temp)

In [5]:
# extract features with encodec
encodec = EncodecFeatureExtractor(feature_size=1)

features = encodec(audio_data, sampling_rate=24000, return_tensors="pt", padding=True)

# Calculate the longest feature
longest_feature = len(max(features['input_values'], key=lambda x: torch.count_nonzero(x))[0])

# Calculate the average feature length
average_feature_length = sum(torch.count_nonzero(x) for x in features['input_values']) / len(features['input_values'])

# Calculate the minimum feature length
min_feature_length = min(torch.count_nonzero(x) for x in features['input_values'])

# Calculate the 90th percentile feature length
percentile_90 = np.percentile([torch.count_nonzero(x).item() for x in features['input_values']], 90)

# Calculate the 10th percentile feature length
percentile_10 = np.percentile([torch.count_nonzero(x).item() for x in features['input_values']], 10)

# Calculate the proportion of features fully covered by the length set to 131072
fully_covered_length = 131072
fully_covered_count = sum(1 for x in features['input_values'] if torch.count_nonzero(x) <= fully_covered_length)
fully_covered_percentile = (fully_covered_count / len(features['input_values'])) * 100


print(f"Longest Feature Length: {longest_feature}")
print(f"Average Feature Length: {average_feature_length:.2f}")
print(f"Shortest Feature Length: {min_feature_length}")
print(f"90th Percentile: {percentile_90:.2f}")
print(f"10th Percentile: {percentile_10:.2f}")
print(f"Fully Covered Percentile: {fully_covered_percentile:.3f}")

Longest Feature Length: 1465856
Average Feature Length: 68692.56
Shortest Feature Length: 627
90th Percentile: 154435.20
10th Percentile: 13178.40
Fully Covered Percentile: 86.244


At a max_length of 131072 (2<sup>17</sup>), 86.244% of the features are fully included. Hence, we will use this value as our max_length

In [6]:
# extract features with encodec
encodec = EncodecFeatureExtractor(feature_size=1)

features = encodec(audio_data, sampling_rate=24000, return_tensors="pt", max_length=512, truncation=True)

In [7]:
# Add the features to the dataframe
earcons["features"] = features["input_values"].tolist()
earcons

Unnamed: 0,query,name,filepaths,features
0,bright,Power_Up_Bright_A.wav,dataset/earcon_dataset/earcons/Power_Up_Bright...,"[[2.058945938188117e-06, 2.7627368126559304e-0..."
1,bright,Magic_Spell_06.wav,dataset/earcon_dataset/earcons/Magic_Spell_06.wav,"[[-0.2963288426399231, -0.3907569348812103, -0..."
2,bright,Hand_Bells_High-C_Single.wav,dataset/earcon_dataset/earcons/Hand_Bells_High...,"[[2.4396009393967688e-05, 4.612719203578308e-0..."
3,bright,Hand_Bells_G#_Ab_Single.wav,dataset/earcon_dataset/earcons/Hand_Bells_G#_A...,"[[-9.434367530047894e-05, -0.00010156775533687..."
4,bright,Hand_Bells_A#_Bb_Single.wav,dataset/earcon_dataset/earcons/Hand_Bells_A#_B...,"[[2.7872014470631257e-05, 9.036024130182341e-0..."
...,...,...,...,...
642,warm,warm_basssynth_4101.wav,dataset/earcon_dataset/earcons/warm_basssynth_...,"[[0.004213274456560612, 0.006184867583215237, ..."
643,warm,warm_basssynth_4001.wav,dataset/earcon_dataset/earcons/warm_basssynth_...,"[[-0.0008950876072049141, -0.00261820969171822..."
644,warm,31WarmStrings.wav,dataset/earcon_dataset/earcons/31WarmStrings.wav,"[[-1.705546537777991e-06, 1.8894140794145642e-..."
645,warm,11WarmBrass.wav,dataset/earcon_dataset/earcons/11WarmBrass.wav,"[[1.223645085701719e-05, -6.467201455961913e-0..."


### Images

In [8]:

# load in images
images = pd.read_csv('dataset\landscape1\csvs\image_classification.csv')

# extract top tag and similarity score
images['top_tags'] = images['top_tags'].apply(ast.literal_eval)
images["top_tags"] = images["top_tags"].apply(lambda x: x[0])
images["similarity_scores"] = images["similarity_scores"].apply(ast.literal_eval)
images["similarity_scores"] = images["similarity_scores"].apply(lambda x: x[0])

images["image_path"] = images["image_path"].str.lstrip("../")

In [9]:
# calculate image vectors using CLIP
import clip
from PIL import Image

# Load the CLIP model
model, preprocess = clip.load("ViT-B/32", device=device)

# Function to calculate image vectors
def calculate_image_vectors(image_paths):
    vectors = []
    for image_path in image_paths:
        image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
        with torch.no_grad():
            vector = model.encode_image(image)
        vectors.append(vector.cpu().numpy())
    return vectors

# Apply the function to the images dataframe
images["vector"] = calculate_image_vectors(images["image_path"].tolist())

  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)


### Calculate Cosine Similarity

This will be used to build the dataset for our model

In [10]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Ensure features from `features['input_values']` are 2D
# earcon_features = features['input_values'].view(features['input_values'].size(0), -1).cpu().numpy()  # (num_earcons, 512)
earcon_flattened_vectors = np.array([np.array(vec).flatten() for vec in earcons["features"]])

# Convert each vector in `images["vector"]` to a flat 1D array
# Assumes images["vector"] is a list of lists, each sublist being the feature vector for an image
images_flattened_vectors = np.array([np.array(vec).flatten() for vec in images["vector"]])

# Calculate cosine similarity
cosine_similarities = cosine_similarity(images_flattened_vectors, earcon_flattened_vectors)

# Find the index of the earcon with the highest similarity for each image
max_sim_indices = np.argmax(cosine_similarities, axis=1)

# Create a new DataFrame to store the image and its most similar earcon
earcon_image_dataset = pd.DataFrame({
    # Assuming 'image' column contains image identifiers or file paths
    'image': images['image_path'],
    'image_vector': images["vector"],
    # Stores the index of the earcon with the highest similarity
    'earcon_index': max_sim_indices,
})

# Replace the earcon index with the corresponding filepath
earcon_image_dataset['earcon'] = earcon_image_dataset['earcon_index'].apply(lambda idx: earcons.iloc[idx]['filepaths'])
earcon_image_dataset['earcon_vector'] = earcon_image_dataset['earcon_index'].apply(lambda idx: earcons.iloc[idx]['features'])

earcon_image_dataset.drop(columns='earcon_index', inplace=True)

# Display results
earcon_image_dataset

Unnamed: 0,image,image_vector,earcon,earcon_vector
0,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.5435, -0.2112, -0.55, 0.04776, -0.4795, -0...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880..."
1,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.1624, -0.01067, -0.1775, -0.0321, -0.2805,...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880..."
2,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.5337, -0.10956, -0.3699, 0.1289, -0.2458, ...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880..."
3,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.761, -0.2683, 0.2966, -0.2286, -0.2045, -0...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880..."
4,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.3289, -0.2368, -0.1471, -0.02492, -0.3987,...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880..."
...,...,...,...,...
11995,dataset/landscape1/Validation Data/Mountain\Mo...,"[[-0.1316, 0.4343, 0.1354, 0.11646, -0.529, 0....",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880..."
11996,dataset/landscape1/Validation Data/Mountain\Mo...,"[[-0.321, 0.167, 0.141, 0.1422, -0.7173, -0.31...",dataset/earcon_dataset/earcons/Warm_Horn_C1.wav,"[[-0.003673421684652567, -0.005470327567309141..."
11997,dataset/landscape1/Validation Data/Mountain\Mo...,"[[0.3435, 0.3037, 0.09796, 0.2435, -0.5654, -0...",dataset/earcon_dataset/earcons/effect_7.wav,"[[4.422572965268046e-05, 5.2678333304356784e-0..."
11998,dataset/landscape1/Validation Data/Mountain\Mo...,"[[0.1853, -0.05533, -0.1471, 0.11993, -0.11285...",dataset/earcon_dataset/earcons/DarkHit001.wav,"[[-0.005931014660745859, -0.005464441142976284..."


### Build Pseudowords and Bouba-Kiki value

In [11]:
import random

sound_dict = psword_gen.load_sound_mappings('utils/sound_mappings.json')

def generate_pseudoword_and_bouba_kiki(image_path, sound_dict):
    x_values, y_values = psword_utils.process_image(image_path, 50, 150)
    weighted_angles, roundness = psword_utils.calculate_weighted_angles_by_edge_length(x_values, y_values)
    
    random.seed(42)
    
    psword = psword_gen.pseudoword_generator(
        roundness,
        len(x_values),
        sound_dict=sound_dict
    )
    
    return roundness, psword

# Apply the function to each row in the earcon_image_dataset
earcon_image_dataset[['roundness', 'pseudoword']] = earcon_image_dataset.apply(
    lambda row: generate_pseudoword_and_bouba_kiki(row['image'], sound_dict), axis=1, result_type='expand'
)

earcon_image_dataset

Unnamed: 0,image,image_vector,earcon,earcon_vector,roundness,pseudoword
0,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.5435, -0.2112, -0.55, 0.04776, -0.4795, -0...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880...",0.548406,juxuluji
1,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.1624, -0.01067, -0.1775, -0.0321, -0.2805,...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880...",0.515597,juxulu
2,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.5337, -0.10956, -0.3699, 0.1289, -0.2458, ...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880...",0.562980,geleje
3,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.761, -0.2683, 0.2966, -0.2286, -0.2045, -0...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880...",0.533194,juxulu
4,dataset/landscape1/Testing Data/Coast\Coast-Te...,"[[0.3289, -0.2368, -0.1471, -0.02492, -0.3987,...",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880...",0.551611,geleje
...,...,...,...,...,...,...
11995,dataset/landscape1/Validation Data/Mountain\Mo...,"[[-0.1316, 0.4343, 0.1354, 0.11646, -0.529, 0....",dataset/earcon_dataset/earcons/discreet_signal...,"[[-3.333803033456206e-05, -0.00019228400196880...",0.544531,juxuluji
11996,dataset/landscape1/Validation Data/Mountain\Mo...,"[[-0.321, 0.167, 0.141, 0.1422, -0.7173, -0.31...",dataset/earcon_dataset/earcons/Warm_Horn_C1.wav,"[[-0.003673421684652567, -0.005470327567309141...",0.512671,juxuluja
11997,dataset/landscape1/Validation Data/Mountain\Mo...,"[[0.3435, 0.3037, 0.09796, 0.2435, -0.5654, -0...",dataset/earcon_dataset/earcons/effect_7.wav,"[[4.422572965268046e-05, 5.2678333304356784e-0...",0.532324,juxuluji
11998,dataset/landscape1/Validation Data/Mountain\Mo...,"[[0.1853, -0.05533, -0.1471, 0.11993, -0.11285...",dataset/earcon_dataset/earcons/DarkHit001.wav,"[[-0.005931014660745859, -0.005464441142976284...",0.554051,gelejegi


### Build Dataloaders

In [None]:
# build dataloader

## Model

The model pipeline is as follows:
- The Earcon Encodec Vector is the target
- The VA Vector and Bouba-Kiki Value will be inputs to the model
- The model will output a set of vectors which will be fed to the MusicGen Decoder along with the Pseudoword
- The output of MusicGen Decoder will be encoded by Encodec
- The output of Encodec will be considered the final output, and loss will be calculated based on the difference between this output and the target Encodec vector from the Earcon

In [None]:
# init model + parameters

In [None]:
# train model

## Testing

In [None]:
# convert some outputs to audio using musicgen's decoder