# Design

This is an decoder model which will take in a roundness value and output a pseudoword that corresponds to the roundness value

In [1]:
from utils.pseudoword_generator import *
from utils.word_tokenizer import *
from dotenv import load_dotenv
from utils.dataset import *
import pandas as pd
import torch
import os


load_dotenv()
pd.set_option('display.max_columns', None)
device = "cuda" if torch.cuda.is_available() else "cpu"
state = 42

# Dataset

The dataset is obtained from normalizing the dataset from `datasets\Fortetal2015_dataforOSF.csv` and the processing code can be found in `data_normalizer.ipynb`

A linear normalization is performed to transform the `ExperimentalRoundScore` from the original (-42 to 50) to (0 to 1), with 0 being maximally round and 1 being maximally sharp

In [2]:
# Import dataset
data = pd.read_csv(f"datasets/normalized.csv")
data.rename(columns={"Stimuli": "Pseudoword", "ExperimentalRoundScore": "Roundness"}, inplace=True)
data

Unnamed: 0,Pseudoword,Roundness
0,bebi,0.815217
1,bibe,0.913043
2,bobou,0.815217
3,boubo,1.000000
4,chechi,0.184783
...,...,...
119,outou,0.347826
120,uku,0.239130
121,ulu,0.913043
122,umu,0.913043


In [3]:
data.describe()

Unnamed: 0,Roundness
count,124.0
mean,0.562675
std,0.316366
min,0.0
25%,0.26087
50%,0.543478
75%,0.902174
max,1.0


In [4]:
trn, val, tst = create_datasets()

# Hyperparam tuning

This section performs a grid search to determine what are the best parameters to use to train the model. These intermediary models are trained with a smaller number of epochs and a shorter early stopping patience, as we are only looking to see which hyperparameters are the best

The best model is defined here as the model with the lowest test score. Since the test set is never seen by the model in training, it can be said that if the model performs well on the test set, it is more generalizable

In [5]:
param_grid = {
    'd_model': [64, 128],
    'nhead': [8, 16],
    'num_layers': [8, 16],
    'learning_rate': [0.1, 0.05],
    'weight_decay': [0.01],
    'batch_size': [8],
    'max_length': [16]
}

In [6]:
result = grid_search(
    trn=trn,
    val=val,
    tst=tst,
    param_grid=param_grid,
    epochs=15,
    patience=3
)


[1/16] Testing parameters: {'d_model': 64, 'nhead': 8, 'num_layers': 8, 'learning_rate': 0.1, 'weight_decay': 0.01, 'batch_size': 8, 'max_length': 16}
Using decoupled weight decay
Epoch 1: Average Training Loss: 3.4642, Average Validation Loss: 3.4732
Epoch 2: Average Training Loss: 3.2958, Average Validation Loss: 2.9094
Epoch 3: Average Training Loss: 2.8217, Average Validation Loss: 2.4684
Epoch 4: Average Training Loss: 2.4561, Average Validation Loss: 2.2023
Epoch 5: Average Training Loss: 2.2808, Average Validation Loss: 2.1097
Epoch 6: Average Training Loss: 2.2001, Average Validation Loss: 2.0974
Epoch 7: Average Training Loss: 2.1021, Average Validation Loss: 2.0468
Epoch 8: Average Training Loss: 2.0711, Average Validation Loss: 1.9833
Epoch 9: Average Training Loss: 2.0413, Average Validation Loss: 2.0219
Epoch 10: Average Training Loss: 1.9987, Average Validation Loss: 2.0081
Epoch 11: Average Training Loss: 1.9866, Average Validation Loss: 2.0060

Early stopping triggered

# Training

Since the best hyperparameters are already determined above, those parameters are used to train a final model

In [7]:
train_result = train(trn, val, tst, params=result["parameters"], epochs=100, patience=10)

Using decoupled weight decay
Epoch 1: Average Training Loss: 3.4464, Average Validation Loss: 3.3782
Epoch 2: Average Training Loss: 3.2012, Average Validation Loss: 2.7766
Epoch 3: Average Training Loss: 2.5506, Average Validation Loss: 2.2288
Epoch 4: Average Training Loss: 2.1918, Average Validation Loss: 2.0165
Epoch 5: Average Training Loss: 2.0739, Average Validation Loss: 1.9730
Epoch 6: Average Training Loss: 2.0533, Average Validation Loss: 2.0382
Epoch 7: Average Training Loss: 1.9835, Average Validation Loss: 1.9551
Epoch 8: Average Training Loss: 1.9389, Average Validation Loss: 1.9197
Epoch 9: Average Training Loss: 1.9323, Average Validation Loss: 1.9295
Epoch 10: Average Training Loss: 1.9066, Average Validation Loss: 1.8553
Epoch 11: Average Training Loss: 1.8833, Average Validation Loss: 1.8317
Epoch 12: Average Training Loss: 1.8912, Average Validation Loss: 1.8517
Epoch 13: Average Training Loss: 1.8744, Average Validation Loss: 1.8790
Epoch 14: Average Training Loss

In [8]:
train_result['final_test_loss']

2.063342809677124

# Testing

After the model has been trained, testing is performed to manually check the performance of the model. This is done by randomly sampling the dataset for roundness values and manually comparing the roundness value to the generated pseudoword, as well as checking how much the generated pseudoword corresponds to the label

After that, a list of roundness values are fed to the model across a range from 0 to 1. This will allow for manual checking of whether the generated pseudowords correspond to the roundness value input across the entire spectrum of inputs

In [9]:
random_sample = data.sample(n=10, random_state=42)
for _, row in random_sample.iterrows():
    print(f"Roundness Value : {row['Roundness']}")
    print(f"Original Word   : {row['Pseudoword']}")
    print(f"Predicted word  : {inference(train_result['model'], row['Roundness'], train_result['tokenizer'])}")
    print()

Roundness Value : 0.4565217391304347
Original Word   : guegui
Predicted word  : foto

Roundness Value : 0.3695652173913043
Original Word   : sise
Predicted word  : tutiu

Roundness Value : 0.9130434782608696
Original Word   : nonou
Predicted word  : louo

Roundness Value : 0.8695652173913043
Original Word   : minlin
Predicted word  : louo

Roundness Value : 0.0
Original Word   : zize
Predicted word  : teke

Roundness Value : 0.8695652173913043
Original Word   : ama
Predicted word  : louo

Roundness Value : 0.3695652173913043
Original Word   : kantan
Predicted word  : tutiu

Roundness Value : 0.9130434782608696
Original Word   : umu
Predicted word  : louo

Roundness Value : 0.9130434782608696
Original Word   : ulu
Predicted word  : louo

Roundness Value : 0.1847826086956521
Original Word   : chechi
Predicted word  : tute



In [10]:
roundness_list = []
for i in range(11):
    roundness_list.append(i/10)

for roundness in roundness_list:
    print(f"Roundness Value: {roundness}")
    print(f"Predicted word: {inference(train_result['model'], roundness, train_result['tokenizer'])}")
    print()

Roundness Value: 0.0
Predicted word: teke

Roundness Value: 0.1
Predicted word: teke

Roundness Value: 0.2
Predicted word: tute

Roundness Value: 0.3
Predicted word: tute

Roundness Value: 0.4
Predicted word: tuto

Roundness Value: 0.5
Predicted word: fofo

Roundness Value: 0.6
Predicted word: fofou

Roundness Value: 0.7
Predicted word: jolou

Roundness Value: 0.8
Predicted word: louou

Roundness Value: 0.9
Predicted word: louo

Roundness Value: 1.0
Predicted word: louo



# Save and load model

In [11]:
save_model(train_result['model'], path=f"outputs/pseudoword_generator_v0{os.getenv("GEN")}.pth")

In [12]:
with open(f"outputs/params_for_model_v0{os.getenv("GEN")}.json", "w") as f:
    json.dump(result["parameters"], f)

In [None]:
model = load_model(filename=f"pseudoword_generator_v0{os.getenv("GEN")}.pth")