# Summary

This notebook is used to generate more data for the main model to use.

This model is trained on (normalized.csv) to be able to predict the roundness of pseudowords.

This is because the original dataset (normalized.csv) only contains 124 rows, and it is insufficient to train a large model like the ByT5-Pseudword-Generator. Hence, this model should learn to predict the roundness values of pseudowords, then be applied on a larger dataset to create a dataset of pseudoword-roundness pairs that will be used to train the ByT5-Pseudword-Generator model.

In [None]:
from utils.roundness_determiner import *
import pandas as pd
import torch

pd.set_option('display.max_columns', None)
device = "cuda" if torch.cuda.is_available() else "cpu"
state = 42

VERSION = 1

# Building and Training

## Dataset

In [None]:
data = pd.read_csv("datasets/normalized.csv")
data

In [3]:
# Split data into train and val

trn = data.sample(frac=0.8, random_state=state)
val = data.drop(trn.index)
trn.reset_index(inplace=True, drop=True)
val.reset_index(inplace=True, drop=True)

In [None]:
print(f"Train set: {len(trn)} samples, Validation set: {len(val)} samples")

## Model

In [5]:
model = roundness_determiner()

In [6]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.8)

## Training

In [None]:
result = train(
    model=model,
    trn_roundness=trn["ExperimentalRoundScore"],
    trn_texts=trn["Stimuli"],
    val_roundness=val["ExperimentalRoundScore"],
    val_texts=val["Stimuli"],
    batch_size=5,
    optimizer=optimizer,
    scheduler=scheduler,
    epochs=1000,
    patience=25,
)

## Testing

In [None]:
word_list = ["bouba", "kiki"]
model.inference(word_list)

In [None]:
word_list = ["takete", "maluma"]
model.inference(word_list)

## Saving the model

In [None]:
save_model(
    model=model,
    directory=f"outputs/",
    filename=f"roundness_determiner_v{VERSION}.pth",
)

# Loading and using the model

## Loading the model

In [None]:
model = load_model(directory="outputs/", filename=f"roundness_determiner_v{VERSION}.pth")

In [None]:
word_list = ["bouba", "kiki"]
model.inference(word_list)

In [None]:
word_list = ["takete", "maluma"]
model.inference(word_list)

## Importing data

In [5]:
data = pd.read_csv("datasets/pseudoword_interpretations_v2.csv")
data = data[data["type"] == "pseudoword"]
data.reset_index(inplace=True, drop=True)
data.drop(columns=["participant", "interpretation", "type", "valence"], inplace=True)
data.rename(columns={"word": "Stimuli"}, inplace=True)

## Applying model

In [None]:
data["ExperimentalRoundScore"] = model.inference(data["Stimuli"].to_list())
data

## Saving CSV

In [7]:
data.to_csv("datasets/normalized_v2.csv", index=False)