# Summary

This notebook is used to generate more data for the main model to use.

This model is trained on (normalized.csv) to be able to predict the roundness of pseudowords.

This is because the original dataset (normalized.csv) only contains 124 rows, and it is insufficient to train a large model like the ByT5-Pseudword-Generator. Hence, this model should learn to predict the roundness values of pseudowords, then be applied on a larger dataset to create a dataset of pseudoword-roundness pairs that will be used to train the ByT5-Pseudword-Generator model.

In [1]:
from utils.roundness_determiner import *
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import random
import torch
import json


state = 42
VERSION = "3.0"


pd.set_option('display.max_columns', None)
device = "cuda" if torch.cuda.is_available() else "cpu"
random.seed(state)

# Building and Training

## Dataset

In [2]:
data = pd.read_csv("datasets/normalized.csv")
data

Unnamed: 0,Stimuli,ExperimentalRoundScore
0,bebi,0.815217
1,bibe,0.913043
2,bobou,0.815217
3,boubo,1.000000
4,chechi,0.184783
...,...,...
119,outou,0.347826
120,uku,0.239130
121,ulu,0.913043
122,umu,0.913043


In [3]:
data.describe()

Unnamed: 0,ExperimentalRoundScore
count,124.0
mean,0.562675
std,0.316366
min,0.0
25%,0.26087
50%,0.543478
75%,0.902174
max,1.0


## Model

In [4]:
model = RoundnessDeterminerBERT(
    model_name="bert-base-uncased",
    hidden_size=768,
)

In [5]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.9)

## Training

In [6]:
result = train_kfold(
    model=model,
    roundness=data["ExperimentalRoundScore"],
    texts=data["Stimuli"],
    batch_size=5,
    optimizer=optimizer,
    scheduler=scheduler,
    epochs=1000,
    patience=10,
    k=12,
)


Fold 1/12
Epoch    1/1000 | Train Loss: 1.0953 | Val Loss: 1.6349 | Best Val: inf
Epoch    2/1000 | Train Loss: 0.9071 | Val Loss: 0.8919 | Best Val: 1.6349
Epoch    3/1000 | Train Loss: 0.7870 | Val Loss: 0.7759 | Best Val: 0.8919
Epoch    4/1000 | Train Loss: 0.7191 | Val Loss: 0.6989 | Best Val: 0.7759
Epoch    5/1000 | Train Loss: 0.7261 | Val Loss: 0.7823 | Best Val: 0.6989
Epoch    6/1000 | Train Loss: 0.7145 | Val Loss: 0.7057 | Best Val: 0.6989
Epoch    7/1000 | Train Loss: 0.7210 | Val Loss: 0.6833 | Best Val: 0.6989
Epoch    8/1000 | Train Loss: 0.6800 | Val Loss: 0.6917 | Best Val: 0.6833
Epoch    9/1000 | Train Loss: 0.7253 | Val Loss: 0.6881 | Best Val: 0.6833
Epoch   10/1000 | Train Loss: 0.6760 | Val Loss: 0.7470 | Best Val: 0.6833
Epoch   11/1000 | Train Loss: 0.6845 | Val Loss: 0.7406 | Best Val: 0.6833
Epoch   12/1000 | Train Loss: 0.6848 | Val Loss: 0.7151 | Best Val: 0.6833
Epoch   13/1000 | Train Loss: 0.6339 | Val Loss: 0.6898 | Best Val: 0.6833
Epoch   14/1000 |

## Testing

In [7]:
word_list = ["bouba", "kiki"]
model.inference(word_list)

array([0.43793532, 0.2597781 ], dtype=float32)

In [8]:
word_list = ["maluma", "takete"]
model.inference(word_list)

array([0.5786823 , 0.28268924], dtype=float32)

## Saving the model

In [9]:
save_model(
    model=model,
    directory=f"outputs/",
    filename=f"roundness_determiner_v0{VERSION}.pth",
)

Model saved to outputs/roundness_determiner_v03.1.pth


# Loading and using the model

## Loading the model

In [10]:
model = load_model(directory="outputs/", filename=f"roundness_determiner_v0{VERSION}.pth")

Model loaded from outputs/roundness_determiner_v03.1.pth


In [11]:
word_list = ["bouba", "kiki"]
model.inference(word_list)

array([0.43793482, 0.25977808], dtype=float32)

In [12]:
word_list = ["maluma", "takete"]
model.inference(word_list)

array([0.5786822 , 0.28268903], dtype=float32)

## Importing data

In [13]:
# Import data
data = json.load(open("datasets/words.json"))

# Function to generate a random string from data
def generate_random_string(data, min_len=2, max_len=5):
    length = random.randint(min_len, max_len)
    return ''.join(random.choices(list(data.keys()), k=length))

# Generate 5000 unique strings
unique_strings = set()
while len(unique_strings) < 10000:
    unique_strings.add(generate_random_string(data))

# Convert to DataFrame
data = pd.DataFrame(list(unique_strings), columns=['Pseudoword'])
data

Unnamed: 0,Pseudoword
0,irepeo
1,bea
2,kiko
3,tsupihamumo
4,koke
...,...
9995,tademunoo
9996,tsujidenubo
9997,musa
9998,sateihemu


## Applying model

In [14]:
data["Roundness"] = model.inference(data["Pseudoword"].to_list())
data

Unnamed: 0,Pseudoword,Roundness
0,irepeo,0.481853
1,bea,0.562284
2,kiko,0.371239
3,tsupihamumo,0.235215
4,koke,0.212681
...,...,...
9995,tademunoo,0.782395
9996,tsujidenubo,0.339519
9997,musa,0.517813
9998,sateihemu,0.492698


In [15]:
data.describe()

Unnamed: 0,Roundness
count,10000.0
mean,0.500261
std,0.169169
min,0.172221
25%,0.372915
50%,0.494394
75%,0.613484
max,0.891836


## Saving CSV

In [16]:
data.to_csv("datasets/japanese_pseudowords.csv", index=False)