# Summary

This notebook is used to generate more data for the main model to use.

This model is trained on (normalized.csv) to be able to predict the roundness of pseudowords.

This is because the original dataset (normalized.csv) only contains 124 rows, and it is insufficient to train a large model like the ByT5-Pseudword-Generator. Hence, this model should learn to predict the roundness values of pseudowords, then be applied on a larger dataset to create a dataset of pseudoword-roundness pairs that will be used to train the ByT5-Pseudword-Generator model.

In [1]:
from utils.roundness_determiner import *
from dotenv import load_dotenv
import pandas as pd
import random
import torch
import json
import os


load_dotenv()
state = 42


pd.set_option('display.max_columns', None)
device = "cuda" if torch.cuda.is_available() else "cpu"
random.seed(state)

# Building and Training

## Dataset

In [2]:
data = pd.read_csv("datasets/normalized.csv")
data

Unnamed: 0,Stimuli,ExperimentalRoundScore
0,bebi,0.815217
1,bibe,0.913043
2,bobou,0.815217
3,boubo,1.000000
4,chechi,0.184783
...,...,...
119,outou,0.347826
120,uku,0.239130
121,ulu,0.913043
122,umu,0.913043


In [3]:
data.describe()

Unnamed: 0,ExperimentalRoundScore
count,124.0
mean,0.562675
std,0.316366
min,0.0
25%,0.26087
50%,0.543478
75%,0.902174
max,1.0


## Model

In [4]:
model_name = "roberta-base"

In [5]:
model = RoundnessDeterminerBERT(
    model_name=model_name,
    hidden_size=768,
)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.9)

## Training

In [7]:
result = train_kfold(
    model=model,
    roundness=data["ExperimentalRoundScore"],
    texts=data["Stimuli"],
    batch_size=5,
    optimizer=optimizer,
    scheduler=scheduler,
    epochs=1000,
    patience=10,
    k=4,
)


Fold 1/4
Epoch    1/1000 | Train Loss: 0.9733 | Val Loss: 0.7455 | Best Val: inf
Epoch    2/1000 | Train Loss: 0.8597 | Val Loss: 0.6643 | Best Val: 0.7455
Epoch    3/1000 | Train Loss: 0.7101 | Val Loss: 0.7195 | Best Val: 0.6643
Epoch    4/1000 | Train Loss: 0.7231 | Val Loss: 0.7014 | Best Val: 0.6643
Epoch    5/1000 | Train Loss: 0.7172 | Val Loss: 0.6583 | Best Val: 0.6643
Epoch    6/1000 | Train Loss: 0.6886 | Val Loss: 0.6519 | Best Val: 0.6583
Epoch    7/1000 | Train Loss: 0.6872 | Val Loss: 0.6614 | Best Val: 0.6519
Epoch    8/1000 | Train Loss: 0.6970 | Val Loss: 0.6643 | Best Val: 0.6519
Epoch    9/1000 | Train Loss: 0.6847 | Val Loss: 0.7220 | Best Val: 0.6519
Epoch   10/1000 | Train Loss: 0.7255 | Val Loss: 0.6556 | Best Val: 0.6519
Epoch   11/1000 | Train Loss: 0.6830 | Val Loss: 0.6645 | Best Val: 0.6519
Epoch   12/1000 | Train Loss: 0.6571 | Val Loss: 0.6606 | Best Val: 0.6519
Epoch   13/1000 | Train Loss: 0.6683 | Val Loss: 0.6521 | Best Val: 0.6519
Epoch   14/1000 | 

## Testing

In [8]:
word_list = ["bouba", "kiki"]
model.inference(word_list)

array([0.6258093 , 0.42961788], dtype=float32)

In [9]:
word_list = ["maluma", "takete"]
model.inference(word_list)

array([0.58911127, 0.4978615 ], dtype=float32)

## Saving the model

In [10]:
save_model(
    model=model,
    directory=f"outputs/",
    filename=f"roundness_determiner_v0{os.getenv("VERSION")}.pth",
)

Model saved to outputs/roundness_determiner_v03.1.pth


# Loading and using the model

## Loading the model

In [11]:
model = load_model(directory="outputs/", filename=f"roundness_determiner_v0{os.getenv('VERSION')}.pth", model_name=model_name)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded from outputs/roundness_determiner_v03.1.pth


In [12]:
word_list = ["bouba", "kiki"]
model.inference(word_list)

array([0.6258093 , 0.42961788], dtype=float32)

In [13]:
word_list = ["maluma", "takete"]
model.inference(word_list)

array([0.58911127, 0.4978615 ], dtype=float32)

## Importing data

In [14]:
# Import data
data = json.load(open("datasets/words.json"))

# Function to generate a random string from data
def generate_random_string(data, min_len=2, max_len=5):
    length = random.randint(min_len, max_len)
    return ''.join(random.choices(list(data.keys()), k=length))

# Generate 5000 unique strings
unique_strings = set()
while len(unique_strings) < 10000:
    unique_strings.add(generate_random_string(data))

# Convert to DataFrame
data = pd.DataFrame(list(unique_strings), columns=['Pseudoword'])
data

Unnamed: 0,Pseudoword
0,mepako
1,bayo
2,depe
3,nushi
4,poipaau
...,...
9995,hipupasago
9996,poniga
9997,pubo
9998,dadapa


## Applying model

In [15]:
data["Roundness"] = model.inference(data["Pseudoword"].to_list())
data

Unnamed: 0,Pseudoword,Roundness
0,mepako,0.529904
1,bayo,0.572885
2,depe,0.505724
3,nushi,0.588515
4,poipaau,0.527272
...,...,...
9995,hipupasago,0.574199
9996,poniga,0.544438
9997,pubo,0.555564
9998,dadapa,0.567358


In [16]:
data.describe()

Unnamed: 0,Roundness
count,10000.0
mean,0.541609
std,0.043833
min,0.388763
25%,0.51001
50%,0.540788
75%,0.571022
max,0.701714


## Saving CSV

In [17]:
data.to_csv(f"datasets/japanese_pseudowords_{os.getenv("VERSION")}.csv", index=False)