This notebook takes in data from `datasets\Fortetal2015_dataforOSF.csv` and normalizes the data to make sure that the roundness values are within 0 to 1

The ExperimentalRoundScore is the one that we are interested in

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv(r'datasets\Fortetal2015_dataforOSF.csv')
data

Unnamed: 0,Study,Stimuli,ExperimentalRoundScore,ModelRoundScore
0,FortExp3,cel_bebi_1_ids.wav,33,5.8294
1,FortExp3,cel_bibe_2_ids.wav,42,12.2039
2,FortExp3,cel_bobou_1_ids.wav,33,25.6795
3,FortExp3,cel_boubo_1_ids.wav,50,29.2659
4,FortExp3,cel_chechi_3_ids.wav,-25,-10.6296
...,...,...,...,...
119,FortExp2,outou_2.wav,-10,-16.3581
120,FortExp2,uku_2.wav,-20,-14.6789
121,FortExp2,ulu_3.wav,42,32.9821
122,FortExp2,umu_3.wav,42,41.1944


drop unnecessary columns

In [3]:
data = data.drop(columns=['Study', 'ModelRoundScore'])
data

Unnamed: 0,Stimuli,ExperimentalRoundScore
0,cel_bebi_1_ids.wav,33
1,cel_bibe_2_ids.wav,42
2,cel_bobou_1_ids.wav,33
3,cel_boubo_1_ids.wav,50
4,cel_chechi_3_ids.wav,-25
...,...,...
119,outou_2.wav,-10
120,uku_2.wav,-20
121,ulu_3.wav,42
122,umu_3.wav,42


find out the largest and smallest values in the `ExperimentalRoundScore` column to facilitate the normalization

In [4]:
max_value = data['ExperimentalRoundScore'].max()
min_value = data['ExperimentalRoundScore'].min()

print(f"Largest value: {max_value}")
print(f"Smallest value: {min_value}")

Largest value: 50
Smallest value: -42


a linear normalization is performed for its simplicity. normalizing the data from (-42 to 50) to (0 to 1) could cause some of the nuance of the data to be lost, but it will allow for a more intuitive understanding of the roundness - 0 is round, and 1 is sharp

In [5]:
def normalize(value):
    result = (value - min_value) / (max_value - min_value)
    return result

In [6]:
data['ExperimentalRoundScore'] = data['ExperimentalRoundScore'].apply(normalize)
data

Unnamed: 0,Stimuli,ExperimentalRoundScore
0,cel_bebi_1_ids.wav,0.815217
1,cel_bibe_2_ids.wav,0.913043
2,cel_bobou_1_ids.wav,0.815217
3,cel_boubo_1_ids.wav,1.000000
4,cel_chechi_3_ids.wav,0.184783
...,...,...
119,outou_2.wav,0.347826
120,uku_2.wav,0.239130
121,ulu_3.wav,0.913043
122,umu_3.wav,0.913043


the `Stimuli` column will also be processed to extract only the pseudowords

In [7]:
def extract_pseudowords(filename):
    result = filename
    if result.startswith('cel_'):
        result = result[4:]
    result = result.split('_')[0]
    return result

In [8]:
data['Stimuli'] = data['Stimuli'].apply(extract_pseudowords)
data

Unnamed: 0,Stimuli,ExperimentalRoundScore
0,bebi,0.815217
1,bibe,0.913043
2,bobou,0.815217
3,boubo,1.000000
4,chechi,0.184783
...,...,...
119,outou,0.347826
120,uku,0.239130
121,ulu,0.913043
122,umu,0.913043


save the cleaned data for later use

In [9]:
data.to_csv(r'datasets\normalized.csv', index=False)