Our second hypothesis tests the effect of degree of misunderstanding on the magnitude of effort. 

We operationalize degree of misunderstanding as a conceptual similarity between target concept and answer offered by a guesser. To have a reproducible measure of conceptual similarity, we use the ConceptNet Numberbatch embeddings (REF). Alongside, in online anonymous rating study, we have collected data from XX people (XX English, XX Dutch) who were asked to rate the similarity between each pair of words. We then compare the 'perceived similarity' with cosine similarity computed from ConceptNet embeddings, to validate the use of ConceptNet embeddings as a measure of conceptual similarity.


In [2]:
#| code-fold: true
#| code-summary: "Code to load packages and prepare environment"

import numpy as np
import os
import glob
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

curfolder = os.getcwd()
datafolder = curfolder + '\\dataset\\'

# load df_all from datafolder
df_all = pd.read_csv(datafolder + 'all_data.csv', index_col=0)
df_all.head(15)

Unnamed: 0_level_0,cycle,word,modality,answer,exp,sessionID
practice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
practice,0.0,rijk,combinatie,geld,1,11_1
practice,0.0,leeg,combinatie,honger,1,11_1
none,0.0,hond,combinatie,hond,1,11_1
none,0.0,hoorn,combinatie,geluid,1,11_1
none,0.0,geur,combinatie,ruiken,1,11_1
none,0.0,oud,combinatie,bejaarde,1,11_1
none,0.0,kruipen,combinatie,baby,1,11_1
none,0.0,drinken,combinatie,drinken,1,11_1
none,0.0,water,combinatie,water,1,11_1
practice,1.0,knippen,combinatie,knippen,1,11_1


First we need to do some data-wrangling to get all in the right format for the embedding extraction and comparison


In [82]:
# concept list
df_concepts = pd.read_excel(datafolder + '/conceptlist_info.xlsx')

# in df_concepts, keep only English and Dutch
df_concepts = df_concepts[['English', 'Dutch']]

# rename Dutch to word
df_concepts = df_concepts.rename(columns={'Dutch': 'word'})

# merge df and df_concepts on word
df = pd.merge(df_all, df_concepts, on='word', how='left')

# show rows where English is NaN
df[df['English'].isnull()]

# add translations manually for each (these are practice trials)
df.loc[df['word'] == 'bloem', 'English'] = 'flower'
df.loc[df['word'] == 'dansen', 'English'] = 'to dance'
df.loc[df['word'] == 'auto', 'English'] = 'car'
df.loc[df['word'] == 'olifant', 'English'] = 'elephant'
df.loc[df['word'] == 'comfortabel', 'English'] = 'comfortable'
df.loc[df['word'] == 'bal', 'English'] = 'ball'
df.loc[df['word'] == 'haasten', 'English'] = 'to hurry'
df.loc[df['word'] == 'gek', 'English'] = 'crazy'
df.loc[df['word'] == 'snijden', 'English'] = 'to cut'
df.loc[df['word'] == 'koken', 'English'] = 'to cook'
df.loc[df['word'] == 'juichen', 'English'] = 'to cheer'
df.loc[df['word'] == 'zingen', 'English'] = 'to sing'
df.loc[df['word'] == 'glimlach', 'English'] = 'smile'
df.loc[df['word'] == 'klok', 'English'] = 'clock'
df.loc[df['word'] == 'fiets', 'English'] = 'bicycle'
df.loc[df['word'] == 'vliegtuig', 'English'] = 'airplane'
df.loc[df['word'] == 'geheim', 'English'] = 'secret'
df.loc[df['word'] == 'telefoon', 'English'] = 'telephone'
df.loc[df['word'] == 'zwaaien', 'English'] = 'to wave'
df.loc[df['word'] == 'sneeuw', 'English'] = 'snow'
df.loc[df['word'] == 'rijk', 'English'] = 'rich'
df.loc[df['word'] == 'leeg', 'English'] = 'empty'
df.loc[df['word'] == 'hond', 'English'] = 'dog'
df.loc[df['word'] == 'knippen', 'English'] = 'to cut'
df.loc[df['word'] == 'eend', 'English'] = 'duck'

# make a list of English answers
#answers_en = ['party', 'to cheer', 'tasty', 'to shoot', 'to breathe', 'zombie', 'bee', 'sea', 'dirty', 'tasty', 'car', 'to eat', 'to eat', 'to blow', 'hose', 'hose', 'to annoy', 'to make noise', 'to make noise', 'to run away', 'elephant', 'to cry', 'cold', 'outfit', 'silence', 'to ski', 'wrong', 'to play basketball', 'to search', 'disturbed', 'to run', 'to lick', 'to lift', 'lightning', 'to think', 'to jump', 'to fall', 'to write', 'to dance', 'shoulder height', 'horn', 'dirty', 'boring', 'to drink', 'strong', 'elderly', 'to mix', 'fish', 'fish', 'dirty', 'wrong', 'smart', 'to box', 'to box', 'dog', 'to catch', 'to cheer', 'to sing', 'pregnant', 'hair', 'to shower', 'pain', 'burnt', 'hot', 'I', 'to chew', 'bird', 'airplane', 'to fly', 'to think', 'to choose', 'to doubt', 'graffiti', 'fireworks', 'bomb', 'to smile', 'to laugh', 'smile', 'clock', 'to wonder', 'height', 'big', 'height', 'space', 'to misjudge', 'to wait', 'satisfied', 'happy', 'fish', 'to smell', 'wind', 'pain', 'to burn', 'hot', 'to cycle', 'to fly', 'airplane', 'bird', 'to crawl', 'to drink', 'waterfall', 'water', 'fire', 'top', 'good', 'to hear', 'to point', 'distance', 'there', 'to whisper', 'quiet', 'to be silent', 'telephone', 'to blow', 'to distribute', 'to give', 'cat', 'to laugh', 'tasty', 'to eat', 'yummy', 'to sleep', 'mountain', 'dirty', 'to vomit', 'to be disgusted', 'to greet', 'hello', 'goodbye', 'to smell', 'nose', 'odor', 'to fly', 'fireworks', 'to blow', 'to cut', 'pain', 'hot', 'to slurp', 'to throw', 'to fall', 'to fall', 'whistle', 'heartbeat', 'mouse', 'to hit', 'to catch', 'to grab', 'to throw', 'to fall', 'to shoot', 'circus', 'trunk', 'to fall', 'to fight', 'pain', 'to push open', 'to growl', 'to cut', 'to eat', 'knife', 'to slurp', 'to drink', 'drink', 'to eat', 'delicious', 'tasty', 'to cough', 'sick', 'to cry', 'to cry']

# get rid of English 'to beat'
df = df[df['English'] != 'to beat']
# and to weep
df = df[df['English'] != 'to weep']

# add those to df as answers_en
#df['answer_en'] = answers_en

# keep only rows where word is not NaN
df = df[df['word'].notnull()]

# make a list of English targets
#meanings_en = list(df['English'])
df.head(15)

Unnamed: 0,cycle,word,modality,answer,exp,sessionID,English
0,0.0,rijk,combinatie,geld,1,11_1,rich
1,0.0,leeg,combinatie,honger,1,11_1,empty
2,0.0,hond,combinatie,hond,1,11_1,dog
3,0.0,hoorn,combinatie,geluid,1,11_1,horn
4,0.0,geur,combinatie,ruiken,1,11_1,odor
5,0.0,oud,combinatie,bejaarde,1,11_1,old
6,0.0,kruipen,combinatie,baby,1,11_1,to crawl
7,0.0,drinken,combinatie,drinken,1,11_1,to drink
8,0.0,water,combinatie,water,1,11_1,water
9,1.0,knippen,combinatie,knippen,1,11_1,to cut


We need to manually repair some incorrect answers (with typo etc.)

In [83]:
# in answer, replace langaam by langzaam
df['answer'] = df['answer'].str.replace('langaam', 'langzaam')
df['answer'] = df['answer'].str.replace('comfortable', 'comfortabel')
df['answer'] = df['answer'].str.replace('neurien', 'neuriën')
df['answer'] = df['answer'].str.replace('neurieen', 'neuriën')
df['answer'] = df['answer'].str.replace('verdietig', 'verdrietig')
df['answer'] = df['answer'].str.replace('skien', 'skiën')
df['answer'] = df['answer'].str.replace('skieen', 'skiën')
df['answer'] = df['answer'].str.replace('geirriteerd', 'geïrriteerd')
df['answer'] = df['answer'].str.replace('vliegtug', 'vliegtuig')
df['answer'] = df['answer'].str.replace('geirriteerd', 'geïrriteerd')
df['answer'] = df['answer'].str.replace('shift', '')
df['answer'] = df['answer'].str.replace('svhieten', 'schieten', regex=False)
df['answer'] = df['answer'].str.replace('scrheeuwen', 'schreeuwen', regex=False)
df['answer'] = df['answer'].str.replace('neerkomem', 'neerkomen', regex=False)
df['answer'] = df['answer'].str.replace('watet', 'water', regex=False)
df['answer'] = df['answer'].str.replace('mastuberen', 'masturberen', regex=False)
df['answer'] = df['answer'].str.replace('shrikken', 'schrikken', regex=False)
df['answer'] = df['answer'].str.replace('grafiti', 'graffiti', regex=False)
df['answer'] = df['answer'].str.replace('vliegtuid', 'vliegtuig', regex=False)
df['answer'] = df['answer'].str.replace('grinikken', 'grinniken', regex=False)
df['answer'] = df['answer'].str.replace('nurien', 'neuriën', regex=False)
df['answer'] = df['answer'].str.replace('optijd', 'op tijd', regex=False)
df['answer'] = df['answer'].str.replace('ontwetend', 'onwetend', regex=False)
df['answer'] = df['answer'].str.replace('verluisteren', 'fluisteren', regex=False)
df['answer'] = df['answer'].str.replace('luchtballin', 'luchtballon', regex=False)
df['answer'] = df['answer'].str.replace('omhooh', 'omhoog', regex=False)
df['answer'] = df['answer'].str.replace('rodellen', 'roddelen', regex=False)
df['answer'] = df['answer'].str.replace('snappem', 'snappen', regex=False)
df['answer'] = df['answer'].str.replace('indrukwekkebd', 'indrukwekkend', regex=False)
df['answer'] = df['answer'].str.replace('zwaairn', 'zwaaien', regex=False)
df['answer'] = df['answer'].str.replace('heigen', 'hijgen', regex=False)
df['answer'] = df['answer'].str.replace('gestressd', 'gestrest', regex=False)
df['answer'] = df['answer'].str.replace('kouwen', 'kauwen', regex=False)
df['answer'] = df['answer'].str.replace('shouders', 'schouders', regex=False)
df['answer'] = df['answer'].str.replace('ballom', 'ballon', regex=False)
df['answer'] = df['answer'].str.replace('autocoereur', 'autocoureur', regex=False)
df['answer'] = df['answer'].str.replace('lachrn', 'lachen', regex=False)
df['answer'] = df['answer'].str.replace('fitesen', 'fietsen', regex=False)
df['answer'] = df['answer'].str.replace('scieten', 'schieten', regex=False)
df['answer'] = df['answer'].str.replace('stamoen', 'stamperen', regex=False)
df['answer'] = df['answer'].str.replace('blixem', 'bliksem', regex=False)
df['answer'] = df['answer'].str.replace('proefen', 'proeven', regex=False)
df['answer'] = df['answer'].str.replace('blokfuit', 'blokfluit', regex=False)
df['answer'] = df['answer'].str.replace('verdrietig ', 'verdrietig', regex=False)
df['answer'] = df['answer'].str.replace('galloperen', 'galopperen', regex=False)
df['answer'] = df['answer'].str.replace('leegl', 'leeg', regex=False)
df['answer'] = df['answer'].str.replace('kinker', 'klinker', regex=False)
df['answer'] = df['answer'].str.replace('gehiem', 'geheim', regex=False)
df['answer'] = df['answer'].str.replace('voge', 'vogel', regex=False)
df['answer'] = df['answer'].str.replace('vogell', 'vogel', regex=False)
df['answer'] = df['answer'].str.replace('grinnikken', 'grinniken', regex=False)
df['answer'] = df['answer'].str.replace('drinken ', 'drinken', regex=False)
df['answer'] = df['answer'].str.replace('gieberen', 'gibberen', regex=False)
df['answer'] = df['answer'].str.replace('juichenl', 'juichen', regex=False)

In [84]:
# Dutch targets
meanings_nl = list(df['word'])
# Dutch answers
answers_nl = list(df['answer'])


In [74]:

# show rows where English is NaN
df[df['English'].isnull()]

Unnamed: 0,cycle,word,modality,answer,exp,sessionID,English


Now we will load in ConceptNet numberbatch (version XX) and compute cosine similarity for each pair


In [20]:
# Load embeddings from a file
def load_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Cosine similarity
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

We will use multilingual numberbatch to extract words in the original language of experiment - Dutch. While English has better representation in ConceptNet, the English numberbatch does not make distinction between nouns and verbs (so 'a drink' and 'to drink' have common representation - drink). Because this is important distinction for us, we opt for Dutch embeddings to avoid this problem


In [16]:
# load embeddings
embeddings = load_embeddings('numberbatch\\numberbatch.txt') # downloaded from https://github.com/commonsense/conceptnet-numberbatch?tab=readme-ov-file
#embeddings_en = load_embeddings('numberbatch-en.txt') # downloaded from https://github.com/commonsense/conceptnet-numberbatch?tab=readme-ov-file

# this is how words are represented
vec_nl = embeddings.get('/c/nl/skiën')
print(vec_nl)

[ 3.410e-02 -4.640e-02  5.490e-02  1.544e-01  1.800e-02 -5.050e-02
 -6.660e-02 -2.300e-02  5.320e-02  1.104e-01  2.770e-02  5.040e-02
 -2.010e-02  5.900e-03 -1.133e-01 -9.370e-02 -7.890e-02  3.540e-02
  3.780e-02  8.400e-02 -3.880e-02  7.680e-02 -8.010e-02  6.540e-02
 -1.493e-01 -1.036e-01  8.490e-02  1.040e-02 -6.890e-02  6.890e-02
  1.226e-01 -1.850e-02  1.520e-02  2.810e-02 -5.660e-02 -2.670e-02
 -5.700e-02 -4.480e-02  1.924e-01  5.800e-02 -7.800e-02 -7.700e-03
  1.132e-01  6.350e-02 -4.310e-02  1.900e-03 -4.820e-02  1.047e-01
  6.900e-02  7.150e-02  1.660e-02  2.730e-02  4.340e-02  1.130e-02
 -1.427e-01 -9.200e-03 -8.000e-04  2.310e-02  1.234e-01 -1.452e-01
 -1.710e-02 -1.094e-01 -1.518e-01  4.820e-02  1.400e-02 -1.460e-02
  1.023e-01  5.220e-02  1.362e-01  3.190e-02 -2.590e-02  1.220e-01
  1.750e-02  8.810e-02 -9.200e-02 -1.226e-01 -5.560e-02 -6.600e-03
  3.180e-02 -1.113e-01  6.130e-02 -1.202e-01 -2.480e-02 -8.300e-03
 -1.710e-02  3.410e-02  1.550e-02 -8.000e-02 -6.390e-02  1.170

Now we take the list of target-answer pairs, transform them into embedding format and perform cosine similarity.

There will probably be some answers that will not be represented in the numberbatch (e.g., if the answer has more than one word). So we will need to think about how to handle these.


In [85]:
# get the embeddings for the words in the list meanings_en
word_embeddings_t = {}
for word in meanings_nl:
    word_embed = '/c/nl/' + str(word)
    if word_embed in embeddings:
        word_embeddings_t[word] = embeddings[word_embed]

# get the embeddings for the words in the list answers_en
word_embeddings_ans = {}
for word in answers_nl:
    word_embed = '/c/nl/' + str(word)
    if word_embed in embeddings:
        word_embeddings_ans[word] = embeddings[word_embed]

# calculate the similarity between the first word in the list meanings_en and first word in answers_en, second word in meanings_en and second word in answers_en, etc.
cosine_similarities = []

for i in range(len(meanings_nl)):
    word1 = meanings_nl[i]
    word2 = answers_nl[i]
    vec1 = word_embeddings_t.get(word1)
    vec2 = word_embeddings_ans.get(word2)
    if vec1 is not None and vec2 is not None:
        cosine_sim = cosine_similarity(vec1, vec2)
        cosine_similarities.append(cosine_sim)
    else:
        # print which concepts could not be found
        if vec1 is None:
            print(f"Concept not found: {word1}")
        if vec2 is None:
            print(f"Concept not found: {word2}")
        cosine_similarities.append(None)

df['cosine_similarity'] = cosine_similarities
df.to_csv(datafolder + 'df_with_similarity.csv', index=False)
df.head(15)

Concept not found: highfive
Concept not found: sniffen
Concept not found: bergwandeling
Concept not found: ver weg
Concept not found: sprinkelen
Concept not found: buiten adem
Concept not found: ik weet het niet
Concept not found: wakker worden
Concept not found: ringtoon
Concept not found: moedergans
Concept not found: fohnen
Concept not found: vies eten
Concept not found: openhaart
Concept not found: uitgleiden
Concept not found: verweg
Concept not found: kuikelen
Concept not found: highfive
Concept not found: kukkelen
Concept not found: huh
Concept not found: ssst
Concept not found: startsignaal
Concept not found: traplopen
Concept not found: iets pakken
Concept not found: geen idee
Concept not found: zachtjes lopen
Concept not found: wc rol
Concept not found: staartbackslash
Concept not found: silte vragen
Concept not found: regendrank
Concept not found: koud hebben
Concept not found: slowmotion
Concept not found: wattenstaaf
Concept not found: bubbelen
Concept not found: slowmotio

Unnamed: 0,cycle,word,modality,answer,exp,sessionID,English,cosine_similarity
0,0.0,rijk,combinatie,geld,1,11_1,rich,0.170759
1,0.0,leeg,combinatie,honger,1,11_1,empty,0.174591
2,0.0,hond,combinatie,hond,1,11_1,dog,1.0
3,0.0,hoorn,combinatie,geluid,1,11_1,horn,0.473597
4,0.0,geur,combinatie,ruiken,1,11_1,odor,0.866592
5,0.0,oud,combinatie,bejaarde,1,11_1,old,0.597854
6,0.0,kruipen,combinatie,baby,1,11_1,to crawl,0.04666
7,0.0,drinken,combinatie,drinken,1,11_1,to drink,1.0
8,0.0,water,combinatie,water,1,11_1,water,1.0
9,1.0,knippen,combinatie,knippen,1,11_1,to cut,1.0


In [86]:
# print rows where cosine similarity    is NaN
problems = df[df['cosine_similarity'].isnull()]
problems

Unnamed: 0,cycle,word,modality,answer,exp,sessionID,English,cosine_similarity
16,1.0,slaan,combinatie,highfive,1,11_1,to hit,
72,0.0,verdrietig,geluiden,sniffen,2,11_2,sad,
124,1.0,berg,combinatie,bergwandeling,2,11_2,mountain,
162,1.0,ver,gebaren,ver weg,2,11_2,far,
164,1.0,vlieg,gebaren,sprinkelen,2,11_2,fly,
...,...,...,...,...,...,...,...,...
9835,1.0,goed,gebaren,duim omhoog,1,8_1,good,
9836,1.0,bliksem,gebaren,appels plukken,1,8_1,lightning,
9874,1.0,oud,geluiden,nijdi,1,8_1,old,
9905,1.0,niet,combinatie,niet goed,2,8_2,not,


In [87]:
# save problems now
problems.to_csv(datafolder + 'problems.csv', index=False)

Now we add also binary yes/no for correct guess

In [88]:
# if answer == word, col guess_binary is 1, else 0
df['guess_binary'] = (df['word'] == df['answer']).astype(int)

In [90]:
# save
df.to_csv(datafolder + 'df_final.csv', index=False)