# Analysis of the effect of psycho linguistic variables on N400 effect size

An important property of our method is that it only takes into account relative changes in N400 amplitude as the same target word is presented in combination with different prime words, i.e. the N400 effect.
Any confounding effects that cause the N400 amplitude to be generally larger or smaller, irregardless of the prime word, are completely ignored.
In order for a variable such as word frequency to be a confounding effect, it needs to have an impact on the N400 effect size.
That is, it must chance the way the N400 amplitude changes when the word is paired with different prime words.

For example, the word “neushoorn” (rhinoceros) may be more difficult for the brain to process than “bed”, due to differences in word frequency and length.
In the event where a word is difficult to process, the brain has potentially more to gain by being primed by the preceding word, than when the word was easy to process in the first place.

In the following analysis, we use as a measure of N400 effect size, the variance of the N400 amplitude of a target word, computed across all word-pairs in which it was the second word.
We compare this with various psycho-linguistic variables that may affect the N400 effect size.

In [1]:
# Module for loading and manipulating tabular data
import pandas as pd

# Bring in a bridge to R for statistics
import rpy2
%load_ext rpy2.ipython.rmagic

# The R code at the bottom produces some harmless warnings that clutter up the page.
# This disables printing of the warnings. When modifying this notebook, you may want to turn
# this back on.
import warnings
warnings.filterwarnings('ignore')

# For pretty display of tables
from IPython.display import display

The following variables are available to us:

|Variable       | Description |
|:--------------|:------------|
|`length`       | The number of characters of the word |
|`log_freq`     | The logarithm of the frequency of occurrence of the word in a movie subtitle corpus [1] [2] |
|`AoA`          | Estimated age of acquisition [3] [2] |
|`rt`           | The mean reaction time of participants performing a lexical descision task on the word [4] [5] |

[1] Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: a new measure for Dutch word frequency based on film subtitles. *Behavior Research Methods*, 42(3), 643–650. http://doi.org/10.3758/BRM.42.3.643

[2] Rijn, V., Moor, D., French, I., Ferrand, L., Bonin, P., Méot, A., … Brysbaert, M. (2008). Age-of-acquisition and subjective frequency estimates for all generally known monosyllabic French words and their relation with other psycholinguistic variables. Behavior Research Methods, 40(4), 1049–1054. http://doi.org/10.3758/BRM.40.4.1049

[3] Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. *Acta Psychologica*, 150, 80–84. http://doi.org/10.1016/j.actpsy.2014.04.010

[4] Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 dutch mono-and disyllabic words and nonwords. *Frontiers in Psychology*, 1(174). http://doi.org/10.3389/fpsyg.2010.00174

[5] Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Méot, A., … Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42(2), 488–496. http://doi.org/10.3758/BRM.42.2.488


In [2]:
# Load the psycho-linguistic variables for our vocabulary
psych_ling = pd.read_csv('psycho_linguistic_variables.csv', index_col=0)
display(psych_ling)

Unnamed: 0,english,length,log_freq,AoA,rt
bed,bed,3.0,4.0209,3.7625,572.72
bureau,desk,6.0,3.4666,6.555556,550.97
deur,door,4.0,4.0343,4.444907,507.49
giraf,giraffe,5.0,1.5051,5.91142,605.22
kast,closet,4.0,3.1189,4.770833,515.55
leeuw,lion,5.0,2.8089,5.160544,506.71
neushoorn,rhinoceros,9.0,2.0414,6.811111,618.59
nijlpaard,hippopotamus,9.0,1.8692,6.547059,641.46
olifant,elephant,7.0,2.721,5.075,
stoel,chair,5.0,3.3502,3.947024,557.67


In the above table, you can see that not all variables are available for all words. There are some missing values, marked as `NaN`.
Next, we load in the data recorded during our experiment and compute the variance of the estimated N400 amplitude for each word.

In [3]:
# Load the N400 amplitudes recorded during our experiment
relevant_columns = ['subject', 'association', 'language', 'N400']
n400 = pd.read_csv('data.csv', usecols=relevant_columns)

# Compute N400 variance for each association word, separately for each subject
groups = n400.groupby(['subject', 'association'])
n400_var = groups.agg(dict(language='first', N400='var'))
n400_var = n400_var.reset_index()

# Annotate the data with the psycho-linguistic variables
n400_var = n400_var.join(psych_ling, on='association')

# Show the first 10 rows of the result
display(n400_var.head(10))

Unnamed: 0,subject,association,N400,language,english,length,log_freq,AoA,rt
0,subject01,bed,1.883255,NL,bed,3.0,4.0209,3.7625,572.72
1,subject01,bureau,0.634921,NL,desk,6.0,3.4666,6.555556,550.97
1,subject01,bureau,0.634921,NL,desk,6.0,2.195014,,606.727273
2,subject01,deur,1.432526,NL,door,4.0,4.0343,4.444907,507.49
3,subject01,giraf,0.506806,NL,giraffe,5.0,1.5051,5.91142,605.22
4,subject01,kast,1.396972,NL,closet,4.0,3.1189,4.770833,515.55
5,subject01,leeuw,1.179084,NL,lion,5.0,2.8089,5.160544,506.71
6,subject01,neushoorn,0.614253,NL,rhinoceros,9.0,2.0414,6.811111,618.59
7,subject01,nijlpaard,0.743091,NL,hippopotamus,9.0,1.8692,6.547059,641.46
8,subject01,olifant,1.129631,NL,elephant,7.0,2.721,5.075,


We now proceed to perform statistical analysis of the impact of the psycho-linguistic variables on the N400 effect size. The subjects are of course modeled as a random effect. Additionally, since we are dealing with two languages (Dutch / French) the variables are sourced from different norm studies. For example, word frequencies for Dutch words are estimated from a different corpus than the frequencies for French words. This is why we also include language as a random effect, which will cause the model to fit seperate intercepts and slopes for both languages.

An exception to this is the `length` variable, which does not need to be estimated from norm studies. We can restrict ourselves to a single intercept and slope, regardless of the language, and thus obtain a bit more statistical power.

In [4]:
%%R -i n400_var

# Load the linear mixed effects library
library('lme4')
library('lmerTest')

# When fitting length, language does not need to be modeled as a random effect
m <- lmer(N400 ~ length + (length | subject) + (length | language), data=n400_var)
length <- summary(m)$coefficients["length", c("Estimate", "t value", "Pr(>|t|)")]

m <- lmer(N400 ~ log_freq + (log_freq | subject) + (log_freq | language), data=n400_var)
log_freq <- summary(m)$coefficients["log_freq", c("Estimate", "t value", "Pr(>|t|)")]

m <- lmer(N400 ~ AoA + (AoA | subject) + (AoA | language), data=n400_var)
AoA <- summary(m)$coefficients["AoA", c("Estimate", "t value", "Pr(>|t|)")]

# For this one, estimation of the degrees of freedom fails, so no p-value is available
m <- lmer(N400 ~ rt + (rt | subject) + (rt | language), data=n400_var)
rt <- summary(m)$coefficients["rt", c("Estimate", "t value")]
rt$"Pr(>|t|)" <- NaN  # Mark the missing p-value as NaN

# Combine the results in a data frame
stats <- rbind(length, log_freq, AoA, rt)

print(stats)

         Estimate      t value      Pr(>|t|) 
length   -0.01835187   -1.046713    0.3121253
log_freq 0.01843698    0.65717      0.5117066
AoA      0.001184847   0.03968944   0.9683833
rt       -0.0002666569 -0.001541579 NaN      


The estimated effect sizes of all phycho-linguistic variables on the N400 effect are very small and none reach significance.