Tools for analyzing scanpaths
R C Makefile
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Documentation
Plots
Resources
scanpath
.gitignore
Makefile
README.R
README.org

README.org

Scanpath – An R Package for Analyzing Scanpaths

https://zenodo.org/badge/doi/10.5281/zenodo.31800.svg

An R package with tools for analyzing scanpaths. Consult von der Malsburg & Vasishth (2011) for the details of this analysis method. The manual of the package can be found here. The package includes a simple toy dataset and example code.

Install

library("devtools");
install_github("tmalsburg/scanpath/scanpath", dependencies=TRUE)

On Linux this should work out of the box. On OSX, it may be necessary to install the developer tools. I have no idea what needs to be done to make this work on Windows. It may work, but I haven’t tried it. Feedback welcome.

Usage example

You can find all code shown below in the file README.R. You can open that file in RStudio and play with the code as you read through this mini-tutorial.

Let’s have a look at the toy data that is included in the package:

library(scanpath)
data(eyemovements)
head(eyemovements)
subjecttrialwordxyduration
Anne1148.82385.0197
Anne1143.47383.14222
Anne13131.4387.77147
Anne12106.22385.9488
Anne13165.26386.75156
Anne14167.25389.5134

To get a sense of what is going on in this data set, we create a series of plots. For this purpose, we use the function plot_scanpaths from the scanpath package. In the first plot below, each panel shows the data from one trial. There are three participants which are coded by color. The data is from a sentence reading task. The x-axis shows words and the y-axis time within trial in milliseconds.

plot_scanpaths(duration ~ word | trial, eyemovements, subject)

Plots/scanpaths.png

We can see that the participants differ in their reading speed. Also we see that each participant read the sentence more or less straight from left to right (trials: 1, 4, 7), or with a short regressions from the end of the sentence to its beginning (trials: 2, 5, 8), or with a long regression from the end of the sentence (trials: 3, 6, 9).

In the next plot, we use the fixations’ x- and y-coordinates. Each circle is a fixation and the size of the circle represents the duration of the corresponding fixation.

plot_scanpaths(duration ~ x + y | trial, eyemovements, subject)

Plots/scanpaths2.png

The function plot_scanpaths returns a ggplot2 object. This means that we can style the plot before rendering it. For example, we can change the limits of the axes:

library(ggplot2)
p <- plot_scanpaths(duration ~ x + y | trial, eyemovements, subject)
p + xlim(0, 600) + ylim(284, 484)

Plots/scanpaths3.png

Now, we calculate the pair-wise similarities of the nine scanpaths in the dataset using the scasim measure. A simplifying intuition is that the measure quantifies the time that was spent looking at different things or at the same things in different order. For a precise definition see von der Malsburg & Vasishth (2011).

d1 <- scasim(eyemovements, duration ~ x + y | trial, 512, 384, 60, 1/30, normalize=FALSE)
round(d1, 2)
123456789
1029.8754.4110.5335.360.2613.7638.7265.47
229.87031.5933.3512.4837.4536.5715.9142.65
354.4131.59057.8937.0213.4261.1140.4518.62
410.5333.3557.89033.2358.7410.8935.8362.87
535.312.4837.0233.23034.9135.1610.8337.57
660.2637.4513.4258.7434.91060.8837.5512.78
713.7636.5761.1110.8935.1660.88034.7461.21
838.7215.9140.4535.8310.8337.5534.74035.92
965.4742.6518.6262.8737.5712.7861.2135.920

Like the function plot_scanpaths, the function scasim takes a formula and a data frame as parameters. The formula specifies which columns in the data frame should be used for the calculations. To account for distortion due to visual perspective, the comparison of the scanpaths is carried out in visual field coordinates (latitude and longitude). In order to transform the pixel coordinates provided by the eye-tracker to visual field coordinates, the scasim function needs some extra information. The first is the position of the gaze when the participant looked straight ahead (512, 384, in the present case), the distance of the eyes from the screen (60 cm), and the size of one pixel in the unit that was used to specify the distance from the screen (1/30). Finally, we have to specify a normalization procedure. normalize=FALSE means that we don’t want to normalize. See the documentation of the scasim function for details.

The time that was spent looking at different things of course depends on the duration of the two compared trials. (total duration of the two compared scanpaths constitutes an upper bound). This means that two long scanpaths may have a larger dissimilarity than two shorter scanpaths even if they look more similar. Depending on the research question, this may be undesirable. One way to get rid of the trivial influence of total duration is to normalize the dissimilarity scores. For example, we can divide them by the total duration of the two compared scanpaths:

d2 <- scasim(eyemovements, duration ~ x + y | trial, 512, 384, 60, 1/30, normalize="durations")
round(d2, 4)
123456789
100.00620.00980.00210.00630.00920.00240.00620.0087
20.006200.00530.00620.00210.00530.0060.00240.0054
30.00980.005300.00940.00540.00170.00890.00540.0021
40.00210.00620.009400.00530.00820.00170.00520.0078
50.00630.00210.00540.005300.00450.00510.00140.0043
60.00920.00530.00170.00820.004500.00770.00440.0013
70.00240.0060.00890.00170.00510.007700.00460.0069
80.00620.00240.00540.00520.00140.00440.004600.0038
90.00870.00540.00210.00780.00430.00130.00690.00380

The number are smaller now and can be interpreted as the proportion of time that was spent looking at different things.

The numbers in the matrix above capture a lot of information about the scanpath variance in the data set. However, dissimilarity scores are somewhat tricky to analyze. One problem is that these values have strong statistical dependencies. When we change one scanpath, this affects n dissimilarity scores. This has to be kept in mind when doing inferential stats directly on the dissimilarity scores. While there are solutions for this, it is typically more convenient to produce a representation of scanpath variance that is free from this problem. One such representation is what we call the “map of scanpath space.” On such a map, every point represents a scanpath and the distances on the map reflect the dissimilarities according to our scanpath measure, i.e. the dissimilarity scores in the matrix above.

The method for calculating these maps is called multi-dimensional scaling and one simple version of the general idea is implemented in the function cmdscale.

map <- cmdscale(d1)
round(map, 5)
V1V2
129.19698-11.74914
22.02624-1.97714
3-27.32948-14.708
428.81608-2.61734
50.505898.65206
6-30.05024-3.16703
729.633114.10074
80.2288115.28742
9-33.027396.17844

The table above contains two numbers for each scanpath in the data set. These numbers (V1 and V2) determine a scanpath’s location in the two-dimensional scanpath space created by cmdscale. How many dimensions we need is an empirical question.

Below is a plot showing the map of scanpaths:

plot(map, cex=4)
text(map, labels=rownames(map))

Plots/map_of_scanpath_space.png

Interestingly, the scanpaths are arranged in the same way as in the plot of the data at the top except that the axes are mirrored. Participants are arranged vertically and reading patterns are horizontally. This suggests that the scasim measure not just recovered these two different kinds of information (reading speed and reading strategy) but also that it can distinguish between them.

To test how well this map represents the original dissimilarity scores, we can calculate the pair-wise differences on the map and compare them to the pair-wise scasim scores:

d1.dash <- as.matrix(dist(map))
plot(d1, d1.dash)

Plots/fit_of_map.png

This plot suggests that the map preserves the variance in dissimilarity scores really well. Given this very good fit of the map, it appears that two dimensions were sufficient to describe the scanpath variance that is captured by scasim. This is not surprising because the scanpaths in the toy data set were designed to vary with respect to two properties: 1.) The speed of the reader, and 2.) whether there was a regression back to the beginning of the sentence and how long it was.

The benefit of the map representation is that it has much weaker statistical dependencies and that it is much more suitable for all kinds of analyses. For example, we can choose among a large number of clustering algorithms to test whether there are groups of similar scanpaths in a data set. Below, we use the simple k-means algorithm to illustrate this:

set.seed(10)
clusters <- kmeans(map, 3, iter.max=100)
plot(map, cex=4, col=clusters$cluster, pch=19)
text(map, labels=rownames(map), col="white")
points(clusters$centers, col="blue", pch=3, cex=4)

Plots/clusters.png

In this plot, color indicates to which cluster a scanpath belongs and the crosses show the center of each cluster.

Apart from cluster analyses there are many other ways to analyze scanpath variance. See the articles listed below for more details.

References

  • von der Malsburg, T., & Vasishth, S. (2011). What is the scanpath signature of syntactic reanalysis? Journal of Memory and Language, 65(2), 109–127. http://dx.doi.org/10.1016/j.jml.2011.02.004
  • von der Malsburg, T., Kliegl, R., & Vasishth, S. (2015). Determinants of scanpath regularity in reading. Cognitive Science, 39(7), 1675–1703. http://dx.doi.org/10.1111/cogs.12208
  • von der Malsburg, T., & Vasishth, S. (2013). Scanpaths reveal syntactic underspecification and reanalysis strategies. Language and Cognitive Processes, 28(10), 1545–1578. http://dx.doi.org/10.1080/01690965.2012.728232
  • von der Malsburg, T., Vasishth, S., & Kliegl, R. (2012). Scanpaths in reading are informative about sentence processing. In P. B. Michael Carl, & K. K. Choudhary, {Proceedings of the First Workshop on Eye-tracking and Natural Language Processing} (pp. 37–53). Mumbai, India: The COLING 2012 organizing committee.