Unsupervised text segmentation predicts eye ﬁxations during reading

The online demo of LiB is here: https://hub-binder.mybinder.ovh/user/ray306-lib_demo-qsr3qu0q/doc/tree/Quick_Demo.ipynb You can run the Jupyter notebook to see the segmentation result.

Unsupervised text segmentation predicts eye ﬁxations during reading

Full article here: https://www.frontiersin.org/articles/10.3389/frai.2022.731615/full

Abstract

Words typically form the basis of psycholinguistic and computational linguistic studies about sentence processing. However, recent evidence shows the basic units during reading, i.e., the items in the mental lexicon, are not always words, but could also be sub-word and supra-word units. To recognize these units, human readers require a cognitive mechanism to learn and detect them. In this paper, we assume eye fixations during reading reveal the locations of the cognitive units, and that the cognitive units are analogous with the text units discovered by unsupervised segmentation models. We predict eye fixations by model-segmented units on both English and Dutch text. The results show the model-segmented units predict eye fixations better than word units. This finding suggests that the predictive performance of model-segmented units indicates their plausibility as cognitive units. The Less-is-Better (LiB) model, which finds the units that minimize both long-term and working memory load, offers advantages both in terms of prediction score and efficiency among alternative models. Our results also suggest that modeling the least-effort principle on the management of long-term and working memory can lead to inferring cognitive units. Overall, the study supports the theory that the mental lexicon stores not only words but also smaller and larger units, suggests that fixation locations during reading depend on these units, and shows that unsupervised segmentation models can discover these units.

Anaylsis code

See [Open It] LiB_evaluation_on_GECO.ipynb.

The LiB model code

LiB.py is the main script of the LiB model. It depends on structures.py, which defines the basic data structure of LiB.

The AG model & the CBL model

See Other models.

Data files

The files without name extension are the pre-processed corpora and eye-fixation data of GECO. Since the file size limitation of Github, the pre-processed large corpora (COCA and SoNaR) are uploaded to https://osf.io/ydr7w/.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
other_models		other_models
LiB.py		LiB.py
README.md		README.md
[Open It] LiB_evaluation_on_GECO.ipynb		[Open It] LiB_evaluation_on_GECO.ipynb
_JYang__Unsupervised_text_segmentation_predicts_eye_fixations_during_reading.pdf		_JYang__Unsupervised_text_segmentation_predicts_eye_fixations_during_reading.pdf
br-phono.txt		br-phono.txt
br-text.txt		br-text.txt
en_sent.txt		en_sent.txt
geco_l1		geco_l1
geco_mono		geco_mono
line_ends_dutch		line_ends_dutch
line_ends_english		line_ends_english
sents_l1_all		sents_l1_all
sents_mono_all		sents_mono_all
sents_with_line_ends_without_repeat_en		sents_with_line_ends_without_repeat_en
sents_with_line_ends_without_repeat_nl		sents_with_line_ends_without_repeat_nl
structures.py		structures.py
tokens_l1_all		tokens_l1_all
tokens_mono_all		tokens_mono_all
zh_sent.txt		zh_sent.txt

ray306/LiB-predicts-eye-fixations

Folders and files

Latest commit

History

Repository files navigation

Unsupervised text segmentation predicts eye ﬁxations during reading

Abstract

Anaylsis code

The LiB model code

The AG model & the CBL model

Data files

Quick tutorial

About

Resources

Stars

Watchers

Forks

Languages