Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal flex scanner internal error--end of buffer missed #16

Closed
narayanibarve opened this issue Feb 13, 2017 · 19 comments · Fixed by #47
Closed

fatal flex scanner internal error--end of buffer missed #16

narayanibarve opened this issue Feb 13, 2017 · 19 comments · Fixed by #47

Comments

@narayanibarve
Copy link

This error when I read .bib file. First I thought it happens because file is huge, with something like 5000 citations, so I exported only 4 citations from this set in bibtex format in a .bib format file. But even this 4 citations files does not work. I get the same error.

@crsh
Copy link
Contributor

crsh commented Apr 16, 2017

In some of the .bib-files I have encountered the error was caused by a single long field containing > 10000 characters. Also see #14.

@rkrug
Copy link

rkrug commented May 4, 2017

Anything happening here? I have the error as well and would really like to read the references into R.

Or are there any alternatives? I can use scan to read the file in, x <- scan(file=bibfile, multi.line = TRUE, sep = "\n", what = "character") followed by a x <- trimws(x), but what than?
How could I parse this object?

@romainfrancois
Copy link
Collaborator

Can you prepare a reprex ?

@rkrug
Copy link

rkrug commented Jun 30, 2017

I am using Python for the task now. I had to adapt the workflow a bit, but now it works; and I am learning some python in parallel.

@romainfrancois
Copy link
Collaborator

@narayanibarve do you still have this problem ? If so can you prepare a reproducible example using the reprex package.

@crsh
Copy link
Contributor

crsh commented Jul 3, 2017

Here's a reprex for a case of a long field causing flex to break:

bibtex::read.bib("long_field.txt")
#> Error: lex fatal error:
#> input buffer overflow, can't enlarge buffer because scanner uses REJECT

long_field.txt

I used the current development version of bibtex from this repository.

@crsh
Copy link
Contributor

crsh commented Jul 3, 2017

Similarly, some reference managers (in this case Zotero) add a jabref comment to the bottom of the file, which causes the same error.

bibtex::read.bib("jabref_comment.txt")
#> Error: lex fatal error:
#> input buffer overflow, can't enlarge buffer because scanner uses REJECT

jabref_comment.txt

@romainfrancois
Copy link
Collaborator

Thanks. I'll have a look for the next version

@swood-ecology
Copy link

Just wanted to add to this that I'm having a similar problem reading in the attached .bib file from WoS.

soil.health_healthy.soil_1to500.bib.zip

@Matherion
Copy link

Matherion commented Aug 23, 2017

This cleans the BibTex comments, for anybody else dealing with this:

### First read file to remove the JabRef comment
cleanFile <- readLines(file.path(queryHitsPath, queryHitsFiles));

### Paste all strings together
cleanFile <- paste(cleanFile, collapse="\n");

### Remove jabref comments
cleanFile <- gsub("(?s)@[Cc]omment\\{jabref-meta:[^\\}]*\\}", "", cleanFile, perl=TRUE);

### Write clean file to disk
writeLines(cleanFile, con=file.path(queryHitsPath, "tmp-clean-file.bib"));

### Import references
queryHits[['1and2']] <- ReadBib(file.path(queryHitsPath, "tmp-clean-file.bib"));

However, for some reason it still fails to import, despite no field having even close to 10K characters in it. So there seem to be other errors, as well. Perhaps simply allowing one to specify a string to parse, and thereby letting people import the files on their own, can be a simple, relatively quick fix? Plus, would add functionality that can more generically be useful, so it wouldn't even be lost functionality once this bug (if it is once :-)) has been resolved :-)

@Matherion
Copy link

I'm no closer to solving this, but I remembered I'd actually written 'my own' function to import BibTex files, for a package I'm working on ('metabefor'). It's at https://github.com/Matherion/metabefor/blob/master/R/importBibtex.r, in case anybody's struggling with the same.

@crsh
Copy link
Contributor

crsh commented Jan 30, 2018

Any news on this?

@kguidonimartins
Copy link

Something new on this? I had the same error using both bitex and RefManageR packages, and using citr addin.

My try:

download.file(url = "https://gist.githubusercontent.com/kguidonimartins/6ca03106109cef5a891c67748b895e6a/raw/32c0e203de7875a1d13db6705aa9b507914a9fd9/library.bib", 
              destfile = "library.bib")

bibtex::read.bib(file = "library.bib")

RefManageR::ReadBib(file = "library.bib")

My session info:

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] shiny_1.0.5.9000     Cite_0.1.0           rcrossref_0.8.1.9429
 [4] wordcountaddin_0.2.0 citr_0.2.0.9055      pacman_0.4.6        
 [7] knitr_1.20           picante_1.6-2        nlme_3.1-131        
[10] brranching_0.2.0     phytools_0.6-44      maps_3.2.0          
[13] data.table_1.10.4-3  flora_0.3.0          readxl_1.0.0        
[16] ape_5.0              betapart_1.5.0       forcats_0.3.0       
[19] stringr_1.3.0        dplyr_0.7.4          purrr_0.2.4         
[22] readr_1.1.1          tidyr_0.8.0          tibble_1.4.2        
[25] ggplot2_2.2.1        tidyverse_1.2.1      vegan_2.4-6         
[28] lattice_0.20-35      permute_0.9-4        bibtex_0.4.2        

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2        rprojroot_1.3-2         rstudioapi_0.7         
 [4] urltools_1.7.0          DT_0.4                  mvtnorm_1.0-7          
 [7] lubridate_1.7.3         RefManageR_0.14.20      xml2_1.2.0             
[10] codetools_0.2-15        splines_3.4.3           mnormt_1.5-5           
[13] bold_0.5.0              jsonlite_1.5            broom_0.4.3            
[16] cluster_2.0.6           compiler_3.4.3          httr_1.3.1             
[19] backports_1.1.2         assertthat_0.2.0        Matrix_1.2-12          
[22] lazyeval_0.2.1          cli_1.0.0               later_0.7.1            
[25] htmltools_0.3.6         tools_3.4.3             bindrcpp_0.2           
[28] igraph_1.1.2            coda_0.19-1             gtable_0.2.0           
[31] glue_1.2.0              taxize_0.9.3            reshape2_1.4.3         
[34] clusterGeneration_1.3.4 fastmatch_1.1-0         Rcpp_0.12.16           
[37] msm_1.6.6               cellranger_1.1.0        crul_0.5.2             
[40] debugme_1.1.0           iterators_1.0.9         psych_1.7.8            
[43] rvest_0.3.2             mime_0.5                miniUI_0.1.1           
[46] phangorn_2.4.0          devtools_1.13.5         stringdist_0.9.4.7     
[49] MASS_7.3-49             zoo_1.8-1               scales_0.5.0           
[52] rcdd_1.2                hms_0.4.2               promises_1.0           
[55] parallel_3.4.3          expm_0.999-2            animation_2.5          
[58] yaml_2.1.18             curl_3.2                memoise_1.1.0          
[61] triebeard_0.3.0         reshape_0.8.7           stringi_1.1.7          
[64] foreach_1.4.4           plotrix_3.7             geometry_0.3-6         
[67] rlang_0.2.0             pkgconfig_2.0.1         evaluate_0.10.1        
[70] bindr_0.1.1             htmlwidgets_1.0         plyr_1.8.4             
[73] magrittr_1.5            R6_2.2.2                combinat_0.0-8         
[76] whisker_0.3-2           pillar_1.2.1            haven_1.1.1            
[79] foreign_0.8-69          withr_2.1.2             mgcv_1.8-23            
[82] survival_2.41-3         scatterplot3d_0.3-41    abind_1.4-5            
[85] modelr_0.1.1            crayon_1.3.4            rmarkdown_1.9          
[88] koRpus_0.10-2           grid_3.4.3              callr_2.0.2            
[91] reprex_0.1.2            digest_0.6.15           xtable_1.8-2           
[94] httpuv_1.3.6.9007       numDeriv_2016.8-1       munsell_0.4.3          
[97] shinyjs_1.0             magic_1.5-8             quadprog_1.5-5  

@kguidonimartins
Copy link

The funny thing is that the code works using the reprex addin.

download.file(url = "https://gist.githubusercontent.com/kguidonimartins/6ca03106109cef5a891c67748b895e6a/raw/32c0e203de7875a1d13db6705aa9b507914a9fd9/library.bib", 
              destfile = "library.bib")
bibtex::read.bib(file = "library.bib")
#> Vellend M (2001). "Do commonly used indices of $\beta$ -diversity
#> measure species turnover ?" _Journal of Vegetation Science_, *12*,
#> pp. 545-552.
#> 
#> López-Mart\'inez JO, Sanaphre-Villanueva L, Dupuy JM,
#> Hernández-Stefanoni JL, Meave JA and Gallardo-Cruz JA (2013).
#> "$\beta$-Diversity of functional groups of woody plants in a
#> tropical dry forest in Yucatan." _PloS one_, *8*(9), pp. e73660.
#> ISSN 1932-6203, doi: 10.1371/journal.pone.0073660 (URL:
#> http://doi.org/10.1371/journal.pone.0073660), <URL:
#> http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3769343{\&}tool=pmcentrez{\&}rendertype=abstract>.
#> 
#> Swenson NG, Stegen JC, Davies SJ, Erickson DL, Forero-Montaña J,
#> Hurlbert AH, Kress WJ, Thompson J, Uriarte M, Wright SJ and
#> Zimmerman JK (2012). "Temporal turnover in the composition of
#> tropical tree communities: functional determinism and phylogenetic
#> stochasticity." _Ecology_, *93*(3), pp. 490-499. ISSN 0012-9658,
#> doi: 10.1890/11-1180.1 (URL: http://doi.org/10.1890/11-1180.1),
#> <URL: http://doi.wiley.com/10.1890/11-1180.1>.

RefManageR::ReadBib(file = "library.bib")
#> Warning in parse_Rd(Rd, encoding = encoding, fragment = fragment, ...):
#> <connection>:3: unknown macro '\beta'
#> Warning in parse_Rd(Rd, encoding = encoding, fragment = fragment, ...):
#> <connection>:3: unknown macro '\beta'
#> [1] J. O. López-Mart\'inez, L. Sanaphre-Villanueva, J. M. Dupuy,
#> et al. "$\beta$-Diversity of functional groups of woody plants in
#> a tropical dry forest in Yucatan.". In: _PloS one_ 8.9 (Jan.
#> 2013), p. e73660. ISSN: 1932-6203. DOI:
#> 10.1371/journal.pone.0073660. <URL:
#> http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3769343{\&}tool=pmcentrez{\&}rendertype=abstract>.
#> 
#> [2] N. G. Swenson, J. C. Stegen, S. J. Davies, et al. "Temporal
#> turnover in the composition of tropical tree communities:
#> functional determinism and phylogenetic stochasticity". In:
#> _Ecology_ 93.3 (Mar. 2012), pp. 490-499. ISSN: 0012-9658. DOI:
#> 10.1890/11-1180.1. <URL: http://doi.wiley.com/10.1890/11-1180.1>.
#> 
#> [3] M. Vellend. "Do commonly used indices of $\beta$ -diversity
#> measure species turnover ?". In: _Journal of Vegetation Science_
#> 12 (2001), pp. 545-552.

@swood-ecology
Copy link

I've been reading bib files with readFiles in the bibliometrix package.

@mohdkarim
Copy link

Hi,

I am using citr and Rmarkdown with Zotero. I partially got around this problem with crsh's suggestion of omitting abstract, but some bibtex entries have 500/1000+ author names, that reproduces the problem.

Any suggestions, has anyone come around with a solution to this?

@AmiZya
Copy link

AmiZya commented Mar 21, 2020

I have the same problem with Rmarkdown and citr. Any suggested solution for this please ?

@NeutralKaon
Copy link

I am having this issue for parsing a long list of authors too. Any progress?

@dieghernan
Copy link
Member

Hi, I think this issue may be closed after #47

I parsed all your example files with the upcoming version of bibtex, where the C code is replaced by R code and the described issue is not observed anymore. The files are read accodingly:

# PR 47 https://github.com/ropensci/bibtex/pull/47

library(bibtex)

# File 1 ----

f1 <- tempfile("file1", fileext = ".txt")

download.file(
  "https://github.com/romainfrancois/bibtex/files/1120203/long_field.txt",
  f1
)

ex1 <- read.bib(f1)
ex1
#> Batzill M (2012). "The Surface Science of Graphene: Metal Interfaces,
#> CVD Synthesis, Nanoribbons, Chemical Modifications, and Defects."
#> _SURFACE SCIENCE REPORTS_, *67*(3-4), 83-115. ISSN 0167-5729, doi:
#> 10.1016/j.surfrep.2011.12.001 (URL:
#> https://doi.org/10.1016/j.surfrep.2011.12.001).

# File 2 ----
f2 <- tempfile("file2", fileext = ".txt")

download.file(
  "https://github.com/romainfrancois/bibtex/files/1120229/jabref_comment.txt",
  f2
)

ex2 <- read.bib(f2)
ex2
#> Gómez RL (2002). "Variability and Detection of Invariant Structure."
#> _Psychological Science_, *13*(5), 431-436. ISSN 0956-7976, 1467-9280,
#> doi: 10.1111/1467-9280.00476 (URL:
#> https://doi.org/10.1111/1467-9280.00476), <URL: 2015-01-20>.

# File 3 -----
f3 <- tempfile("file3", fileext = ".zip")
download.file(
  "https://github.com/romainfrancois/bibtex/files/1229495/soil.health_healthy.soil_1to500.bib.zip",
  f3
)


unzip(f3, junkpaths = TRUE, exdir = tempdir())
ex3 <- read.bib(
  file.path(
    tempdir(),
    "soil.health_healthy.soil_1to500.bib"
  )
)
#> ignoring entry 'ISI:000268383100002' (line 34779) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100003' (line 34853) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100004' (line 34928) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100005' (line 34999) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100006' (line 35080) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100008' (line 35134) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100010' (line 35192) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author

length(ex3)
#> [1] 493

# Small sample of entries, since the file has 500 (493 read)

ex3[1:5]
#> FORMAN J (1951). "SOIL, HEALTH, AND THE DENTAL PROFESSION." _JOURNAL OF
#> PROSTHETIC DENTISTRY_, *1*(5), 508-522. ISSN 0022-3913, doi:
#> 10.1016/0022-3913(51)90037-6 (URL:
#> https://doi.org/10.1016/0022-3913(51)90037-6).
#> 
#> SHARMA N, MADAN M (1983). "EARTHWORMS FOR SOIL HEALTH AND
#> POLLUTION-CONTROL." _JOURNAL OF SCIENTIFIC \& INDUSTRIAL RESEARCH_,
#> *42*(10), 575-583. ISSN 0022-4456.
#> 
#> HABERERN J (1992). "A SOIL HEALTH INDEX." _JOURNAL OF SOIL AND WATER
#> CONSERVATION_, *47*(1), 6. ISSN 0022-4561.
#> 
#> [Anonymous] (1993). "THE BREAD CORNER - NO BREAD WITHOUT HEALTHY SOIL."
#> _ALIMENTA_, *32*(3), 45. ISSN 0002-5402.
#> 
#> Watts M (1994). "Pesticide residues in food: The views of the Soil \&
#> Health Association of New Zealand." In Savage, GP (ed.), _PROCEEDINGS
#> OF THE NUTRITION SOCIETY OF NEW ZEALAND, VOL 19_, volume 19 number 0
#> series PROCEEDINGS OF THE NUTRITION SOCIETY OF NEW ZEALAND, 58-63. Nutr
#> Soc New Zealand, ANIMAL \& VETERINARY SCI GROUP, LINCOLN UNIVERSITY, PO
#> BOX 84, CANTERBURY, NEW ZEALAND. 29th Annual Conference of the
#> Nutrition-Society-of-New-Zealand, CHRISTCHURCH, NEW ZEALAND, AUG, 1994.

# From gist ----
gist <- tempfile(fileext = ".bib")

download.file(
  url = "https://gist.githubusercontent.com/kguidonimartins/6ca03106109cef5a891c67748b895e6a/raw/32c0e203de7875a1d13db6705aa9b507914a9fd9/library.bib",
  destfile = gist
)

bibtex::read.bib(file = gist)
#> Vellend M (2001). "Do commonly used indices of $\beta$ -diversity
#> measure species turnover ?" _Journal of Vegetation Science_, *12*,
#> 545-552.
#> 
#> López-Mart\'inez JO, Sanaphre-Villanueva L, Dupuy JM,
#> Hernández-Stefanoni JL, Meave JA, Gallardo-Cruz JA (2013).
#> "$\beta$-Diversity of functional groups of woody plants in a tropical
#> dry forest in Yucatan." _PloS one_, *8*(9), e73660. ISSN 1932-6203,
#> doi: 10.1371/journal.pone.0073660 (URL:
#> https://doi.org/10.1371/journal.pone.0073660), <URL:
#> http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3769343{\&}tool=pmcentrez{\&}rendertype=abstract>.
#> 
#> Swenson NG, Stegen JC, Davies SJ, Erickson DL, Forero-Montaña J,
#> Hurlbert AH, Kress WJ, Thompson J, Uriarte M, Wright SJ, Zimmerman JK
#> (2012). "Temporal turnover in the composition of tropical tree
#> communities: functional determinism and phylogenetic stochasticity."
#> _Ecology_, *93*(3), 490-499. ISSN 0012-9658, doi: 10.1890/11-1180.1
#> (URL: https://doi.org/10.1890/11-1180.1), <URL:
#> http://doi.wiley.com/10.1890/11-1180.1>.

Created on 2022-01-17 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.2 (2021-11-01)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Spanish_Spain.1252
#>  ctype    Spanish_Spain.1252
#>  tz       Europe/Paris
#>  date     2022-01-17
#>  pandoc   2.14.0.3 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date (UTC) lib source
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.1.2)
#>  bibtex      * 0.5.0   2022-01-17 [1] local
#>  cli           3.1.0   2021-10-27 [1] CRAN (R 4.1.1)
#>  crayon        1.4.2   2021-10-29 [1] CRAN (R 4.1.1)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.1)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.1)
#>  fansi         1.0.0   2022-01-10 [1] CRAN (R 4.1.2)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.1)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  glue          1.6.0   2021-12-17 [1] CRAN (R 4.1.2)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.1)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
#>  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.2)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.1)
#>  pillar        1.6.4   2021-10-18 [1] CRAN (R 4.1.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.1)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.1)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.1.1)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.1.1)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.1)
#>  rlang         0.4.12  2021-10-18 [1] CRAN (R 4.1.1)
#>  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.1)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.1)
#>  styler        1.6.2   2021-09-23 [1] CRAN (R 4.1.1)
#>  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.1)
#>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.1)
#>  withr         2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
#>  xfun          0.29    2021-12-14 [1] CRAN (R 4.1.2)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.1)
#> 
#>  [1] C:/Users/diego/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.2/library
#> 
#> ------------------------------------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.