Skip to content

Tokenizer needs to handle nulls #202

@r2evans

Description

@r2evans

After attempting to read a file with nulls fails (different issue), the datafile is locked. It is not being shown as an open connection by showConnections(all=TRUE), and I know no other places to check for known open/locked connections.

The file, call it abc.txt:

ab\x00c,def
abc,def

(That's a null character, not the literal characters ... I don't know another way to depict it in GH.)

read_delim('abc.txt', delim=',')
## Error in collectorsGuess(source, tokenizer, n = 100) : 
##   embedded nul in string: 'ab\0c'

read_delim('abc.txt', delim=',', col_names=FALSE, col_types='cc')
## Error in read_tokens(ds, tokenizer, col_types, col_names, n_max = n_max,  : 
##   embedded nul in string: 'ab\0c'

The fact that it cannot read a null is not the issue here (though to some people it may be a problem). More the point is that the file is locked and cannot be changed/over-written by anything. (By "anything", I tested emacs/ESS, RStudio text editor, notepad++, even some of R's base file-writing functions.) Tested combinations of read_delim arguments, plus the row of the datafile containing the null character:

file row col_names col_types read file correctly?
1 "cc" yes
1 yes
1 FALSE "cc" error
1 FALSE error
2 "cc" error
2 error
2 FALSE "cc" error
2 FALSE error

When it fails, subsequent calls to one of the two "good" combinations above works fine, though the file is still locked to outside editors. Closing the session of R is the only way I was able to unlock the file for editors to be able to save/over-write the file.

Done on win81_64 using emacs/ESS and RStudio. Attempted it on linux (ubuntu 14.04.2 with R-3.2.0) and none of these argument combinations resulted in a locked file. Not tested on mac.

R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_0.1.1 r2_0.4.15  

loaded via a namespace (and not attached):
[1] compiler_3.2.0  tools_3.2.0     htmltools_0.2.6 Rcpp_0.11.6    
[5] rmarkdown_0.7   digest_0.6.8   

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions