New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer needs to handle nulls #202

Closed
r2evans opened this Issue Jun 20, 2015 · 11 comments

Comments

Projects
None yet
4 participants
@r2evans

r2evans commented Jun 20, 2015

After attempting to read a file with nulls fails (different issue), the datafile is locked. It is not being shown as an open connection by showConnections(all=TRUE), and I know no other places to check for known open/locked connections.

The file, call it abc.txt:

ab\x00c,def
abc,def

(That's a null character, not the literal characters ... I don't know another way to depict it in GH.)

read_delim('abc.txt', delim=',')
## Error in collectorsGuess(source, tokenizer, n = 100) : 
##   embedded nul in string: 'ab\0c'

read_delim('abc.txt', delim=',', col_names=FALSE, col_types='cc')
## Error in read_tokens(ds, tokenizer, col_types, col_names, n_max = n_max,  : 
##   embedded nul in string: 'ab\0c'

The fact that it cannot read a null is not the issue here (though to some people it may be a problem). More the point is that the file is locked and cannot be changed/over-written by anything. (By "anything", I tested emacs/ESS, RStudio text editor, notepad++, even some of R's base file-writing functions.) Tested combinations of read_delim arguments, plus the row of the datafile containing the null character:

file row col_names col_types read file correctly?
1 "cc" yes
1 yes
1 FALSE "cc" error
1 FALSE error
2 "cc" error
2 error
2 FALSE "cc" error
2 FALSE error

When it fails, subsequent calls to one of the two "good" combinations above works fine, though the file is still locked to outside editors. Closing the session of R is the only way I was able to unlock the file for editors to be able to save/over-write the file.

Done on win81_64 using emacs/ESS and RStudio. Attempted it on linux (ubuntu 14.04.2 with R-3.2.0) and none of these argument combinations resulted in a locked file. Not tested on mac.

R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_0.1.1 r2_0.4.15  

loaded via a namespace (and not attached):
[1] compiler_3.2.0  tools_3.2.0     htmltools_0.2.6 Rcpp_0.11.6    
[5] rmarkdown_0.7   digest_0.6.8   
@hadley

This comment has been minimized.

Member

hadley commented Jul 9, 2015

Can you upload the file somewhere and link to it from this issue please?

I've had a quick look and I'm not sure why the file handle isn't being freed - it's happening in C++, so it probably means I've missed a destructor somewhere. You only see the problem on windows because other platforms don't have such insane file locking strategies.

@r2evans

This comment has been minimized.

r2evans commented Jul 9, 2015

How about code to create the file? :-)

r <- charToRaw('abqc,def\nabc,def')
r[3] <- raw(1)
writeBin(r, 'abc.txt')

The file that triggered my initial investigating is not relevant (and way too big), and this code creates the exact file content I used for testing. It's about as "minimal working example" as I could create without going code-golf on it.

@kevinushey

This comment has been minimized.

kevinushey commented Jul 9, 2015

Is it possible that an R error is causing a longjmp that then skips
destructors?

On Thu, Jul 9, 2015, 14:44 r2evans notifications@github.com wrote:

How about code to create the file? :-)

r <- charToRaw('abqc,def\nabc,def')r[3] <- raw(1)
writeBin(r, 'abc.txt')

The file that triggered my initial investigating is not relevant (and way
too big), and this code creates the exact file content I used for testing.
It's about as "minimal working example" as I could create without going
code-golf on it.


Reply to this email directly or view it on GitHub
#202 (comment).

@hadley

This comment has been minimized.

Member

hadley commented Jul 9, 2015

@kevinushey oooooooh, maybe

@hadley

This comment has been minimized.

Member

hadley commented Jul 10, 2015

Ah yes, it looks like that error is thrown by mkCharLenCE(). @kevinushey is there a good way to handle this problem in general?

@kevinushey

This comment has been minimized.

kevinushey commented Jul 10, 2015

You can avoid R errors longjmping by using R_ToplevelExec. Unfortunately the interface is kind of clunky and there's not really a 'clean' way of calling an arbitrary function... there's probably some magic you could do with templates, but it would take some work to get right.

Here's an example:

#include <Rcpp.h>
using namespace Rcpp;

namespace safe {

struct MkCharLenCEData {
  const char* data;
  int n;
  cetype_t enc;
  SEXP result;
};

namespace internal {

void mkCharLenCE(void* pData) {

  MkCharLenCEData* pCall = (MkCharLenCEData*) pData;

  pCall->result = Rf_mkCharLenCE(
    pCall->data,
    pCall->n,
    pCall->enc
  );
}

} // namespace internal

SEXP mkCharLenCE(const char* data, int n, cetype_t enc)
{
  MkCharLenCEData callData;
  callData.data = data;
  callData.n = n;
  callData.enc = enc;
  callData.result = Rf_mkChar("");

  Rboolean ok = R_ToplevelExec(
    internal::mkCharLenCE,
    (void*) &callData
  );

  if (!ok)
    Rprintf("Failed to get character length\n");

  return callData.result;
}

} // namespace safe

// [[Rcpp::export]]
CharacterVector test() {
  CharacterVector result(1);

  result[0] = safe::mkCharLenCE("foo\0bar", 7, CE_UTF8);
  Rprintf("Still got here!\n");
  return result;
}

/*** R
test()
*/

gives me:

> test()
Error: embedded nul in string: 'foo\0bar'
Failed to get character length
Still got here!
[1] ""

It seems like adding something like this in Rcpp (e.g. a generic template function that provides a means of executing some R API function 'safely') would be useful, just might be tricky to make generic enough.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Jul 10, 2015

@kevinushey did we not do precisely this in Rcpp11, although perhaps with a bit of forbidden stuff.

What we need is access to contexts and some documentation on how to use them ... but hey not sure this will happen until Rcpp* is under 50% of CRAN packages ...

@kevinushey

This comment has been minimized.

kevinushey commented Jul 10, 2015

@romainfrancois you're right; that was all the stuff dragging contexts out to power try_catch (I had forgotten about this)

@hadley

This comment has been minimized.

Member

hadley commented Jul 10, 2015

Hmmmm, seems like a lot of work for this case. Might be possible to tackle by checking the string myself

@hadley

This comment has been minimized.

Member

hadley commented Sep 3, 2015

@kevinushey has anything changed in Rcpp that would affect this?

@kevinushey

This comment has been minimized.

kevinushey commented Sep 3, 2015

@hadley not directly, no -- I think you'll have to just check the string yourself.

@hadley hadley changed the title from collectorsGuess/read_tokens failure leaves file locked (windows) to Tokenizer needs to handle nulls Sep 3, 2015

hadley added a commit that referenced this issue Sep 22, 2015

Safely make strings with embedded nuls. #202
Still needs to be plumbed into warning system.

@hadley hadley closed this in da39530 Sep 23, 2015

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.