Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer needs to handle nulls #202

Closed
r2evans opened this issue Jun 20, 2015 · 11 comments
Closed

Tokenizer needs to handle nulls #202

r2evans opened this issue Jun 20, 2015 · 11 comments

Comments

@r2evans
Copy link

r2evans commented Jun 20, 2015

After attempting to read a file with nulls fails (different issue), the datafile is locked. It is not being shown as an open connection by showConnections(all=TRUE), and I know no other places to check for known open/locked connections.

The file, call it abc.txt:

ab\x00c,def
abc,def

(That's a null character, not the literal characters ... I don't know another way to depict it in GH.)

read_delim('abc.txt', delim=',')
## Error in collectorsGuess(source, tokenizer, n = 100) : 
##   embedded nul in string: 'ab\0c'

read_delim('abc.txt', delim=',', col_names=FALSE, col_types='cc')
## Error in read_tokens(ds, tokenizer, col_types, col_names, n_max = n_max,  : 
##   embedded nul in string: 'ab\0c'

The fact that it cannot read a null is not the issue here (though to some people it may be a problem). More the point is that the file is locked and cannot be changed/over-written by anything. (By "anything", I tested emacs/ESS, RStudio text editor, notepad++, even some of R's base file-writing functions.) Tested combinations of read_delim arguments, plus the row of the datafile containing the null character:

file row col_names col_types read file correctly?
1 "cc" yes
1 yes
1 FALSE "cc" error
1 FALSE error
2 "cc" error
2 error
2 FALSE "cc" error
2 FALSE error

When it fails, subsequent calls to one of the two "good" combinations above works fine, though the file is still locked to outside editors. Closing the session of R is the only way I was able to unlock the file for editors to be able to save/over-write the file.

Done on win81_64 using emacs/ESS and RStudio. Attempted it on linux (ubuntu 14.04.2 with R-3.2.0) and none of these argument combinations resulted in a locked file. Not tested on mac.

R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_0.1.1 r2_0.4.15  

loaded via a namespace (and not attached):
[1] compiler_3.2.0  tools_3.2.0     htmltools_0.2.6 Rcpp_0.11.6    
[5] rmarkdown_0.7   digest_0.6.8   
@hadley
Copy link
Member

hadley commented Jul 9, 2015

Can you upload the file somewhere and link to it from this issue please?

I've had a quick look and I'm not sure why the file handle isn't being freed - it's happening in C++, so it probably means I've missed a destructor somewhere. You only see the problem on windows because other platforms don't have such insane file locking strategies.

@r2evans
Copy link
Author

r2evans commented Jul 9, 2015

How about code to create the file? :-)

r <- charToRaw('abqc,def\nabc,def')
r[3] <- raw(1)
writeBin(r, 'abc.txt')

The file that triggered my initial investigating is not relevant (and way too big), and this code creates the exact file content I used for testing. It's about as "minimal working example" as I could create without going code-golf on it.

@kevinushey
Copy link

Is it possible that an R error is causing a longjmp that then skips
destructors?

On Thu, Jul 9, 2015, 14:44 r2evans notifications@github.com wrote:

How about code to create the file? :-)

r <- charToRaw('abqc,def\nabc,def')r[3] <- raw(1)
writeBin(r, 'abc.txt')

The file that triggered my initial investigating is not relevant (and way
too big), and this code creates the exact file content I used for testing.
It's about as "minimal working example" as I could create without going
code-golf on it.


Reply to this email directly or view it on GitHub
#202 (comment).

@hadley
Copy link
Member

hadley commented Jul 9, 2015

@kevinushey oooooooh, maybe

@hadley
Copy link
Member

hadley commented Jul 10, 2015

Ah yes, it looks like that error is thrown by mkCharLenCE(). @kevinushey is there a good way to handle this problem in general?

@kevinushey
Copy link

You can avoid R errors longjmping by using R_ToplevelExec. Unfortunately the interface is kind of clunky and there's not really a 'clean' way of calling an arbitrary function... there's probably some magic you could do with templates, but it would take some work to get right.

Here's an example:

#include <Rcpp.h>
using namespace Rcpp;

namespace safe {

struct MkCharLenCEData {
  const char* data;
  int n;
  cetype_t enc;
  SEXP result;
};

namespace internal {

void mkCharLenCE(void* pData) {

  MkCharLenCEData* pCall = (MkCharLenCEData*) pData;

  pCall->result = Rf_mkCharLenCE(
    pCall->data,
    pCall->n,
    pCall->enc
  );
}

} // namespace internal

SEXP mkCharLenCE(const char* data, int n, cetype_t enc)
{
  MkCharLenCEData callData;
  callData.data = data;
  callData.n = n;
  callData.enc = enc;
  callData.result = Rf_mkChar("");

  Rboolean ok = R_ToplevelExec(
    internal::mkCharLenCE,
    (void*) &callData
  );

  if (!ok)
    Rprintf("Failed to get character length\n");

  return callData.result;
}

} // namespace safe

// [[Rcpp::export]]
CharacterVector test() {
  CharacterVector result(1);

  result[0] = safe::mkCharLenCE("foo\0bar", 7, CE_UTF8);
  Rprintf("Still got here!\n");
  return result;
}

/*** R
test()
*/

gives me:

> test()
Error: embedded nul in string: 'foo\0bar'
Failed to get character length
Still got here!
[1] ""

It seems like adding something like this in Rcpp (e.g. a generic template function that provides a means of executing some R API function 'safely') would be useful, just might be tricky to make generic enough.

@romainfrancois
Copy link
Member

@kevinushey did we not do precisely this in Rcpp11, although perhaps with a bit of forbidden stuff.

What we need is access to contexts and some documentation on how to use them ... but hey not sure this will happen until Rcpp* is under 50% of CRAN packages ...

@kevinushey
Copy link

@romainfrancois you're right; that was all the stuff dragging contexts out to power try_catch (I had forgotten about this)

@hadley
Copy link
Member

hadley commented Jul 10, 2015

Hmmmm, seems like a lot of work for this case. Might be possible to tackle by checking the string myself

@hadley
Copy link
Member

hadley commented Sep 3, 2015

@kevinushey has anything changed in Rcpp that would affect this?

@kevinushey
Copy link

@hadley not directly, no -- I think you'll have to just check the string yourself.

@hadley hadley changed the title collectorsGuess/read_tokens failure leaves file locked (windows) Tokenizer needs to handle nulls Sep 3, 2015
hadley added a commit that referenced this issue Sep 22, 2015
Still needs to be plumbed into warning system.
@hadley hadley closed this as completed in da39530 Sep 23, 2015
@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants