see what happens if CHARSXPs are not allowed to have embedded nuls

git-svn-id: https://svn.r-project.org/R/trunk@45416 00db46b3-68df-0310-9c12-caf00c1e9a41
wch · Apr 21, 2008 · 7e8027d · 7e8027d
1 parent c2a236f
commit 7e8027d
Show file tree

Hide file tree

Showing 40 changed files with 140 additions and 180 deletions.
diff --git a/NEWS b/NEWS
@@ -35,8 +35,6 @@ NEW FEATURES
     o	tools::texi2dvi() has a new argument 'texinputs' to allow the
 	TeX and bibtex input paths to be specified (even on MiKTeX).
 
-    o	Encoding<-() now handles character strings with embedded nuls.
-
     o	setEPS() and setPS() gain '...' to allow other arguments to be
     	passed to ps.options(), including overriding 'width' and 'height'.
 
@@ -98,6 +96,10 @@ DEPRECATED & DEFUNCT
     o	In package installation, SaveImage: yes is now ignored, and
 	any use of the field will give a warning.
 
+    o	unserialize() no longer accepts character strings as input --
+	that was a format prior to R 2.4.0 which needs embedded nulls
+	in character strings.
+
     o	The C macro 'allocString' has been removed -- use 'mkChar', 
 	or 'allocVector' directly if really necessary.
 

diff --git a/doc/manual/R-ints.texi b/doc/manual/R-ints.texi
@@ -1283,6 +1283,23 @@ used to hold the finalizer function of a C finalizer (uncached) -- now
 @code{CHARSXP}s via @code{allocString} (removed in @R 2.8.0) and
 @code{allocVector(CHARSXP ...)} (deprecated in @R 2.8.0).
 
+Currently @code{CHARSXP}s with embedded nulls can be created by
+
+@itemize
+@item 
+parsing a character string containing @code{\0}. (New in @R{} 2.8.0.)
+@item 
+using @code{scan(allowEscapes=TRUE} on a string containing
+@code{\0}. (New in @R{} 2.8.0.)
+@item 
+by @code{readChar}, @code{rawToChar} or @code{intToUtf8}.
+@item 
+@code{load}ing a saved CHARSXP.  (Broken for version 2 saves in @R{} 2.6.x.)
+@end itemize
+
+@noindent
+This may change before release.
+
 @node Warnings and errors, S4 objects, The CHARSXP cache, R Internal Structures
 @section Warnings and errors
 

diff --git a/src/library/base/man/Comparison.Rd b/src/library/base/man/Comparison.Rd
@@ -58,12 +58,6 @@ x != y
   coerced to the type of the other, the (decreasing) order of precedence
   being character, complex, numeric, integer, logical and raw.
 
-  When comparisons are made between character strings, parts of the
-  strings after embedded \code{nul} characters are ignored.  (This is
-  necessary as the position of \code{nul} in the collation sequence is
-  undefined, and we want one of \code{<}, \code{==} and \code{>} to be
-  true for any comparison.)
-
   Missing values (\code{\link{NA}}) and \code{\link{NaN}} values are
   regarded as non-comparable even to themselves, so comparisons
   involving them will always result in \code{NA}.  Missing values can

diff --git a/src/library/base/man/abbreviate.Rd b/src/library/base/man/abbreviate.Rd
@@ -42,9 +42,6 @@ abbreviate(names.arg, minlength = 4, use.classes = TRUE,
 
   If \code{use.classes} is \code{FALSE} then the only distinction is to
   be between letters and space.  This has NOT been implemented.
-
-  Elements of \code{names.arg} with embedded nul bytes will be truncated
-  at the first nul.
 }
 \value{
   A character vector containing abbreviations for the strings in its

diff --git a/src/library/base/man/agrep.Rd b/src/library/base/man/agrep.Rd
@@ -55,8 +55,6 @@ agrep(pattern, x, ignore.case = FALSE, value = FALSE,
   space it only supports the first 65536 characters of UTF-8 (where all
   the characters for human languages lie).  Note that it can be quite
   slow in UTF-8, and \code{useBytes = TRUE} will be much faster.
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \value{
   Either a vector giving the indices of the elements that yielded a

diff --git a/src/library/base/man/cat.Rd b/src/library/base/man/cat.Rd
@@ -55,8 +55,7 @@ cat(\dots , file = "", sep = " ", fill = FALSE, labels = NULL,
   are handled.  Character strings are output \sQuote{as is} (unlike
   \code{\link{print.default}} which escapes non-printable characters and
   backslash --- use \code{\link{encodeString}} if you want to output
-  encoded strings using \code{cat}).  (Character strings with embedded
-  nuls are truncated at the first nul.)  Other types of \R object should be
+  encoded strings using \code{cat}).  Other types of \R object should be
   converted (e.g. by \code{\link{as.character}} or \code{\link{format}})
   before being passed to \code{cat}.
 

diff --git a/src/library/base/man/char.expand.Rd b/src/library/base/man/char.expand.Rd
@@ -21,8 +21,6 @@
   This function is particularly useful when abbreviations are allowed in
   function arguments, and need to be uniquely expanded with respect to a
   target table of possible values.
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \seealso{
   \code{\link{charmatch}} and \code{\link{pmatch}} for performing

diff --git a/src/library/base/man/charmatch.Rd b/src/library/base/man/charmatch.Rd
@@ -32,8 +32,6 @@ charmatch(x, table, nomatch = NA_integer_)
   returned and if no match is found then \code{nomatch} is returned.
 
   \code{NA} values are treated as the string constant \code{"NA"}.
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \value{
   An integer vector of the same length as \code{x}, giving the

diff --git a/src/library/base/man/chartr.Rd b/src/library/base/man/chartr.Rd
@@ -41,8 +41,6 @@ casefold(x, upper = FALSE)
 
   \code{casefold} is a wrapper for \code{tolower} and \code{toupper}
   provided for compatibility with S-PLUS.
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \value{
   A character vector of the same length and with the same attributes as

diff --git a/src/library/base/man/duplicated.Rd b/src/library/base/man/duplicated.Rd
@@ -58,9 +58,6 @@ duplicated(x, incomparables = FALSE, \dots)
 
   Missing values are regarded as equal, but \code{NaN} is not equal to
   \code{NA_real_}.
-
-  Strings with embedded nuls of the same length will be considered
-  equal if they agree when truncated at the first nul.
 }
 \section{Warning}{
   Using this for lists is potentially slow, especially if the elements

diff --git a/src/library/base/man/encodeString.Rd b/src/library/base/man/encodeString.Rd
@@ -1,6 +1,6 @@
 % File src/library/base/man/encodeString.Rd
 % Part of the R package, http://www.R-project.org
-% Copyright 1995-2007 R Core Development Team
+% Copyright 1995-2008 R Core Development Team
 % Distributed under GPL 2 or later
 
 \name{encodeString}
@@ -33,8 +33,8 @@ encodeString(x, width = 0, quote = "", na.encode = TRUE,
 \details{
   This escapes backslash and the control characters \code{\a} (bell),
   \code{\b} (backspace), \code{\f} (formfeed), \code{\n} (line feed),
-  \code{\r} (carriage return), \code{\t} (tab), \code{\v} (vertical tab)
-  and \code{\0} (nul) as well as any non-printable characters in a
+  \code{\r} (carriage return), \code{\t} (tab) and \code{\v} (vertical tab)
+  as well as any non-printable characters in a
   single-byte locale, which are printed in octal notation
   (\code{\xyz} with leading zeroes).
 #ifdef unix

diff --git a/src/library/base/man/format.Rd b/src/library/base/man/format.Rd
@@ -104,8 +104,6 @@ format(x, \dots)
 
   Raw vectors are converted to their 2-digit hexadecimal representation
   by \code{\link{as.character}}.
-
-  Character inputs with embedded nul bytes will be truncated at the first nul.
 }
 \value{
   An object of similar structure to \code{x} containing character

diff --git a/src/library/base/man/formatc.Rd b/src/library/base/man/formatc.Rd
@@ -126,8 +126,6 @@ prettyNum(x, big.mark = "",   big.interval = 3,
   unexpectedly if \code{x} is a \code{character} vector not resulting from
   something like \code{format(<number>)}: in particular it assumes that
   a period is a decimal mark.
-
-  Character inputs with embedded nul bytes will be truncated at the first nul.
 }
 \author{
   \code{formatC} was originally written by Bill Dunlap, later much

diff --git a/src/library/base/man/grep.Rd b/src/library/base/man/grep.Rd
@@ -93,8 +93,6 @@ gregexpr(pattern, text, ignore.case = FALSE, extended = TRUE,
 
   PCRE only supports caseless matching for a non-ASCII pattern in a
   UTF-8 locale (and not for \code{useBytes = TRUE} in any locale).
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \value{
   For \code{grep} a vector giving either the indices of the elements of

diff --git a/src/library/base/man/iconv.Rd b/src/library/base/man/iconv.Rd
@@ -60,8 +60,6 @@ iconvlist()
 
   As from \R 2.7.0 \code{"UTF8"} will be accepted as meaning the (more
   correct) \code{"UTF-8"}.
-
-  Inputs \code{x} with embedded nul bytes will be handled completely.
 }
 \value{
   A character vector of the same length and the same attributes as

diff --git a/src/library/base/man/identical.Rd b/src/library/base/man/identical.Rd
@@ -50,9 +50,8 @@ identical(x, y)
   \code{\link{NA_real_}}, but all \code{NaN}s are equal (and all \code{NA}
   of the same type are equal).
 
-  Comparison of character strings allows for embedded \code{nul}
-  characters.  Comparison of attributes view them as a set (and not a
-  vector, so order is not tested).
+  Comparison of attributes view them as a set (and not a vector, so
+  order is not tested).
 }
 \value{
   A single logical value, \code{TRUE} or \code{FALSE}, never \code{NA}

diff --git a/src/library/base/man/make.names.Rd b/src/library/base/man/make.names.Rd
@@ -43,8 +43,6 @@ make.names(names, unique = FALSE, allow_ = TRUE)
   \code{allow_ = FALSE} is also useful when creating names for export to
   applications which do not allow underline in names (for example,
   S-PLUS and some DBMSs).
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \seealso{
   \code{\link{make.unique}},

diff --git a/src/library/base/man/make.unique.Rd b/src/library/base/man/make.unique.Rd
@@ -31,8 +31,6 @@ make.unique(names, sep = ".")
 
   If character vector \code{A} is already unique, then
   \code{make.unique(c(A, B))} preserves \code{A}.
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \author{Thomas P Minka}
 \seealso{

diff --git a/src/library/base/man/match.Rd b/src/library/base/man/match.Rd
@@ -61,8 +61,6 @@ x \%in\% table
   For all types, \code{NA} matches \code{NA} and no other value.
   For real and complex values, \code{NaN} values are regarded
   as matching any other \code{NaN} value, but not matching \code{NA}.
-
-  Character inputs with embedded nul bytes will be truncated at the first nul.
 }
 \references{
   Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)

diff --git a/src/library/base/man/nchar.Rd b/src/library/base/man/nchar.Rd
@@ -42,10 +42,6 @@ nzchar(x)
   These will often be the same, and almost always will be in single-byte
   locales.  There will be differences between the first two with
   multibyte character sequences, e.g. in UTF-8 locales.
-  If the byte stream contains embedded \code{nul} bytes,
-  \code{type = "bytes"} looks at all the bytes whereas the other two
-  types look only at the string as printed by \code{cat}, up to the
-  first \code{nul} byte.
 
   The internal equivalent of the default method of
   \code{\link{as.character}} is performed on \code{x} (so there is no
@@ -72,11 +68,6 @@ nzchar(x)
   will be used to \code{print()} the string.  Use
   \code{\link{encodeString}} to find the characters used to print the
   string.
-
-  Embedded \code{nul} bytes are included in the byte count (but not the
-  final \code{nul}).  In contrast, characters are counted up to the
-  string terminator (the first \code{nul} that is not part of a
-  character representation).
 }
 \references{
   Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)

diff --git a/src/library/base/man/paste.Rd b/src/library/base/man/paste.Rd
@@ -36,8 +36,6 @@ paste(\dots, sep = " ", collapse = NULL)
   If a value is specified for \code{collapse}, the values in the result
   are then concatenated into a single string, with the elements being
   separated by the value of \code{collapse}.
-
-  Character inputs with embedded nul bytes will be truncated at the first nul.
 }
 \value{
   A character vector of the concatenated values.  This will be of length

diff --git a/src/library/base/man/pmatch.Rd b/src/library/base/man/pmatch.Rd
@@ -48,8 +48,6 @@ pmatch(x, table, nomatch = NA_integer_, duplicates.ok = FALSE)
   does match empty strings, and it does not allow multiple exact matches.
 
   \code{NA} values are treated as if they were the string constant \code{"NA"}.
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \value{
   An integer vector (possibly including \code{NA} if \code{nomatch =

diff --git a/src/library/base/man/rawConversion.Rd b/src/library/base/man/rawConversion.Rd
@@ -39,7 +39,8 @@ packBits(x, type = c("raw", "integer"))
 
   \code{rawToChar} converts raw bytes either to a single character
   string or a character vector of single bytes.  (Note that a single
-  character string could contain embedded nuls.)
+  character string could contain embedded nuls, in which case it will be
+  truncated at the first nul with a warning.)
 
   \code{rawToBits} returns a raw vector of 8 times the length of a raw
   vector with entries 0 or 1.  \code{intToBits} returns a raw vector

diff --git a/src/library/base/man/readChar.Rd b/src/library/base/man/readChar.Rd
@@ -53,10 +53,8 @@ writeChar(object, con,
   should be returned.  
 
   Character strings containing ASCII \code{nul}(s) will be read
-  correctly by \code{readChar} and appear with embedded nuls in the
-  character vector returned.  \code{writeChar} can write strings with
-  embedded \code{nul}s, and for such strings inteprets \code{nchar} as
-  the number of bytes to be written.
+  correctly by \code{readChar} but truncated at the first
+  \code{nul} with a warning.
 
   If the character length requested for \code{readChar} is longer than
   the data available on the connection, what is available is

diff --git a/src/library/base/man/scan.Rd b/src/library/base/man/scan.Rd
@@ -243,10 +243,8 @@ scan(file = "", what = double(0), nmax = -1, n = -1, sep = "",
   chars: use an explicit separator to avoid this.
 
   Having \code{nul} bytes in fields may lead to interpretation of the
-  field being terminated at the \code{nul} (so they are fine in
-  character fields).  \R 2.8.0 handles these better than earlier
-  versions, but they not normally present in text files -- see
-  \code{\link{readBin}}.
+  field being terminated at the \code{nul}.  They not normally present
+  in text files -- see \code{\link{readBin}} and \code{\link{readChar}}.
 }
 \references{
   Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)

diff --git a/src/library/base/man/serialize.Rd b/src/library/base/man/serialize.Rd
@@ -25,8 +25,8 @@ unserialize(connection, refhook = NULL)
 \arguments{
   \item{object}{\R object to serialize.}
   \item{connection}{an open connection or (for \code{serialize})
-    \code{NULL} or (for \code{unserialize}) a raw vector or a length-one
-    character vector (see \sQuote{Details}).}
+    \code{NULL} or (for \code{unserialize}) a raw vector
+    (see \sQuote{Details}).}
   \item{file}{a connection or the name of the file where the R object
     is saved to or read from.}
   \item{ascii}{a logical.  If \code{TRUE}, an ASCII representation is
@@ -51,8 +51,7 @@ unserialize(connection, refhook = NULL)
   across separate calls to \code{serialize}.
 
   \code{unserialize} reads an object (as written by \code{serialize})
-  from \code{connection} or a raw vector or (for compatibility with
-  earlier versions of \code{serialize}) a length-one character vector.
+  from \code{connection} or a raw vector.
 
   The \code{refhook} functions can be used to customize handling of
   non-system reference objects (all external pointers and weak
@@ -89,9 +88,6 @@ unserialize(connection, refhook = NULL)
 \examples{
 x <- serialize(list(1,2,3), NULL)
 unserialize(x)
-## test earlier interface as a length-one character vector
-y <- rawToChar(x)
-unserialize(y)
 }
 \keyword{internal}
 \keyword{file}
diff --git a/src/library/base/man/sprintf.Rd b/src/library/base/man/sprintf.Rd
@@ -109,8 +109,6 @@ gettextf(fmt, \dots, domain = NULL)
 
   There is a limit of 8192 bytes on elements of \code{fmt} and also on
   strings included by a \code{\%s} conversion specification.
-
-  Character inputs with embedded nul bytes will be truncated at the first nul.
 }
 
 \value{

diff --git a/src/library/base/man/strsplit.Rd b/src/library/base/man/strsplit.Rd
@@ -84,8 +84,6 @@ strsplit(x, split, extended = TRUE, fixed = FALSE, perl = FALSE)
   (non-empty) string, the first element of the output is \code{""}, but
   if there is a match at the end of the string, the output is the same
   as with the match removed. 
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \section{Warning}{
   The standard regular expression code has been reported to be very slow

diff --git a/src/library/base/man/strwrap.Rd b/src/library/base/man/strwrap.Rd
@@ -41,8 +41,6 @@ strwrap(x, width = 0.9 * getOption("width"), indent = 0,
 
   Indentation is relative to the number of characters in the prefix
   string.
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \examples{
 ## Read in file 'THANKS'.

diff --git a/src/library/base/man/substr.Rd b/src/library/base/man/substr.Rd
@@ -49,8 +49,6 @@ substring(text, first, last = 1000000) <- value
   the current locale (see \code{\link{Encoding}} if the corresponding
   input had a declared encoding and the current locale is either Latin-1
   or UTF-8.
-
-  Inputs with embedded nul bytes will be truncated at the first nul.
 }
 \value{
   For \code{substr}, a character vector of the same length and with the

diff --git a/src/library/base/man/unique.Rd b/src/library/base/man/unique.Rd
@@ -57,9 +57,6 @@ unique(x, incomparables = FALSE, \dots)
 
   Missing values are regarded as equal, but \code{NaN} is not equal to
   \code{NA_real_}.
-
-  Strings with embedded nuls of the same length will be considered
-  equal if they agree when truncated at the first nul.
 }
 \value{
   For a vector, an object of the same type of \code{x}, but with only

diff --git a/src/library/base/man/utf8Conversion.Rd b/src/library/base/man/utf8Conversion.Rd
@@ -30,7 +30,8 @@ intToUtf8(x, multiple = FALSE)
   \code{intToUtf8} converts a vector of (numeric) UTF-8 code points
   either to a single character string or a character vector of single
   characters.  (Note that a single character string could contain
-  embedded nuls.)  The \code{\link{Encoding}} is declared as
+  embedded nuls, in which case it will be truncated at the first nul,
+  with a warning.)  The \code{\link{Encoding}} is declared as
   \code{"UTF-8"}.
 }
 \examples{\dontrun{