Skip to content

Commit

Permalink
see what happens if CHARSXPs are not allowed to have embedded nuls
Browse files Browse the repository at this point in the history
git-svn-id: https://svn.r-project.org/R/trunk@45416 00db46b3-68df-0310-9c12-caf00c1e9a41
  • Loading branch information
ripley committed Apr 21, 2008
1 parent c2a236f commit 7e8027d
Show file tree
Hide file tree
Showing 40 changed files with 140 additions and 180 deletions.
6 changes: 4 additions & 2 deletions NEWS
Expand Up @@ -35,8 +35,6 @@ NEW FEATURES
o tools::texi2dvi() has a new argument 'texinputs' to allow the
TeX and bibtex input paths to be specified (even on MiKTeX).

o Encoding<-() now handles character strings with embedded nuls.

o setEPS() and setPS() gain '...' to allow other arguments to be
passed to ps.options(), including overriding 'width' and 'height'.

Expand Down Expand Up @@ -98,6 +96,10 @@ DEPRECATED & DEFUNCT
o In package installation, SaveImage: yes is now ignored, and
any use of the field will give a warning.

o unserialize() no longer accepts character strings as input --
that was a format prior to R 2.4.0 which needs embedded nulls
in character strings.

o The C macro 'allocString' has been removed -- use 'mkChar',
or 'allocVector' directly if really necessary.

Expand Down
17 changes: 17 additions & 0 deletions doc/manual/R-ints.texi
Expand Up @@ -1283,6 +1283,23 @@ used to hold the finalizer function of a C finalizer (uncached) -- now
@code{CHARSXP}s via @code{allocString} (removed in @R 2.8.0) and
@code{allocVector(CHARSXP ...)} (deprecated in @R 2.8.0).

Currently @code{CHARSXP}s with embedded nulls can be created by

@itemize
@item
parsing a character string containing @code{\0}. (New in @R{} 2.8.0.)
@item
using @code{scan(allowEscapes=TRUE} on a string containing
@code{\0}. (New in @R{} 2.8.0.)
@item
by @code{readChar}, @code{rawToChar} or @code{intToUtf8}.
@item
@code{load}ing a saved CHARSXP. (Broken for version 2 saves in @R{} 2.6.x.)
@end itemize

@noindent
This may change before release.

@node Warnings and errors, S4 objects, The CHARSXP cache, R Internal Structures
@section Warnings and errors

Expand Down
6 changes: 0 additions & 6 deletions src/library/base/man/Comparison.Rd
Expand Up @@ -58,12 +58,6 @@ x != y
coerced to the type of the other, the (decreasing) order of precedence
being character, complex, numeric, integer, logical and raw.

When comparisons are made between character strings, parts of the
strings after embedded \code{nul} characters are ignored. (This is
necessary as the position of \code{nul} in the collation sequence is
undefined, and we want one of \code{<}, \code{==} and \code{>} to be
true for any comparison.)

Missing values (\code{\link{NA}}) and \code{\link{NaN}} values are
regarded as non-comparable even to themselves, so comparisons
involving them will always result in \code{NA}. Missing values can
Expand Down
3 changes: 0 additions & 3 deletions src/library/base/man/abbreviate.Rd
Expand Up @@ -42,9 +42,6 @@ abbreviate(names.arg, minlength = 4, use.classes = TRUE,

If \code{use.classes} is \code{FALSE} then the only distinction is to
be between letters and space. This has NOT been implemented.

Elements of \code{names.arg} with embedded nul bytes will be truncated
at the first nul.
}
\value{
A character vector containing abbreviations for the strings in its
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/agrep.Rd
Expand Up @@ -55,8 +55,6 @@ agrep(pattern, x, ignore.case = FALSE, value = FALSE,
space it only supports the first 65536 characters of UTF-8 (where all
the characters for human languages lie). Note that it can be quite
slow in UTF-8, and \code{useBytes = TRUE} will be much faster.

Inputs with embedded nul bytes will be truncated at the first nul.
}
\value{
Either a vector giving the indices of the elements that yielded a
Expand Down
3 changes: 1 addition & 2 deletions src/library/base/man/cat.Rd
Expand Up @@ -55,8 +55,7 @@ cat(\dots , file = "", sep = " ", fill = FALSE, labels = NULL,
are handled. Character strings are output \sQuote{as is} (unlike
\code{\link{print.default}} which escapes non-printable characters and
backslash --- use \code{\link{encodeString}} if you want to output
encoded strings using \code{cat}). (Character strings with embedded
nuls are truncated at the first nul.) Other types of \R object should be
encoded strings using \code{cat}). Other types of \R object should be
converted (e.g. by \code{\link{as.character}} or \code{\link{format}})
before being passed to \code{cat}.

Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/char.expand.Rd
Expand Up @@ -21,8 +21,6 @@
This function is particularly useful when abbreviations are allowed in
function arguments, and need to be uniquely expanded with respect to a
target table of possible values.

Inputs with embedded nul bytes will be truncated at the first nul.
}
\seealso{
\code{\link{charmatch}} and \code{\link{pmatch}} for performing
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/charmatch.Rd
Expand Up @@ -32,8 +32,6 @@ charmatch(x, table, nomatch = NA_integer_)
returned and if no match is found then \code{nomatch} is returned.

\code{NA} values are treated as the string constant \code{"NA"}.

Inputs with embedded nul bytes will be truncated at the first nul.
}
\value{
An integer vector of the same length as \code{x}, giving the
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/chartr.Rd
Expand Up @@ -41,8 +41,6 @@ casefold(x, upper = FALSE)

\code{casefold} is a wrapper for \code{tolower} and \code{toupper}
provided for compatibility with S-PLUS.

Inputs with embedded nul bytes will be truncated at the first nul.
}
\value{
A character vector of the same length and with the same attributes as
Expand Down
3 changes: 0 additions & 3 deletions src/library/base/man/duplicated.Rd
Expand Up @@ -58,9 +58,6 @@ duplicated(x, incomparables = FALSE, \dots)

Missing values are regarded as equal, but \code{NaN} is not equal to
\code{NA_real_}.

Strings with embedded nuls of the same length will be considered
equal if they agree when truncated at the first nul.
}
\section{Warning}{
Using this for lists is potentially slow, especially if the elements
Expand Down
6 changes: 3 additions & 3 deletions src/library/base/man/encodeString.Rd
@@ -1,6 +1,6 @@
% File src/library/base/man/encodeString.Rd
% Part of the R package, http://www.R-project.org
% Copyright 1995-2007 R Core Development Team
% Copyright 1995-2008 R Core Development Team
% Distributed under GPL 2 or later

\name{encodeString}
Expand Down Expand Up @@ -33,8 +33,8 @@ encodeString(x, width = 0, quote = "", na.encode = TRUE,
\details{
This escapes backslash and the control characters \code{\a} (bell),
\code{\b} (backspace), \code{\f} (formfeed), \code{\n} (line feed),
\code{\r} (carriage return), \code{\t} (tab), \code{\v} (vertical tab)
and \code{\0} (nul) as well as any non-printable characters in a
\code{\r} (carriage return), \code{\t} (tab) and \code{\v} (vertical tab)
as well as any non-printable characters in a
single-byte locale, which are printed in octal notation
(\code{\xyz} with leading zeroes).
#ifdef unix
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/format.Rd
Expand Up @@ -104,8 +104,6 @@ format(x, \dots)

Raw vectors are converted to their 2-digit hexadecimal representation
by \code{\link{as.character}}.

Character inputs with embedded nul bytes will be truncated at the first nul.
}
\value{
An object of similar structure to \code{x} containing character
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/formatc.Rd
Expand Up @@ -126,8 +126,6 @@ prettyNum(x, big.mark = "", big.interval = 3,
unexpectedly if \code{x} is a \code{character} vector not resulting from
something like \code{format(<number>)}: in particular it assumes that
a period is a decimal mark.

Character inputs with embedded nul bytes will be truncated at the first nul.
}
\author{
\code{formatC} was originally written by Bill Dunlap, later much
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/grep.Rd
Expand Up @@ -93,8 +93,6 @@ gregexpr(pattern, text, ignore.case = FALSE, extended = TRUE,

PCRE only supports caseless matching for a non-ASCII pattern in a
UTF-8 locale (and not for \code{useBytes = TRUE} in any locale).

Inputs with embedded nul bytes will be truncated at the first nul.
}
\value{
For \code{grep} a vector giving either the indices of the elements of
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/iconv.Rd
Expand Up @@ -60,8 +60,6 @@ iconvlist()

As from \R 2.7.0 \code{"UTF8"} will be accepted as meaning the (more
correct) \code{"UTF-8"}.

Inputs \code{x} with embedded nul bytes will be handled completely.
}
\value{
A character vector of the same length and the same attributes as
Expand Down
5 changes: 2 additions & 3 deletions src/library/base/man/identical.Rd
Expand Up @@ -50,9 +50,8 @@ identical(x, y)
\code{\link{NA_real_}}, but all \code{NaN}s are equal (and all \code{NA}
of the same type are equal).

Comparison of character strings allows for embedded \code{nul}
characters. Comparison of attributes view them as a set (and not a
vector, so order is not tested).
Comparison of attributes view them as a set (and not a vector, so
order is not tested).
}
\value{
A single logical value, \code{TRUE} or \code{FALSE}, never \code{NA}
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/make.names.Rd
Expand Up @@ -43,8 +43,6 @@ make.names(names, unique = FALSE, allow_ = TRUE)
\code{allow_ = FALSE} is also useful when creating names for export to
applications which do not allow underline in names (for example,
S-PLUS and some DBMSs).

Inputs with embedded nul bytes will be truncated at the first nul.
}
\seealso{
\code{\link{make.unique}},
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/make.unique.Rd
Expand Up @@ -31,8 +31,6 @@ make.unique(names, sep = ".")

If character vector \code{A} is already unique, then
\code{make.unique(c(A, B))} preserves \code{A}.

Inputs with embedded nul bytes will be truncated at the first nul.
}
\author{Thomas P Minka}
\seealso{
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/match.Rd
Expand Up @@ -61,8 +61,6 @@ x \%in\% table
For all types, \code{NA} matches \code{NA} and no other value.
For real and complex values, \code{NaN} values are regarded
as matching any other \code{NaN} value, but not matching \code{NA}.
Character inputs with embedded nul bytes will be truncated at the first nul.
}
\references{
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
Expand Down
9 changes: 0 additions & 9 deletions src/library/base/man/nchar.Rd
Expand Up @@ -42,10 +42,6 @@ nzchar(x)
These will often be the same, and almost always will be in single-byte
locales. There will be differences between the first two with
multibyte character sequences, e.g. in UTF-8 locales.
If the byte stream contains embedded \code{nul} bytes,
\code{type = "bytes"} looks at all the bytes whereas the other two
types look only at the string as printed by \code{cat}, up to the
first \code{nul} byte.

The internal equivalent of the default method of
\code{\link{as.character}} is performed on \code{x} (so there is no
Expand All @@ -72,11 +68,6 @@ nzchar(x)
will be used to \code{print()} the string. Use
\code{\link{encodeString}} to find the characters used to print the
string.

Embedded \code{nul} bytes are included in the byte count (but not the
final \code{nul}). In contrast, characters are counted up to the
string terminator (the first \code{nul} that is not part of a
character representation).
}
\references{
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/paste.Rd
Expand Up @@ -36,8 +36,6 @@ paste(\dots, sep = " ", collapse = NULL)
If a value is specified for \code{collapse}, the values in the result
are then concatenated into a single string, with the elements being
separated by the value of \code{collapse}.

Character inputs with embedded nul bytes will be truncated at the first nul.
}
\value{
A character vector of the concatenated values. This will be of length
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/pmatch.Rd
Expand Up @@ -48,8 +48,6 @@ pmatch(x, table, nomatch = NA_integer_, duplicates.ok = FALSE)
does match empty strings, and it does not allow multiple exact matches.

\code{NA} values are treated as if they were the string constant \code{"NA"}.

Inputs with embedded nul bytes will be truncated at the first nul.
}
\value{
An integer vector (possibly including \code{NA} if \code{nomatch =
Expand Down
3 changes: 2 additions & 1 deletion src/library/base/man/rawConversion.Rd
Expand Up @@ -39,7 +39,8 @@ packBits(x, type = c("raw", "integer"))

\code{rawToChar} converts raw bytes either to a single character
string or a character vector of single bytes. (Note that a single
character string could contain embedded nuls.)
character string could contain embedded nuls, in which case it will be
truncated at the first nul with a warning.)

\code{rawToBits} returns a raw vector of 8 times the length of a raw
vector with entries 0 or 1. \code{intToBits} returns a raw vector
Expand Down
6 changes: 2 additions & 4 deletions src/library/base/man/readChar.Rd
Expand Up @@ -53,10 +53,8 @@ writeChar(object, con,
should be returned.

Character strings containing ASCII \code{nul}(s) will be read
correctly by \code{readChar} and appear with embedded nuls in the
character vector returned. \code{writeChar} can write strings with
embedded \code{nul}s, and for such strings inteprets \code{nchar} as
the number of bytes to be written.
correctly by \code{readChar} but truncated at the first
\code{nul} with a warning.

If the character length requested for \code{readChar} is longer than
the data available on the connection, what is available is
Expand Down
6 changes: 2 additions & 4 deletions src/library/base/man/scan.Rd
Expand Up @@ -243,10 +243,8 @@ scan(file = "", what = double(0), nmax = -1, n = -1, sep = "",
chars: use an explicit separator to avoid this.
Having \code{nul} bytes in fields may lead to interpretation of the
field being terminated at the \code{nul} (so they are fine in
character fields). \R 2.8.0 handles these better than earlier
versions, but they not normally present in text files -- see
\code{\link{readBin}}.
field being terminated at the \code{nul}. They not normally present
in text files -- see \code{\link{readBin}} and \code{\link{readChar}}.
}
\references{
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
Expand Down
10 changes: 3 additions & 7 deletions src/library/base/man/serialize.Rd
Expand Up @@ -25,8 +25,8 @@ unserialize(connection, refhook = NULL)
\arguments{
\item{object}{\R object to serialize.}
\item{connection}{an open connection or (for \code{serialize})
\code{NULL} or (for \code{unserialize}) a raw vector or a length-one
character vector (see \sQuote{Details}).}
\code{NULL} or (for \code{unserialize}) a raw vector
(see \sQuote{Details}).}
\item{file}{a connection or the name of the file where the R object
is saved to or read from.}
\item{ascii}{a logical. If \code{TRUE}, an ASCII representation is
Expand All @@ -51,8 +51,7 @@ unserialize(connection, refhook = NULL)
across separate calls to \code{serialize}.

\code{unserialize} reads an object (as written by \code{serialize})
from \code{connection} or a raw vector or (for compatibility with
earlier versions of \code{serialize}) a length-one character vector.
from \code{connection} or a raw vector.

The \code{refhook} functions can be used to customize handling of
non-system reference objects (all external pointers and weak
Expand Down Expand Up @@ -89,9 +88,6 @@ unserialize(connection, refhook = NULL)
\examples{
x <- serialize(list(1,2,3), NULL)
unserialize(x)
## test earlier interface as a length-one character vector
y <- rawToChar(x)
unserialize(y)
}
\keyword{internal}
\keyword{file}
2 changes: 0 additions & 2 deletions src/library/base/man/sprintf.Rd
Expand Up @@ -109,8 +109,6 @@ gettextf(fmt, \dots, domain = NULL)
There is a limit of 8192 bytes on elements of \code{fmt} and also on
strings included by a \code{\%s} conversion specification.
Character inputs with embedded nul bytes will be truncated at the first nul.
}
\value{
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/strsplit.Rd
Expand Up @@ -84,8 +84,6 @@ strsplit(x, split, extended = TRUE, fixed = FALSE, perl = FALSE)
(non-empty) string, the first element of the output is \code{""}, but
if there is a match at the end of the string, the output is the same
as with the match removed.

Inputs with embedded nul bytes will be truncated at the first nul.
}
\section{Warning}{
The standard regular expression code has been reported to be very slow
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/strwrap.Rd
Expand Up @@ -41,8 +41,6 @@ strwrap(x, width = 0.9 * getOption("width"), indent = 0,

Indentation is relative to the number of characters in the prefix
string.

Inputs with embedded nul bytes will be truncated at the first nul.
}
\examples{
## Read in file 'THANKS'.
Expand Down
2 changes: 0 additions & 2 deletions src/library/base/man/substr.Rd
Expand Up @@ -49,8 +49,6 @@ substring(text, first, last = 1000000) <- value
the current locale (see \code{\link{Encoding}} if the corresponding
input had a declared encoding and the current locale is either Latin-1
or UTF-8.

Inputs with embedded nul bytes will be truncated at the first nul.
}
\value{
For \code{substr}, a character vector of the same length and with the
Expand Down
3 changes: 0 additions & 3 deletions src/library/base/man/unique.Rd
Expand Up @@ -57,9 +57,6 @@ unique(x, incomparables = FALSE, \dots)

Missing values are regarded as equal, but \code{NaN} is not equal to
\code{NA_real_}.

Strings with embedded nuls of the same length will be considered
equal if they agree when truncated at the first nul.
}
\value{
For a vector, an object of the same type of \code{x}, but with only
Expand Down
3 changes: 2 additions & 1 deletion src/library/base/man/utf8Conversion.Rd
Expand Up @@ -30,7 +30,8 @@ intToUtf8(x, multiple = FALSE)
\code{intToUtf8} converts a vector of (numeric) UTF-8 code points
either to a single character string or a character vector of single
characters. (Note that a single character string could contain
embedded nuls.) The \code{\link{Encoding}} is declared as
embedded nuls, in which case it will be truncated at the first nul,
with a warning.) The \code{\link{Encoding}} is declared as
\code{"UTF-8"}.
}
\examples{\dontrun{
Expand Down

0 comments on commit 7e8027d

Please sign in to comment.