Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
tree: 742647236e
Fetching contributors…

Cannot retrieve contributors at this time

executable file 3202 lines (2448 sloc) 158.406 kb
Better String library
---------------------
by Paul Hsieh
The bstring library is an attempt to provide improved string processing
functionality to the C and C++ language. At the heart of the bstring library
(Bstrlib for short) is the management of "bstring"s which are a significant
improvement over '\0' terminated char buffers.
===============================================================================
Motivation
----------
The standard C string library has serious problems:
1) Its use of '\0' to denote the end of the string means knowing a
string's length is O(n) when it could be O(1).
2) It imposes an interpretation for the character value '\0'.
3) gets() always exposes the application to a buffer overflow.
4) strtok() modifies the string its parsing and thus may not be usable in
programs which are re-entrant or multithreaded.
5) fgets has the unusual semantic of ignoring '\0's that occur before
'\n's are consumed.
6) There is no memory management, and actions performed such as strcpy,
strcat and sprintf are common places for buffer overflows.
7) strncpy() doesn't '\0' terminate the destination in some cases.
8) Passing NULL to C library string functions causes an undefined NULL
pointer access.
9) Parameter aliasing (overlapping, or self-referencing parameters)
within most C library functions has undefined behavior.
10) Many C library string function calls take integer parameters with
restricted legal ranges. Parameters passed outside these ranges are
not typically detected and cause undefined behavior.
So the desire is to create an alternative string library that does not suffer
from the above problems and adds in the following functionality:
1) Incorporate string functionality seen from other languages.
a) MID$() - from BASIC
b) split()/join() - from Python
c) string/char x n - from Perl
2) Implement analogs to functions that combine stream IO and char buffers
without creating a dependency on stream IO functionality.
3) Implement the basic text editor-style functions insert, delete, find,
and replace.
4) Implement reference based sub-string access (as a generalization of
pointer arithmetic.)
5) Implement runtime write protection for strings.
There is also a desire to avoid "API-bloat". So functionality that can be
implemented trivially in other functionality is omitted. So there is no
left$() or right$() or reverse() or anything like that as part of the core
functionality.
Explaining Bstrings
-------------------
A bstring is basically a header which wraps a pointer to a char buffer. Lets
start with the declaration of a struct tagbstring:
struct tagbstring {
int mlen;
int slen;
unsigned char * data;
};
This definition is considered exposed, not opaque (though it is neither
necessary nor recommended that low level maintenance of bstrings be performed
whenever the abstract interfaces are sufficient). The mlen field (usually)
describes a lower bound for the memory allocated for the data field. The
slen field describes the exact length for the bstring. The data field is a
single contiguous buffer of unsigned chars. Note that the existence of a '\0'
character in the unsigned char buffer pointed to by the data field does not
necessarily denote the end of the bstring.
To be a well formed modifiable bstring the mlen field must be at least the
length of the slen field, and slen must be non-negative. Furthermore, the
data field must point to a valid buffer in which access to the first mlen
characters has been acquired. So the minimal check for correctness is:
(slen >= 0 && mlen >= slen && data != NULL)
bstrings returned by bstring functions can be assumed to be either NULL or
satisfy the above property. (When bstrings are only readable, the mlen >=
slen restriction is not required; this is discussed later in this section.)
A bstring itself is just a pointer to a struct tagbstring:
typedef struct tagbstring * bstring;
Note that use of the prefix "tag" in struct tagbstring is required to work
around the inconsistency between C and C++'s struct namespace usage. This
definition is also considered exposed.
Bstrlib basically manages bstrings allocated as a header and an associated
data-buffer. Since the implementation is exposed, they can also be
constructed manually. Functions which mutate bstrings assume that the header
and data buffer have been malloced; the bstring library may perform free() or
realloc() on both the header and data buffer of any bstring parameter.
Functions which return bstring's create new bstrings. The string memory is
freed by a bdestroy() call (or using the bstrFree macro).
The following related typedef is also provided:
typedef const struct tagbstring * const_bstring;
which is also considered exposed. These are directly bstring compatible (no
casting required) but are just used for parameters which are meant to be
non-mutable. So in general, bstring parameters which are read as input but
not meant to be modified will be declared as const_bstring, and bstring
parameters which may be modified will be declared as bstring. This convention
is recommended for user written functions as well.
Since bstrings maintain interoperability with C library char-buffer style
strings, all functions which modify, update or create bstrings also append a
'\0' character into the position slen + 1. This trailing '\0' character is
not required for bstrings input to the bstring functions; this is provided
solely as a convenience for interoperability with standard C char-buffer
functionality.
Analogs for the ANSI C string library functions have been created when they
are necessary, but have also been left out when they are not. In particular
there are no functions analogous to fwrite, or puts just for the purposes of
bstring. The ->data member of any string is exposed, and therefore can be
used just as easily as char buffers for C functions which read strings.
For those that wish to hand construct bstrings, the following should be kept
in mind:
1) While bstrlib can accept constructed bstrings without terminating
'\0' characters, the rest of the C language string library will not
function properly on such non-terminated strings. This is obvious
but must be kept in mind.
2) If it is intended that a constructed bstring be written to by the
bstring library functions then the data portion should be allocated
by the malloc function and the slen and mlen fields should be entered
properly. The struct tagbstring header is not reallocated, and only
freed by bdestroy.
3) Writing arbitrary '\0' characters at various places in the string
will not modify its length as perceived by the bstring library
functions. In fact, '\0' is a legitimate non-terminating character
for a bstring to contain.
4) For read only parameters, bstring functions do not check the mlen.
I.e., the minimal correctness requirements are reduced to:
(slen >= 0 && data != NULL)
Better pointer arithmetic
-------------------------
One built-in feature of '\0' terminated char * strings, is that its very easy
and fast to obtain a reference to the tail of any string using pointer
arithmetic. Bstrlib does one better by providing a way to get a reference to
any substring of a bstring (or any other length delimited block of memory.)
So rather than just having pointer arithmetic, with bstrlib one essentially
has segment arithmetic. This is achieved using the macro blk2tbstr() which
builds a reference to a block of memory and the macro bmid2tbstr() which
builds a reference to a segment of a bstring. Bstrlib also includes
functions for direct consumption of memory blocks into bstrings, namely
bcatblk () and blk2bstr ().
One scenario where this can be extremely useful is when string contains many
substrings which one would like to pass as read-only reference parameters to
some string consuming function without the need to allocate entire new
containers for the string data. More concretely, imagine parsing a command
line string whose parameters are space delimited. This can only be done for
tails of the string with '\0' terminated char * strings.
Improved NULL semantics and error handling
------------------------------------------
Unless otherwise noted, if a NULL pointer is passed as a bstring or any other
detectably illegal parameter, the called function will return with an error
indicator (either NULL or BSTR_ERR) rather than simply performing a NULL
pointer access, or having undefined behavior.
To illustrate the value of this, consider the following example:
strcpy (p = malloc (13 * sizeof (char)), "Hello,");
strcat (p, " World");
This is not correct because malloc may return NULL (due to an out of memory
condition), and the behaviour of strcpy is undefined if either of its
parameters are NULL. However:
bstrcat (p = bfromcstr ("Hello,"), q = bfromcstr (" World"));
bdestroy (q);
is well defined, because if either p or q are assigned NULL (indicating a
failure to allocate memory) both bstrcat and bdestroy will recognize it and
perform no detrimental action.
Note that it is not necessary to check any of the members of a returned
bstring for internal correctness (in particular the data member does not need
to be checked against NULL when the header is non-NULL), since this is
assured by the bstring library itself.
bStreams
--------
In addition to the bgets and bread functions, bstrlib can abstract streams
with a high performance read only stream called a bStream. In general, the
idea is to open a core stream (with something like fopen) then pass its
handle as well as a bNread function pointer (like fread) to the bsopen
function which will return a handle to an open bStream. Then the functions
bsread, bsreadln or bsreadlns can be called to read portions of the stream.
Finally, the bsclose function is called to close the bStream -- it will
return a handle to the original (core) stream. So bStreams, essentially,
wrap other streams.
The bStreams have two main advantages over the bgets and bread (as well as
fgets/ungetc) paradigms:
1) Improved functionality via the bunread function which allows a stream to
unread characters, giving the bStream stack-like functionality if so
desired.
2) A very high performance bsreadln function. The C library function fgets()
(and the bgets function) can typically be written as a loop on top of
fgetc(), thus paying all of the overhead costs of calling fgetc on a per
character basis. bsreadln will read blocks at a time, thus amortizing the
overhead of fread calls over many characters at once.
However, clearly bStreams are suboptimal or unusable for certain kinds of
streams (stdin) or certain usage patterns (a few spotty, or non-sequential
reads from a slow stream.) For those situations, using bgets will be more
appropriate.
The semantics of bStreams allows practical construction of layerable data
streams. What this means is that by writing a bNread compatible function on
top of a bStream, one can construct a new bStream on top of it. This can be
useful for writing multi-pass parsers that don't actually read the entire
input more than once and don't require the use of intermediate storage.
Aliasing
--------
Aliasing occurs when a function is given two parameters which point to data
structures which overlap in the memory they occupy. While this does not
disturb read only functions, for many libraries this can make functions that
write to these memory locations malfunction. This is a common problem of the
C standard library and especially the string functions in the C standard
library.
The C standard string library is entirely char by char oriented (as is
bstring) which makes conforming implementations alias safe for some
scenarios. However no actual detection of aliasing is typically performed,
so it is easy to find cases where the aliasing will cause anomolous or
undesirable behaviour (consider: strcat (p, p).) The C99 standard includes
the "restrict" pointer modifier which allows the compiler to document and
assume a no-alias condition on usage. However, only the most trivial cases
can be caught (if at all) by the compiler at compile time, and thus there is
no actual enforcement of non-aliasing.
Bstrlib, by contrast, permits aliasing and is completely aliasing safe, in
the C99 sense of aliasing. That is to say, under the assumption that
pointers of incompatible types from distinct objects can never alias, bstrlib
is completely aliasing safe. (In practice this means that the data buffer
portion of any bstring and header of any bstring are assumed to never alias.)
With the exception of the reference building macros, the library behaves as
if all read-only parameters are first copied and replaced by temporary
non-aliased parameters before any writing to any output bstring is performed
(though actual copying is extremely rarely ever done.)
Besides being a useful safety feature, bstring searching/comparison
functions can improve to O(1) execution when aliasing is detected.
Note that aliasing detection and handling code in Bstrlib is generally
extremely cheap. There is almost never any appreciable performance penalty
for using aliased parameters.
Reenterancy
-----------
Nearly every function in Bstrlib is a leaf function, and is completely
reenterable with the exception of writing to common bstrings. The split
functions which use a callback mechanism requires only that the source string
not be destroyed by the callback function unless the callback function returns
with an error status (note that Bstrlib functions which return an error do
not modify the string in any way.) The string can in fact be modified by the
callback and the behaviour is deterministic. See the documentation of the
various split functions for more details.
Undefined scenarios
-------------------
One of the basic important premises for Bstrlib is to not to increase the
propogation of undefined situations from parameters that are otherwise legal
in of themselves. In particular, except for extremely marginal cases, usages
of bstrings that use the bstring library functions alone cannot lead to any
undefined action. But due to C/C++ language and library limitations, there
is no way to define a non-trivial library that is completely without
undefined operations. All such possible undefined operations are described
below:
1) bstrings or struct tagbstrings that are not explicitely initialized cannot
be passed as a parameter to any bstring function.
2) The members of the NULL bstring cannot be accessed directly. (Though all
APIs and macros detect the NULL bstring.)
3) A bstring whose data member has not been obtained from a malloc or
compatible call and which is write accessible passed as a writable
parameter will lead to undefined results. (i.e., do not writeAllow any
constructed bstrings unless the data portion has been obtained from the
heap.)
4) If the headers of two strings alias but are not identical (which can only
happen via a defective manual construction), then passing them to a
bstring function in which one is writable is not defined.
5) If the mlen member is larger than the actual accessible length of the data
member for a writable bstring, or if the slen member is larger than the
readable length of the data member for a readable bstring, then the
corresponding bstring operations are undefined.
6) Any bstring definition whose header or accessible data portion has been
assigned to inaccessible or otherwise illegal memory clearly cannot be
acted upon by the bstring library in any way.
7) Destroying the source of an incremental split from within the callback
and not returning with a negative value (indicating that it should abort)
will lead to undefined behaviour. (Though *modifying* or adjusting the
state of the source data, even if those modification fail within the
bstrlib API, has well defined behavior.)
8) Modifying a bstring which is write protected by direct access has
undefined behavior.
While this may seem like a long list, with the exception of invalid uses of
the writeAllow macro, and source destruction during an iterative split
without an accompanying abort, no usage of the bstring API alone can cause
any undefined scenario to occurr. I.e., the policy of restricting usage of
bstrings to the bstring API can significantly reduce the risk of runtime
errors (in practice it should eliminate them) related to string manipulation
due to undefined action.
C++ wrapper
-----------
A C++ wrapper has been created to enable bstring functionality for C++ in the
most natural (for C++ programers) way possible. The mandate for the C++
wrapper is different from the base C bstring library. Since the C++ language
has far more abstracting capabilities, the CBString structure is considered
fully abstracted -- i.e., hand generated CBStrings are not supported (though
conversion from a struct tagbstring is allowed) and all detectable errors are
manifest as thrown exceptions.
- The C++ class definitions are all under the namespace Bstrlib. bstrwrap.h
enables this namespace (with a using namespace Bstrlib; directive at the
end) unless the macro BSTRLIB_DONT_ASSUME_NAMESPACE has been defined before
it is included.
- Erroneous accesses results in an exception being thrown. The exception
parameter is of type "struct CBStringException" which is derived from
std::exception if STL is used. A verbose description of the error message
can be obtained from the what() method.
- CBString is a C++ structure derived from a struct tagbstring. An address
of a CBString cast to a bstring must not be passed to bdestroy. The bstring
C API has been made C++ safe and can be used directly in a C++ project.
- It includes constructors which can take a char, '\0' terminated char
buffer, tagbstring, (char, repeat-value), a length delimited buffer or a
CBStringList to initialize it.
- Concatenation is performed with the + and += operators. Comparisons are
done with the ==, !=, <, >, <= and >= operators. Note that == and != use
the biseq call, while <, >, <= and >= use bstrcmp.
- CBString's can be directly cast to const character buffers.
- CBString's can be directly cast to double, float, int or unsigned int so
long as the CBString are decimal representations of those types (otherwise
an exception will be thrown). Converting the other way should be done with
the format(a) method(s).
- CBString contains the length, character and [] accessor methods. The
character and [] accessors are aliases of each other. If the bounds for
the string are exceeded, an exception is thrown. To avoid the overhead for
this check, first cast the CBString to a (const char *) and use [] to
dereference the array as normal. Note that the character and [] accessor
methods allows both reading and writing of individual characters.
- The methods: format, formata, find, reversefind, findcaseless,
reversefindcaseless, midstr, insert, insertchrs, replace, findreplace,
findreplacecaseless, remove, findchr, nfindchr, alloc, toupper, tolower,
gets, read are analogous to the functions that can be found in the C API.
- The caselessEqual and caselessCmp methods are analogous to biseqcaseless
and bstricmp functions respectively.
- Note that just like the bformat function, the format and formata methods do
not automatically cast CBStrings into char * strings for "%s"-type
substitutions:
CBString w("world");
CBString h("Hello");
CBString hw;
/* The casts are necessary */
hw.format ("%s, %s", (const char *)h, (const char *)w);
- The methods trunc and repeat have been added instead of using pattern.
- ltrim, rtrim and trim methods have been added. These remove characters
from a given character string set (defaulting to the whitespace characters)
from either the left, right or both ends of the CBString, respectively.
- The method setsubstr is also analogous in functionality to bsetstr, except
that it cannot be passed NULL. Instead the method fill and the fill-style
constructor have been supplied to enable this functionality.
- The writeprotect(), writeallow() and iswriteprotected() methods are
analogous to the bwriteprotect(), bwriteallow() and biswriteprotected()
macros in the C API. Write protection semantics in CBString are stronger
than with the C API in that indexed character assignment is checked for
write protection. However, unlike with the C API, a write protected
CBString can be destroyed by the destructor.
- CBStream is a C++ structure which wraps a struct bStream (its not derived
from it, since destruction is slightly different). It is constructed by
passing in a bNread function pointer and a stream parameter cast to void *.
This structure includes methods for detecting eof, setting the buffer
length, reading the whole stream or reading entries line by line or block
by block, an unread function, and a peek function.
- If STL is available, the CBStringList structure is derived from a vector of
CBString with various split methods. The split method has been overloaded
to accept either a character or CBString as the second parameter (when the
split parameter is a CBString any character in that CBString is used as a
seperator). The splitstr method takes a CBString as a substring seperator.
Joins can be performed via a CBString constructor which takes a
CBStringList as a parameter, or just using the CBString::join() method.
- If there is proper support for std::iostreams, then the >> and << operators
and the getline() function have been added (with semantics the same as
those for std::string).
Multithreading
--------------
A mutable bstring is kind of analogous to a small (two entry) linked list
allocated by malloc, with all aliasing completely under programmer control.
I.e., manipulation of one bstring will never affect any other distinct
bstring unless explicitely constructed to do so by the programmer via hand
construction or via building a reference. Bstrlib also does not use any
static or global storage, so there are no hidden unremovable race conditions.
Bstrings are also clearly not inherently thread local. So just like
char *'s, bstrings can be passed around from thread to thread and shared and
so on, so long as modifications to a bstring correspond to some kind of
exclusive access lock as should be expected (or if the bstring is read-only,
which can be enforced by bstring write protection) for any sort of shared
object in a multithreaded environment.
Bsafe module
------------
For convenience, a bsafe module has been included. The idea is that if this
module is included, inadvertant usage of the most dangerous C functions will
be overridden and lead to an immediate run time abort. Of course, it should
be emphasized that usage of this module is completely optional. The
intention is essentially to provide an option for creating project safety
rules which can be enforced mechanically rather than socially. This is
useful for larger, or open development projects where its more difficult to
enforce social rules or "coding conventions".
Problems not solved
-------------------
Bstrlib is written for the C and C++ languages, which have inherent weaknesses
that cannot be easily solved:
1. Memory leaks: Forgetting to call bdestroy on a bstring that is about to be
unreferenced, just as forgetting to call free on a heap buffer that is
about to be dereferenced. Though bstrlib itself is leak free.
2. Read before write usage: In C, declaring an auto bstring does not
automatically fill it with legal/valid contents. This problem has been
somewhat mitigated in C++. (The bstrDeclare and bstrFree macros from
bstraux can be used to help mitigate this problem.)
Other problems not addressed:
3. Built-in mutex usage to automatically avoid all bstring internal race
conditions in multitasking environments: The problem with trying to
implement such things at this low a level is that it is typically more
efficient to use locks in higher level primitives. There is also no
platform independent way to implement locks or mutexes.
4. Unicode/widecharacter support.
Note that except for spotty support of wide characters, the default C
standard library does not address any of these problems either.
Configurable compilation options
--------------------------------
All configuration options are meant solely for the purpose of compiler
compatibility. Configuration options are not meant to change the semantics
or capabilities of the library, except where it is unavoidable.
Since some C++ compilers don't include the Standard Template Library and some
have the options of disabling exception handling, a number of macros can be
used to conditionally compile support for each of this:
BSTRLIB_CAN_USE_STL
- defining this will enable the used of the Standard Template Library.
Defining BSTRLIB_CAN_USE_STL overrides the BSTRLIB_CANNOT_USE_STL macro.
BSTRLIB_CANNOT_USE_STL
- defining this will disable the use of the Standard Template Library.
Defining BSTRLIB_CAN_USE_STL overrides the BSTRLIB_CANNOT_USE_STL macro.
BSTRLIB_CAN_USE_IOSTREAM
- defining this will enable the used of streams from class std. Defining
BSTRLIB_CAN_USE_IOSTREAM overrides the BSTRLIB_CANNOT_USE_IOSTREAM macro.
BSTRLIB_CANNOT_USE_IOSTREAM
- defining this will disable the use of streams from class std. Defining
BSTRLIB_CAN_USE_IOSTREAM overrides the BSTRLIB_CANNOT_USE_IOSTREAM macro.
BSTRLIB_THROWS_EXCEPTIONS
- defining this will enable the exception handling within bstring.
Defining BSTRLIB_THROWS_EXCEPTIONS overrides the
BSTRLIB_DOESNT_THROWS_EXCEPTIONS macro.
BSTRLIB_DOESNT_THROW_EXCEPTIONS
- defining this will disable the exception handling within bstring.
Defining BSTRLIB_THROWS_EXCEPTIONS overrides the
BSTRLIB_DOESNT_THROW_EXCEPTIONS macro.
Note that these macros must be defined consistently throughout all modules
that use CBStrings including bstrwrap.cpp.
Some older C compilers do not support functions such as vsnprintf. This is
handled by the following macro variables:
BSTRLIB_NOVSNP
- defining this indicates that the compiler does not support vsnprintf.
This will cause bformat and bformata to not be declared. Note that
for some compilers, such as Turbo C, this is set automatically.
Defining BSTRLIB_NOVSNP overrides the BSTRLIB_VSNP_OK macro.
BSTRLIB_VSNP_OK
- defining this will disable the autodetection of compilers the do not
support of compilers that do not support vsnprintf.
Defining BSTRLIB_NOVSNP overrides the BSTRLIB_VSNP_OK macro.
Semantic compilation options
----------------------------
Bstrlib comes with very few compilation options for changing the semantics of
of the library. These are described below.
BSTRLIB_DONT_ASSUME_NAMESPACE
- Defining this before including bstrwrap.h will disable the automatic
enabling of the Bstrlib namespace for the C++ declarations.
BSTRLIB_DONT_USE_VIRTUAL_DESTRUCTOR
- Defining this will make the CBString destructor non-virtual.
BSTRLIB_MEMORY_DEBUG
- Defining this will cause the bstrlib modules bstrlib.c and bstrwrap.cpp
to invoke a #include "memdbg.h". memdbg.h has to be supplied by the user.
Note that these macros must be defined consistently throughout all modules
that use bstrings or CBStrings including bstrlib.c, bstraux.c and
bstrwrap.cpp.
===============================================================================
Files
-----
bstrlib.c - C implementaion of bstring functions.
bstrlib.h - C header file for bstring functions.
bstraux.c - C example that implements trivial additional functions.
bstraux.h - C header for bstraux.c
bstest.c - C unit/regression test for bstrlib.c
bstrwrap.cpp - C++ implementation of CBString.
bstrwrap.h - C++ header file for CBString.
test.cpp - C++ unit/regression test for bstrwrap.cpp
bsafe.c - C runtime stubs to abort usage of unsafe C functions.
bsafe.h - C header file for bsafe.c functions.
C projects need only include bstrlib.h and compile/link bstrlib.c to use the
bstring library. C++ projects need to additionally include bstrwrap.h and
compile/link bstrwrap.cpp. For both, there may be a need to make choices
about feature configuration as described in the "Configurable compilation
options" in the section above.
Other files that are included in this archive are:
license.txt - The 3 clause BSD license for Bstrlib
gpl.txt - The GPL version 2
security.txt - A security statement useful for auditting Bstrlib
porting.txt - A guide to porting Bstrlib
bstrlib.txt - This file
===============================================================================
The functions
-------------
extern bstring bfromcstr (const char * str);
Take a standard C library style '\0' terminated char buffer and generate
a bstring with the same contents as the char buffer. If an error occurs
NULL is returned.
So for example:
bstring b = bfromcstr ("Hello");
if (!b) {
fprintf (stderr, "Out of memory");
} else {
puts ((char *) b->data);
}
..........................................................................
extern bstring bfromcstralloc (int mlen, const char * str);
Create a bstring which contains the contents of the '\0' terminated
char * buffer str. The memory buffer backing the bstring is at least
mlen characters in length. If an error occurs NULL is returned.
So for example:
bstring b = bfromcstralloc (64, someCstr);
if (b) b->data[63] = 'x';
The idea is that this will set the 64th character of b to 'x' if it is at
least 64 characters long otherwise do nothing. And we know this is well
defined so long as b was successfully created, since it will have been
allocated with at least 64 characters.
..........................................................................
extern bstring blk2bstr (const void * blk, int len);
Create a bstring whose contents are described by the contiguous buffer
pointing to by blk with a length of len bytes. Note that this function
creates a copy of the data in blk, rather than simply referencing it.
Compare with the blk2tbstr macro. If an error occurs NULL is returned.
..........................................................................
extern char * bstr2cstr (const_bstring s, char z);
Create a '\0' terminated char buffer which contains the contents of the
bstring s, except that any contained '\0' characters are converted to the
character in z. This returned value should be freed with bcstrfree(), by
the caller. If an error occurs NULL is returned.
..........................................................................
extern int bcstrfree (char * s);
Frees a C-string generated by bstr2cstr (). This is normally unnecessary
since it just wraps a call to free (), however, if malloc () and free ()
have been redefined as a macros within the bstrlib module (via macros in
the memdbg.h backdoor) with some difference in behaviour from the std
library functions, then this allows a correct way of freeing the memory
that allows higher level code to be independent from these macro
redefinitions.
..........................................................................
extern bstring bstrcpy (const_bstring b1);
Make a copy of the passed in bstring. The copied bstring is returned if
there is no error, otherwise NULL is returned.
..........................................................................
extern int bassign (bstring a, const_bstring b);
Overwrite the bstring a with the contents of bstring b. Note that the
bstring a must be a well defined and writable bstring. If an error
occurs BSTR_ERR is returned and a is not overwritten.
..........................................................................
int bassigncstr (bstring a, const char * str);
Overwrite the string a with the contents of char * string str. Note that
the bstring a must be a well defined and writable bstring. If an error
occurs BSTR_ERR is returned and a may be partially overwritten.
..........................................................................
int bassignblk (bstring a, const void * s, int len);
Overwrite the string a with the contents of the block (s, len). Note that
the bstring a must be a well defined and writable bstring. If an error
occurs BSTR_ERR is returned and a is not overwritten.
..........................................................................
extern int bassignmidstr (bstring a, const_bstring b, int left, int len);
Overwrite the bstring a with the middle of contents of bstring b
starting from position left and running for a length len. left and
len are clamped to the ends of b as with the function bmidstr. Note that
the bstring a must be a well defined and writable bstring. If an error
occurs BSTR_ERR is returned and a is not overwritten.
..........................................................................
extern bstring bmidstr (const_bstring b, int left, int len);
Create a bstring which is the substring of b starting from position left
and running for a length len (clamped by the end of the bstring b.) If
there was no error, the value of this constructed bstring is returned
otherwise NULL is returned.
..........................................................................
extern int bdelete (bstring s1, int pos, int len);
Removes characters from pos to pos+len-1 and shifts the tail of the
bstring starting from pos+len to pos. len must be positive for this call
to have any effect. The section of the bstring described by (pos, len)
is clamped to boundaries of the bstring b. The value BSTR_OK is returned
if the operation is successful, otherwise BSTR_ERR is returned.
..........................................................................
extern int bconcat (bstring b0, const_bstring b1);
Concatenate the bstring b1 to the end of bstring b0. The value BSTR_OK
is returned if the operation is successful, otherwise BSTR_ERR is
returned.
..........................................................................
extern int bconchar (bstring b, char c);
Concatenate the character c to the end of bstring b. The value BSTR_OK
is returned if the operation is successful, otherwise BSTR_ERR is
returned.
..........................................................................
extern int bcatcstr (bstring b, const char * s);
Concatenate the char * string s to the end of bstring b. The value
BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is
returned.
..........................................................................
extern int bcatblk (bstring b, const void * s, int len);
Concatenate a fixed length buffer (s, len) to the end of bstring b. The
value BSTR_OK is returned if the operation is successful, otherwise
BSTR_ERR is returned.
..........................................................................
extern int biseq (const_bstring b0, const_bstring b1);
Compare the bstring b0 and b1 for equality. If the bstrings differ, 0
is returned, if the bstrings are the same, 1 is returned, if there is an
error, -1 is returned. If the length of the bstrings are different, this
function has O(1) complexity. Contained '\0' characters are not treated
as a termination character.
Note that the semantics of biseq are not completely compatible with
bstrcmp because of its different treatment of the '\0' character.
..........................................................................
extern int bisstemeqblk (const_bstring b, const void * blk, int len);
Compare beginning of bstring b0 with a block of memory of length len for
equality. If the beginning of b0 differs from the memory block (or if b0
is too short), 0 is returned, if the bstrings are the same, 1 is returned,
if there is an error, -1 is returned.
..........................................................................
extern int biseqcaseless (const_bstring b0, const_bstring b1);
Compare two bstrings for equality without differentiating between case.
If the bstrings differ other than in case, 0 is returned, if the bstrings
are the same, 1 is returned, if there is an error, -1 is returned. If
the length of the bstrings are different, this function is O(1). '\0'
termination characters are not treated in any special way.
..........................................................................
extern int bisstemeqcaselessblk (const_bstring b0, const void * blk, int len);
Compare beginning of bstring b0 with a block of memory of length len
without differentiating between case for equality. If the beginning of b0
differs from the memory block other than in case (or if b0 is too short),
0 is returned, if the bstrings are the same, 1 is returned, if there is an
error, -1 is returned.
..........................................................................
extern int biseqcstr (const_bstring b, const char *s);
Compare the bstring b and char * bstring s. The C string s must be '\0'
terminated at exactly the length of the bstring b, and the contents
between the two must be identical with the bstring b with no '\0'
characters for the two contents to be considered equal. This is
equivalent to the condition that their current contents will be always be
equal when comparing them in the same format after converting one or the
other. If they are equal 1 is returned, if they are unequal 0 is
returned and if there is a detectable error BSTR_ERR is returned.
..........................................................................
extern int biseqcstrcaseless (const_bstring b, const char *s);
Compare the bstring b and char * string s. The C string s must be '\0'
terminated at exactly the length of the bstring b, and the contents
between the two must be identical except for case with the bstring b with
no '\0' characters for the two contents to be considered equal. This is
equivalent to the condition that their current contents will be always be
equal ignoring case when comparing them in the same format after
converting one or the other. If they are equal, except for case, 1 is
returned, if they are unequal regardless of case 0 is returned and if
there is a detectable error BSTR_ERR is returned.
..........................................................................
extern int bstrcmp (const_bstring b0, const_bstring b1);
Compare the bstrings b0 and b1 for ordering. If there is an error,
SHRT_MIN is returned, otherwise a value less than or greater than zero,
indicating that the bstring pointed to by b0 is lexicographically less
than or greater than the bstring pointed to by b1 is returned. If the
bstring lengths are unequal but the characters up until the length of the
shorter are equal then a value less than, or greater than zero,
indicating that the bstring pointed to by b0 is shorter or longer than the
bstring pointed to by b1 is returned. 0 is returned if and only if the
two bstrings are the same. If the length of the bstrings are different,
this function is O(n). Like its standard C library counter part, the
comparison does not proceed past any '\0' termination characters
encountered.
The seemingly odd error return value, merely provides slightly more
granularity than the undefined situation given in the C library function
strcmp. The function otherwise behaves very much like strcmp().
Note that the semantics of bstrcmp are not completely compatible with
biseq because of its different treatment of the '\0' termination
character.
..........................................................................
extern int bstrncmp (const_bstring b0, const_bstring b1, int n);
Compare the bstrings b0 and b1 for ordering for at most n characters. If
there is an error, SHRT_MIN is returned, otherwise a value is returned as
if b0 and b1 were first truncated to at most n characters then bstrcmp
was called with these new bstrings are paremeters. If the length of the
bstrings are different, this function is O(n). Like its standard C
library counter part, the comparison does not proceed past any '\0'
termination characters encountered.
The seemingly odd error return value, merely provides slightly more
granularity than the undefined situation given in the C library function
strncmp. The function otherwise behaves very much like strncmp().
..........................................................................
extern int bstricmp (const_bstring b0, const_bstring b1);
Compare two bstrings without differentiating between case. The return
value is the difference of the values of the characters where the two
bstrings first differ, otherwise 0 is returned indicating that the
bstrings are equal. If the lengths are different, then a difference from
0 is given, but if the first extra character is '\0', then it is taken to
be the value UCHAR_MAX+1.
..........................................................................
extern int bstrnicmp (const_bstring b0, const_bstring b1, int n);
Compare two bstrings without differentiating between case for at most n
characters. If the position where the two bstrings first differ is
before the nth position, the return value is the difference of the values
of the characters, otherwise 0 is returned. If the lengths are different
and less than n characters, then a difference from 0 is given, but if the
first extra character is '\0', then it is taken to be the value
UCHAR_MAX+1.
..........................................................................
extern int bdestroy (bstring b);
Deallocate the bstring passed. Passing NULL in as a parameter will have
no effect. Note that both the header and the data portion of the bstring
will be freed. No other bstring function which modifies one of its
parameters will free or reallocate the header. Because of this, in
general, bdestroy cannot be called on any declared struct tagbstring even
if it is not write protected. A bstring which is write protected cannot
be destroyed via the bdestroy call. Any attempt to do so will result in
no action taken, and BSTR_ERR will be returned.
Note to C++ users: Passing in a CBString cast to a bstring will lead to
undefined behavior (free will be called on the header, rather than the
CBString destructor.) Instead just use the ordinary C++ language
facilities to dealloc a CBString.
..........................................................................
extern int binstr (const_bstring s1, int pos, const_bstring s2);
Search for the bstring s2 in s1 starting at position pos and looking in a
forward (increasing) direction. If it is found then it returns with the
first position after pos where it is found, otherwise it returns BSTR_ERR.
The algorithm used is brute force; O(m*n).
..........................................................................
extern int binstrr (const_bstring s1, int pos, const_bstring s2);
Search for the bstring s2 in s1 starting at position pos and looking in a
backward (decreasing) direction. If it is found then it returns with the
first position after pos where it is found, otherwise return BSTR_ERR.
Note that the current position at pos is tested as well -- so to be
disjoint from a previous forward search it is recommended that the
position be backed up (decremented) by one position. The algorithm used
is brute force; O(m*n).
..........................................................................
extern int binstrcaseless (const_bstring s1, int pos, const_bstring s2);
Search for the bstring s2 in s1 starting at position pos and looking in a
forward (increasing) direction but without regard to case. If it is
found then it returns with the first position after pos where it is
found, otherwise it returns BSTR_ERR. The algorithm used is brute force;
O(m*n).
..........................................................................
extern int binstrrcaseless (const_bstring s1, int pos, const_bstring s2);
Search for the bstring s2 in s1 starting at position pos and looking in a
backward (decreasing) direction but without regard to case. If it is
found then it returns with the first position after pos where it is
found, otherwise return BSTR_ERR. Note that the current position at pos
is tested as well -- so to be disjoint from a previous forward search it
is recommended that the position be backed up (decremented) by one
position. The algorithm used is brute force; O(m*n).
..........................................................................
extern int binchr (const_bstring b0, int pos, const_bstring b1);
Search for the first position in b0 starting from pos or after, in which
one of the characters in b1 is found. This function has an execution
time of O(b0->slen + b1->slen). If such a position does not exist in b0,
then BSTR_ERR is returned.
..........................................................................
extern int binchrr (const_bstring b0, int pos, const_bstring b1);
Search for the last position in b0 no greater than pos, in which one of
the characters in b1 is found. This function has an execution time
of O(b0->slen + b1->slen). If such a position does not exist in b0,
then BSTR_ERR is returned.
..........................................................................
extern int bninchr (const_bstring b0, int pos, const_bstring b1);
Search for the first position in b0 starting from pos or after, in which
none of the characters in b1 is found and return it. This function has
an execution time of O(b0->slen + b1->slen). If such a position does
not exist in b0, then BSTR_ERR is returned.
..........................................................................
extern int bninchrr (const_bstring b0, int pos, const_bstring b1);
Search for the last position in b0 no greater than pos, in which none of
the characters in b1 is found and return it. This function has an
execution time of O(b0->slen + b1->slen). If such a position does not
exist in b0, then BSTR_ERR is returned.
..........................................................................
extern int bstrchr (const_bstring b, int c);
Search for the character c in the bstring b forwards from the start of
the bstring. Returns the position of the found character or BSTR_ERR if
it is not found.
NOTE: This has been implemented as a macro on top of bstrchrp ().
..........................................................................
extern int bstrrchr (const_bstring b, int c);
Search for the character c in the bstring b backwards from the end of the
bstring. Returns the position of the found character or BSTR_ERR if it is
not found.
NOTE: This has been implemented as a macro on top of bstrrchrp ().
..........................................................................
extern int bstrchrp (const_bstring b, int c, int pos);
Search for the character c in b forwards from the position pos
(inclusive). Returns the position of the found character or BSTR_ERR if
it is not found.
..........................................................................
extern int bstrrchrp (const_bstring b, int c, int pos);
Search for the character c in b backwards from the position pos in bstring
(inclusive). Returns the position of the found character or BSTR_ERR if
it is not found.
..........................................................................
extern int bsetstr (bstring b0, int pos, const_bstring b1, unsigned char fill);
Overwrite the bstring b0 starting at position pos with the bstring b1. If
the position pos is past the end of b0, then the character "fill" is
appended as necessary to make up the gap between the end of b0 and pos.
If b1 is NULL, it behaves as if it were a 0-length bstring. The value
BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is
returned.
..........................................................................
extern int binsert (bstring s1, int pos, const_bstring s2, unsigned char fill);
Inserts the bstring s2 into s1 at position pos. If the position pos is
past the end of s1, then the character "fill" is appended as necessary to
make up the gap between the end of s1 and pos. The value BSTR_OK is
returned if the operation is successful, otherwise BSTR_ERR is returned.
..........................................................................
extern int binsertch (bstring s1, int pos, int len, unsigned char fill);
Inserts the character fill repeatedly into s1 at position pos for a
length len. If the position pos is past the end of s1, then the
character "fill" is appended as necessary to make up the gap between the
end of s1 and the position pos + len (exclusive). The value BSTR_OK is
returned if the operation is successful, otherwise BSTR_ERR is returned.
..........................................................................
extern int breplace (bstring b1, int pos, int len, const_bstring b2,
unsigned char fill);
Replace a section of a bstring from pos for a length len with the bstring
b2. If the position pos is past the end of b1 then the character "fill"
is appended as necessary to make up the gap between the end of b1 and
pos.
..........................................................................
extern int bfindreplace (bstring b, const_bstring find,
const_bstring replace, int position);
Replace all occurrences of the find substring with a replace bstring
after a given position in the bstring b. The find bstring must have a
length > 0 otherwise BSTR_ERR is returned. This function does not
perform recursive per character replacement; that is to say successive
searches resume at the position after the last replace.
So for example:
bfindreplace (a0 = bfromcstr("aabaAb"), a1 = bfromcstr("a"),
a2 = bfromcstr("aa"), 0);
Should result in changing a0 to "aaaabaaAb".
This function performs exactly (b->slen - position) bstring comparisons,
and data movement is bounded above by character volume equivalent to size
of the output bstring.
..........................................................................
extern int bfindreplacecaseless (bstring b, const_bstring find,
const_bstring replace, int position);
Replace all occurrences of the find substring, ignoring case, with a
replace bstring after a given position in the bstring b. The find bstring
must have a length > 0 otherwise BSTR_ERR is returned. This function
does not perform recursive per character replacement; that is to say
successive searches resume at the position after the last replace.
So for example:
bfindreplacecaseless (a0 = bfromcstr("AAbaAb"), a1 = bfromcstr("a"),
a2 = bfromcstr("aa"), 0);
Should result in changing a0 to "aaaabaaaab".
This function performs exactly (b->slen - position) bstring comparisons,
and data movement is bounded above by character volume equivalent to size
of the output bstring.
..........................................................................
extern int balloc (bstring b, int length);
Increase the allocated memory backing the data buffer for the bstring b
to a length of at least length. If the memory backing the bstring b is
already large enough, not action is performed. This has no effect on the
bstring b that is visible to the bstring API. Usually this function will
only be used when a minimum buffer size is required coupled with a direct
access to the ->data member of the bstring structure.
Be warned that like any other bstring function, the bstring must be well
defined upon entry to this function. I.e., doing something like:
b->slen *= 2; /* ?? Most likely incorrect */
balloc (b, b->slen);
is invalid, and should be implemented as:
int t;
if (BSTR_OK == balloc (b, t = (b->slen * 2))) b->slen = t;
This function will return with BSTR_ERR if b is not detected as a valid
bstring or length is not greater than 0, otherwise BSTR_OK is returned.
..........................................................................
extern int ballocmin (bstring b, int length);
Change the amount of memory backing the bstring b to at least length.
This operation will never truncate the bstring data including the
extra terminating '\0' and thus will not decrease the length to less than
b->slen + 1. Note that repeated use of this function may cause
performance problems (realloc may be called on the bstring more than
the O(log(INT_MAX)) times). This function will return with BSTR_ERR if b
is not detected as a valid bstring or length is not greater than 0,
otherwise BSTR_OK is returned.
So for example:
if (BSTR_OK == ballocmin (b, 64)) b->data[63] = 'x';
The idea is that this will set the 64th character of b to 'x' if it is at
least 64 characters long otherwise do nothing. And we know this is well
defined so long as the ballocmin call was successfully, since it will
ensure that b has been allocated with at least 64 characters.
..........................................................................
int btrunc (bstring b, int n);
Truncate the bstring to at most n characters. This function will return
with BSTR_ERR if b is not detected as a valid bstring or n is less than
0, otherwise BSTR_OK is returned.
..........................................................................
extern int bpattern (bstring b, int len);
Replicate the starting bstring, b, end to end repeatedly until it
surpasses len characters, then chop the result to exactly len characters.
This function operates in-place. This function will return with BSTR_ERR
if b is NULL or of length 0, otherwise BSTR_OK is returned.
..........................................................................
extern int btoupper (bstring b);
Convert contents of bstring to upper case. This function will return with
BSTR_ERR if b is NULL or of length 0, otherwise BSTR_OK is returned.
..........................................................................
extern int btolower (bstring b);
Convert contents of bstring to lower case. This function will return with
BSTR_ERR if b is NULL or of length 0, otherwise BSTR_OK is returned.
..........................................................................
extern int bltrimws (bstring b);
Delete whitespace contiguous from the left end of the bstring. This
function will return with BSTR_ERR if b is NULL or of length 0, otherwise
BSTR_OK is returned.
..........................................................................
extern int brtrimws (bstring b);
Delete whitespace contiguous from the right end of the bstring. This
function will return with BSTR_ERR if b is NULL or of length 0, otherwise
BSTR_OK is returned.
..........................................................................
extern int btrimws (bstring b);
Delete whitespace contiguous from both ends of the bstring. This function
will return with BSTR_ERR if b is NULL or of length 0, otherwise BSTR_OK
is returned.
..........................................................................
extern int bstrListCreate (void);
Create an empty struct bstrList. The struct bstrList output structure is
declared as follows:
struct bstrList {
int qty, mlen;
bstring * entry;
};
The entry field actually is an array with qty number entries. The mlen
record counts the maximum number of bstring's for which there is memory
in the entry record.
The Bstrlib API does *NOT* include a comprehensive set of functions for
full management of struct bstrList in an abstracted way. The reason for
this is because aliasing semantics of the list are best left to the user
of this function, and performance varies wildly depending on the
assumptions made. For a complete list of bstring data type it is
recommended that the C++ public std::vector<CBString> be used, since its
semantics are usage are more standard.
..........................................................................
extern int bstrListDestroy (struct bstrList * sl);
Destroy a struct bstrList structure that was returned by the bsplit
function. Note that this will destroy each bstring in the ->entry array
as well. See bstrListCreate() above for structure of struct bstrList.
..........................................................................
extern int bstrListAlloc (struct bstrList * sl, int msz);
Ensure that there is memory for at least msz number of entries for the
list.
..........................................................................
extern int bstrListAllocMin (struct bstrList * sl, int msz);
Try to allocate the minimum amount of memory for the list to include at
least msz entries or sl->qty whichever is greater.
..........................................................................
extern struct bstrList * bsplit (bstring str, unsigned char splitChar);
Create an array of sequential substrings from str divided by the
character splitChar. Successive occurrences of the splitChar will be
divided by empty bstring entries, following the semantics from the Python
programming language. To reclaim the memory from this output structure,
bstrListDestroy () should be called. See bstrListCreate() above for
structure of struct bstrList.
..........................................................................
extern struct bstrList * bsplits (bstring str, const_bstring splitStr);
Create an array of sequential substrings from str divided by any
character contained in splitStr. An empty splitStr causes a single entry
bstrList containing a copy of str to be returned. See bstrListCreate()
above for structure of struct bstrList.
..........................................................................
extern struct bstrList * bsplitstr (bstring str, const_bstring splitStr);
Create an array of sequential substrings from str divided by the entire
substring splitStr. An empty splitStr causes a single entry bstrList
containing a copy of str to be returned. See bstrListCreate() above for
structure of struct bstrList.
..........................................................................
extern bstring bjoin (const struct bstrList * bl, const_bstring sep);
Join the entries of a bstrList into one bstring by sequentially
concatenating them with the sep bstring in between. If sep is NULL, it
is treated as if it were the empty bstring. Note that:
bjoin (l = bsplit (b, s->data[0]), s);
should result in a copy of b, if s->slen is 1. If there is an error NULL
is returned, otherwise a bstring with the correct result is returned.
See bstrListCreate() above for structure of struct bstrList.
..........................................................................
extern int bsplitcb (const_bstring str, unsigned char splitChar, int pos,
int (* cb) (void * parm, int ofs, int len), void * parm);
Iterate the set of disjoint sequential substrings over str starting at
position pos divided by the character splitChar. The parm passed to
bsplitcb is passed on to cb. If the function cb returns a value < 0,
then further iterating is halted and this value is returned by bsplitcb.
Note: Non-destructive modification of str from within the cb function
while performing this split is not undefined. bsplitcb behaves in
sequential lock step with calls to cb. I.e., after returning from a cb
that return a non-negative integer, bsplitcb continues from the position
1 character after the last detected split character and it will halt
immediately if the length of str falls below this point. However, if the
cb function destroys str, then it *must* return with a negative value,
otherwise bsplitcb will continue in an undefined manner.
This function is provided as an incremental alternative to bsplit that is
abortable and which does not impose additional memory allocation.
..........................................................................
extern int bsplitscb (const_bstring str, const_bstring splitStr, int pos,
int (* cb) (void * parm, int ofs, int len), void * parm);
Iterate the set of disjoint sequential substrings over str starting at
position pos divided by any of the characters in splitStr. An empty
splitStr causes the whole str to be iterated once. The parm passed to
bsplitcb is passed on to cb. If the function cb returns a value < 0,
then further iterating is halted and this value is returned by bsplitcb.
Note: Non-destructive modification of str from within the cb function
while performing this split is not undefined. bsplitscb behaves in
sequential lock step with calls to cb. I.e., after returning from a cb
that return a non-negative integer, bsplitscb continues from the position
1 character after the last detected split character and it will halt
immediately if the length of str falls below this point. However, if the
cb function destroys str, then it *must* return with a negative value,
otherwise bsplitscb will continue in an undefined manner.
This function is provided as an incremental alternative to bsplits that
is abortable and which does not impose additional memory allocation.
..........................................................................
extern int bsplitstrcb (const_bstring str, const_bstring splitStr, int pos,
int (* cb) (void * parm, int ofs, int len), void * parm);
Iterate the set of disjoint sequential substrings over str starting at
position pos divided by the entire substring splitStr. An empty splitStr
causes each character of str to be iterated. The parm passed to bsplitcb
is passed on to cb. If the function cb returns a value < 0, then further
iterating is halted and this value is returned by bsplitcb.
Note: Non-destructive modification of str from within the cb function
while performing this split is not undefined. bsplitstrcb behaves in
sequential lock step with calls to cb. I.e., after returning from a cb
that return a non-negative integer, bsplitstrcb continues from the position
1 character after the last detected split character and it will halt
immediately if the length of str falls below this point. However, if the
cb function destroys str, then it *must* return with a negative value,
otherwise bsplitscb will continue in an undefined manner.
This function is provided as an incremental alternative to bsplitstr that
is abortable and which does not impose additional memory allocation.
..........................................................................
extern bstring bformat (const char * fmt, ...);
Takes the same parameters as printf (), but rather than outputting
results to stdio, it forms a bstring which contains what would have been
output. Note that if there is an early generation of a '\0' character,
the bstring will be truncated to this end point.
Note that %s format tokens correspond to '\0' terminated char * buffers,
not bstrings. To print a bstring, first dereference data element of the
the bstring:
/* b1->data needs to be '\0' terminated, so tagbstrings generated
by blk2tbstr () might not be suitable. */
b0 = bformat ("Hello, %s", b1->data);
Note that if the BSTRLIB_NOVSNP macro has been set when bstrlib has been
compiled the bformat function is not present.
..........................................................................
extern int bformata (bstring b, const char * fmt, ...);
In addition to the initial output buffer b, bformata takes the same
parameters as printf (), but rather than outputting results to stdio, it
appends the results to the initial bstring parameter. Note that if
there is an early generation of a '\0' character, the bstring will be
truncated to this end point.
Note that %s format tokens correspond to '\0' terminated char * buffers,
not bstrings. To print a bstring, first dereference data element of the
the bstring:
/* b1->data needs to be '\0' terminated, so tagbstrings generated
by blk2tbstr () might not be suitable. */
bformata (b0 = bfromcstr ("Hello"), ", %s", b1->data);
Note that if the BSTRLIB_NOVSNP macro has been set when bstrlib has been
compiled the bformata function is not present.
..........................................................................
extern int bassignformat (bstring b, const char * fmt, ...);
After the first parameter, it takes the same parameters as printf (), but
rather than outputting results to stdio, it outputs the results to
the bstring parameter b. Note that if there is an early generation of a
'\0' character, the bstring will be truncated to this end point.
Note that %s format tokens correspond to '\0' terminated char * buffers,
not bstrings. To print a bstring, first dereference data element of the
the bstring:
/* b1->data needs to be '\0' terminated, so tagbstrings generated
by blk2tbstr () might not be suitable. */
bassignformat (b0 = bfromcstr ("Hello"), ", %s", b1->data);
Note that if the BSTRLIB_NOVSNP macro has been set when bstrlib has been
compiled the bassignformat function is not present.
..........................................................................
extern int bvcformata (bstring b, int count, const char * fmt, va_list arglist);
The bvcformata function formats data under control of the format control
string fmt and attempts to append the result to b. The fmt parameter is
the same as that of the printf function. The variable argument list is
replaced with arglist, which has been initialized by the va_start macro.
The size of the output is upper bounded by count. If the required output
exceeds count, the string b is not augmented with any contents and a value
below BSTR_ERR is returned. If a value below -count is returned then it
is recommended that the negative of this value be used as an update to the
count in a subsequent pass. On other errors, such as running out of
memory, parameter errors or numeric wrap around BSTR_ERR is returned.
BSTR_OK is returned when the output is successfully generated and
appended to b.
Note: There is no sanity checking of arglist, and this function is
destructive of the contents of b from the b->slen point onward. If there
is an early generation of a '\0' character, the bstring will be truncated
to this end point.
Although this function is part of the external API for Bstrlib, the
interface and semantics (length limitations, and unusual return codes)
are fairly atypical. The real purpose for this function is to provide an
engine for the bvformata macro.
Note that if the BSTRLIB_NOVSNP macro has been set when bstrlib has been
compiled the bvcformata function is not present.
..........................................................................
extern bstring bread (bNread readPtr, void * parm);
typedef size_t (* bNread) (void *buff, size_t elsize, size_t nelem,
void *parm);
Read an entire stream into a bstring, verbatum. The readPtr function
pointer is compatible with fread sematics, except that it need not obtain
the stream data from a file. The intention is that parm would contain
the stream data context/state required (similar to the role of the FILE*
I/O stream parameter of fread.)
Abstracting the block read function allows for block devices other than
file streams to be read if desired. Note that there is an ANSI
compatibility issue if "fread" is used directly; see the ANSI issues
section below.
..........................................................................
extern int breada (bstring b, bNread readPtr, void * parm);
Read an entire stream and append it to a bstring, verbatum. Behaves
like bread, except that it appends it results to the bstring b.
BSTR_ERR is returned on error, otherwise 0 is returned.
..........................................................................
extern bstring bgets (bNgetc getcPtr, void * parm, char terminator);
typedef int (* bNgetc) (void * parm);
Read a bstring from a stream. As many bytes as is necessary are read
until the terminator is consumed or no more characters are available from
the stream. If read from the stream, the terminator character will be
appended to the end of the returned bstring. The getcPtr function must
have the same semantics as the fgetc C library function (i.e., returning
an integer whose value is negative when there are no more characters
available, otherwise the value of the next available unsigned character
from the stream.) The intention is that parm would contain the stream
data context/state required (similar to the role of the FILE* I/O stream
parameter of fgets.) If no characters are read, or there is some other
detectable error, NULL is returned.
bgets will never call the getcPtr function more often than necessary to
construct its output (including a single call, if required, to determine
that the stream contains no more characters.)
Abstracting the character stream function and terminator character allows
for different stream devices and string formats other than '\n'
terminated lines in a file if desired (consider \032 terminated email
messages, in a UNIX mailbox for example.)
For files, this function can be used analogously as fgets as follows:
fp = fopen ( ... );
if (fp) b = bgets ((bNgetc) fgetc, fp, '\n');
(Note that only one terminator character can be used, and that '\0' is
not assumed to terminate the stream in addition to the terminator
character. This is consistent with the semantics of fgets.)
..........................................................................
extern int bgetsa (bstring b, bNgetc getcPtr, void * parm, char terminator);
Read from a stream and concatenate to a bstring. Behaves like bgets,
except that it appends it results to the bstring b. The value 1 is
returned if no characters are read before a negative result is returned
from getcPtr. Otherwise BSTR_ERR is returned on error, and 0 is returned
in other normal cases.
..........................................................................
extern int bassigngets (bstring b, bNgetc getcPtr, void * parm, char terminator);
Read from a stream and concatenate to a bstring. Behaves like bgets,
except that it assigns the results to the bstring b. The value 1 is
returned if no characters are read before a negative result is returned
from getcPtr. Otherwise BSTR_ERR is returned on error, and 0 is returned
in other normal cases.
..........................................................................
extern struct bStream * bsopen (bNread readPtr, void * parm);
Wrap a given open stream (described by a fread compatible function
pointer and stream handle) into an open bStream suitable for the bstring
library streaming functions.
..........................................................................
extern void * bsclose (struct bStream * s);
Close the bStream, and return the handle to the stream that was
originally used to open the given stream. If s is NULL or detectably
invalid, NULL will be returned.
..........................................................................
extern int bsbufflength (struct bStream * s, int sz);
Set the length of the buffer used by the bStream. If sz is the macro
BSTR_BS_BUFF_LENGTH_GET (which is 0), the length is not set. If s is
NULL or sz is negative, the function will return with BSTR_ERR, otherwise
this function returns with the previous length.
..........................................................................
extern int bsreadln (bstring r, struct bStream * s, char terminator);
Read a bstring terminated by the terminator character or the end of the
stream from the bStream (s) and return it into the parameter r. The
matched terminator, if found, appears at the end of the line read. If
the stream has been exhausted of all available data, before any can be
read, BSTR_ERR is returned. This function may read additional characters
into the stream buffer from the core stream that are not returned, but
will be retained for subsequent read operations. When reading from high
speed streams, this function can perform significantly faster than bgets.
..........................................................................
extern int bsreadlna (bstring r, struct bStream * s, char terminator);
Read a bstring terminated by the terminator character or the end of the
stream from the bStream (s) and concatenate it to the parameter r. The
matched terminator, if found, appears at the end of the line read. If
the stream has been exhausted of all available data, before any can be
read, BSTR_ERR is returned. This function may read additional characters
into the stream buffer from the core stream that are not returned, but
will be retained for subsequent read operations. When reading from high
speed streams, this function can perform significantly faster than bgets.
..........................................................................
extern int bsreadlns (bstring r, struct bStream * s, bstring terminators);
Read a bstring terminated by any character in the terminators bstring or
the end of the stream from the bStream (s) and return it into the
parameter r. This function may read additional characters from the core
stream that are not returned, but will be retained for subsequent read
operations.
..........................................................................
extern int bsreadlnsa (bstring r, struct bStream * s, bstring terminators);
Read a bstring terminated by any character in the terminators bstring or
the end of the stream from the bStream (s) and concatenate it to the
parameter r. If the stream has been exhausted of all available data,
before any can be read, BSTR_ERR is returned. This function may read
additional characters from the core stream that are not returned, but
will be retained for subsequent read operations.
..........................................................................
extern int bsread (bstring r, struct bStream * s, int n);
Read a bstring of length n (or, if it is fewer, as many bytes as is
remaining) from the bStream. This function will read the minimum
required number of additional characters from the core stream. When the
stream is at the end of the file BSTR_ERR is returned, otherwise BSTR_OK
is returned.
..........................................................................
extern int bsreada (bstring r, struct bStream * s, int n);
Read a bstring of length n (or, if it is fewer, as many bytes as is
remaining) from the bStream and concatenate it to the parameter r. This
function will read the minimum required number of additional characters
from the core stream. When the stream is at the end of the file BSTR_ERR
is returned, otherwise BSTR_OK is returned.
..........................................................................
extern int bsunread (struct bStream * s, const_bstring b);
Insert a bstring into the bStream at the current position. These
characters will be read prior to those that actually come from the core
stream.
..........................................................................
extern int bspeek (bstring r, const struct bStream * s);
Return the number of currently buffered characters from the bStream that
will be read prior to reads from the core stream, and append it to the
the parameter r.
..........................................................................
extern int bssplitscb (struct bStream * s, const_bstring splitStr,
int (* cb) (void * parm, int ofs, const_bstring entry), void * parm);
Iterate the set of disjoint sequential substrings over the stream s
divided by any character from the bstring splitStr. The parm passed to
bssplitscb is passed on to cb. If the function cb returns a value < 0,
then further iterating is halted and this return value is returned by
bssplitscb.
Note: At the point of calling the cb function, the bStream pointer is
pointed exactly at the position right after having read the split
character. The cb function can act on the stream by causing the bStream
pointer to move, and bssplitscb will continue by starting the next split
at the position of the pointer after the return from cb.
However, if the cb causes the bStream s to be destroyed then the cb must
return with a negative value, otherwise bssplitscb will continue in an
undefined manner.
This function is provided as way to incrementally parse through a file
or other generic stream that in total size may otherwise exceed the
practical or desired memory available. As with the other split callback
based functions this is abortable and does not impose additional memory
allocation.
..........................................................................
extern int bssplitstrcb (struct bStream * s, const_bstring splitStr,
int (* cb) (void * parm, int ofs, const_bstring entry), void * parm);
Iterate the set of disjoint sequential substrings over the stream s
divided by the entire substring splitStr. The parm passed to
bssplitstrcb is passed on to cb. If the function cb returns a
value < 0, then further iterating is halted and this return value is
returned by bssplitstrcb.
Note: At the point of calling the cb function, the bStream pointer is
pointed exactly at the position right after having read the split
character. The cb function can act on the stream by causing the bStream
pointer to move, and bssplitstrcb will continue by starting the next
split at the position of the pointer after the return from cb.
However, if the cb causes the bStream s to be destroyed then the cb must
return with a negative value, otherwise bssplitscb will continue in an
undefined manner.
This function is provided as way to incrementally parse through a file
or other generic stream that in total size may otherwise exceed the
practical or desired memory available. As with the other split callback
based functions this is abortable and does not impose additional memory
allocation.
..........................................................................
extern int bseof (const struct bStream * s);
Return the defacto "EOF" (end of file) state of a stream (1 if the
bStream is in an EOF state, 0 if not, and BSTR_ERR if stream is closed or
detectably erroneous.) When the readPtr callback returns a value <= 0
the stream reaches its "EOF" state. Note that bunread with non-empty
content will essentially turn off this state, and the stream will not be
in its "EOF" state so long as its possible to read more data out of it.
Also note that the semantics of bseof() are slightly different from
something like feof(). I.e., reaching the end of the stream does not
necessarily guarantee that bseof() will return with a value indicating
that this has happened. bseof() will only return indicating that it has
reached the "EOF" and an attempt has been made to read past the end of
the bStream.
The macros
----------
The macros described below are shown in a prototype form indicating their
intended usage. Note that the parameters passed to these macros will be
referenced multiple times. As with all macros, programmer care is
required to guard against unintended side effects.
int blengthe (const_bstring b, int err);
Returns the length of the bstring. If the bstring is NULL err is
returned.
..........................................................................
int blength (const_bstring b);
Returns the length of the bstring. If the bstring is NULL, the length
returned is 0.
..........................................................................
int bchare (const_bstring b, int p, int c);
Returns the p'th character of the bstring b. If the position p refers to
a position that does not exist in the bstring or the bstring is NULL,
then c is returned.
..........................................................................
char bchar (const_bstring b, int p);
Returns the p'th character of the bstring b. If the position p refers to
a position that does not exist in the bstring or the bstring is NULL,
then '\0' is returned.
..........................................................................
char * bdatae (bstring b, char * err);
Returns the char * data portion of the bstring b. If b is NULL, err is
returned.
..........................................................................
char * bdata (bstring b);
Returns the char * data portion of the bstring b. If b is NULL, NULL is
returned.
..........................................................................
char * bdataofse (bstring b, int ofs, char * err);
Returns the char * data portion of the bstring b offset by ofs. If b is
NULL, err is returned.
..........................................................................
char * bdataofs (bstring b, int ofs);
Returns the char * data portion of the bstring b offset by ofs. If b is
NULL, NULL is returned.
..........................................................................
struct tagbstring var = bsStatic ("...");
The bsStatic macro allows for static declarations of literal string
constants as struct tagbstring structures. The resulting tagbstring does
not need to be freed or destroyed. Note that this macro is only well
defined for string literal arguments. For more general string pointers,
use the btfromcstr macro.
The resulting struct tagbstring is permanently write protected. Attempts
to write to this struct tagbstring from any bstrlib function will lead to
BSTR_ERR being returned. Invoking the bwriteallow macro onto this struct
tagbstring has no effect.
..........................................................................
<void * blk, int len> <- bsStaticBlkParms ("...")
The bsStaticBlkParms macro emits a pair of comma seperated parameters
corresponding to the block parameters for the block functions in Bstrlib
(i.e., blk2bstr, bcatblk, blk2tbstr, bisstemeqblk, bisstemeqcaselessblk.)
Note that this macro is only well defined for string literal arguments.
Examples:
bstring b = blk2bstr (bsStaticBlkParms ("Fast init. "));
bcatblk (b, bsStaticBlkParms ("No frills fast concatenation."));
These are faster than using bfromcstr() and bcatcstr() respectively
because the length of the inline string is known as a compile time
constant. Also note that seperate struct tagbstring declarations for
holding the output of a bsStatic() macro are not required.
..........................................................................
void btfromcstr (struct tagbstring& t, const char * s);
Fill in the tagbstring t with the '\0' terminated char buffer s. This
action is purely reference oriented; no memory management is done. The
data member is just assigned s, and slen is assigned the strlen of s.
The s parameter is accessed exactly once in this macro.
The resulting struct tagbstring is initially write protected. Attempts
to write to this struct tagbstring in a write protected state from any
bstrlib function will lead to BSTR_ERR being returned. Invoke the
bwriteallow on this struct tagbstring to make it writeable (though this
requires that s be obtained from a function compatible with malloc.)
..........................................................................
void btfromblk (struct tagbstring& t, void * s, int len);
Fill in the tagbstring t with the data buffer s with length len. This
action is purely reference oriented; no memory management is done. The
data member of t is just assigned s, and slen is assigned len. Note that
the buffer is not appended with a '\0' character. The s and len
parameters are accessed exactly once each in this macro.
The resulting struct tagbstring is initially write protected. Attempts
to write to this struct tagbstring in a write protected state from any
bstrlib function will lead to BSTR_ERR being returned. Invoke the
bwriteallow on this struct tagbstring to make it writeable (though this
requires that s be obtained from a function compatible with malloc.)
..........................................................................
void btfromblkltrimws (struct tagbstring& t, void * s, int len);
Fill in the tagbstring t with the data buffer s with length len after it
has been left trimmed. This action is purely reference oriented; no
memory management is done. The data member of t is just assigned to a
pointer inside the buffer s. Note that the buffer is not appended with a
'\0' character. The s and len parameters are accessed exactly once each
in this macro.
The resulting struct tagbstring is permanently write protected. Attempts
to write to this struct tagbstring from any bstrlib function will lead to
BSTR_ERR being returned. Invoking the bwriteallow macro onto this struct
tagbstring has no effect.
..........................................................................
void btfromblkrtrimws (struct tagbstring& t, void * s, int len);
Fill in the tagbstring t with the data buffer s with length len after it
has been right trimmed. This action is purely reference oriented; no
memory management is done. The data member of t is just assigned to a
pointer inside the buffer s. Note that the buffer is not appended with a
'\0' character. The s and len parameters are accessed exactly once each
in this macro.
The resulting struct tagbstring is permanently write protected. Attempts
to write to this struct tagbstring from any bstrlib function will lead to
BSTR_ERR being returned. Invoking the bwriteallow macro onto this struct
tagbstring has no effect.
..........................................................................
void btfromblktrimws (struct tagbstring& t, void * s, int len);
Fill in the tagbstring t with the data buffer s with length len after it
has been left and right trimmed. This action is purely reference
oriented; no memory management is done. The data member of t is just
assigned to a pointer inside the buffer s. Note that the buffer is not
appended with a '\0' character. The s and len parameters are accessed
exactly once each in this macro.
The resulting struct tagbstring is permanently write protected. Attempts
to write to this struct tagbstring from any bstrlib function will lead to
BSTR_ERR being returned. Invoking the bwriteallow macro onto this struct
tagbstring has no effect.
..........................................................................
void bmid2tbstr (struct tagbstring& t, bstring b, int pos, int len);
Fill the tagbstring t with the substring from b, starting from position
pos with a length len. The segment is clamped by the boundaries of
the bstring b. This action is purely reference oriented; no memory
management is done. Note that the buffer is not appended with a '\0'
character. Note that the t parameter to this macro may be accessed
multiple times. Note that the contents of t will become undefined
if the contents of b change or are destroyed.
The resulting struct tagbstring is permanently write protected. Attempts
to write to this struct tagbstring in a write protected state from any
bstrlib function will lead to BSTR_ERR being returned. Invoking the
bwriteallow macro on this struct tagbstring will have no effect.
..........................................................................
void bvformata (int& ret, bstring b, const char * format, lastarg);
Append the bstring b with printf like formatting with the format control
string, and the arguments taken from the ... list of arguments after
lastarg passed to the containing function. If the containing function
does not have ... parameters or lastarg is not the last named parameter
before the ... then the results are undefined. If successful, the
results are appended to b and BSTR_OK is assigned to ret. Otherwise
BSTR_ERR is assigned to ret.
Example:
void dbgerror (FILE * fp, const char * fmt, ...) {
int ret;
bstring b;
bvformata (ret, b = bfromcstr ("DBG: "), fmt, fmt);
if (BSTR_OK == ret) fputs ((char *) bdata (b), fp);
bdestroy (b);
}
Note that if the BSTRLIB_NOVSNP macro was set when bstrlib had been
compiled the bvformata macro will not link properly. If the
BSTRLIB_NOVSNP macro has been set, the bvformata macro will not be
available.
..........................................................................
void bwriteprotect (struct tagbstring& t);
Disallow bstring from being written to via the bstrlib API. Attempts to
write to the resulting tagbstring from any bstrlib function will lead to
BSTR_ERR being returned.
Note: bstrings which are write protected cannot be destroyed via bdestroy.
Note to C++ users: Setting a CBString as write protected will not prevent
it from being destroyed by the destructor.
..........................................................................
void bwriteallow (struct tagbstring& t);
Allow bstring to be written to via the bstrlib API. Note that such an
action makes the bstring both writable and destroyable. If the bstring is
not legitimately writable (as is the case for struct tagbstrings
initialized with a bsStatic value), the results of this are undefined.
Note that invoking the bwriteallow macro may increase the number of
reallocs by one more than necessary for every call to bwriteallow
interleaved with any bstring API which writes to this bstring.
..........................................................................
int biswriteprotected (struct tagbstring& t);
Returns 1 if the bstring is write protected, otherwise 0 is returned.
===============================================================================
The bstest module
-----------------
The bstest module is just a unit test for the bstrlib module. For correct
implementations of bstrlib, it should execute with 0 failures being reported.
This test should be utilized if modifications/customizations to bstrlib have
been performed. It tests each core bstrlib function with bstrings of every
mode (read-only, NULL, static and mutable) and ensures that the expected
semantics are observed (including results that should indicate an error). It
also tests for aliasing support. Passing bstest is a necessary but not a
sufficient condition for ensuring the correctness of the bstrlib module.
The test module
---------------
The test module is just a unit test for the bstrwrap module. For correct
implementations of bstrwrap, it should execute with 0 failures being
reported. This test should be utilized if modifications/customizations to
bstrwrap have been performed. It tests each core bstrwrap function with
CBStrings write protected or not and ensures that the expected semantics are
observed (including expected exceptions.) Note that exceptions cannot be
disabled to run this test. Passing test is a necessary but not a sufficient
condition for ensuring the correctness of the bstrwrap module.
===============================================================================
Using Bstring and CBString as an alternative to the C library
-------------------------------------------------------------
First let us give a table of C library functions and the alternative bstring
functions and CBString methods that should be used instead of them.
C-library Bstring alternative CBString alternative
--------- ------------------- --------------------
gets bgets ::gets
strcpy bassign = operator
strncpy bassignmidstr ::midstr
strcat bconcat += operator
strncat bconcat + btrunc += operator + ::trunc
strtok bsplit, bsplits ::split
sprintf b(assign)format ::format
snprintf b(assign)format + btrunc ::format + ::trunc
vsprintf bvformata bvformata
vsnprintf bvformata + btrunc bvformata + btrunc
vfprintf bvformata + fputs use bvformata + fputs
strcmp biseq, bstrcmp comparison operators.
strncmp bstrncmp, memcmp bstrncmp, memcmp
strlen ->slen, blength ::length
strdup bstrcpy constructor
strset bpattern ::fill
strstr binstr ::find
strpbrk binchr ::findchr
stricmp bstricmp cast & use bstricmp
strlwr btolower cast & use btolower
strupr btoupper cast & use btoupper
strrev bReverse (aux module) cast & use bReverse
strchr bstrchr cast & use bstrchr
strspnp use strspn use strspn
ungetc bsunread bsunread
The top 9 C functions listed here are troublesome in that they impose memory
management in the calling function. The Bstring and CBstring interfaces have
built-in memory management, so there is far less code with far less potential
for buffer overrun problems. strtok can only be reliably called as a "leaf"
calculation, since it (quite bizarrely) maintains hidden internal state. And
gets is well known to be broken no matter what. The Bstrlib alternatives do
not suffer from those sorts of problems.
The substitute for strncat can be performed with higher performance by using
the blk2tbstr macro to create a presized second operand for bconcat.
C-library Bstring alternative CBString alternative
--------- ------------------- --------------------
strspn strspn acceptable strspn acceptable
strcspn strcspn acceptable strcspn acceptable
strnset strnset acceptable strnset acceptable
printf printf acceptable printf acceptable
puts puts acceptable puts acceptable
fprintf fprintf acceptable fprintf acceptable
fputs fputs acceptable fputs acceptable
memcmp memcmp acceptable memcmp acceptable
Remember that Bstring (and CBstring) functions will automatically append the
'\0' character to the character data buffer. So by simply accessing the data
buffer directly, ordinary C string library functions can be called directly
on them. Note that bstrcmp is not the same as memcmp in exactly the same way
that strcmp is not the same as memcmp.
C-library Bstring alternative CBString alternative
--------- ------------------- --------------------
fread balloc + fread ::alloc + fread
fgets balloc + fgets ::alloc + fgets
These are odd ones because of the exact sizing of the buffer required. The
Bstring and CBString alternatives requires that the buffers are forced to
hold at least the prescribed length, then just use fread or fgets directly.
However, typically the automatic memory management of Bstring and CBstring
will make the typical use of fgets and fread to read specifically sized
strings unnecessary.
Implementation Choices
----------------------
Overhead:
.........
The bstring library has more overhead versus straight char buffers for most
functions. This overhead is essentially just the memory management and
string header allocation. This overhead usually only shows up for small
string manipulations. The performance loss has to be considered in
light of the following:
1) What would be the performance loss of trying to write this management
code in one's own application?
2) Since the bstring library source code is given, a sufficiently powerful
modern inlining globally optimizing compiler can remove function call
overhead.
Since the data type is exposed, a developer can replace any unsatisfactory
function with their own inline implementation. And that is besides the main
point of what the better string library is mainly meant to provide. Any
overhead lost has to be compared against the value of the safe abstraction
for coupling memory management and string functionality.
Performance of the C interface:
...............................
The algorithms used have performance advantages versus the analogous C
library functions. For example:
1. bfromcstr/blk2str/bstrcpy versus strcpy/strdup. By using memmove instead
of strcpy, the break condition of the copy loop is based on an independent
counter (that should be allocated in a register) rather than having to
check the results of the load. Modern out-of-order executing CPUs can
parallelize the final branch mis-predict penality with the loading of the
source string. Some CPUs will also tend to have better built-in hardware
support for counted memory moves than load-compare-store. (This is a
minor, but non-zero gain.)
2. biseq versus strcmp. If the strings are unequal in length, bsiseq will
return in O(1) time. If the strings are aliased, or have aliased data
buffers, biseq will return in O(1) time. strcmp will always be O(k),
where k is the length of the common prefix or the whole string if they are
identical.
3. ->slen versus strlen. ->slen is obviously always O(1), while strlen is
always O(n) where n is the length of the string.
4. bconcat versus strcat. Both rely on precomputing the length of the
destination string argument, which will favor the bstring library. On
iterated concatenations the performance difference can be enormous.
5. bsreadln versus fgets. The bsreadln function reads large blocks at a time
from the given stream, then parses out lines from the buffers directly.
Some C libraries will implement fgets as a loop over single fgetc calls.
Testing indicates that the bsreadln approach can be several times faster
for fast stream devices (such as a file that has been entirely cached.)
6. bsplits/bsplitscb versus strspn. Accelerators for the set of match
characters are generated only once.
7. binstr versus strstr. The binstr implementation unrolls the loops to
help reduce loop overhead. This will matter if the target string is
long and source string is not found very early in the target string.
With strstr, while it is possible to unroll the source contents, it is
not possible to do so with the destination contents in a way that is
effective because every destination character must be tested against
'\0' before proceeding to the next character.
8. bReverse versus strrev. The C function must find the end of the string
first before swaping character pairs.
9. bstrrchr versus no comparable C function. Its not hard to write some C
code to search for a character from the end going backwards. But there
is no way to do this without computing the length of the string with
strlen.
Practical testing indicates that in general Bstrlib is never signifcantly
slower than the C library for common operations, while very often having a
performance advantage that ranges from significant to massive. Even for
functions like b(n)inchr versus str(c)spn() (where, in theory, there is no
advantage for the Bstrlib architecture) the performance of Bstrlib is vastly
superior to most tested C library implementations.
Some of Bstrlib's extra functionality also lead to inevitable performance
advantages over typical C solutions. For example, using the blk2tbstr macro,
one can (in O(1) time) generate an internal substring by reference while not
disturbing the original string. If disturbing the original string is not an
option, typically, a comparable char * solution would have to make a copy of
the substring to provide similar functionality. Another example is reverse
character set scanning -- the str(c)spn functions only scan in a forward
direction which can complicate some parsing algorithms.
Where high performance char * based algorithms are available, Bstrlib can
still leverage them by accessing the ->data field on bstrings. So
realistically Bstrlib can never be significantly slower than any standard
'\0' terminated char * based solutions.
Performance of the C++ interface:
.................................
The C++ interface has been designed with an emphasis on abstraction and safety
first. However, since it is substantially a wrapper for the C bstring
functions, for longer strings the performance comments described in the
"Performance of the C interface" section above still apply. Note that the
(CBString *) type can be directly cast to a (bstring) type, and passed as
parameters to the C functions (though a CBString must never be passed to
bdestroy.)
Probably the most controversial choice is performing full bounds checking on
the [] operator. This decision was made because 1) the fast alternative of
not bounds checking is still available by first casting the CBString to a
(const char *) buffer or to a (struct tagbstring) then derefencing .data and
2) because the lack of bounds checking is seen as one of the main weaknesses
of C/C++ versus other languages. This check being done on every access leads
to individual character extraction being actually slower than other languages
in this one respect (other language's compilers will normally dedicate more
resources on hoisting or removing bounds checking as necessary) but otherwise
bring C++ up to the level of other languages in terms of functionality.
It is common for other C++ libraries to leverage the abstractions provided by
C++ to use reference counting and "copy on write" policies. While these
techniques can speed up some scenarios, they impose a problem with respect to
thread safety. bstrings and CBStrings can be properly protected with
"per-object" mutexes, meaning that two bstrlib calls can be made and execute
simultaneously, so long as the bstrings and CBstrings are distinct. With a
reference count and alias before copy on write policy, global mutexes are
required that prevent multiple calls to the strings library to execute
simultaneously regardless of whether or not the strings represent the same
string.
One interesting trade off in CBString is that the default constructor is not
trivial. I.e., it always prepares a ready to use memory buffer. The purpose
is to ensure that there is a uniform internal composition for any functioning
CBString that is compatible with bstrings. It also means that the other
methods in the class are not forced to perform "late initialization" checks.
In the end it means that construction of CBStrings are slower than other
comparable C++ string classes. Initial testing, however, indicates that
CBString outperforms std::string and MFC's CString, for example, in all other
operations. So to work around this weakness it is recommended that CBString
declarations be pushed outside of inner loops.
Practical testing indicates that with the exception of the caveats given
above (constructors and safe index character manipulations) the C++ API for
Bstrlib generally outperforms popular standard C++ string classes. Amongst
the standard libraries and compilers, the quality of concatenation operations
varies wildly and very little care has gone into search functions. Bstrlib
dominates those performance benchmarks.
Memory management:
..................
The bstring functions which write and modify bstrings will automatically
reallocate the backing memory for the char buffer whenever it is required to
grow. The algorithm for resizing chosen is to snap up to sizes that are a
power of two which are sufficient to hold the intended new size. Memory
reallocation is not performed when the required size of the buffer is
decreased. This behavior can be relied on, and is necessary to make the
behaviour of balloc deterministic. This trades off additional memory usage
for decreasing the frequency for required reallocations:
1. For any bstring whose size never exceeds n, its buffer is not ever
reallocated more than log_2(n) times for its lifetime.
2. For any bstring whose size never exceeds n, its buffer is never more than
2*(n+1) in length. (The extra characters beyond 2*n are to allow for the
implicit '\0' which is always added by the bstring modifying functions.)
Decreasing the buffer size when the string decreases in size would violate 1)
above and in real world case lead to pathological heap thrashing. Similarly,
allocating more tightly than "least power of 2 greater than necessary" would
lead to a violation of 1) and have the same potential for heap thrashing.
Property 2) needs emphasizing. Although the memory allocated is always a
power of 2, for a bstring that grows linearly in size, its buffer memory also
grows linearly, not exponentially. The reason is that the amount of extra
space increases with each reallocation, which decreases the frequency of
future reallocations.
Obviously, given that bstring writing functions may reallocate the data
buffer backing the target bstring, one should not attempt to cache the data
buffer address and use it after such bstring functions have been called.
This includes making reference struct tagbstrings which alias to a writable
bstring.
balloc or bfromcstralloc can be used to preallocate the minimum amount of
space used for a given bstring. This will reduce even further the number of
times the data portion is reallocated. If the length of the string is never
more than one less than the memory length then there will be no further
reallocations.
Note that invoking the bwriteallow macro may increase the number of reallocs
by one more than necessary for every call to bwriteallow interleaved with any
bstring API which writes to this bstring.
The library does not use any mechanism for automatic clean up for the C API.
Thus explicit clean up via calls to bdestroy() are required to avoid memory
leaks.
Constant and static tagbstrings:
................................
A struct tagbstring can be write protected from any bstrlib function using
the bwriteprotect macro. A write protected struct tagbstring can then be
reset to being writable via the bwriteallow macro. There is, of course, no
protection from attempts to directly access the bstring members. Modifying a
bstring which is write protected by direct access has undefined behavior.
static struct tagbstrings can be declared via the bsStatic macro. They are
considered permanently unwritable. Such struct tagbstrings's are declared
such that attempts to write to it are not well defined. Invoking either
bwriteallow or bwriteprotect on static struct tagbstrings has no effect.
struct tagbstring's initialized via btfromcstr or blk2tbstr are protected by
default but can be made writeable via the bwriteallow macro. If bwriteallow
is called on such struct tagbstring's, it is the programmer's responsibility
to ensure that:
1) the buffer supplied was allocated from the heap.
2) bdestroy is not called on this tagbstring (unless the header itself has
also been allocated from the heap.)
3) free is called on the buffer to reclaim its memory.
bwriteallow and bwriteprotect can be invoked on ordinary bstrings (they have
to be dereferenced with the (*) operator to get the levels of indirection
correct) to give them write protection.
Buffer declaration:
...................
The memory buffer is actually declared "unsigned char *" instead of "char *".
The reason for this is to trigger compiler warnings whenever uncasted char
buffers are assigned to the data portion of a bstring. This will draw more
diligent programmers into taking a second look at the code where they
have carelessly left off the typically required cast. (Research from
AT&T/Lucent indicates that additional programmer eyeballs is one of the most
effective mechanisms at ferreting out bugs.)
Function pointers:
..................
The bgets, bread and bStream functions use function pointers to obtain
strings from data streams. The function pointer declarations have been
specifically chosen to be compatible with the fgetc and fread functions.
While this may seem to be a convoluted way of implementing fgets and fread
style functionality, it has been specifically designed this way to ensure
that there is no dependency on a single narrowly defined set of device
interfaces, such as just stream I/O. In the embedded world, its quite
possible to have environments where such interfaces may not exist in the
standard C library form. Furthermore, the generalization that this opens up
allows for more sophisticated uses for these functions (performing an fgets
like function on a socket, for example.) By using function pointers, it also
allows such abstract stream interfaces to be created using the bstring library
itself while not creating a circular dependency.
Use of int's for sizes:
.......................
This is just a recognition that 16bit platforms with requirements for strings
that are larger than 64K and 32bit+ platforms with requirements for strings
that are larger than 4GB are pretty marginal. The main focus is for 32bit
platforms, and emerging 64bit platforms with reasonable < 4GB string
requirements. Using ints allows for negative values which has meaning
internally to bstrlib.
Semantic consideration:
.......................
Certain care needs to be taken when copying and aliasing bstrings. A bstring
is essentially a pointer type which points to a multipart abstract data
structure. Thus usage, and lifetime of bstrings have semantics that follow
these considerations. For example:
bstring a, b;
struct tagbstring t;
a = bfromcstr("Hello"); /* Create new bstring and copy "Hello" into it. */
b = a; /* Alias b to the contents of a. */
t = *a; /* Create a current instance pseudo-alias of a. */
bconcat (a, b); /* Double a and b, t is now undefined. */
bdestroy (a); /* Destroy the contents of both a and b. */
Variables of type bstring are really just references that point to real
bstring objects. The equal operator (=) creates aliases, and the asterisk
dereference operator (*) creates a kind of alias to the current instance (which
is generally not useful for any purpose.) Using bstrcpy() is the correct way
of creating duplicate instances. The ampersand operator (&) is useful for
creating aliases to struct tagbstrings (remembering that constructed struct
tagbstrings are not writable by default.)
CBStrings use complete copy semantics for the equal operator (=), and thus do
not have these sorts of issues.
Debugging:
..........
Bstrings have a simple, exposed definition and construction, and the library
itself is open source. So most debugging is going to be fairly straight-
forward. But the memory for bstrings come from the heap, which can often be
corrupted indirectly, and it might not be obvious what has happened even from
direct examination of the contents in a debugger or a core dump. There are
some tools such as Purify, Insure++ and Electric Fence which can help solve
such problems, however another common approach is to directly instrument the
calls to malloc, realloc, calloc, free, memcpy, memmove and/or other calls
by overriding them with macro definitions.
Although the user could hack on the Bstrlib sources directly as necessary to
perform such an instrumentation, Bstrlib comes with a built-in mechanism for
doing this. By defining the macro BSTRLIB_MEMORY_DEBUG and providing an
include file named memdbg.h this will force the core Bstrlib modules to
attempt to include this file. In such a file, macros could be defined which
overrides Bstrlib's useage of the C standard library.
Rather than calling malloc, realloc, free, memcpy or memmove directly, Bstrlib
emits the macros bstr__alloc, bstr__realloc, bstr__free, bstr__memcpy and
bstr__memmove in their place respectively. By default these macros are simply
assigned to be equivalent to their corresponding C standard library function
call. However, if they are given earlier macro definitions (via the back
door include file) they will not be given their default definition. In this
way Bstrlib's interface to the standard library can be changed but without
having to directly redefine or link standard library symbols (both of which
are not strictly ANSI C compliant.)
An example definition might include:
#define bstr__alloc(sz) X_malloc ((sz), __LINE__, __FILE__)
which might help contextualize heap entries in a debugging environment.
The NULL parameter and sanity checking of bstrings is part of the Bstrlib
API, and thus Bstrlib itself does not present any different modes which would
correspond to "Debug" or "Release" modes. Bstrlib always contains mechanisms
which one might think of as debugging features, but retains the performance
and small memory footprint one would normally associate with release mode
code.
Integration Microsoft's Visual Studio debugger:
...............................................
Microsoft's Visual Studio debugger has a capability of customizable mouse
float over data type descriptions. This is accomplished by editting the
AUTOEXP.DAT file to include the following:
; new for CBString
tagbstring =slen=<slen> mlen=<mlen> <data,st>
Bstrlib::CBStringList =count=<size()>
In Visual C++ 6.0 this file is located in the directory:
C:\Program Files\Microsoft Visual Studio\Common\MSDev98\Bin
and in Visual Studio .NET 2003 its located here:
C:\Program Files\Microsoft Visual Studio .NET 2003\Common7\Packages\Debugger
This will improve the ability of debugging with Bstrlib under Visual Studio.
Security
--------
Bstrlib does not come with explicit security features outside of its fairly
comprehensive error detection, coupled with its strict semantic support.
That is to say that certain common security problems, such as buffer overrun,
constant overwrite, arbitrary truncation etc, are far less likely to happen
inadvertently. Where it does help, Bstrlib maximizes its advantage by
providing developers a simple adoption path that lets them leave less secure
string mechanisms behind. The library will not leave developers wanting, so
they will be less likely to add new code using a less secure string library
to add functionality that might be missing from Bstrlib.
That said there are a number of security ideas not addressed by Bstrlib:
1. Race condition exploitation (i.e., verifying a string's contents, then
raising the privilege level and execute it as a shell command as two
non-atomic steps) is well beyond the scope of what Bstrlib can provide. It
should be noted that MFC's built-in string mutex actually does not solve this
problem either -- it just removes immediate data corruption as a possible
outcome of such exploit attempts (it can be argued that this is worse, since
it will leave no trace of the exploitation). In general race conditions have
to be dealt with by careful design and implementation; it cannot be assisted
by a string library.
2. Any kind of access control or security attributes to prevent usage in
dangerous interfaces such as system(). Perl includes a "trust" attribute
which can be endowed upon strings that are intended to be passed to such
dangerous interfaces. However, Perl's solution reflects its own limitations
-- notably that it is not a strongly typed language. In the example code for
Bstrlib, there is a module called taint.cpp. It demonstrates how to write a
simple wrapper class for managing "untainted" or trusted strings using the
type system to prevent questionable mixing of ordinary untrusted strings with
untainted ones then passing them to dangerous interfaces. In this way the
security correctness of the code reduces to auditing the direct usages of
dangerous interfaces or promotions of tainted strings to untainted ones.
3. Encryption of string contents is way beyond the scope of Bstrlib.
Maintaining encrypted string contents in the futile hopes of thwarting things
like using system-level debuggers to examine sensitive string data is likely
to be a wasted effort (imagine a debugger that runs at a higher level than a
virtual processor where the application runs). For more standard encryption
usages, since the bstring contents are simply binary blocks of data, this
should pose no problem for usage with other standard encryption libraries.
Compatibility
-------------
The Better String Library is known to compile and function correctly with the
following compilers:
- Microsoft Visual C++
- Watcom C/C++
- Intel's C/C++ compiler (Windows)
- The GNU C/C++ compiler (cygwin and Linux on PPC64)
- Borland C
- Turbo C
Setting of configuration options should be unnecessary for these compilers
(unless exceptions are being disabled or STLport has been added to WATCOM
C/C++). Bstrlib has been developed with an emphasis on portability. As such
porting it to other compilers should be straight forward. This package
includes a porting guide (called porting.txt) which explains what issues may
exist for porting Bstrlib to different compilers and environments.
ANSI issues
-----------
1. The function pointer types bNgetc and bNread have prototypes which are very
similar to, but not exactly the same as fgetc and fread respectively.
Basically the FILE * parameter is replaced by void *. The purpose of this
was to allow one to create other functions with fgetc and fread like
semantics without being tied to ANSI C's file streaming mechanism. I.e., one
could very easily adapt it to sockets, or simply reading a block of memory,
or procedurally generated strings (for fractal generation, for example.)
The problem is that invoking the functions (bNgetc)fgetc and (bNread)fread is
not technically legal in ANSI C. The reason being that the compiler is only
able to coerce the function pointers themselves into the target type, however
are unable to perform any cast (implicit or otherwise) on the parameters
passed once invoked. I.e., if internally void * and FILE * need some kind of
mechanical coercion, the compiler will not properly perform this conversion
and thus lead to undefined behavior.
Apparently a platform from Data General called "Eclipse" and another from
Tandem called "NonStop" have a different representation for pointers to bytes
and pointers to words, for example, where coercion via casting is necessary.
(Actual confirmation of the existence of such machines is hard to come by, so
it is prudent to be skeptical about this information.) However, this is not
an issue for any known contemporary platforms. One may conclude that such
platforms are effectively apocryphal even if they do exist.
To correctly work around this problem to the satisfaction of the ANSI
limitations, one needs to create wrapper functions for fgets and/or
fread with the prototypes of bNgetc and/or bNread respectively which performs
no other action other than to explicitely cast the void * parameter to a
FILE *, and simply pass the remaining parameters straight to the function
pointer call.
The wrappers themselves are trivial:
size_t freadWrap (void * buff, size_t esz, size_t eqty, void * parm) {
return fread (buff, esz, eqty, (FILE *) parm);
}
int fgetcWrap (void * parm) {
return fgetc ((FILE *) parm);
}
These have not been supplied in bstrlib or bstraux to prevent unnecessary
linking with file I/O functions.
2. vsnprintf is not available on all compilers. Because of this, the bformat
and bformata functions (and format and formata methods) are not guaranteed to
work properly. For those compilers that don't have vsnprintf, the
BSTRLIB_NOVSNP macro should be set before compiling bstrlib, and the format
functions/method will be disabled.
The more recent ANSI C standards have specified the required inclusion of a
vsnprintf function.
3. The bstrlib function names are not unique in the first 6 characters. This
is only an issue for older C compiler environments which do not store more
than 6 characters for function names.
4. The bsafe module defines macros and function names which are part of the
C library. This simply overrides the definition as expected on all platforms
tested, however it is not sanctioned by the ANSI standard. This module is
clearly optional and should be omitted on platforms which disallow its
undefined semantics.
In practice the real issue is that some compilers in some modes of operation
can/will inline these standard library functions on a module by module basis
as they appear in each. The linker will thus have no opportunity to override
the implementation of these functions for those cases. This can lead to
inconsistent behaviour of the bsafe module on different platforms and
compilers.
===============================================================================
Comparison with Microsoft's CString class
-----------------------------------------
Although developed independently, CBStrings have very similar functionality to
Microsoft's CString class. However, the bstring library has significant
advantages over CString:
1. Bstrlib is a C-library as well as a C++ library (using the C++ wrapper).
- Thus it is compatible with more programming environments and
available to a wider population of programmers.
2. The internal structure of a bstring is considered exposed.
- A single contiguous block of data can be cut into read-only pieces by
simply creating headers, without allocating additional memory to create
reference copies of each of these sub-strings.
- In this way, using bstrings in a totally abstracted way becomes a choice
rather than an imposition. Further this choice can be made differently
at different layers of applications that use it.
3. Static declaration support precludes the need for constructor
invocation.
- Allows for static declarations of constant strings that has no
additional constructor overhead.
4. Bstrlib is not attached to another library.
- Bstrlib is designed to be easily plugged into any other library
collection, without dependencies on other libraries or paradigms (such
as "MFC".)
The bstring library also comes with a few additional functions that are not
available in the CString class:
- bsetstr
- bsplit
- bread
- breplace (this is different from CString::Replace())
- Writable indexed characters (for example a[i]='x')
Interestingly, although Microsoft did implement mid$(), left$() and right$()
functional analogues (these are functions from GWBASIC) they seem to have
forgotten that mid$() could be also used to write into the middle of a string.
This functionality exists in Bstrlib with the bsetstr() and breplace()
functions.
Among the disadvantages of Bstrlib is that there is no special support for
localization or wide characters. Such things are considered beyond the scope
of what bstrings are trying to deliver. CString essentially supports the
older UCS-2 version of Unicode via widechar_t as an application-wide compile
time switch.
CString's also use built-in mechanisms for ensuring thread safety under all
situations. While this makes writing thread safe code that much easier, this
built-in safety feature has a price -- the inner loops of each CString method
runs in its own critical section (grabbing and releasing a light weight mutex
on every operation.) The usual way to decrease the impact of a critical
section performance penalty is to amortize more operations per critical
section. But since the implementation of CStrings is fixed as a one critical
section per-operation cost, there is no way to leverage this common
performance enhancing idea.
The search facilities in Bstrlib are comparable to those in MFC's CString
class, though it is missing locale specific collation. But because Bstrlib
is interoperable with C's char buffers, it will allow programmers to write
their own string searching mechanism (such as Boyer-Moore), or be able to
choose from a variety of available existing string searching libraries (such
as those for regular expressions) without difficulty.
Microsoft used a very non-ANSI conforming trick in its implementation to
allow printf() to use the "%s" specifier to output a CString correctly. This
can be convenient, but it is inherently not portable. CBString requires an
explicit cast, while bstring requires the data member to be dereferenced.
Microsoft's own documentation recommends casting, instead of relying on this
feature.
Comparison with C++'s std::string
---------------------------------
This is the C++ language's standard STL based string class.
1. There is no C implementation.
2. The [] operator is not bounds checked.
3. Missing a lot of useful functions like printf-like formatting.
4. Some sub-standard std::string implementations (SGI) are necessarily unsafe
to use with multithreading.
5. Limited by STL's std::iostream which in turn is limited by ifstream which
can only take input from files. (Compare to CBStream's API which can take
abstracted input.)
6. Extremely uneven performance across implementations.
Comparison with ISO C TR 24731 proposal
---------------------------------------
Following the ISO C99 standard, Microsoft has proposed a group of C library
extensions which are supposedly "safer and more secure". This proposal is
expected to be adopted by the ISO C standard which follows C99.
The proposal reveals itself to be very similar to Microsoft's "StrSafe"
library. The functions are basically the same as other standard C library
string functions except that destination parameters are paired with an
additional length parameter of type rsize_t. rsize_t is the same as size_t,
however, the range is checked to make sure its between 1 and RSIZE_MAX. Like
Bstrlib, the functions perform a "parameter check". Unlike Bstrlib, when a
parameter check fails, rather than simply outputing accumulatable error
statuses, they call a user settable global error function handler, and upon
return of control performs no (additional) detrimental action. The proposal
covers basic string functions as well as a few non-reenterable functions
(asctime, ctime, and strtok).
1. Still based solely on char * buffers (and therefore strlen() and strcat()
is still O(n), and there are no faster streq() comparison functions.)
2. No growable string semantics.
3. Requires manual buffer length synchronization in the source code.
4. No attempt to enhance functionality of the C library.
5. Introduces a new error scenario (strings exceeding RSIZE_MAX length).
The hope is that by exposing the buffer length requirements there will be
fewer buffer overrun errors. However, the error modes are really just
transformed, rather than removed. The real problem of buffer overflows is
that they all happen as a result of erroneous programming. So forcing
programmers to manually deal with buffer limits, will make them more aware of
the problem but doesn't remove the possibility of erroneous programming. So
a programmer that erroneously mixes up the rsize_t parameters is no better off
from a programmer that introduces potential buffer overflows through other
more typical lapses. So at best this may reduce the rate of erroneous
programming, rather than making any attempt at removing failure modes.
The error handler can discriminate between types of failures, but does not
take into account any callsite context. So the problem is that the error is
going to be manifest in a piece of code, but there is no pointer to that
code. It would seem that passing in the call site __FILE__, __LINE__ as
parameters would be very useful, but the API clearly doesn't support such a
thing (it would increase code bloat even more than the extra length
parameter does, and would require macro tricks to implement).
The Bstrlib C API takes the position that error handling needs to be done at
the callsite, and just tries to make it as painless as possible. Furthermore,
error modes are removed by supporting auto-growing strings and aliasing. For
capturing errors in more central code fragments, Bstrlib's C++ API uses
exception handling extensively, which is superior to the leaf-only error
handler approach.
Comparison with Managed String Library CERT proposal
----------------------------------------------------
The main webpage for the managed string library:
http://www.cert.org/secure-coding/managedstring.html
Robert Seacord at CERT has proposed a C string library that he calls the
"Managed String Library" for C. Like Bstrlib, it introduces a new type
which is called a managed string. The structure of a managed string
(string_m) is like a struct tagbstring but missing the length field. This
internal structure is considered opaque. The length is, like the C standard
library, always computed on the fly by searching for a terminating NUL on
every operation that requires it. So it suffers from every performance
problem that the C standard library suffers from. Interoperating with C
string APIs (like printf, fopen, or anything else that takes a string
parameter) requires copying to additionally allocating buffers that have to
be manually freed -- this makes this library probably slower and more
cumbersome than any other string library in existence.
The library gives a fully populated error status as the return value of every
string function. The hope is to be able to diagnose all problems
specifically from the return code alone. Comparing this to Bstrlib, which
aways returns one consistent error message, might make it seem that Bstrlib
would be harder to debug; but this is not true. With Bstrlib, if an error
occurs there is always enough information from just knowing there was an error
and examining the parameters to deduce exactly what kind of error has
happened. The managed string library thus gives up nested function calls
while achieving little benefit, while Bstrlib does not.
One interesting feature that "managed strings" has is the idea of data
sanitization via character set whitelisting. That is to say, a globally
definable filter that makes any attempt to put invalid characters into strings
lead to an error and not modify the string. The author gives the following
example:
// create valid char set
if (retValue = strcreate_m(&str1, "abc") ) {
fprintf(
stderr,
"Error %d from strcreate_m.\n",
retValue
);
}
if (retValue = setcharset(str1)) {
fprintf(
stderr,
"Error %d from setcharset().\n",
retValue
);
}
if (retValue = strcreate_m(&str1, "aabbccabc")) {
fprintf(
stderr,
"Error %d from strcreate_m.\n",
retValue
);
}
// create string with invalid char set
if (retValue = strcreate_m(&str1, "abbccdabc")) {
fprintf(
stderr,
"Error %d from strcreate_m.\n",
retValue
);
}
Which we can compare with a more Bstrlib way of doing things:
bstring bCreateWithFilter (const char * cstr, const_bstring filter) {
bstring b = bfromcstr (cstr);
if (BSTR_ERR != bninchr (b, filter) && NULL != b) {
fprintf (stderr, "Filter violation.\n");
bdestroy (b);
b = NULL;
}
return b;
}
struct tagbstring charFilter = bsStatic ("abc");
bstring str1 = bCreateWithFilter ("aabbccabc", &charFilter);
bstring str2 = bCreateWithFilter ("aabbccdabc", &charFilter);
The first thing we should notice is that with the Bstrlib approach you can
have different filters for different strings if necessary. Furthermore,
selecting a charset filter in the Managed String Library is uni-contextual.
That is to say, there can only be one such filter active for the entire
program, which means its usage is not well defined for intermediate library
usage (a library that uses it will interfere with user code that uses it, and
vice versa.) It is also likely to be poorly defined in multi-threading
environments.
There is also a question as to whether the data sanitization filter is checked
on every operation, or just on creation operations. Since the charset can be
set arbitrarily at run time, it might be set *after* some managed strings have
been created. This would seem to imply that all functions should run this
additional check every time if there is an attempt to enforce this. This
would make things tremendously slow. On the other hand, if it is assumed that
only creates and other operations that take char *'s as input need be checked
because the charset was only supposed to be called once at and before any
other managed string was created, then one can see that its easy to cover
Bstrlib with equivalent functionality via a few wrapper calls such as the
example given above.
And finally we have to question the value of sanitation in the first place.
For example, for httpd servers, there is generally a requirement that the
URLs parsed have some form that avoids undesirable translation to local file
system filenames or resources. The problem is that the way URLs can be
encoded, it must be completely parsed and translated to know if it is using
certain invalid character combinations. That is to say, merely filtering
each character one at a time is not necessarily the right way to ensure that
a string has safe contents.
In the article that describes this proposal, it is claimed that it fairly
closely approximates the existing C API semantics. On this point we should
compare this "closeness" with Bstrlib:
Bstrlib Managed String Library
------- ----------------------
Pointer arithmetic Segment arithmetic N/A
Use in C Std lib ->data, or bdata{e} getstr_m(x,*) ... free(x)
String literals bsStatic, bsStaticBlk strcreate_m()
Transparency Complete None
Its pretty clear that the semantic mapping from C strings to Bstrlib is fairly
straightforward, and that in general semantic capabilities are the same or
superior in Bstrlib. On the other hand the Managed String Library is either
missing semantics or changes things fairly significantly.
Comparison with Annexia's c2lib library
---------------------------------------
This library is available at:
http://www.annexia.org/freeware/c2lib
1. Still based solely on char * buffers (and therefore strlen() and strcat()
is still O(n), and there are no faster streq() comparison functions.)
Their suggestion that alternatives which wrap the string data type (such as
bstring does) imposes a difficulty in interoperating with the C langauge's
ordinary C string library is not founded.
2. Introduction of memory (and vector?) abstractions imposes a learning
curve, and some kind of memory usage policy that is outside of the strings
themselves (and therefore must be maintained by the developer.)
3. The API is massive, and filled with all sorts of trivial (pjoin) and
controvertial (pmatch -- regular expression are not sufficiently
standardized, and there is a very large difference in performance between
compiled and non-compiled, REs) functions. Bstrlib takes a decidely
minimal approach -- none of the functionality in c2lib is difficult or
challenging to implement on top of Bstrlib (except the regex stuff, which
is going to be difficult, and controvertial no matter what.)
4. Understanding why c2lib is the way it is pretty much requires a working
knowledge of Perl. bstrlib requires only knowledge of the C string library
while providing just a very select few worthwhile extras.
5. It is attached to a lot of cruft like a matrix math library (that doesn't
include any functions for getting the determinant, eigenvectors,
eigenvalues, the matrix inverse, test for singularity, test for
orthogonality, a grahm schmit orthogonlization, LU decomposition ... I
mean why bother?)
Convincing a development house to use c2lib is likely quite difficult. It
introduces too much, while not being part of any kind of standards body. The
code must therefore be trusted, or maintained by those that use it. While
bstring offers nothing more on this front, since its so much smaller, covers
far less in terms of scope, and will typically improve string performance,
the barrier to usage should be much smaller.
Comparison with stralloc/qmail
------------------------------
More information about this library can be found here:
http://www.canonical.org/~kragen/stralloc.html or here:
http://cr.yp.to/lib/stralloc.html
1. Library is very very minimal. A little too minimal.
2. Untargetted source parameters are not declared const.
3. Slightly different expected emphasis (like _cats function which takes an
ordinary C string char buffer as a parameter.) Its clear that the
remainder of the C string library is still required to perform more
useful string operations.
The struct declaration for their string header is essentially the same as that
for bstring. But its clear that this was a quickly written hack whose goals
are clearly a subset of what Bstrlib supplies. For anyone who is served by
stralloc, Bstrlib is complete substitute that just adds more functionality.
stralloc actually uses the interesting policy that a NULL data pointer
indicates an empty string. In this way, non-static empty strings can be
declared without construction. This advantage is minimal, since static empty
bstrings can be declared inline without construction, and if the string needs
to be written to it should be constructed from an empty string (or its first
initializer) in any event.
wxString class
--------------
This is the string class used in the wxWindows project. A description of
wxString can be found here:
http://www.wxwindows.org/manuals/2.4.2/wx368.htm#wxstring
This C++ library is similar to CBString. However, it is littered with
trivial functions (IsAscii, UpperCase, RemoveLast etc.)
1. There is no C implementation.
2. The memory management strategy is to allocate a bounded fixed amount of
additional space on each resize, meaning that it does not have the
log_2(n) property that Bstrlib has (it will thrash very easily, cause
massive fragmentation in common heap implementations, and can easily be a
common source of performance problems).
3. The library uses a "copy on write" strategy, meaning that it has to deal
with multithreading problems.
Vstr
----
This is a highly orthogonal C string library with an emphasis on
networking/realtime programming. It can be found here:
http://www.and.org/vstr/
1. The convoluted internal structure does not contain a '\0' char * compatible
buffer, so interoperability with the C library a non-starter.
2. The API and implementation is very large (owing to its orthogonality) and
can lead to difficulty in understanding its exact functionality.
3. An obvious dependency on gnu tools (confusing make configure step)
4. Uses a reference counting system, meaning that it is not likely to be
thread safe.
The implementation has an extreme emphasis on performance for nontrivial
actions (adds, inserts and deletes are all constant or roughly O(#operations)
time) following the "zero copy" principle. This trades off performance of
trivial functions (character access, char buffer access/coersion, alias
detection) which becomes significantly slower, as well as incremental
accumulative costs for its searching/parsing functions. Whether or not Vstr
wins any particular performance benchmark will depend a lot on the benchmark,
but it should handily win on some, while losing dreadfully on others.
The learning curve for Vstr is very steep, and it doesn't come with any
obvious way to build for Windows or other platforms without gnu tools. At
least one mechanism (the iterator) introduces a new undefined scenario
(writing to a Vstr while iterating through it.) Vstr has a very large
footprint, and is very ambitious in its total functionality. Vstr has no C++
API.
Vstr usage requires context initialization via vstr_init() which must be run
in a thread-local context. Given the totally reference based architecture
this means that sharing Vstrings across threads is not well defined, or at
least not safe from race conditions. This API is clearly geared to the older
standard of fork() style multitasking in UNIX, and is not safely transportable
to modern shared memory multithreading available in Linux and Windows. There
is no portable external solution making the library thread safe (since it
requires a mutex around each Vstr context -- not each string.)
In the documentation for this library, a big deal is made of its self hosted
s(n)printf-like function. This is an issue for older compilers that don't
include vsnprintf(), but also an issue because Vstr has a slow conversion to
'\0' terminated char * mechanism. That is to say, using "%s" to format data
that originates from Vstr would be slow without some sort of native function
to do so. Bstrlib sidesteps the issue by relying on what snprintf-like
functionality does exist and having a high performance conversion to a char *
compatible string so that "%s" can be used directly.
Str Library
-----------
This is a fairly extensive string library, that includes full unicode support
and targetted at the goal of out performing MFC and STL. The architecture,
similarly to MFC's CStrings, is a copy on write reference counting mechanism.
http://www.utilitycode.com/str/default.aspx
1. Commercial.
2. C++ only.
This library, like Vstr, uses a ref counting system. There is only so deeply
I can analyze it, since I don't have a license for it. However, performance
improvements over MFC's and STL, doesn't seem like a sufficient reason to
move your source base to it. For example, in the future, Microsoft may
improve the performance CString.
It should be pointed out that performance testing of Bstrlib has indicated
that its relative performance advantage versus MFC's CString and STL's
std::string is at least as high as that for the Str library.
libmib astrings
---------------
A handful of functional extensions to the C library that add dynamic string
functionality.
http://www.mibsoftware.com/libmib/astring/
This package basically references strings through char ** pointers and assumes
they are pointing to the top of an allocated heap entry (or NULL, in which
case memory will be newly allocated from the heap.) So its still up to user
to mix and match the older C string functions with these functions whenever
pointer arithmetic is used (i.e., there is no leveraging of the type system
to assert semantic differences between references and base strings as Bstrlib
does since no new types are introduced.) Unlike Bstrlib, exact string length
meta data is not stored, thus requiring a strlen() call on *every* string
writing operation. The library is very small, covering only a handful of C's
functions.
While this is better than nothing, it is clearly slower than even the
standard C library, less safe and less functional than Bstrlib.
To explain the advantage of using libmib, their website shows an example of
how dangerous C code:
char buf[256];
char *pszExtraPath = ";/usr/local/bin";
strcpy(buf,getenv("PATH")); /* oops! could overrun! */
strcat(buf,pszExtraPath); /* Could overrun as well! */
printf("Checking...%s\n",buf); /* Some printfs overrun too! */
is avoided using libmib:
char *pasz = 0; /* Must initialize to 0 */
char *paszOut = 0;
char *pszExtraPath = ";/usr/local/bin";
if (!astrcpy(&pasz,getenv("PATH"))) /* malloc error */ exit(-1);
if (!astrcat(&pasz,pszExtraPath)) /* malloc error */ exit(-1);
/* Finally, a "limitless" printf! we can use */
asprintf(&paszOut,"Checking...%s\n",pasz);fputs(paszOut,stdout);
astrfree(&pasz); /* Can use free(pasz) also. */
astrfree(&paszOut);
However, compare this to Bstrlib:
bstring b, out;
bcatcstr (b = bfromcstr (getenv ("PATH")), ";/usr/local/bin");
out = bformat ("Checking...%s\n", bdatae (b, "<Out of memory>"));
/* if (out && b) */ fputs (bdatae (out, "<Out of memory>"), stdout);
bdestroy (b);
bdestroy (out);
Besides being shorter, we can see that error handling can be deferred right
to the very end. Also, unlike the above two versions, if getenv() returns
with NULL, the Bstrlib version will not exhibit undefined behavior.
Initialization starts with the relevant content rather than an extra
autoinitialization step.
libclc
------
An attempt to add to the standard C library with a number of common useful
functions, including additional string functions.
http://libclc.sourceforge.net/
1. Uses standard char * buffer, and adopts C 99's usage of "restrict" to pass
the responsibility to guard against aliasing to the programmer.
2. Adds no safety or memory management whatsoever.
3. Most of the supplied string functions are completely trivial.
The goals of libclc and Bstrlib are clearly quite different.
fireString
----------
http://firestuff.org/
1. Uses standard char * buffer, and adopts C 99's usage of "restrict" to pass
the responsibility to guard against aliasing to the programmer.
2. Mixes char * and length wrapped buffers (estr) functions, doubling the API
size, with safety limited to only half of the functions.
Firestring was originally just a wrapper of char * functionality with extra
length parameters. However, it has been augmented with the inclusion of the
estr type which has similar functionality to stralloc. But firestring does
not nearly cover the functional scope of Bstrlib.
Safe C String Library
---------------------
A library written for the purpose of increasing safety and power to C's string
handling capabilities.
http://www.zork.org/safestr/safestr.html
1. While the safestr_* functions are safe in of themselves, interoperating
with char * string has dangerous unsafe modes of operation.
2. The architecture of safestr's causes the base pointer to change. Thus,
its not practical/safe to store a safestr in multiple locations if any
single instance can be manipulated.
3. Dependent on an additional error handling library.
4. Uses reference counting, meaning that it is either not thread safe or
slow and not portable.
I think the idea of reallocating (and hence potentially changing) the base
pointer is a serious design flaw that is fatal to this architecture. True
safety is obtained by having automatic handling of all common scenarios
without creating implicit constraints on the user.
Because of its automatic temporary clean up system, it cannot use "const"
semantics on input arguments. Interesting anomolies such as:
safestr_t s, t;
s = safestr_replace (t = SAFESTR_TEMP ("This is a test"),
SAFESTR_TEMP (" "), SAFESTR_TEMP ("."));
/* t is now undefined. */
are possible. If one defines a function which takes a safestr_t as a
parameter, then the function would not know whether or not the safestr_t is
defined after it passes it to a safestr library function. The author
recommended method for working around this problem is to examine the
attributes of the safestr_t within the function which is to modify any of
its parameters and play games with its reference count. I think, therefore,
that the whole SAFESTR_TEMP idea is also fatally broken.
The library implements immutability, optional non-resizability, and a "trust"
flag. This trust flag is interesting, and suggests that applying any
arbitrary sequence of safestr_* function calls on any set of trusted strings
will result in a trusted string. It seems to me, however, that if one wanted
to implement a trusted string semantic, one might do so by actually creating
a different *type* and only implement the subset of string functions that are
deemed safe (i.e., user input would be excluded, for example.) This, in
essence, would allow the compiler to enforce trust propogation at compile
time rather than run time. Non-resizability is also interesting, however,
it seems marginal (i.e., to want a string that cannot be resized, yet can be
modified and yet where a fixed sized buffer is undesirable.)
===============================================================================
Examples
--------
Dumping a line numbered file:
FILE * fp;
int i, ret;
struct bstrList * lines;
struct tagbstring prefix = bsStatic ("-> ");
if (NULL != (fp = fopen ("bstrlib.txt", "rb"))) {
bstring b = bread ((bNread) fread, fp);
fclose (fp);
if (NULL != (lines = bsplit (b, '\n'))) {
for (i=0; i < lines->qty; i++) {
binsert (lines->entry[i], 0, &prefix, '?');
printf ("%04d: %s\n", i, bdatae (lines->entry[i], "NULL"));
}
bstrListDestroy (lines);
}
bdestroy (b);
}
For numerous other examples, see bstraux.c, bstraux.h and the example archive.
===============================================================================
License
-------
The Better String Library is available under either the 3 clause BSD license
(see the accompanying license.txt) or the Gnu Public License version 2 (see
the accompanying gpl.txt) at the option of the user.
===============================================================================
Acknowledgements
----------------
The following individuals have made significant contributions to the design
and testing of the Better String Library:
Bjorn Augestad
Clint Olsen
Darryl Bleau
Fabian Cenedese
Graham Wideman
Ignacio Burgueno
International Business Machines Corporation
Ira Mica
John Kortink
Manuel Woelker
Marcel van Kervinck
Michael Hsieh
Richard A. Smith
Simon Ekstrom
Wayne Scott
===============================================================================
Jump to Line
Something went wrong with that request. Please try again.