Skip to content

Commit

Permalink
Merge pull request #5 from brianlheim/typos
Browse files Browse the repository at this point in the history
fix various typos
  • Loading branch information
ThePhD committed Feb 22, 2021
2 parents 1d3dd26 + 065f2cf commit a4d00c9
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 14 deletions.
4 changes: 2 additions & 2 deletions documentation/source/definitions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
Glossary of Terms & Definitions
===============================

Occassionally, we may need to use precise language to describe what we want. This contains a list of definitions that can be linked to from the documentation to help describe key concepts that are useful for the explication of the concepts and ideas found in this documentation.
Occasionally, we may need to use precise language to describe what we want. This contains a list of definitions that can be linked to from the documentation to help describe key concepts that are useful for the explication of the concepts and ideas found in this documentation.


.. glossary::
Expand All @@ -55,7 +55,7 @@ Occassionally, we may need to use precise language to describe what we want. Thi

A unicode code point has been reserved to take at most 21 bits of space to identify itself.

A single unicode code point is NOT equivalent to a :term:`character <character>`, and multiple of them can be put together or taken apart and still have their sequence form a :term:`"character" <character>`. For a more holsitic, human-like interpretation of code points or other data, see :term:`grapheme clusters <grapheme cluster>`.
A single unicode code point is NOT equivalent to a :term:`character <character>`, and multiple of them can be put together or taken apart and still have their sequence form a :term:`"character" <character>`. For a more holistic, human-like interpretation of code points or other data, see :term:`grapheme clusters <grapheme cluster>`.

unicode scalar value
A single unit of decoded information for Unicode. It's definition is identical to that of :term:`unicode code points <unicode code point>`, with the additional constraint that every unicode svalar value may not be a "Surrogate Value". Surrogate values are non-characters used exclusively for the purpose of encoding and decoding specific sequences of code units, and therefore carry no useful meaning in general interchange. They may appear in text streams in certain encodings: see :doc:`Wobbly Transformation Format-8 (WTF-8) </api/encodings/wtf8>` for an example.
Expand Down
12 changes: 6 additions & 6 deletions documentation/source/design/loss.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ As the maintainer of code inside of the function ``read_name``, what is the enco
}
Even here, we've only made marginal improvements. We know the string is stored in some heap by the default allocator, we have the size of the string, but that only tells us how many ``char`` units are stored, not how many conceptual, human-readable :term:`characters <character>` there are or any other pertinent information. Is this information encoded? Is it UTF-8? Maybe it's EBCDIC Code Page 833. Maybe it's UTF-7-IMAP. You don't know, and by the time you start inspecting or poking at the individual ``char`` :term:`code units <code unit>`, who knows what can happen? To make matters worse, even C++ and its Standard Library have poor support for encoding/decoding, let alone Unicode in general. These problems have been explained in quite a lot of detail up to ths point, but the pitfalls are many:
Even here, we've only made marginal improvements. We know the string is stored in some heap by the default allocator, we have the size of the string, but that only tells us how many ``char`` units are stored, not how many conceptual, human-readable :term:`characters <character>` there are or any other pertinent information. Is this information encoded? Is it UTF-8? Maybe it's EBCDIC Code Page 833. Maybe it's UTF-7-IMAP. You don't know, and by the time you start inspecting or poking at the individual ``char`` :term:`code units <code unit>`, who knows what can happen? To make matters worse, even C++ and its Standard Library have poor support for encoding/decoding, let alone Unicode in general. These problems have been explained in quite a lot of detail up to this point, but the pitfalls are many:

.. epigraph::

Expand All @@ -68,7 +68,7 @@ Some proponents say that if we just change everything to mean "UTF-8" (`const ch
"UTF-8 Everywhere!!"
--------------------

There are many in the programmign space that believe that just switching everything to UTF-8 everywhere will solve the problem. This is, unfortunately, greatly inadequate as a solution. For those who actually read the entire UTF-8 Everywhere manifesto in its fullness, they will come across this FAQ entry:
There are many in the programming space that believe that just switching everything to UTF-8 everywhere will solve the problem. This is, unfortunately, greatly inadequate as a solution. For those who actually read the entire UTF-8 Everywhere manifesto in its fullness, they will come across this FAQ entry:

.. epigraph::

Expand All @@ -78,7 +78,7 @@ There are many in the programmign space that believe that just switching everyth

-- `FAQ Entry #6 <https://utf8everywhere.org/#faq.liberal>`_

The core problem with the "``std::string`` is always UTF-8" decision (even when they are as big as Gooogle, Apple, Facebook, or Microsoft and own everything from the data center to the browser you work with) is that they live on a planet with other people who do not share the same sweeping generalizations about their application environments. Nor have they invoked the ability to, magically, rewrite everyone's code or the data that's been put out by these programs in the last 50 or 60 years. This results in a gratuitous amount of replacement characters or :term:`Mojibake <mojibake>` when things do not encode or decode properly:
The core problem with the "``std::string`` is always UTF-8" decision (even when they are as big as Google, Apple, Facebook, or Microsoft and own everything from the data center to the browser you work with) is that they live on a planet with other people who do not share the same sweeping generalizations about their application environments. Nor have they invoked the ability to, magically, rewrite everyone's code or the data that's been put out by these programs in the last 50 or 60 years. This results in a gratuitous amount of replacement characters or :term:`Mojibake <mojibake>` when things do not encode or decode properly:

.. image:: /img/paris-post-office.jpg
:alt: A package going between Russia and Paris, written in Mojibake because of interpreting text with the wrong encoding. It has been corrected in marker with the correct lettering, because they are so used to this occurence for international packages.
Expand All @@ -100,7 +100,7 @@ So, what do we do from here?
Fighting Code Rot
-----------------

We need ways to fight bit rot and issues of function invariants -- like expected encoding on string objects -- from infesting code. While we can't rewrite every function declaration or wrap every functin declaration, one of the core mechanisms this library provides is a way of tracking and tagging this kind of invariant information, particularly at compile-time.
We need ways to fight bit rot and issues of function invariants -- like expected encoding on string objects -- from infesting code. While we can't rewrite every function declaration or wrap every function declaration, one of the core mechanisms this library provides is a way of tracking and tagging this kind of invariant information, particularly at compile-time.

We know we can't solve interchange on a global level (e.g., demanding everyone use UTF-8) because, at some point, there is always going to be some small holdout of legacy data that has not yet been fixed or ported. The start of solving this is by having views and containers that keep encoding information with them after they are first constructed. This makes it possible to not "lose" that information as it flows through your program:

Expand All @@ -115,7 +115,7 @@ We know we can't solve interchange on a global level (e.g., demanding everyone u
Now, we have an :doc:`explicit decoding view </api/views/decode_view>` into a sequence of UTF-8 code units, that produces ``unicode_code_point``\ s that we can inspect and work with. This is much better, as it uses C++'s strong typing mechanisms to give us a useful view. This means that not only does the person outside of the ``read_name`` function understand that the function expects some UTF-8 encoded text, but the person inside the function knows that they are working with UTF-8 encoded text. This solves both ends of the user and maintainer divide.

Of course, sometimes this is not always possible. ABI stability mandates some functions can't have their signatures change. Other times, you can't modify the signature of functions youu don't own. This is still helpful in this case, as you can, at the nearest available point inside the function or outside of it, apply these transformations:
Of course, sometimes this is not always possible. ABI stability mandates some functions can't have their signatures change. Other times, you can't modify the signature of functions you don't own. This is still helpful in this case, as you can, at the nearest available point inside the function or outside of it, apply these transformations:


.. code-block:: cpp
Expand All @@ -140,4 +140,4 @@ Because the range and container types are templated on not only encoding, but th

-- `UTF-8 Everywhere, FAQ Entry #19 <https://utf8everywhere.org/#faq.ood>`_

Rather than create a new ``std::string`` or ``std::string_view``, we simply wrap existing storage interfaces and provide specific views or operations on those things. This alleviates the burden of having to reinvent things that already work fine for byte-oriented interfaces, and helps programmers control (and prevent) bugs. They also get to communicate their intent in their APIs if they so desire ("This API takes a ``std::string_view``, but with the expectation that it's going to be decoded as ``utf8``). The wrapped type will always be available by calling ``.base()``, which means a developer can drop down to the level they think is appropriate when they want it (with the explicit acknowledgement they're going to be ruining things).
Rather than create a new ``std::string`` or ``std::string_view``, we simply wrap existing storage interfaces and provide specific views or operations on those things. This alleviates the burden of having to reinvent things that already work fine for byte-oriented interfaces, and helps programmers control (and prevent) bugs. They also get to communicate their intent in their APIs if they so desire ("This API takes a ``std::string_view``, but with the expectation that it's going to be decoded as ``utf8``"). The wrapped type will always be available by calling ``.base()``, which means a developer can drop down to the level they think is appropriate when they want it (with the explicit acknowledgement they're going to be ruining things).
6 changes: 3 additions & 3 deletions documentation/source/design/strong vs weak code points.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,11 +75,11 @@ In a long piece on P0422, the C and C++ landscape, and Standardization efforts,

-- `Henri Sivonen, It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++ <https://hsivonen.fi/non-unicode-in-cpp/>`_

This is a different set of choices and a different set of priorities from the outset. Sivonen's work specifically is that with Browsers and large code bases like Firefox; they are responsible for making very good traction and progress on encoding issues in a world that is filled primarily with Unicode, but still has millions of documents that are not in Unicode and, for the forseeable future, won't end up as Unicode.
This is a different set of choices and a different set of priorities from the outset. Sivonen's work specifically is that with Browsers and large code bases like Firefox; they are responsible for making very good traction and progress on encoding issues in a world that is filled primarily with Unicode, but still has millions of documents that are not in Unicode and, for the foreseeable future, won't end up as Unicode.

This is a strong argument for simply channeling ``char16_t``, ``char32_t``, and -- since C++20 -- ``char8_t`` as the only types one would need. Firefox at most deals in UTF-16 (due to the JavaScript engine for legacy reasons) and UTF-8, internally. At the boundaries, it deals with many more text encodings, because it `has to from the world wide web <https://encoding.spec.whatwg.org/>`_. Occasionally, UTF-32 will appear in someone's codebase for interoperation purposes or algorithms that need to operate on something better than code units.

Unicode is also... well, a [UNI]versal [CODE]. It's purposes is interoperation, interchange, and common ground between all the encodings, and it has been the clear winner for this for quite some time now. Sivonen makes a compelling point for just considering Unicode — and only Unicode — for all future text endeavors.
Unicode is also... well, a [UNI]versal [CODE]. Its purposes are interoperation, interchange, and common ground between all the encodings, and it has been the clear winner for this for quite some time now. Sivonen makes a compelling point for just considering Unicode — and only Unicode — for all future text endeavors.

Do we really need to focus on having support for legacy encodings? Or at least, do we really need support for legacy encodings at the level that Tom Honermann's text_view is trying to achieve?

Expand Down Expand Up @@ -112,7 +112,7 @@ There is room in Sivonen's world, even with perfectly-consistent and fully-Unico

That's why encodings can still define their own ``code_unit`` and ``code_point`` types; even if this library — or the Standard Library — traffics in strictly ``unicode_code_point``\ s, it doesn't mean the user should be forced to do that if they are willing to put in the effort for a more type-safe world.

Being able to know, at compile-time, without any objects or markup, that a particular pointer + size pairing is meant for a specific encoding is aa powerful way to maintain invariants and track the flow of data without runtime cost through a program. It can also make it easy to find places where external, non-Unicode data is making it "too far" into the system, and try to push a conversion closer to the edges of the program.
Being able to know, at compile-time, without any objects or markup, that a particular pointer + size pairing is meant for a specific encoding is a powerful way to maintain invariants and track the flow of data without runtime cost through a program. It can also make it easy to find places where external, non-Unicode data is making it "too far" into the system, and try to push a conversion closer to the edges of the program.

While ztd.text will traffic and work with ``char32_t`` and consider it a ``unicode_code_point`` value :doc:`under most circumstances </api/is_unicode_code_point>`, users are free to define and extend this classification for their own types and generally create as strict (or loose) as taxonomy as they desire.

Expand Down
6 changes: 3 additions & 3 deletions examples/documentation/source/error_handler.anatomy.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,15 @@
#include <ztd/text.hpp>

struct my_error_handler {
// Helper definintions
// Helper definitions
template <typename Encoding>
using code_point_span
= ztd::text::span<const ztd::text::code_point_t<Encoding>>;
template <typename Encoding>
using code_unit_span
= ztd::text::span<const ztd::text::code_unit_t<Encoding>>;

// Function call operatorthat returns a "deduced" (auto) type
// Function call operator that returns a "deduced" (auto) type
// Specifically, this one is called for encode failures
template <typename Encoding, typename Input, typename Output,
typename State>
Expand Down Expand Up @@ -79,4 +79,4 @@ int main(int, char* argv[]) {
ztd::text::basic_utf8<char> {}, my_error_handler {});

return 0;
}
}

0 comments on commit a4d00c9

Please sign in to comment.