diff --git a/content/posts/2021-02-06.md b/content/posts/2021-02-06.md new file mode 100644 index 00000000..86f6a0b2 --- /dev/null +++ b/content/posts/2021-02-06.md @@ -0,0 +1,109 @@ +--- +{ + "datetime": "2021-02-06T16:55:46Z", + "updatedAt": null, + "draft": false, + "title": "Tip: get an array of characters from a string", + "description": "Splitting strings into an array of individual characters is a common source of bugs in JavaScript. I'll show you three ways of doing it reliably.", + "tags": [ + "JavaScript" + ] +} +--- +Splitting strings into an array of individual characters is a common source of +bugs in JavaScript. I'll show you three ways of doing it reliably. + +When asked to split a string in JavaScript, many programmers will try to split +the string using an empty string: + +```javascript +// DON'T DO THIS. +const string = 'Hello, world!'; +const characters = string.split(''); +``` + +The `characters` variable appears to be what we wanted, i.e. an array populated +with length one strings, each with a character of the original strings. + +So what's the problem with this? To illustrate the problem, let's try the same +string with an emoji in it: + +```javascript +// DON'T DO THIS. +const string = 'Hello, world! 👋'; +const characters = string.split(''); +``` + +The beginning of the array looks fine, but where the emoji should be there are +instead two weird characters. This is also true of lots of runes, not just some +emoji! The encoding of unicode characters and their interaction of JavaScript is +beyond the scope of this article (I'm only concerned with the solution) but the +gist of it is that JavaScript internally represents strings in UTF-16, but +unicode characters cannot in general be represented as a single UTF-16 +character, so some spill over into two. + +This problem tends to escape notice until it's too late. This happens because +basic latin characters are represented as a single UTF-16 character, and unit +tests tend to be written with strings using the same basic character set. + +The same problem happens when you loop over strings by index: + +```javascript +// DON'T DO THIS. +const characters = []; + +for (let i = 0; i < s.length; i++) { + characters.push(s[i]); +} +``` + +So what's the solution? + +## Solution 1 + +ES2015 introduced _iterables_ to JavaScript. Strings are such a type. One way in +which this helps is that the `Array` constructor can consume them to build an +array. + +```javascript +const string = 'Hello, world! 👋'; +const characters = Array.from(string); // This works! +``` + +This works out of the box for ES2015 or above compliant browsers. The one +browser this tends to be an issue for at the time of writing is IE11. If you +need to support IE11, then you can polyfill `Array.from` or you can compile it +your JavaScript to ES5. + +## Solution 2 + +ES2015 also gave us some special syntax called _spread_ for this. You can spread +an iterable into a new array. + +```javascript +const string = 'Hello, world! 👋'; +const characters = [...string]; // This works! +``` + +Unlike solution 1, solution 2 will require compilation to ES5 to work in older +browsers like IE11. I prefer solution 1 to solution 2 because solution 1 +documents itself, whereas solution 2 can be cryptic if you're not familiar with +the spread operator. + +## Solution 3 + +If you want to loop over strings by character, the `for-of` loop was created to +work with iterables. + +```javascript +const string = 'Hello, world! 👋'; +const characters = []; + +for (const character of string) { + characters.push(character); // This works! +} +``` + +Like the spread operator, `for-of` loops use new syntax. If you need to support +old browsers which don't support ES2015, then you'll need to compile your +JavaScript to ES5.