Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content: implement support for unicode emojis in messages #245

Merged
merged 5 commits into from
Aug 25, 2023

Conversation

rajveermalviya
Copy link
Collaborator

Fixes: #58

Context

Implement parsing of the unicode emojis encoded in <span class="emoji emoji-*"> elements and then generating direct TextSpans that contains the unicode codepoint for emoji and falling back to displaying emoji_name in case the parsing had failed.

This PR also vendors the "Noto Color Emoji" font in COLRv1 format (4.53 MB) that was downloaded from it's latest release.

Alternatives

As mentioned in #58 there are many alternative ways to handle the displaying of emojis. AFAIK using a font is the most efficient way to render emojis in Flutter (even for icons it prefers fonts). Using a font also provides us a way to render the emojis that are in the message content & other strings (topic/stream name). Especially on older servers which may fail to parse the newer emojis and fail to encode them in <span class="emoji"> elements and rather including the unicode codepoints in the text message content as is. (Albeit we can use the spritesheet method there too, but not without parsing each character of the message content to check for emojis.)

Now coming to different possible routes for the loading of the fonts:

  • Vendoring them in the app (for now the PR does this)
  • Using a package like google_fonts that downloads the fonts from font.google.com server and dynamically load the font in the app.
  • Implement the functionality from google_fonts ourselves that downloads & caches the preferred emoji font from the realm server (this will probably require that server hosts the font files) and then load the downloaded font in the app using FontLoader API.

Testing

Updated the current ones.

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajveermalviya! See review comments below.

We'll also want some unit tests for this code before merging it.


final String text;
final String emojiName;
final String? emojiUnicode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe the situation where this is expected to be null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, after reading the surrounding code a bit more, it seems like it would be unreachable.
In previous revision, when tryParseEmojiCodeToUnicode returned null, parser would set emojiUnicode to null too. And in the widgets layer the emojiName with the fallback container would be displayed.

But thinking about it a bit more; if tryParseEmojiCodeToUnicode fails then we should just emit "unimplemented" from the parser rather than just adding a silent workaround. And also it will be pretty rare that tryParseEmojiCodeToUnicode would fail as it is mostly unreachable because of the regexp check in parser and if it does fail it would be a bug in the server.

@@ -568,7 +570,8 @@ class _ZulipContentParser {
&& classes.length == 2
&& classes.contains('emoji')
&& classes.every(_emojiClassRegexp.hasMatch)) {
return UnicodeEmojiNode(text: element.text, debugHtmlNode: debugHtmlNode);
final emojiUnicode = tryParseEmojiCodeToUnicode(classes.elementAt(1).replaceFirst('emoji-', ''));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This elementAt(1) implicitly assumes the emoji class comes first, and the emoji-12345 class after that. But the conditions above it don't ensure that — they might come in the other order.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, now uses .firstWhere()

Comment on lines 633 to 635
// Ported from https://github.com/zulip/zulip-mobile/tree/main/src/emoji/data.js
//
// Which was in turn ported from https://github.com/zulip/zulip/blob/main/zerver/models.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When linking to code on GitHub for reference, use a permalink. Links to main can break when the file is renamed, or perhaps worse can silently become unhelpful when the relevant code moves elsewhere or otherwise changes.

GitHub has a handy keyboard shortcut y to get a permalink when you're looking at a page like this.

(I guess we didn't do that in the code you're borrowing from in zulip-mobile, oops. Still, let's fix it here.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

String? tryParseEmojiCodeToUnicode(String code) {
try {
return String.fromCharCodes(code.split('-').map((hex) => int.parse(hex, radix: 16)));
} catch (_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, catch (_) is a bit of a code smell :-)

Generally if there's some exception we expect to sometimes happen, and doesn't represent a bug in the code, we'll want to be explicit about what we expect and just catch that exception type specifically. That way if some unrelated exception is thrown as a result of a bug, the exception propagates just like it normally would.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@@ -12,6 +12,7 @@ class ZulipApp extends StatelessWidget {
@override
Widget build(BuildContext context) {
final theme = ThemeData(
fontFamilyFallback: const <String>['sans-serif', 'Noto Color Emoji'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about what this line is doing and why?

The commit message says "correctly propogate emoji font fallback", but this doesn't appear to be propagating anything; instead it's introducing something new.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, added a comment explaining the reason and moved to a separate commit.

@gnprice
Copy link
Member

gnprice commented Aug 4, 2023

For the commit adding the Noto Color Emoji font to assets, here's a revision for the commit message:

font: Add Noto Color Emoji v2.038, in COLRv1 format

These files are from:
  https://github.com/googlefonts/noto-emoji/tree/v2.038/fonts

The main difference is just to make it a bit more explicit exactly where these files come from, because the existing URL https://github.com/googlefonts/noto-emoji/releases/tag/v2.038 doesn't provide the individual font file as a download. It does lead to the whole source tree; but there's a lot of files in there (and one wouldn't necessarily expect the font file to be checked into the source tree anyway, as presumably it's the result of building the tree), so it's not obvious from that where to find these files.

(And, mentioning this mostly for my own notes: I did just go and download those files myself, and verified that I get the same contents as what's in this PR.)

@rajveermalviya rajveermalviya force-pushed the unicode-emoji branch 2 times, most recently from 3069f62 to 721c58a Compare August 6, 2023 04:21
@rajveermalviya
Copy link
Collaborator Author

Thanks for the review @gnprice, I have rebased and updated with the suggested changes.

@rajveermalviya
Copy link
Collaborator Author

(Just rebased to main, and squashed some commits)

@rajveermalviya
Copy link
Collaborator Author

(rebased again, to fix conflicts with main branch)

@rajveermalviya rajveermalviya force-pushed the unicode-emoji branch 2 times, most recently from 312bcdb to 19bf5b2 Compare August 18, 2023 18:41
@rajveermalviya
Copy link
Collaborator Author

(rebased to main; with updated commit messages)

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajveermalviya for the revision! Comments below.

static final _emojiClassRegexp = RegExp(r"^emoji(-[0-9a-f]+)?$");
static final _emojiClassRegexp = RegExp(r"^emoji(-[0-9a-f]+)*$");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting, good catch.

This is basically fixing a bug, right? We would have not produced a UnicodeEmojiNode when the emoji had multiple codepoints, and instead produced an UnimplementedNode.

Let's include a test that exercises that case, in the same commit that fixes the bug.

Comment on lines 596 to 608
final className = classes
.firstWhere((className) => className.startsWith('emoji-'), orElse: () => '')
.replaceFirst('emoji-', '');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replaceFirst means this isn't really a "class name" anymore. A good name might be emojiCode — I believe this string is now the same as the Zulip "emoji code" for the emoji.

Comment on lines 596 to 599
final className = classes
.firstWhere((className) => className.startsWith('emoji-'), orElse: () => '')
.replaceFirst('emoji-', '');
if (className.isEmpty) return unimplemented();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the orElse / className.isEmpty case here actually happen?

I think it can't, even with malformed data from the server — that is, I think our own code, and in fact the few lines of code around this spot, ensure that it can't happen. In the condition above, we check that

  • there are two classes
  • one of them is just emoji
  • the other matches the regexp

And I think any string that matches the regexp and isn't emoji has to start with emoji-.

So at this spot we can just assume that we do find a class starting with emoji-, and barrel ahead using the ! operator or equivalent. If the resulting logic seems unclear, we can cheerfully add an assert to explicitly verify (but at zero cost when in release mode) that we did find such a class.

@@ -582,7 +584,22 @@ class _ZulipContentParser {
&& classes.length == 2
&& classes.contains('emoji')
&& classes.every(_emojiClassRegexp.hasMatch)) {
return UnicodeEmojiNode(text: element.text, debugHtmlNode: debugHtmlNode);
// TODO: Check if font contains the glyph for the specific unicode codepoints;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting thought.

I'm not sure we'll actually want to do this, though. The question is: if the answer is that the font doesn't seem to have the emoji, then what do we do instead?

One possible answer would be an UnimplementedNode. But in a UX for general release (i.e. post #194), I think the best option for handling that would likely be to just show the Unicode anyway. And I think this case wouldn't be real informative for #190. It'd tell us that the server has updated to a newer list of emoji than we have in our font, and we should update our emoji font. But updating emoji lists is going to be an annual housekeeping task for both Zulip server/web and us for the foreseeable future, and I think we can find a lower-effort way than this to get reminded of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the answer is that the font doesn't seem to have the emoji, then what do we do instead?

We could fallback to the previous behavior (before this PR) - showing the emoji name: :wave:
But yeah the emoji list will have to be maintained accordingly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah we can't do that because if font doesn't have that emoji that will mean the "emoji list" won't have that new emoji the user typed too. Also if all the projects will update the emoji sets at same time - that will mean server doesn't know about the new emoji too and won't send us an emoji_name.
So, I guess we can't do this - removing the todo.

Comment on lines 590 to 594
// eg. unicode v15 set of emojis. Note that in both of these options parsing
// each character of the whole message text will be required to make the
// checks reliable, because emojis can be present in non encoded form in the
// message text. (For example, when server hasn't been updated to account for
// newer unicode versions.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yeah, if there are emoji in the message as just straight-up Unicode emoji, encoded in the normal Unicode way, then I am definitely happy to pass them right through to the Text widget and not worry about them.

If they're not in a font we bundled, they may still — or may not — be in a system font on the user's device. In that way they have the same status as if the message sender had used some script we didn't plan for, out of the 100+ scripts in Unicode.

@@ -14,6 +14,14 @@ class ZulipApp extends StatelessWidget {
@override
Widget build(BuildContext context) {
final theme = ThemeData(
// This correctly sets up a the font fallback for normal text that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This correctly sets up a the font fallback for normal text that
// This sets up a font fallback for normal text that

Comment on lines 155 to 157
// "🪿"
'<p>🪿</p>',
const TextNode('\u{1fabf}'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: clearest to read if these either all use the \u notation, or all have the literal emoji in the source

(either way it's the same string, just two different ways in the source code of expressing it)

// ":thumbs_up:"
'<p><span aria-label="thumbs up" class="emoji emoji-1f44d" role="img" title="thumbs up">:thumbs_up:</span></p>',
const UnicodeEmojiNode(text: ':thumbs_up:'));
const UnicodeEmojiNode(emojiName: ':thumbs_up:', codepoints: '\u{1f44d}'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess this example shows that the thing labeled emojiName isn't exactly the name — it's the name (which here is thumbs_up) plus colons to decorate it.

One good name would be just colonsName.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps better yet, drop that field entirely, since we're no longer using it. (Easy to add it back later if we start using it again for something.)

That will mean squashing the commit that changes how widgets/content.dart consumes UnicodeEmojiNode into the commit that changes what those nodes contain and how we parse them. I think squashing those is probably cleanest anyway.

@@ -620,23 +619,6 @@ class UserMention extends StatelessWidget {
// borderRadius: BorderRadius.all(Radius.circular(3))));
}

class MessageUnicodeEmoji extends StatelessWidget {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's squash deleting this into the same commit that makes it no longer needed.

Meanwhile, from the commit message:

This widget displayed the emoji name as a fallback in a bordered
container. But now that we rely on TextSpan to display the emoji
codepoints text, we can't reliably determine the fallback condition
_for now_.

I don't understand the last bit here. What fallback condition are you thinking of; and what would need to change between "for now" and being able to determine it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commits talks about the possibility to determine if the emojiUnicode is present in a supported emoji list. If not we could fallback to showing this container. See discussion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, see comment

// > "unified" in the following data, with "non_qualified" taking
// > precedence when both present:
// > https://raw.githubusercontent.com/iamcal/emoji-data/master/emoji_pretty.json
String? tryParseEmojiCodeToUnicode(String code) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this puts this function in amongst the functions for parsing block content; it's parsing inline content, so it should instead go amongst the parsing of inline content.

Specifically let's put it above parseInlineContent. In general this parser class has the more general and more toward-the-root logic at the bottom, and the more specific and more toward-the-leaves logic at the top.

@rajveermalviya
Copy link
Collaborator Author

Thanks for the review @gnprice, revision pushed!

@rajveermalviya
Copy link
Collaborator Author

(and rebased to main)

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajveermalviya for all your work on this! Looks good with just a couple of nits (below); I'll fix and merge.

Happy to see that it works nicely — here's on my Pixel 5:

before after
image image

In fact, it even effectively fixes zulip/zulip#11767 ! We just yesterday merged a fix for that in the server's main branch, but old messages will remain affected. Here in this client implementation, though, because we translate the Unicode emoji right back into Unicode, even the old mojibake messages show up perfectly:

before after
image image

That basically comes as a bonus from our decision to use Unicode (and ship an emoji font for consistency across devices) instead of managing these glyphs as images.

@@ -582,7 +604,14 @@ class _ZulipContentParser {
&& classes.length == 2
&& classes.contains('emoji')
&& classes.every(_emojiClassRegexp.hasMatch)) {
return UnicodeEmojiNode(text: element.text, debugHtmlNode: debugHtmlNode);
final emojiCode = classes
.firstWhereOrNull((className) => className.startsWith('emoji-'))!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can simplify slightly:

Suggested change
.firstWhereOrNull((className) => className.startsWith('emoji-'))!
.firstWhere((className) => className.startsWith('emoji-'))

@@ -460,8 +460,7 @@ class _InlineContentBuilder {
return WidgetSpan(alignment: PlaceholderAlignment.middle,
child: UserMention(node: node));
} else if (node is UnicodeEmojiNode) {
return WidgetSpan(alignment: PlaceholderAlignment.middle,
child: MessageUnicodeEmoji(node: node));
return TextSpan(text: node.emojiUnicode, recognizer: _recognizer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit in commit message:

- Generate a tappable TextSpan for each unicode emoji

It's true that this will be tappable if it's inside something tappable (i.e., a link). But this line in the commit message sounds like it's saying the span will be tappable in its own way, related somehow to its status as an emoji. Which is a plausible feature — you could imagine tapping (maybe more likely long-tapping, but possibly just tapping) to pull up the name of the emoji in a tooltip, or something like that.

So just s/tappable // clarifies it, I think.

@gnprice
Copy link
Member

gnprice commented Aug 24, 2023

Looks good with just a couple of nits (below); I'll fix and merge.

Well, I ran into a small road bump: this PR was causing a test to break, once rebased atop main. The issue is the test's fault, though, not this PR's. Sent #285 to fix that, and meanwhile I'll push my rebased and nits-fixed version of this PR back to the PR branch.

An emoji with multiple codepoints is encoded as `emoji-xxxxxx-xxxxxx-..`
(where xxxxxx corresponds to a hex sequence). Current regexp matches
strings with either zero or single occurence of subgroup, changing it
to match string with either zero or infinitely repeating subgroup.
- Parse the zulip emoji code class names to unicode codepoints
- Generate a TextSpan for each unicode emoji
- Remove fallback emoji name container

Fixes: zulip#58
@gnprice
Copy link
Member

gnprice commented Aug 25, 2023

OK, and merging! Thanks again @rajveermalviya for the contribution.

@gnprice gnprice merged commit 9e82157 into zulip:main Aug 25, 2023
@rajveermalviya rajveermalviya deleted the unicode-emoji branch August 25, 2023 19:43
chrisbobbe added a commit to chrisbobbe/zulip-flutter that referenced this pull request Nov 21, 2023
tl;dr:

When this Flutter issue is resolved:
  flutter/flutter#134897
try adding the Noto Color Emoji font back again (COLRv1 format) and
set it as a fallback font behind some higher-priority font wherever
we want to show text that might contain emojis.

Before we do that, we're at the mercy of the system font, which
might not be updated to handle newer emojis. This seems especially
true of Android devices in the wild.

-----

In zulip#245, we added a COLRv1 version of this font, probably because of
a warning that made us doubt if other formats would work on iOS:
  https://github.com/googlefonts/noto-emoji#using-notocoloremoji

> NotoColorEmoji uses the CBDT/CBLC color font format, which is
> supported by Android and Chrome/Chromium OS. Windows supports it
> starting with Windows 10 Anniversary Update in Chrome and Edge. On
> macOS, only Chrome supports it, while on Linux it will support it
> with some fontconfig tweaking, see issue zulip#36. Currently we do not
> build other color font formats.

(It seems like that part of the README wasn't updated when the
COLRv1 format was added, and hasn't been updated since.)

And I guess there wasn't a clear iOS compatibility problem with the
COLRv1 format, so we used that.

But, in zulip#245, it seems like we didn't do a good manual test that the
COLRv1 version of this font could actually handle emojis in the app
on iOS. That testing shows that it can't (or at least couldn't on my
phone or in a simulator). When the font is in the effective list of
fallbacks, it takes responsibility for rendering the emoji, before
the system font gets a chance to, but then it fails to render it,
and a blank space appears. We suspect this Flutter issue ("Rendering
of COLRv1 fonts is broken"):
  flutter/flutter#134897

We don't know why that Flutter issue doesn't, or at least doesn't
always, prevent emojis from successfully rendering in the font on
*Android*. I followed the issue's reproduction recipe with the
COLRv1 font they specified, and indeed the glyphs weren't showing up
on Android (but they did on web). So apparently the specific COLRv1
font is a variable.

Since we don't know what all the variables are, and we have other
short-term priorities, just take the font out of the picture for
now, with a plan to reinstate it once we're sure it won't cause
emojis to fail to render.

Discussion:
  https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Emoji.20rendering.20issue.20.28iOS.29/near/1683965
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Zulip's encoding of Unicode emoji in messages
2 participants