Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for emoji-data.txt and emoji-variation-sequences.txt to unicodedata #85788

Open
jack1142 mannequin opened this issue Aug 23, 2020 · 2 comments
Open

Add support for emoji-data.txt and emoji-variation-sequences.txt to unicodedata #85788

jack1142 mannequin opened this issue Aug 23, 2020 · 2 comments
Labels
3.9 only security fixes 3.10 only security fixes topic-unicode type-feature A feature request or enhancement

Comments

@jack1142
Copy link
Mannequin

jack1142 mannequin commented Aug 23, 2020

BPO 41622
Nosy @terryjreedy, @ezio-melotti, @jack1142

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-08-23.20:34:50.618>
labels = ['3.10', 'type-feature', '3.9', 'expert-unicode']
title = 'Add support for emoji-data.txt and emoji-variation-sequences.txt to unicodedata'
updated_at = <Date 2020-09-01.09:46:27.836>
user = 'https://github.com/jack1142'

bugs.python.org fields:

activity = <Date 2020-09-01.09:46:27.836>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2020-08-23.20:34:50.618>
creator = 'jack1142'
dependencies = []
files = []
hgrepos = []
issue_num = 41622
keywords = []
message_count = 2.0
messages = ['375826', '376084']
nosy_count = 3.0
nosy_names = ['terry.reedy', 'ezio.melotti', 'jack1142']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue41622'
versions = ['Python 3.9', 'Python 3.10']

@jack1142
Copy link
Mannequin Author

jack1142 mannequin commented Aug 23, 2020

emoji-data.txt and emoji-variation-sequences.txt files were formally pulled into the UCD as of Version 13.0 [1] so I think that unicodedata as a package providing access to UCD could support those as well.
In particular:

  • emoji-data.txt lists character properties for emoji characters [2]
  • emoji-variation-sequences.txt lists valid text and emoji presentation sequences [3]

Data from emoji-variation-sequences.txt can be used to ensure consistent rendering of emoji characters across devices [4] (StandardizedVariants.txt has a similar purpose for non-emoji characters).
I'm not entirely sure of the use cases for emoji-data.txt, but because it's also newly added in UCD 13.0.0, I figured I at least shouldn't omit it when making this issue.

[1] https://www.unicode.org/reports/tr44/#Change_History - Changes in Unicode 13.0.0, "Emoji Data" section
[2] https://www.unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files
[3] https://www.unicode.org/reports/tr51/#Emoji_Variation_Sequences
[4] https://unicode.org/faq/vs.html#1

@jack1142 jack1142 mannequin added 3.9 only security fixes 3.10 only security fixes topic-unicode type-feature A feature request or enhancement labels Aug 23, 2020
@terryjreedy
Copy link
Member

Base facts: The Unicode Character Database, UCD, is defined in Tech Report 44, https://www.unicode.org/reports/tr44/. The latest files (now for 13.0) are at https://www.unicode.org/Public/UCD/latest/ and in particular, in the ucd subdirectory. ucd/UnicodeData.txt has a sequential list of current codepoints, including emoji codepoints.

Version 13 added subdirectly ucd/emoji with the 2 files listed above. emoji-variation-sequences.txt comprises 177 highly redundant pairs of lines like this:
0023 FE0E ; text style; # (1.1) NUMBER SIGN
0023 FE0F ; emoji style; # (1.1) NUMBER SIGN
The only difference between the lines is 'FE0E; text' versus 'FE0F; emoji', 'TEXT PRESENTATION SELECTOR' versus 'EMOJI PRESENTATION SELECTOR'.

tr51 does not explicitly say that every line is paired, but perusal suggests that this is true, making the file highly redundant. The 177 characters include some non-emoji symbols, like #, and omits most emoji, including SNAKE, '\U0001f40d', '🐍' (colored coiled snake). And yet, here, at least in Firefox, is the supposedly invalid text snake, '\U0001f40d\ufe0e': '🐍︎' (a flat black-only, uncoiled wiggling snake head). I don't know how '#\ufe0f' might be different from plain '#'.

Our UCD copy is accessed via 13 functions in the unicodedata module. Support for the file could consist of a new function, such as 'emoji_text'. The implementation could be 'chr in emoji_text_set', where the latter is the set of 177 characters. But given the accidental experiment above with an unauthorized sequence, I don't know how useful it would be.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.9 only security fixes 3.10 only security fixes topic-unicode type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

1 participant