Adding Extra Data #1

Shizcow · 2020-11-18T17:02:44Z

The problem

As I first brought up in this reddit comment, I think this crate would really benefit from additional data being stored with each emoji.

Here I'll present a draft of what kind of data may be included, how to scrape that data, and methods of generation. If this idea is accepted, it can be work-shopped before implementation.

Source Files

Currently, this repo pulls from the emoji-test page. While this does well to give basic information, it lacks much useful information. I propose using the Unicode CLDR to gather data from. Not only is it how major projects typically build emoji libraries, but it has much more data.

Some data that can be scraped with this method is as follows:

Codepoint characters
- Ex: 😂
Canonical name
- Ex: face with tears of joy
Category name
- Ex: Smileys & People
Subcategory name
- Ex: face-positive
Keywords
- Ex: face, face with tears of joy, joy, laugh, tear
Qualification
- Ex: fully-qualified

Scraping Method

Gathering data can be done in a few steps:

Categorize emoji-specific codepoints
Parsing basic info
Cross referencing keywords

Gathering the emoji-specific codepoints and initial data is easy. It's found in cldr/tools/java/org/unicode/cldr/util/data/emoji/emoji-test.txt via the CLDR link above. I believe this is the same data as where this project currently pulls from, but this should be double checked and a link to latest should be found.

Parsing basic info is done directly from the above file. This gives codepoint, string representation, qualification, and canonical name.

Cross referencing is done by examining files within common/annotations/*.xml.

Packaging

Interpreting the scraped data and dumping into a rust crate should be done with great care. There is a lot of data here, and I think a lot of room for improvement over the current method. I propose the following method:

Use build.rs to download files and generate Rust code. This removes the dependency on javascript that this crate currently has, and would allow for a very small footprint -- all generation is done during build time.
After scraping data, use build.rs to dump pre-formatted rust code into OUT_DIR to be included directly in lib.rs.
In addition to having each codepoint chronicled, include a final metadata marker -- a compile time hashmap in lib.rs to help in searching and filtering emoji.

Localization

Good news is that CLDR gives annotations in a large number of languages. Bad news is this project should eventually account for that. I propose we stick with English for now and work that out later.

However, here is a rough idea of what I was thinking:

Use crate features for each localization. This will require semi-manual updating when new CLDR localizations come out, but I think it's worth it.
Each feature is the name of the annotations/*.xml file. For example, English localication would be enabled via the en feature.
en should be a default feature

Dependencies

I recommend some of the following crates while working on this project:

phf to create perfect compile-time hashtables
xml-rs to parse the annotations
quote to generate the rust library code
proc_use for separating large modules into different files (these files will get seriously huge if not seperated)

More will be needed obviously, but I've had positive experiences with the ones above.

The text was updated successfully, but these errors were encountered:

richardanaya · 2020-11-18T22:43:54Z

All this really sounds perfect, don't let me stand in the way. You definitely have a much stronger vision for this than I.

dropping the dependency on javascript sounds great
I'll research more into the CLDR
focusing english sounds good for now!

richardanaya · 2020-11-18T22:45:13Z

Added to you have access to the repo!

Shizcow · 2020-11-19T02:32:52Z

Thanks for the access! I have a fork going right now and will finish work soon. First goal is to remove the JS dependency and generate an identical library to what's already here. I have a lot of free time coming up so I'll probably binge code and finish in a few days... Hopefully.

Extra data

Shizcow mentioned this issue Nov 23, 2020

Extra data, localizations, search functions #3

Merged

Shizcow closed this as completed in #3 Nov 23, 2020

Shizcow added a commit that referenced this issue Nov 23, 2020

Merge pull request #1 from Shizcow/extra_data

9f1da1a

Extra data

mooreniemi mentioned this issue May 17, 2021

update crate? #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Extra Data #1

Adding Extra Data #1

Shizcow commented Nov 18, 2020 •

edited

Loading

richardanaya commented Nov 18, 2020

richardanaya commented Nov 18, 2020

Shizcow commented Nov 19, 2020

Adding Extra Data #1

Adding Extra Data #1

Comments

Shizcow commented Nov 18, 2020 • edited Loading

The problem

Source Files

Scraping Method

Packaging

Localization

Dependencies

richardanaya commented Nov 18, 2020

richardanaya commented Nov 18, 2020

Shizcow commented Nov 19, 2020

Shizcow commented Nov 18, 2020 •

edited

Loading