Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Extra Data #1

Closed
Shizcow opened this issue Nov 18, 2020 · 3 comments · Fixed by #3
Closed

Adding Extra Data #1

Shizcow opened this issue Nov 18, 2020 · 3 comments · Fixed by #3

Comments

@Shizcow
Copy link
Collaborator

Shizcow commented Nov 18, 2020

The problem

As I first brought up in this reddit comment, I think this crate would really benefit from additional data being stored with each emoji.

Here I'll present a draft of what kind of data may be included, how to scrape that data, and methods of generation. If this idea is accepted, it can be work-shopped before implementation.

Source Files

Currently, this repo pulls from the emoji-test page. While this does well to give basic information, it lacks much useful information. I propose using the Unicode CLDR to gather data from. Not only is it how major projects typically build emoji libraries, but it has much more data.

Some data that can be scraped with this method is as follows:

  • Codepoint characters
    • Ex: 😂
  • Canonical name
    • Ex: face with tears of joy
  • Category name
    • Ex: Smileys & People
  • Subcategory name
    • Ex: face-positive
  • Keywords
    • Ex: face, face with tears of joy, joy, laugh, tear
  • Qualification
    • Ex: fully-qualified

Scraping Method

Gathering data can be done in a few steps:

  • Categorize emoji-specific codepoints
  • Parsing basic info
  • Cross referencing keywords

Gathering the emoji-specific codepoints and initial data is easy. It's found in cldr/tools/java/org/unicode/cldr/util/data/emoji/emoji-test.txt via the CLDR link above. I believe this is the same data as where this project currently pulls from, but this should be double checked and a link to latest should be found.

Parsing basic info is done directly from the above file. This gives codepoint, string representation, qualification, and canonical name.

Cross referencing is done by examining files within common/annotations/*.xml.

Packaging

Interpreting the scraped data and dumping into a rust crate should be done with great care. There is a lot of data here, and I think a lot of room for improvement over the current method. I propose the following method:

  • Use build.rs to download files and generate Rust code. This removes the dependency on javascript that this crate currently has, and would allow for a very small footprint -- all generation is done during build time.
  • After scraping data, use build.rs to dump pre-formatted rust code into OUT_DIR to be included directly in lib.rs.
  • In addition to having each codepoint chronicled, include a final metadata marker -- a compile time hashmap in lib.rs to help in searching and filtering emoji.

Localization

Good news is that CLDR gives annotations in a large number of languages. Bad news is this project should eventually account for that. I propose we stick with English for now and work that out later.

However, here is a rough idea of what I was thinking:

  • Use crate features for each localization. This will require semi-manual updating when new CLDR localizations come out, but I think it's worth it.
  • Each feature is the name of the annotations/*.xml file. For example, English localication would be enabled via the en feature.
  • en should be a default feature

Dependencies

I recommend some of the following crates while working on this project:

  • phf to create perfect compile-time hashtables
  • xml-rs to parse the annotations
  • quote to generate the rust library code
  • proc_use for separating large modules into different files (these files will get seriously huge if not seperated)

More will be needed obviously, but I've had positive experiences with the ones above.

@richardanaya
Copy link
Owner

All this really sounds perfect, don't let me stand in the way. You definitely have a much stronger vision for this than I.

  • dropping the dependency on javascript sounds great
  • I'll research more into the CLDR
  • focusing english sounds good for now!

@richardanaya
Copy link
Owner

Added to you have access to the repo!

@Shizcow
Copy link
Collaborator Author

Shizcow commented Nov 19, 2020

Thanks for the access! I have a fork going right now and will finish work soon. First goal is to remove the JS dependency and generate an identical library to what's already here. I have a lot of free time coming up so I'll probably binge code and finish in a few days... Hopefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants