-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Extra Data #1
Comments
All this really sounds perfect, don't let me stand in the way. You definitely have a much stronger vision for this than I.
|
Added to you have access to the repo! |
Thanks for the access! I have a fork going right now and will finish work soon. First goal is to remove the JS dependency and generate an identical library to what's already here. I have a lot of free time coming up so I'll probably binge code and finish in a few days... Hopefully. |
The problem
As I first brought up in this reddit comment, I think this crate would really benefit from additional data being stored with each emoji.
Here I'll present a draft of what kind of data may be included, how to scrape that data, and methods of generation. If this idea is accepted, it can be work-shopped before implementation.
Source Files
Currently, this repo pulls from the emoji-test page. While this does well to give basic information, it lacks much useful information. I propose using the Unicode CLDR to gather data from. Not only is it how major projects typically build emoji libraries, but it has much more data.
Some data that can be scraped with this method is as follows:
face with tears of joy
Smileys & People
face-positive
face
,face with tears of joy
,joy
,laugh
,tear
fully-qualified
Scraping Method
Gathering data can be done in a few steps:
Gathering the emoji-specific codepoints and initial data is easy. It's found in
cldr/tools/java/org/unicode/cldr/util/data/emoji/emoji-test.txt
via the CLDR link above. I believe this is the same data as where this project currently pulls from, but this should be double checked and a link tolatest
should be found.Parsing basic info is done directly from the above file. This gives codepoint, string representation, qualification, and canonical name.
Cross referencing is done by examining files within
common/annotations/*.xml
.Packaging
Interpreting the scraped data and dumping into a rust crate should be done with great care. There is a lot of data here, and I think a lot of room for improvement over the current method. I propose the following method:
build.rs
to download files and generate Rust code. This removes the dependency on javascript that this crate currently has, and would allow for a very small footprint -- all generation is done during build time.build.rs
to dump pre-formatted rust code intoOUT_DIR
to be included directly inlib.rs
.lib.rs
to help in searching and filtering emoji.Localization
Good news is that CLDR gives annotations in a large number of languages. Bad news is this project should eventually account for that. I propose we stick with English for now and work that out later.
However, here is a rough idea of what I was thinking:
annotations/*.xml
file. For example, English localication would be enabled via theen
feature.en
should be a default featureDependencies
I recommend some of the following crates while working on this project:
phf
to create perfect compile-time hashtablesxml-rs
to parse the annotationsquote
to generate the rust library codeproc_use
for separating large modules into different files (these files will get seriously huge if not seperated)More will be needed obviously, but I've had positive experiences with the ones above.
The text was updated successfully, but these errors were encountered: