-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize character conversion #15
Optimize character conversion #15
Conversation
6a56072
to
b41ad5b
Compare
800864e
to
bad23de
Compare
@ManyTheFish Please confirm the intention here is to merge this branch into |
@sotch-pr35mac, this is indeed what we want to do: merge these improvements into your main branch. |
4dfc85f
to
982701b
Compare
982701b
to
38be7b4
Compare
758c99e
to
e86a035
Compare
e86a035
to
404968a
Compare
Hello @sotch-pr35mac, I think the implementation is finished on our side. 🙂 Could you please review this PR in order to merge it? Thanks! |
Thanks for letting me know. I'll take a look tonight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash your commits as well.
@@ -9,26 +9,28 @@ Turn Traditional Chinese script to Simplified Chinese script and vice-versa. Che | |||
```rust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️ This is a major breaking change, bump the version number to 2.0.0
above and elsewhere throughout.
use fst::raw::{Fst, Output}; | ||
use once_cell::sync::Lazy; | ||
|
||
static T2S: Lazy<HashMap<String, String>> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the implementation with the FST, especially how clean it is to simply walk forward through it. After doing some very basic benchmarking on my laptop I'm noticing ~60x improvement in the conversion post-initialization. Initialization however has increased by ~4x to ~2.2s up from ~0.56s on my machine. Initialization time is very important to me here, so I'd like to keep this at or around the previous time. It is also very important to me that the consumer is able to specify when to perform initialization.
So what I'd like to do to handle these tradeoffs is preprocess the FSTs into "profiles" the same way we have done for the HashMaps. Then we should add three initialization functions, one for the HashMaps, one for the FSTs, and one for both the HashMaps and FSTs. We'll probably need to implement a serialize trait for the FST, but after a cursory look around it doesn't look like that should be out of the question. I'm open to other suggestions on how to reduce the initialization latency, but as it stands the ~4x is just too high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sotch-pr35mac! Thank you for the review, I made the small changes you wanted.
However, I want to respond to some of your suggestions:
I really like the implementation with the FST, especially how clean it is to simply walk forward through it. After doing some very basic benchmarking on my laptop I'm noticing ~60x improvement in the conversion post-initialization. Initialization however has increased by ~4x to ~2.2s up from ~0.56s on my machine. Initialization time is very important to me here, so I'd like to keep this at or around the previous time. It is also very important to me that the consumer is able to specify when to perform initialization.
So what I'd like to do to handle these tradeoffs is preprocess the FSTs into "profiles" the same way we have done for the HashMaps. Then we should add three initialization functions, one for the HashMaps, one for the FSTs, and one for both the HashMaps and FSTs. We'll probably need to implement a serialize trait for the FST, but after a cursory look around it doesn't look like that should be out of the question. I'm open to other suggestions on how to reduce the initialization latency, but as it stands the ~4x is just too high.
I understand your point on this, I would suggest moving the initialization in a build.rs in order to build the FST at compile time.
Please squash your commits as well.
Can we avoid squashing them, I find it interesting to link atomical commits with the performance gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point on this, I would suggest moving the initialization in a build.rs in order to build the FST at compile time.
I'm open to that. Just to confirm my own understanding, you're suggesting that we add a build script to build the FSTs at compile time and serialize them out to a file that's later loaded with Lazy
?
Can we avoid squashing them, I find it interesting to link atomical commits with the performance gain.
Hmm... I see, I think that should be fine then.
0aaee2d
to
08d1674
Compare
Hey @sotch-pr35mac, I made the changes about creating the FST at compile time. Could you retry your initialization tests, please? 😊 |
08d1674
to
ce11684
Compare
Thanks for the quick turnaround, I will give it another look after work today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good
Add benchmarks and optimize character-converter.
Commits
Add Benchmarks
Avoid string allocations
Avoid iterating from the start of the string
Create slices instead of collecting chars into an allocated string
Find by prefix using an FST
Add more benchmarks
Use cow instead of always allocating string
This will avoid to create a string when no characters changes.
Encode in a buffer instead of creating a string in is_script
Use String::new() instead of to_string()
Use with_capacity instead of new when allocating the string
Use contains_key in is_script
contains_key
version expresses better the behavior of the code despite the small performance decrease.poke @Kerollmops