Optimize character conversion #15

ManyTheFish · 2022-04-28T09:45:11Z

Add benchmarks and optimize character-converter.

Commits

Add Benchmarks

  test benches::simplified_to_traditional ... bench:     129,776 ns/iter (+/- 4,669)
  test benches::traditional_to_simplified ... bench:     121,232 ns/iter (+/- 2,866)

Avoid string allocations

  test benches::simplified_to_traditional ... bench:     112,347 ns/iter (+/- 5,848)
  test benches::traditional_to_simplified ... bench:     103,843 ns/iter (+/- 6,099)

Avoid iterating from the start of the string

  test benches::simplified_to_traditional ... bench:       94,790 ns/iter (+/- 625)
  test benches::traditional_to_simplified ... bench:       85,548 ns/iter (+/- 2,355)

Create slices instead of collecting chars into an allocated string

  test benches::simplified_to_traditional ... bench:      19,382 ns/iter (+/- 224)
  test benches::traditional_to_simplified ... bench:      16,697 ns/iter (+/- 358)

Find by prefix using an FST

  test benches::bench_simplified_to_traditional ... bench:       2,590 ns/iter (+/- 135)
  test benches::bench_traditional_to_simplified ... bench:       2,568 ns/iter (+/- 21)

Add more benchmarks

  test benches::bench_simplified_is_simplified   ... bench:       2,086 ns/iter (+/- 47)
  test benches::bench_simplified_is_traditional  ... bench:         535 ns/iter (+/- 27)
  test benches::bench_simplified_to_simplified   ... bench:       2,454 ns/iter (+/- 18)
  test benches::bench_simplified_to_traditional  ... bench:       2,578 ns/iter (+/- 20)
  test benches::bench_traditional_is_simplified  ... bench:         546 ns/iter (+/- 25)
  test benches::bench_traditional_is_traditional ... bench:       1,961 ns/iter (+/- 20)
  test benches::bench_traditional_to_simplified  ... bench:       2,547 ns/iter (+/- 26)
  test benches::bench_traditional_to_traditional ... bench:       2,432 ns/iter (+/- 29)

Use cow instead of always allocating string

This will avoid to create a string when no characters changes.

  test benches::bench_simplified_is_simplified   ... bench:       2,141 ns/iter (+/- 77)
  test benches::bench_simplified_is_traditional  ... bench:         550 ns/iter (+/- 37)
+ test benches::bench_simplified_to_simplified   ... bench:       2,262 ns/iter (+/- 12)
  test benches::bench_simplified_to_traditional  ... bench:       2,580 ns/iter (+/- 31)
  test benches::bench_traditional_is_simplified  ... bench:         572 ns/iter (+/- 10)
  test benches::bench_traditional_is_traditional ... bench:       2,009 ns/iter (+/- 19)
  test benches::bench_traditional_to_simplified  ... bench:       2,562 ns/iter (+/- 18)
+ test benches::bench_traditional_to_traditional ... bench:       2,210 ns/iter (+/- 38)

Encode in a buffer instead of creating a string in is_script

+ test benches::bench_simplified_is_simplified   ... bench:         845 ns/iter (+/- 32)
+ test benches::bench_simplified_is_traditional  ... bench:         188 ns/iter (+/- 1)
  test benches::bench_simplified_to_simplified   ... bench:       2,260 ns/iter (+/- 18)
  test benches::bench_simplified_to_traditional  ... bench:       2,571 ns/iter (+/- 34)
+ test benches::bench_traditional_is_simplified  ... bench:         170 ns/iter (+/- 2)
+ test benches::bench_traditional_is_traditional ... bench:         787 ns/iter (+/- 6)
  test benches::bench_traditional_to_simplified  ... bench:       2,561 ns/iter (+/- 44)
  test benches::bench_traditional_to_traditional ... bench:       2,211 ns/iter (+/- 21)

Use String::new() instead of to_string()

  test benches::bench_simplified_is_simplified   ... bench:         884 ns/iter (+/- 22)
  test benches::bench_simplified_is_traditional  ... bench:         194 ns/iter (+/- 4)
+ test benches::bench_simplified_to_simplified   ... bench:       2,230 ns/iter (+/- 27)
+ test benches::bench_simplified_to_traditional  ... bench:       2,532 ns/iter (+/- 19)
  test benches::bench_traditional_is_simplified  ... bench:         206 ns/iter (+/- 1)
  test benches::bench_traditional_is_traditional ... bench:         833 ns/iter (+/- 40)
+ test benches::bench_traditional_to_simplified  ... bench:       2,496 ns/iter (+/- 17)
+ test benches::bench_traditional_to_traditional ... bench:       2,212 ns/iter (+/- 32)

Use with_capacity instead of new when allocating the string

  test benches::bench_simplified_is_simplified   ... bench:         889 ns/iter (+/- 23)
  test benches::bench_simplified_is_traditional  ... bench:         190 ns/iter (+/- 1)
  test benches::bench_simplified_to_simplified   ... bench:       2,235 ns/iter (+/- 10)
+ test benches::bench_simplified_to_traditional  ... bench:       2,420 ns/iter (+/- 72)
  test benches::bench_traditional_is_simplified  ... bench:         197 ns/iter (+/- 3)
  test benches::bench_traditional_is_traditional ... bench:         871 ns/iter (+/- 32)
+ test benches::bench_traditional_to_simplified  ... bench:       2,399 ns/iter (+/- 22)
  test benches::bench_traditional_to_traditional ... bench:       2,177 ns/iter (+/- 17)

Use contains_key in is_script

contains_key version expresses better the behavior of the code despite the small performance decrease.

- test benches::bench_simplified_is_simplified   ... bench:       1,001 ns/iter (+/- 31)
- test benches::bench_simplified_is_traditional  ... bench:         217 ns/iter (+/- 3)
  test benches::bench_simplified_to_simplified   ... bench:       2,237 ns/iter (+/- 18)
  test benches::bench_simplified_to_traditional  ... bench:       2,425 ns/iter (+/- 22)
- test benches::bench_traditional_is_simplified  ... bench:         230 ns/iter (+/- 4)
- test benches::bench_traditional_is_traditional ... bench:         979 ns/iter (+/- 72)
  test benches::bench_traditional_to_simplified  ... bench:       2,406 ns/iter (+/- 56)
  test benches::bench_traditional_to_traditional ... bench:       2,189 ns/iter (+/- 20)

poke @Kerollmops

README.md

src/lib.rs

sotch-pr35mac · 2022-04-29T13:32:33Z

@ManyTheFish Please confirm the intention here is to merge this branch into sotch-pr35mac:master and not ManyTheFish:master.

Kerollmops · 2022-04-29T21:15:23Z

@sotch-pr35mac, this is indeed what we want to do: merge these improvements into your main branch.

src/lib.rs

ManyTheFish · 2022-05-02T16:01:25Z

Hello @sotch-pr35mac, I think the implementation is finished on our side. 🙂

Could you please review this PR in order to merge it?

Thanks!

sotch-pr35mac · 2022-05-02T16:16:24Z

Thanks for letting me know. I'll take a look tonight.

sotch-pr35mac

Please squash your commits as well.

sotch-pr35mac · 2022-05-03T01:06:34Z

README.md

@@ -9,26 +9,28 @@ Turn Traditional Chinese script to Simplified Chinese script and vice-versa. Che
 ```rust


☝️ This is a major breaking change, bump the version number to 2.0.0 above and elsewhere throughout.

sotch-pr35mac · 2022-05-03T02:40:42Z

src/lib.rs

+use fst::raw::{Fst, Output};
+use once_cell::sync::Lazy;
+
+static T2S: Lazy<HashMap<String, String>> =


I really like the implementation with the FST, especially how clean it is to simply walk forward through it. After doing some very basic benchmarking on my laptop I'm noticing ~60x improvement in the conversion post-initialization. Initialization however has increased by ~4x to ~2.2s up from ~0.56s on my machine. Initialization time is very important to me here, so I'd like to keep this at or around the previous time. It is also very important to me that the consumer is able to specify when to perform initialization.

So what I'd like to do to handle these tradeoffs is preprocess the FSTs into "profiles" the same way we have done for the HashMaps. Then we should add three initialization functions, one for the HashMaps, one for the FSTs, and one for both the HashMaps and FSTs. We'll probably need to implement a serialize trait for the FST, but after a cursory look around it doesn't look like that should be out of the question. I'm open to other suggestions on how to reduce the initialization latency, but as it stands the ~4x is just too high.

Hey @sotch-pr35mac! Thank you for the review, I made the small changes you wanted.

However, I want to respond to some of your suggestions:

I really like the implementation with the FST, especially how clean it is to simply walk forward through it. After doing some very basic benchmarking on my laptop I'm noticing ~60x improvement in the conversion post-initialization. Initialization however has increased by ~4x to ~2.2s up from ~0.56s on my machine. Initialization time is very important to me here, so I'd like to keep this at or around the previous time. It is also very important to me that the consumer is able to specify when to perform initialization.

So what I'd like to do to handle these tradeoffs is preprocess the FSTs into "profiles" the same way we have done for the HashMaps. Then we should add three initialization functions, one for the HashMaps, one for the FSTs, and one for both the HashMaps and FSTs. We'll probably need to implement a serialize trait for the FST, but after a cursory look around it doesn't look like that should be out of the question. I'm open to other suggestions on how to reduce the initialization latency, but as it stands the ~4x is just too high.

I understand your point on this, I would suggest moving the initialization in a build.rs in order to build the FST at compile time.

Please squash your commits as well.

Can we avoid squashing them, I find it interesting to link atomical commits with the performance gain.

I understand your point on this, I would suggest moving the initialization in a build.rs in order to build the FST at compile time.

I'm open to that. Just to confirm my own understanding, you're suggesting that we add a build script to build the FSTs at compile time and serialize them out to a file that's later loaded with Lazy?

Can we avoid squashing them, I find it interesting to link atomical commits with the performance gain.

Hmm... I see, I think that should be fine then.

.rustfmt.toml

ManyTheFish · 2022-05-04T10:19:56Z

Hey @sotch-pr35mac, I made the changes about creating the FST at compile time. Could you retry your initialization tests, please? 😊

sotch-pr35mac · 2022-05-04T13:17:56Z

Thanks for the quick turnaround, I will give it another look after work today.

sotch-pr35mac

Everything looks good

ManyTheFish added 3 commits April 28, 2022 11:41

Add benchmarks

d54a2dd

Add rustfmt

9c91503

Avoid string allocations

bed7037

ManyTheFish force-pushed the optimize-character-converter branch 2 times, most recently from 6a56072 to b41ad5b Compare April 28, 2022 15:34

ManyTheFish added 5 commits April 28, 2022 18:36

Avoid iterating from the start of the string

a48bdc7

Create slices instead of collecting chars into an allocated string

5a38c37

Rework library o be more rust idiomatic

39bcb25

Find by prefix using an FST

5f5b9eb

Make some cleaning

bad23de

ManyTheFish force-pushed the optimize-character-converter branch from 800864e to bad23de Compare April 28, 2022 16:37

ManyTheFish marked this pull request as ready for review April 28, 2022 17:00

sotch-pr35mac self-assigned this Apr 28, 2022

Kerollmops suggested changes Apr 29, 2022

View reviewed changes

Fix small PR comments

5cf447a

ManyTheFish force-pushed the optimize-character-converter branch from 4dfc85f to 982701b Compare May 2, 2022 11:29

ManyTheFish added 2 commits May 2, 2022 13:32

Add more benchmarks

e19bdce

Use cow instead of always allocating string

38be7b4

ManyTheFish force-pushed the optimize-character-converter branch from 982701b to 38be7b4 Compare May 2, 2022 11:34

Kerollmops suggested changes May 2, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

Encode in a buffer instead of creating a string in is_script

33b97dc

ManyTheFish commented May 2, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

ManyTheFish force-pushed the optimize-character-converter branch 2 times, most recently from 758c99e to e86a035 Compare May 2, 2022 12:36

Use String::new() instead of to_string()

404968a

ManyTheFish force-pushed the optimize-character-converter branch from e86a035 to 404968a Compare May 2, 2022 12:39

Kerollmops suggested changes May 2, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

Use with_capacity instead of new when allocating the string

ab0ab52

Use contains_key in is_script

7cf0888

sotch-pr35mac requested changes May 3, 2022

View reviewed changes

ManyTheFish requested a review from sotch-pr35mac May 3, 2022 14:35

ManyTheFish force-pushed the optimize-character-converter branch from 0aaee2d to 08d1674 Compare May 4, 2022 10:15

Kerollmops approved these changes May 4, 2022

View reviewed changes

ManyTheFish added 3 commits May 4, 2022 12:21

Force tabs in rustfmt

31409e4

upgrade lib version

3263511

Move FST creation at building time

ce11684

ManyTheFish force-pushed the optimize-character-converter branch from 08d1674 to ce11684 Compare May 4, 2022 10:21

sotch-pr35mac approved these changes May 7, 2022

View reviewed changes

sotch-pr35mac merged commit 4b45edb into sotch-pr35mac:master May 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize character conversion #15

Optimize character conversion #15

ManyTheFish commented Apr 28, 2022 •

edited

Loading

sotch-pr35mac commented Apr 29, 2022

Kerollmops commented Apr 29, 2022 •

edited

Loading

ManyTheFish commented May 2, 2022

sotch-pr35mac commented May 2, 2022

sotch-pr35mac left a comment

sotch-pr35mac May 3, 2022

sotch-pr35mac May 3, 2022

ManyTheFish May 3, 2022

sotch-pr35mac May 3, 2022

ManyTheFish commented May 4, 2022

sotch-pr35mac commented May 4, 2022

sotch-pr35mac left a comment

		@@ -9,26 +9,28 @@ Turn Traditional Chinese script to Simplified Chinese script and vice-versa. Che
		```rust

Optimize character conversion #15

Optimize character conversion #15

Conversation

ManyTheFish commented Apr 28, 2022 • edited Loading

Commits

Add Benchmarks

Avoid string allocations

Avoid iterating from the start of the string

Create slices instead of collecting chars into an allocated string

Find by prefix using an FST

Add more benchmarks

Use cow instead of always allocating string

Encode in a buffer instead of creating a string in is_script

Use String::new() instead of to_string()

Use with_capacity instead of new when allocating the string

Use contains_key in is_script

sotch-pr35mac commented Apr 29, 2022

Kerollmops commented Apr 29, 2022 • edited Loading

ManyTheFish commented May 2, 2022

sotch-pr35mac commented May 2, 2022

sotch-pr35mac left a comment

Choose a reason for hiding this comment

sotch-pr35mac May 3, 2022

Choose a reason for hiding this comment

sotch-pr35mac May 3, 2022

Choose a reason for hiding this comment

ManyTheFish May 3, 2022

Choose a reason for hiding this comment

sotch-pr35mac May 3, 2022

Choose a reason for hiding this comment

ManyTheFish commented May 4, 2022

sotch-pr35mac commented May 4, 2022

sotch-pr35mac left a comment

Choose a reason for hiding this comment

ManyTheFish commented Apr 28, 2022 •

edited

Loading

Kerollmops commented Apr 29, 2022 •

edited

Loading