Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing LSTM forward layer without allocating #3351

Merged
merged 11 commits into from
May 4, 2023

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Apr 19, 2023

Based on #3349

There's no visible performance difference on our benchmarks, however this avoids an allocation of length codepoints x hidden units, and we only bench on small strings. For longer strings it's probably good to avoid this allocation.

#3305

@jira-pull-request-webhook

This comment was marked as spam.

@robertbastian robertbastian added the C-segmentation Component: Segmentation label Apr 20, 2023
@robertbastian robertbastian requested review from Manishearth and removed request for Manishearth April 20, 2023 15:32
@robertbastian robertbastian marked this pull request as ready for review April 20, 2023 16:48
@zbraniecki
Copy link
Member

I can report perf improvement on th lstm from icu-perf

icu4x/th/baked/segmenter/word/lstm/overview
                        time:   [2.6583 ms 2.6646 ms 2.6730 ms]
                        change: [-1.4063% -1.1068% -0.7734%] (p = 0.00 < 0.05)
                        Change within noise threshold.

icu4x/th/baked/segmenter/line/lstm/overview
                        time:   [2.6445 ms 2.6487 ms 2.6530 ms]
                        change: [-1.7483% -1.4729% -1.2153%] (p = 0.00 < 0.05)
                        Performance has improved.

@robertbastian
Copy link
Member Author

Yeah I got the 1% as well but it's nothing to write home about 😀

@zbraniecki
Copy link
Member

I'm all for celebrating little wins.

sffc
sffc previously approved these changes May 2, 2023
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change

sffc
sffc previously approved these changes May 3, 2023
@Manishearth Manishearth removed their request for review May 3, 2023 16:34
@robertbastian robertbastian requested a review from sffc May 3, 2023 18:38
self.dic
.get_copied(UnvalidatedStr::from_bytes(&buf[..i]))
.get_copied_by(|key| {
key.as_bytes().iter().copied().cmp(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (optional): UTF-8 byte order is equivalent to UTF-32 order; it may be cleaner/smaller/faster code if you compared an iterator of char rather than an iterator of u8.

Separately, I'm starting to get a bit worried that the grapheme cluster code may bloat the code size of the LSTM segmenter. We should probably delete it at some point. Maybe as part of the next round of ML model upgrades.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys are UnvalidatedStrs aka [u8], we cannot use them as anything that's strongly typed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't like deps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Segmenter already depends on utf8_iter.

utf8_iter = "1.0.3"

@robertbastian robertbastian merged commit 5ab0f5f into unicode-org:main May 4, 2023
22 checks passed
@robertbastian robertbastian deleted the segiter branch May 4, 2023 08:46
@Manishearth Manishearth mentioned this pull request Sep 21, 2023
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants