Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kmp tokenizer #49

Merged
merged 12 commits into from
May 19, 2023
Merged

Kmp tokenizer #49

merged 12 commits into from
May 19, 2023

Conversation

nomisRev
Copy link
Contributor

@nomisRev nomisRev commented May 10, 2023

This PR introduces a KMP tokenizer, ported from https://github.com/knuddelsgmbh/jtokkit. Excuse the 200k additions, but most of that are encodings..

It moves TokenTextSplitter into commonMain with the new code.

  • Port & add tests (can also be in subsequent PR)

This was a quick spike to see if we could port it from Java to KMP, and it was quite easy based on a discussion yesterday with @raulraja.

@xebia-functional/team-ai

scrape-it-browser-fetcher = { module = "it.skrape:skrapeit-browser-fetcher", version.ref = "scrapeit" }
scrape-it-async-fetcher = { module = "it.skrape:skrapeit-asyn-fetcher", version.ref = "scrapeit" }
jtokk-it = { module = "com.knuddels:jtokkit", version.ref = "jtokkit" }
skrape = { module = "it.skrape:skrapeit", version.ref = "scrapeit" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we rename scrapeit to skrapeit in the version reference too?

import com.xebia.functional.tools.Agent

suspend fun search(vararg prompt: String): Array<out Agent> =
prompt
.map {
bingSearch(
search = it,
TokenTextSplitter(modelName = "gpt-3.5-turbo", chunkSize = 100, chunkOverlap = 50)
TokenTextSplitter(ModelType.GPT_3_5_TURBO, chunkSize = 100, chunkOverlap = 50)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

Comment on lines +53 to +60
* ```kotlin
* val encoding = EncodingType.CL100K_BASE.encoding
* encoding.encode("hello world", 100)
* // returns [15339, 1917]
*
* encoding.encode("hello &lt;|endoftext|&gt; world", 100)
* // raises an UnsupportedOperationException
* ```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example returns the same result as if we had called the encode function. It would be helpful to provide an example that will behave differently than if we used encode to show the difference between both functions.

Copy link
Contributor

@franciscodr franciscodr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor comments. Thanks, @nomisRev!

@nomisRev
Copy link
Contributor Author

nomisRev commented May 17, 2023

Semi-blocked due to: https://youtrack.jetbrains.com/issue/KT-58678/Native-Regex-inconsistency-with-JVM-Native-Regex

This probably works for 99% of the cases so we could move ahead with it, or we need to exclude Kotlin/Native for now. This has been resolved.

@nomisRev
Copy link
Contributor Author

This is ready for re-review @xebia-functional/team-ai.
Thank you @realdavidvega for helping fixing the regex 🙌

@nomisRev nomisRev merged commit f04a8ab into main May 19, 2023
1 check failed
@nomisRev nomisRev deleted the KMP-tokenizer branch May 19, 2023 11:05
@serras serras mentioned this pull request May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants