-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kmp tokenizer #49
Kmp tokenizer #49
Conversation
scrape-it-browser-fetcher = { module = "it.skrape:skrapeit-browser-fetcher", version.ref = "scrapeit" } | ||
scrape-it-async-fetcher = { module = "it.skrape:skrapeit-asyn-fetcher", version.ref = "scrapeit" } | ||
jtokk-it = { module = "com.knuddels:jtokkit", version.ref = "jtokkit" } | ||
skrape = { module = "it.skrape:skrapeit", version.ref = "scrapeit" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we rename scrapeit
to skrapeit
in the version reference too?
import com.xebia.functional.tools.Agent | ||
|
||
suspend fun search(vararg prompt: String): Array<out Agent> = | ||
prompt | ||
.map { | ||
bingSearch( | ||
search = it, | ||
TokenTextSplitter(modelName = "gpt-3.5-turbo", chunkSize = 100, chunkOverlap = 50) | ||
TokenTextSplitter(ModelType.GPT_3_5_TURBO, chunkSize = 100, chunkOverlap = 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌
* ```kotlin | ||
* val encoding = EncodingType.CL100K_BASE.encoding | ||
* encoding.encode("hello world", 100) | ||
* // returns [15339, 1917] | ||
* | ||
* encoding.encode("hello <|endoftext|> world", 100) | ||
* // raises an UnsupportedOperationException | ||
* ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example returns the same result as if we had called the encode
function. It would be helpful to provide an example that will behave differently than if we used encode
to show the difference between both functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of minor comments. Thanks, @nomisRev!
Semi-blocked due to: https://youtrack.jetbrains.com/issue/KT-58678/Native-Regex-inconsistency-with-JVM-Native-Regex
|
…4k into KMP-tokenizer
This is ready for re-review @xebia-functional/team-ai. |
This PR introduces a KMP tokenizer, ported from https://github.com/knuddelsgmbh/jtokkit. Excuse the 200k additions, but most of that are encodings..
It moves
TokenTextSplitter
intocommonMain
with the new code.This was a quick spike to see if we could port it from Java to KMP, and it was quite easy based on a discussion yesterday with @raulraja.
@xebia-functional/team-ai