Kmp tokenizer #49

nomisRev · 2023-05-10T11:06:18Z

This PR introduces a KMP tokenizer, ported from https://github.com/knuddelsgmbh/jtokkit. Excuse the 200k additions, but most of that are encodings..

It moves TokenTextSplitter into commonMain with the new code.

Port & add tests (can also be in subsequent PR)

This was a quick spike to see if we could port it from Java to KMP, and it was quite easy based on a discussion yesterday with @raulraja.

@xebia-functional/team-ai

franciscodr · 2023-05-10T12:24:05Z

gradle/libs.versions.toml

-scrape-it-browser-fetcher = { module = "it.skrape:skrapeit-browser-fetcher", version.ref = "scrapeit" }
-scrape-it-async-fetcher = { module = "it.skrape:skrapeit-asyn-fetcher", version.ref = "scrapeit" }
-jtokk-it = { module = "com.knuddels:jtokkit", version.ref = "jtokkit" }
+skrape = { module = "it.skrape:skrapeit", version.ref = "scrapeit" }


Nit: Can we rename scrapeit to skrapeit in the version reference too?

franciscodr · 2023-05-10T12:25:34Z

kotlin/src/jvmMain/kotlin/com/xebia/functional/tool/DefaultSearch.kt

 import com.xebia.functional.tools.Agent

 suspend fun search(vararg prompt: String): Array<out Agent> =
  prompt
    .map {
      bingSearch(
        search = it,
-        TokenTextSplitter(modelName = "gpt-3.5-turbo", chunkSize = 100, chunkOverlap = 50)
+        TokenTextSplitter(ModelType.GPT_3_5_TURBO, chunkSize = 100, chunkOverlap = 50)


franciscodr · 2023-05-10T12:37:44Z

tokenizer/src/commonMain/kotlin/com/xebia/functional/tokenizer/Encoding.kt

+   * ```kotlin
+   * val encoding = EncodingType.CL100K_BASE.encoding
+   * encoding.encode("hello world", 100)
+   * // returns [15339, 1917]
+   *
+   * encoding.encode("hello &lt;|endoftext|&gt; world", 100)
+   * // raises an UnsupportedOperationException
+   * ```


This example returns the same result as if we had called the encode function. It would be helpful to provide an example that will behave differently than if we used encode to show the difference between both functions.

franciscodr

Just a couple of minor comments. Thanks, @nomisRev!

nomisRev · 2023-05-17T08:03:16Z

Semi-blocked due to: https://youtrack.jetbrains.com/issue/KT-58678/Native-Regex-inconsistency-with-JVM-Native-Regex

~~This probably works for 99% of the cases so we could move ahead with it, or we need to exclude Kotlin/Native for now.~~ This has been resolved.

Link to the issue: https://youtrack.jetbrains.com/issue/KT-58678/Native-Regex-inconsistency-with-JVM-Native-Regex

…han just ASCII

…4k into KMP-tokenizer

nomisRev · 2023-05-19T10:10:40Z

This is ready for re-review @xebia-functional/team-ai.
Thank you @realdavidvega for helping fixing the regex 🙌

nomisRev added 2 commits May 10, 2023 12:56

Add KMP tokenizer

7f59ffa

Surrounding code, and remove dependenncy jtokit

69cae83

franciscodr reviewed May 10, 2023

View reviewed changes

Add ImmutableByteArrayTest

76cd36b

franciscodr reviewed May 10, 2023

View reviewed changes

franciscodr approved these changes May 10, 2023

View reviewed changes

nomisRev and others added 2 commits May 10, 2023 19:35

Stuck on Kotlin/Native Regex

8a2e099

Merge branch 'main' into KMP-tokenizer

6b502ed

realdavidvega and others added 7 commits May 17, 2023 16:33

fix: disable native target due to regex inconsistencies

7a0aebb

Link to the issue: https://youtrack.jetbrains.com/issue/KT-58678/Native-Regex-inconsistency-with-JVM-Native-Regex

fix(tokenizer): increase mocha timeout for pipeline js test issues

311c662

fix(tokenizer): subtitute \p{L} and \P{N} by unicodes for native

d257062

feat(tokenizer): adding also unicodes for numbers for greater range t…

f9332fe

…han just ASCII

feat(tokenizer): replace /d with unicodes

1753f45

Complete test suite, and support p50 & r50

b5ccb52

Merge branch 'KMP-tokenizer' of github.com:xebia-functional/langchain…

d136520

…4k into KMP-tokenizer

nomisRev merged commit f04a8ab into main May 19, 2023
1 check failed

nomisRev deleted the KMP-tokenizer branch May 19, 2023 11:05

serras mentioned this pull request May 24, 2023

Proper tokenizer #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kmp tokenizer #49

Kmp tokenizer #49

nomisRev commented May 10, 2023 •

edited

franciscodr May 10, 2023

franciscodr May 10, 2023

franciscodr May 10, 2023

franciscodr left a comment

nomisRev commented May 17, 2023 •

edited

nomisRev commented May 19, 2023

Kmp tokenizer #49

Kmp tokenizer #49

Conversation

nomisRev commented May 10, 2023 • edited

franciscodr May 10, 2023

Choose a reason for hiding this comment

franciscodr May 10, 2023

Choose a reason for hiding this comment

franciscodr May 10, 2023

Choose a reason for hiding this comment

franciscodr left a comment

Choose a reason for hiding this comment

nomisRev commented May 17, 2023 • edited

nomisRev commented May 19, 2023

nomisRev commented May 10, 2023 •

edited

nomisRev commented May 17, 2023 •

edited