Semantic Text Splitter

A Go package for intelligently splitting text into chunks while preserving semantic meaning. This package is particularly useful for processing large texts for language models and other NLP applications.

Features

Configurable chunk size
Support for overlapping chunks
Customizable token counting
Semantic-aware text splitting

Installation

go get github.com/sanbaiw/semtxtsplitter

Usage

To use the package, you need to provide a token counter function. The following example uses tiktoken to count tokens.

import (
	"log"

	"github.com/pkoukk/tiktoken-go"
	"github.com/sanbaiw/semtxtsplitter"
)

// use tiktoken 
tokenizer, err = tiktoken.GetEncoding("cl100k_base")
if err != nil {
    log.Fatal(err)
}

func tokenCounter(text string) int {
    return len(tokenizer.Encode(text, nil, nil))
}

// Create a text splitter with chunk size of 1000 and 10% overlap
splitter, err := semchunk.NewTextSplitter(1000, 0.1, tokenCounter)
if err != nil {
    log.Fatal(err)
}

// Split your text
splits := splitter.Split(yourText)

You can also use other token counters if you prefer.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cmd		cmd
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
semchunk.go		semchunk.go
semchunk_test.go		semchunk_test.go
utils.go		utils.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Text Splitter

Features

Installation

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sanbaiw/semtxtsplitter

Folders and files

Latest commit

History

Repository files navigation

Semantic Text Splitter

Features

Installation

Usage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages