A tokenizer based on the dictionary and Bigram language models for Golang. (Now only support chinese segmentation)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data/zh modify stop token path Oct 15, 2018
.gitignore
.travis.yml
LICENSE Initial commit Oct 11, 2018
README.md Update README.md Jan 23, 2019
bidirectional_maximum_matching.go fixed golint warnings Oct 15, 2018
bidirectional_maximum_matching_test.go
bidirectional_minimum_matching.go fixed golint warnings Oct 15, 2018
bidirectional_minimum_matching_test.go add wordfilter for MM and RMM BMM &&add tests Oct 17, 2018
bigram_dict.go fixed golint warnings Oct 15, 2018
common_test.go add wordfilter for MM and RMM BMM &&add tests Oct 17, 2018
dict.go fixed golint warnings Oct 15, 2018
maximum_matching.go fixed gofmt warnings Oct 17, 2018
maximum_matching_test.go add wordfilter for MM and RMM BMM &&add tests Oct 17, 2018
minimum_matching.go fixed golint warnings Oct 15, 2018
minimum_matching_test.go rename package name Oct 15, 2018
num_letter_wordfilter.go add wordfilter for MM and RMM BMM &&add tests Oct 17, 2018
reverse_maximum_matching.go fixed gofmt warnings Oct 17, 2018
reverse_maximum_matching_test.go add wordfilter for MM and RMM BMM &&add tests Oct 17, 2018
reverse_minimum_matching.go fixed golint warnings Oct 15, 2018
reverse_minimum_matching_test.go
stop_tokens.go fixed golint warnings Oct 15, 2018
stop_tokens_test.go rename package name Oct 15, 2018
tokenizer.go fixed golint warnings Oct 15, 2018
utils.go fixed golint warnings Oct 15, 2018
utils_test.go
wordfilter.go fixed golint warnings Oct 17, 2018

README.md

gotokenizer GoDoc Build Status Coverage Status Go Report Card License Awesome

A tokenizer based on the dictionary and Bigram language models for Golang. (Now only support chinese segmentation)

Motivation

I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.

Features

  • Support Maximum Matching Method
  • Support Minimum Matching Method
  • Support Reverse Maximum Matching
  • Support Reverse Minimum Matching
  • Support Bidirectional Maximum Matching
  • Support Bidirectional Minimum Matching
  • Support using Stop Tokens
  • Support Custom word Filter

Installation

go get -u github.com/xujiajun/gotokenizer

Usage

package main

import (
	"fmt"
	"github.com/xujiajun/gotokenizer"
)

func main() {
	text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器,支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"

	dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
	// NewMaxMatch default wordFilter is NumAndLetterWordFilter
	mm := gotokenizer.NewMaxMatch(dictPath)
	// load dict
	mm.LoadDict()

	fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 , 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>

	// enabled filter stop tokens 
	mm.EnabledFilterStopToken = true
	mm.StopTokens = gotokenizer.NewStopTokens()
	stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
	mm.StopTokens.Load(stopTokenDicPath)

	fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
	fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>

}

More examples see tests

Contributing

If you'd like to help out with the project. You can put up a Pull Request.

Author

License

The gotokenizer is open-sourced software licensed under the Apache-2.0

Acknowledgements

This package is inspired by the following:

https://github.com/ysc/word