Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add language analyzer for Chinese support #111

Merged
merged 10 commits into from
Mar 8, 2022

Conversation

hengfeiyang
Copy link
Contributor

No description provided.

@hengfeiyang
Copy link
Contributor Author

add language analyzer for Chinese support

@hengfeiyang hengfeiyang changed the title add language analyzer for Chinese support Add language analyzer for Chinese support Mar 5, 2022
@hengfeiyang
Copy link
Contributor Author

hengfeiyang commented Mar 6, 2022

It's a plugin of zinc to support Chinese analyzer.

  • Analyzer: gse_standard , gse_search
  • Tokenizer: gse_standard , gse_search
  • TokenFilter: gse_stop

gse

https://github.com/go-ego/gse

Go efficient multilingual NLP and text segmentation; support english, Chinese, Japanese and other.

Environment

you need pass environment to enable gse support:

  • ZINC_PLUGIN_GSE_ENABLE true of false, default is false
  • ZINC_PLUGIN_GSE_DICT_EMBED small or big, default is small, which size dictionary will load when gse enabled.
  • ZINC_PLUGIN_GSE_DICT_PATH custom dictionary path, default is ./plugins/gse/dict

API example

POST http://localhost:4080/es/_analyze

{
  "analyzer": "gse_standard",
  "text": "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
}

POST http://localhost:4080/es/_analyze

{
  "analyzer": "gse_search",
  "text": "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
}

PUT http://localhost:4080/api/index

{
	"name": "my-index-chs",
		"mappings": {
			"properties": {
				"title": {
					"type": "text",
					"index": true,
					"highlightable": true,
                                          "analyzer": "gse_search",
                                          "search_analyzer": "gse_standard"
				},
				"author": {
					"type": "keyword",
					"index": true,
					"store": false
				},
				"create_time": {
					"type":"time"
				}
			}
		}
}

PUT http://localhost:4080/api/my-index-chs/document

{
	"title": "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的科幻片",
	"author": "灭霸",
	"create_time": "2022-03-05T18:18:18+08:00"
}

POST http://localhost:4080/es/my-index-chs/_search

{
	"query": {
		"match": {
			"title": "复仇者联盟"
		}
	}
}

custom user dictionary

add your words append to the file ${ZINC_PLUGIN_GSE_DICT_PATH}/user.txt

format:

分词文本  频率        词性
word    frequency   property

like:

复仇者联盟 100 n

custom stop tokens

add your words append to the file ${ZINC_PLUGIN_GSE_DICT_PATH}/stop.txt

format:

停止词
word

like:

哈哈

@hengfeiyang
Copy link
Contributor Author

set default search analyzer:

PUT http://localhost:4080/api/index

{
	"name": "my-index-chs",
	"settings": {
		"analysis": {
			"analyzer": {
				"default": {
					"type": "gse_search"
				}
			}
		}
	}
}

@liuxingke
Copy link

How to overwrite the built-in stop words list? I want to search all the words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants