The repository contains some corpus(Korean), python scripts for training and inferring test document vectors using doc2vec.
- Korean Wikipedia / space tokenizer (467MB)
- Korean Wikipedia / mecab pos tokenizer / tag info (910MB)
- Korean Wikipedia / mecab pos tokenizer / no tag info (535MB)
-
Korean Wikipedia / mecab pos tokenizer / no tag info / 30 vectors(dmpv)
-
Korean Wikipedia / mecab pos tokenizer / no tag info / 100 vectors(dmpv)
-
Korean Wikipedia / mecab pos tokenizer / no tag info / 300 vectors(dmpv)
-
Korean Wikipedia / mecab pos tokenizer / no tag info / 1000 vectors(dmpv)
-
Korean Wikipedia + financial news / mecab pos tokenizer / no tag info / 30 vectors(dmpv)
-
Korean Wikipedia + financial news / mecab pos tokenizer / no tag info / 100 vectors(dmpv)
Simple web service providing a word embedding API. The methods are based on Gensim Word2Vec / Doc2Vec implementation. Models are passed as parameters and must be in the Word2Vec / Doc2Vec text or binary format. This web2vec-api script is forked from this word2vec-api github and get minor update to support Korean word2vec models.
- Install Dependencies
pip2 install -r requirements.txt
- Launching the service
python word2vec-api --model path/to/the/model [--host host --port 1234]
ex) python /home/word2vec-api.py --model /home/model/all_terms_50vectors --path /word2vec --host 0.0.0.0 --port 4000
python doc2vec-api --model path/to/the/model [--host host --port 1234]
ex) python /home/doc2vec-api.py --model /home/model/all_terms_50vectors --path /doc2vec --host 0.0.0.0 --port 4000
- Example calls
curl http://127.0.0.1:5000/word2vec/most_similar?positive=무증