Skip to content

tiennm99/word2sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2sim

Tiny HTTP service that returns word2vec cosine similarity and nearest neighbors. Stateless. No sessions. Just the math.

Designed as a backend building block — a Semantle-style game, a search re-ranker, or a writing-assistance tool can all sit on top.

Stack

  • FastAPI + uvicorn
  • gensim (loads pretrained word2vec-google-news-300 by default: 3M tokens × 300 dims, ~3.4GB RAM)

Endpoints

Method Path Purpose
GET /health liveness probe
GET /similarity?a=X&b=Y cosine similarity between two words
GET /neighbors?word=X&topn=10 nearest-neighbor words with scores
GET /vocab?word=X check if a word is in vocab; return canonical form
GET /random random vocab word, filtered for game-friendliness

Examples

curl 'http://localhost:8000/similarity?a=king&b=queen'
# {"a":"king","b":"queen","canonical_a":"king","canonical_b":"queen",
#  "in_vocab_a":true,"in_vocab_b":true,"similarity":0.6510957}

curl 'http://localhost:8000/neighbors?word=ocean&topn=5'
# {"word":"ocean","canonical":"ocean","in_vocab":true,
#  "neighbors":[{"word":"oceans","similarity":0.78},{"word":"sea","similarity":0.75}, ...]}

curl 'http://localhost:8000/vocab?word=Paris'
# {"word":"Paris","canonical":"Paris","in_vocab":true}

curl 'http://localhost:8000/random?min_rank=500&max_rank=20000&min_len=4&max_len=8'
# {"word":"harbor","rank":8421}

/random query params

Param Default Meaning
min_rank 100 skip the top-N most frequent tokens (common function words)
max_rank 50000 cap at top-N most frequent (avoids rare/noisy tail)
alpha_only true reject phrases (new_york), digits, punctuation
min_len 3
max_len 12

Uses rejection sampling over the frequency-sorted vocab; returns 503 if no word matches within 1000 attempts (loosen the filters).

Out-of-vocab words return in_vocab:false and similarity:null. Case-insensitive lookup tries exact → lower → capitalized.

Quick start

docker compose up --build
# first boot downloads ~1.6GB model into the gensim-cache volume; later boots are instant

Using your own vectors

Skip the download by mounting a locally trained vectors.bin:

# docker-compose.yml
services:
  word2sim:
    environment:
      MODEL_PATH: /models/vectors.bin
    volumes:
      - ./vectors.bin:/models/vectors.bin:ro

(Train one with bash demo-word.sh from the upstream word2vec repo.)

Config (env vars)

Var Default Meaning
MODEL_NAME word2vec-google-news-300 gensim downloader id
MODEL_PATH (unset) if set + file exists, load this .bin instead (skips download)
GENSIM_DATA_DIR /data/gensim-cache where gensim caches downloaded models

Project layout

word2sim/
├── app/
│   ├── main.py       # FastAPI routes
│   └── vectors.py    # model loader + similarity/neighbors
├── Dockerfile
├── docker-compose.yml
└── requirements.txt

Building a Semantle-style game on top

The game server keeps state (session, secret, guess log); it calls word2sim per guess:

new game:  GET  /random?min_rank=500&max_rank=20000&min_len=4&max_len=10
           GET  /neighbors?word={secret}&topn=1000     → cache ranks locally
on guess:  GET  /similarity?a={secret}&b={guess}

word2sim stays stateless and cache-friendly.

About

Word2vec cosine similarity API. Tiny stateless FastAPI service over gensim pretrained vectors (GoogleNews 3M×300). Endpoints: /similarity /neighbors /vocab /random. Docker-ready building block for Semantle-style games, search re-rankers, writing tools.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors