If heroku is running, checkout the live demo (it may take 30 seconds to warm up):
Words rarely exist in a vacuum. To understand the meaning of the word cat, it's useful to know that it is (hypernym) an animal, that it is the same as (synonym) a feline, that a Tabby is a type of (hyponym) cat, and that in some reasonable sense it is the opposite (antonym) of a dog. Since words are connected in a rich network of linguisitic information, why not (literally) follow that path and see where it takes us?
Instead of looking at a single word in isolation, this project tries elucidate what words should be in between a start and end word.
Grouping words together is a classic problem in computational linguisitics. Typical approaches use LSA, LSI, LDA or Pachinko allocation. Personally, I perfer Word2Vec which was developed by some lovely engineers from Google. Partly because there exists an excellent port to Python via gensim, but mostly because it's awesome.
Word2Vec maps each word to a point on a unit hypersphere. Words that are "close" on this sphere often share some kind of semantic relation. If we pick two words, say "boy" and "man", we can trace the shortest path that connects them. We parameterize this curve with a "time" where t=0 (at boy) and t=1 (at man). Words that are close to this timeline are selected and ordered by their t value (e.g. to the t where they are closest to the connecting curve). In theory, this timeline should be a semantic map from one word to another -- smoothly varying across meaning.
In practice however, it turns out that computing the true curve across the hypersphere is rather tricky. It's even harder to numerically find the nearest points efficiently. However if we cheat a little, we can draw a straight line connecting the two points as an approximation to the curve. If we do this, the problem reduces down to a fast linear algebra solution. Since we are moving across (trans) the orthogonal space spanned by the word2vec's construction, we call this method transorthogonal linguistics.
The database contained within this repo was constructed from a full English dump of Wikipedia that was sentence and word tokenized by NLTK. Word2Vec training was done with a single pass, 300 dimensions and an 800 minimum vocabulary count. These choices were found to be optimal for the results, yet still be small enough to query online reasonably quickly.
python transorthogonal_linguistics/word_path.py boy man
With the input of
man we get:
boy - sixteen-year-old, orphan teenager, girl, schoolgirl youngster, shepherd, lad, kid kitten, lonely, maid beggar, policeman prostitute, thug, villager, handsome, loner, thief, cop gentleman, stranger, lady, Englishman, guy - woman person man
sun sunlight, mist glow, shine, clouds skies, shines, shining, glare, moonlight, sky, darkness shadows, heavens horizon, crescent earth, eclipses constellations, comet, planets, orbits, orbiting, Earth, Io Jupiter, planet, Venus, Pluto, Uranus, orbit - moons, lunar moon
Other interesting examples:
girl woman lover sinner fate destiny god demon good bad mind body heaven hell American Soviet idea action socialism capitalism Marxism Stalinism man machine sustenance starvation war peace predictable idiosyncratic acceptance uproar