Skip to content

A word search through vector space using Julia and GloVe

Notifications You must be signed in to change notification settings

qiuyuew/thesaurus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GloVe Thesaurus

This is a Julia web app that uses the GloVe word vector data to look up similarities among word meanings.

Installation

wget http://www-nlp.stanford.edu/data/glove.840B.300d.txt.gz

gunzip -c glove.840B.300d.txt.gz \
| awk '{for (i=2; i<NF; i++) printf $i " "; print $NF}' \
| head -100000 >crawl300-100k.vec

gunzip -c glove.840B.300d.txt.gz \
| awk '{print $1}' \
| head -100000 >crawl300-100k.idx

The .vec file is the large file--it contains only floating point values, one row per word, with 300 dimensions per row. The .idx file is simply an index of all of the words, each word listed per line, in the same order as the .vec file.

You'll probably need these dependencies:

julia -e 'Pkg.add("Distances")'
julia -e 'Pkg.add("Morsel")'
julia -e 'Pkg.add("JSON")'

Running

$ julia 

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.3.3 (2014-11-23 20:19 UTC)
 _/ |\__'_|_|_|\__'_|  |  'Official http://julialang.org/ release'
|__/                   |  x86_64-apple-darwin13.3.0
julia> include("WordMatrix.jl")

julia> m = WordMatrix("/path/to/crawl300-100k")

Note that the file path has no suffix--the ".idx" and ".vec" are added automatically.

Example

julia> topn(m, m["laser"] + m["science"], 12)
12-element Array{(String,Float64),1}:
 ("laser",6.939878264355364)     
 ("science",7.804343256488264)   
 ("lasers",8.240093239632245)    
 ("scientific",8.946992980586012)
 ("physics",9.008729968872023)   
 ("technology",9.193295956585269)
 ("sciences",9.426696189473054)  
 ("astronomy",9.54593972112149)  
 ("imaging",9.547669144832966)   
 ("research",9.627897577880903)  
 ("optics",9.651007849517955)    
 ("Laser",9.724459820355438)     

About

A word search through vector space using Julia and GloVe

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Julia 42.2%
  • CSS 29.9%
  • JavaScript 27.9%