The word2vec tool by Mikolov et al enables us to create word vectors from a dataset containing text data. Unlike a binary present/absent representation used by a bag-of-words, these word vectors can be used to compare 2 words and see if they are related.
This is a Clojure wrapper of Java implementation of word2vec [available here] (https://github.com/medallia/Word2VecJava).
To include word2vec, add the following to your :dependencies section of project.clj
First import clojure-word2vec.core into your namespace
(ns clojure-word2vec.examples
(:require [clojure-word2vec.core :refer :all]
[clojure.java.io :as io]))
Download a text corpus and place it in the resources folder. Here we'll download James Joyce's Ulysses from Project Gutenberg.
(def data
(create-input-format "path/to/ulysses.txt"))
Create the model and train it, using the default hyperparameters
(def model (word2vec data))
The hyper parameters can be specified as arguments to word2vec.
(def model (word2vec data :window-size 15)
Find the closest words to a given word
(get-matches model "woman")
A longer introduction is available in the docs .
Copyright © 2015 Bridgei2i
Distributed under the Eclipse Public License version 1.0.