word2vec tutorial

tensorlayer · Aug 11, 2016 · a43b58f · a43b58f
1 parent 33be2ab
commit a43b58f
Show file tree

Hide file tree

Showing 14 changed files with 50 additions and 15 deletions.
diff --git a/docs/user/tutorial.rst b/docs/user/tutorial.rst
@@ -762,9 +762,16 @@ Word Embedding
 
 Hao Dong highly recommend you to read Colah's blog `Word Representations`_ to
 understand why we want to use a vector representation, and how to compute the
-vectors.
+vectors. (For chinese reader please `click <http://dataunion.org/9331.html>`_.
 
-Train an embedding matrix
+Bascially, training an embedding matrix is an unsupervised learning. As every word
+is refected by an unique ID, which is the row index of the embedding matrix,
+a word can be converted into a vector, it can better represent the meaning.
+For example, there seems to be a constant male-female difference vector:
+``woman − man = queen - king``, this means one dimension in the vector represents gender.
+
+
+The model can be created as follow.
 
 .. code-block:: python
 
@@ -792,27 +799,58 @@ Train an embedding matrix
           nce_b_init_args = {},
           name ='word2vec_layer',
       )
-  cost = emb_net.nce_cost
 
-Dataset iteration
-^^^^^^^^^^^^^^^^^
+Dataset iteration and loss
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
+Word2vec uses Negative Sampling and Skip-Gram model for training.
+Noise-Contrastive Estimation Loss (NCE) can help to reduce the computation
+of loss. Skip-Gram inverts context and targets, tries to predict each context
+word from its target word. We use ``tl.nlp.generate_skip_gram_batch`` to
+generate training data as follow.
 
-Loss and update expressions
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: python
+
+  cost = emb_net.nce_cost
+  train_params = emb_net.all_params
 
+  train_op = tf.train.AdagradOptimizer(learning_rate, initial_accumulator_value=0.1,
+            use_locking=False).minimize(cost, var_list=train_params)
 
-Load an Embedding matrix
-^^^^^^^^^^^^^^^^^^^^^^^^^
+  data_index = 0
+  while (step < num_steps):
+    batch_inputs, batch_labels, data_index = tl.nlp.generate_skip_gram_batch(
+                  data=data, batch_size=batch_size, num_skips=num_skips,
+                  skip_window=skip_window, data_index=data_index)
+    feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}
+    _, loss_val = sess.run([train_op, cost], feed_dict=feed_dict)
 
-In the end of training the embedding matrix,
 
-.. code-block:: bash
+Restore existing Embedding matrix
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-  python tutorial_generate_text.py
+In the end of training the embedding matrix, we save the matrix and
+corresponding dictionaries. Then next time, we can restore the matrix and
+directories as follow.
+(see ``main_restore_embedding_layer()`` in ``tutorial_generate_text.py``)
 
 .. code-block:: python
 
+  vocabulary_size = 50000
+  embedding_size = 128
+  model_file_name = "model_word2vec_50k_128"
+  batch_size = None
+
+  print("Load existing embedding matrix and dictionaries")
+  all_var = tl.files.load_npy_to_any(name=model_file_name+'.npy')
+  data = all_var['data']; count = all_var['count']
+  dictionary = all_var['dictionary']
+  reverse_dictionary = all_var['reverse_dictionary']
+
+  tl.nlp.save_vocab(count, name='vocab_'+model_file_name+'.txt')
+
+  del all_var, data, count
+
   load_params = tl.files.load_npz(name=model_file_name+'.npz')
 
   x = tf.placeholder(tf.int32, shape=[batch_size])
@@ -831,9 +869,6 @@ In the end of training the embedding matrix,
 
 
 
-
-
-
 Run the PTB example
 =========================
 

diff --git a/tensorlayer/__init__.pyc b/tensorlayer/__init__.pyc
diff --git a/tensorlayer/__pycache__/__init__.cpython-34.pyc b/tensorlayer/__pycache__/__init__.cpython-34.pyc
diff --git a/tensorlayer/activation.pyc b/tensorlayer/activation.pyc
diff --git a/tensorlayer/cost.pyc b/tensorlayer/cost.pyc
diff --git a/tensorlayer/files.pyc b/tensorlayer/files.pyc
diff --git a/tensorlayer/iterate.pyc b/tensorlayer/iterate.pyc
diff --git a/tensorlayer/layers.pyc b/tensorlayer/layers.pyc
diff --git a/tensorlayer/nlp.pyc b/tensorlayer/nlp.pyc
diff --git a/tensorlayer/ops.pyc b/tensorlayer/ops.pyc
diff --git a/tensorlayer/preprocess.pyc b/tensorlayer/preprocess.pyc
diff --git a/tensorlayer/rein.pyc b/tensorlayer/rein.pyc
diff --git a/tensorlayer/utils.pyc b/tensorlayer/utils.pyc
diff --git a/tensorlayer/visualize.pyc b/tensorlayer/visualize.pyc