Skip to content

Commit

Permalink
word2vec tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
zsdonghao committed Aug 11, 2016
1 parent 33be2ab commit a43b58f
Show file tree
Hide file tree
Showing 14 changed files with 50 additions and 15 deletions.
65 changes: 50 additions & 15 deletions docs/user/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -762,9 +762,16 @@ Word Embedding

Hao Dong highly recommend you to read Colah's blog `Word Representations`_ to
understand why we want to use a vector representation, and how to compute the
vectors.
vectors. (For chinese reader please `click <http://dataunion.org/9331.html>`_.

Train an embedding matrix
Bascially, training an embedding matrix is an unsupervised learning. As every word
is refected by an unique ID, which is the row index of the embedding matrix,
a word can be converted into a vector, it can better represent the meaning.
For example, there seems to be a constant male-female difference vector:
``woman − man = queen - king``, this means one dimension in the vector represents gender.


The model can be created as follow.

.. code-block:: python
Expand Down Expand Up @@ -792,27 +799,58 @@ Train an embedding matrix
nce_b_init_args = {},
name ='word2vec_layer',
)
cost = emb_net.nce_cost
Dataset iteration
^^^^^^^^^^^^^^^^^
Dataset iteration and loss
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Word2vec uses Negative Sampling and Skip-Gram model for training.
Noise-Contrastive Estimation Loss (NCE) can help to reduce the computation
of loss. Skip-Gram inverts context and targets, tries to predict each context
word from its target word. We use ``tl.nlp.generate_skip_gram_batch`` to
generate training data as follow.

Loss and update expressions
^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python
cost = emb_net.nce_cost
train_params = emb_net.all_params
train_op = tf.train.AdagradOptimizer(learning_rate, initial_accumulator_value=0.1,
use_locking=False).minimize(cost, var_list=train_params)
Load an Embedding matrix
^^^^^^^^^^^^^^^^^^^^^^^^^
data_index = 0
while (step < num_steps):
batch_inputs, batch_labels, data_index = tl.nlp.generate_skip_gram_batch(
data=data, batch_size=batch_size, num_skips=num_skips,
skip_window=skip_window, data_index=data_index)
feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}
_, loss_val = sess.run([train_op, cost], feed_dict=feed_dict)
In the end of training the embedding matrix,
.. code-block:: bash
Restore existing Embedding matrix
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

python tutorial_generate_text.py
In the end of training the embedding matrix, we save the matrix and
corresponding dictionaries. Then next time, we can restore the matrix and
directories as follow.
(see ``main_restore_embedding_layer()`` in ``tutorial_generate_text.py``)

.. code-block:: python
vocabulary_size = 50000
embedding_size = 128
model_file_name = "model_word2vec_50k_128"
batch_size = None
print("Load existing embedding matrix and dictionaries")
all_var = tl.files.load_npy_to_any(name=model_file_name+'.npy')
data = all_var['data']; count = all_var['count']
dictionary = all_var['dictionary']
reverse_dictionary = all_var['reverse_dictionary']
tl.nlp.save_vocab(count, name='vocab_'+model_file_name+'.txt')
del all_var, data, count
load_params = tl.files.load_npz(name=model_file_name+'.npz')
x = tf.placeholder(tf.int32, shape=[batch_size])
Expand All @@ -831,9 +869,6 @@ In the end of training the embedding matrix,
Run the PTB example
=========================

Expand Down
Binary file removed tensorlayer/__init__.pyc
Binary file not shown.
Binary file added tensorlayer/__pycache__/__init__.cpython-34.pyc
Binary file not shown.
Binary file removed tensorlayer/activation.pyc
Binary file not shown.
Binary file removed tensorlayer/cost.pyc
Binary file not shown.
Binary file removed tensorlayer/files.pyc
Binary file not shown.
Binary file removed tensorlayer/iterate.pyc
Binary file not shown.
Binary file removed tensorlayer/layers.pyc
Binary file not shown.
Binary file removed tensorlayer/nlp.pyc
Binary file not shown.
Binary file removed tensorlayer/ops.pyc
Binary file not shown.
Binary file removed tensorlayer/preprocess.pyc
Binary file not shown.
Binary file removed tensorlayer/rein.pyc
Binary file not shown.
Binary file removed tensorlayer/utils.pyc
Binary file not shown.
Binary file removed tensorlayer/visualize.pyc
Binary file not shown.

0 comments on commit a43b58f

Please sign in to comment.