Skip to content


Repository files navigation


[Japanese README (日本語ドキュメント)] (


Category2Vec is an implementation of the category vector models [Marui and Hagiwara 2015], and the paragraph vector models [Le and Mikolov 2014]. These programs are based on word2vec [Mikolov et al. 2013a,b] in gensim project ([Rahurek 2013].

After training, you can obtain distributed representations for categories, paragraphs, and words. You can also infer a category from a description.


There are demo programs of category vector models(Category2Vec) and paragraph vector models(Sentence2Vec). To execute the demo programs,

    make # to build extensions (optional)
    python  # for category vector models
    python # for paragraph vector models



Clone the git repository as

    git clone

or download the zip archive from here:


Category2Vec requires following python modules:

  • numpy
  • scipy
  • six
  • cython

For Ubuntu, you can install them as follows:

    apt-get install python-dev python-pip python-numpy python-scipy
    pip install six cython

To build extensions, it is recommended to get a BLAS library such as:

  • OpenBLAS

For Ubuntu, you can use the provided ATLAS package as follows:

    apt-get install libatlas-base-dev

There is limitation in OpenBLAS package included in Ubuntu, we recommend you to install it from the source.

    apt-get install gfortran
    git clone
    cd OpenBLAS
    #edit Makefile.rule (USE_THREAD=1 is recommended)
    make install PREFIX=your_installation_directory # for example, PREFIX=/usr/local


You don't have to build the extensions to use Category2Vec, but to pursue good performance, we recommend to build the extensions. GCC >= 4.6 is required.

Before the compilation, edit If you use scipy >= 0.15.1, you should use the option use_blas = True, and specify your BLAS library in blas_include and blas_libs. If you have an Intel CPU (>= Sandy Bridge), you can enable the option use_avx = True.

After you editted the settings, type make to build extensions.


This version has been tested under the following environments:

  • Ubuntu 12.04 / 14.04
  • Python 2.7.3 / 2.7.6 / 2.7.8 (anaconda)
  • Numpy 1.8.0 / 1.8.2 / 1.9.0 (anaconda) / 1.10.0
  • Scipy 0.9.0 / 0.13.3 / 0.14.0 (anaconda) / 0.15.0 / 0.16.0 (use_blas=True)
  • gcc 4.6.3 / 4.8.1 / 4.8.3
  • OpenBLAS 0.2.12 / ATLAS 3.10.1-4

And under the following CPUs:

  • Haswell CPUs
    • i7-4770K
    • i7-5820K
  • Ivy Bridge CPUs
    • Xeon E5-26xx v2 series
  • AMD CPUs
    • Opteron 4100 series (use_avx=False)

Mac OS X

In some environment, you'll get NaNs with use_blas = False. You should use BLAS library in this case. You can install OpenBLAS using the instruction above, or you can install it from Homebrew as follows:

    brew install homebrew/science/openblas
    # if you use brewed openblas, you should add the following in the ~/.bash_profile
    # export LDFLAGS="-L/usr/local/opt/openblas/lib"
    # export CPPFLAGS="-I/usr/local/opt/openblas/include"
    # edit and specify the followings
    # use_blas = True
    # blas_libs = ["openblas"]

If you use clang, edit and specify use_clang = True.

    easy_install pip
    pip install numpy scipy six cython # it's important to build cython using gcc

If you want to use gcc, you can edit and specify force_gcc = True and use_clang = False. For Homebrew,

    brew install gcc49
    brew install python
    source ~/.bash_profile # to use brewed python
    easy_install pip
    pip install numpy scipy six cython # it's important to build cython using gcc

If you get errors around AVX (e.g. no such instruction error), replace the native OS X assembler (usr/bin/as) by a script below.

Or you can specify use_avx = False to avoid this error.

This version has been tested under Mac OS X 10.9.5, Python 2.7.9 (Numpy 1.9.2, Scipy 0.15.1) and Apple LLVM 6.0 / gcc 4.8.4 (brew gcc48) / gcc 4.9.2 (brew gcc49) with the above modification on the assembler.


    from cat2vec import Category2Vec
    from sentences import CatSentence
    sentences = CatSentence("myfile.txt")
    # CV-DBoW model with hierarchical softmax, 10 iterations, dimension 300
    model = Category2Vec(sentences, model = "dbow", hs = 1, size = 300, iteration = 10)
    # Save a model"myfile.model")
    # Load the model
    model = Category2Vec.load("myfile.model")

Terms and Conditions

Distribution, modification, and academic/commercial use of Category2Vec is permitted, provided that you conform with GNU Lesser General Public License v3.0

If you are using Category2Vec for research purposes, please cite our paper on Category2Vec [Marui and Hagiwara 2015]

Although we are accepting contributions (e.g., issues and pull requests) to this category2vec projects, our contribution policy has not been finalized yet. We will be announcing our official policy in near future, after which we can fully accept contributions.

FAQ (Frequently Asked Questions)

Q. Is commercial use permitted?

  • A. Yes, as long as you follow the terms and conditions. See "Terms and Conditions" above for the details.


The developers would like to thank Masato Hagiwara and Kaoru Yamada for their contribution to this project.


[Marui and Hagiwara 2015] Junki Marui, and Masato Hagiwara. Category2Vec: 単語・段落・カテゴリに対するベクトル分散表現. 言語処理学会第21回年次大会(NLP2015).

[Le and Mikolov 2014] Quoc Le, and Tomas Mikolov. Distributed Representations of Sentence and Documents. In Proceedings of ICML 2014.

[Mikolov et al. 2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

[Mikolov et al. 2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

[Rahurek 2013] Radim Rehurek, Optimizing word2vec in gensim,

© 2015 Rakuten NLP Project. All Rights Reserved. / Sponsored by Rakuten, Inc. and Rakuten Institute of Technology.


No description, website, or topics provided.







No releases published


No packages published