# Tutorial for using Gensim's API for downloading corpuses/models
Let's start by importing the api module.

In [1]:
import gensim.downloader as api

2017-09-22 11:51:25,498 :gensim.api :INFO : Creating /home/ivan/gensim-data


Now, lets download the text8 corpus and load it.

In [2]:
corpus = api.load('text8')

2017-09-22 11:51:28,385 :gensim.api :INFO : Creating /home/ivan/gensim-data/text8
2017-09-22 11:51:28,387 :gensim.api :INFO : Creation of /home/ivan/gensim-data/text8 successful.
2017-09-22 11:51:28,389 :gensim.api :INFO : Downloading text8
2017-09-22 11:51:39,842 :gensim.api :INFO : text8 downloaded
2017-09-22 11:51:39,843 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/text8
2017-09-22 11:51:41,057 :gensim.api :INFO : text8 installed


In [3]:
corpus

<gensim.models.word2vec.Text8Corpus at 0x7f1fdac0ddd0>

As the corpus has been installed, let's create a word2vec model of our corpus.

In [3]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)

2017-09-19 17:46:39,030 :gensim.models.word2vec :INFO : collecting all words and their counts
2017-09-19 17:46:39,037 :gensim.models.word2vec :INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-09-19 17:46:46,104 :gensim.models.word2vec :INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2017-09-19 17:46:46,105 :gensim.models.word2vec :INFO : Loading a fresh vocabulary
2017-09-19 17:46:46,393 :gensim.models.word2vec :INFO : min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2017-09-19 17:46:46,394 :gensim.models.word2vec :INFO : min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2017-09-19 17:46:46,607 :gensim.models.word2vec :INFO : deleting the raw counts dictionary of 253854 items
2017-09-19 17:46:46,618 :gensim.models.word2vec :INFO : sample=0.001 downsamples 38 most-common words
2017-09-19 17:46:46,620 :gensim.models.word2vec :INFO : downsampling leaves estima

2017-09-19 17:47:43,592 :gensim.models.word2vec :INFO : PROGRESS: at 53.52% examples, 602970 words/s, in_qsize 5, out_qsize 0
2017-09-19 17:47:44,647 :gensim.models.word2vec :INFO : PROGRESS: at 54.27% examples, 600118 words/s, in_qsize 6, out_qsize 1
2017-09-19 17:47:45,650 :gensim.models.word2vec :INFO : PROGRESS: at 55.16% examples, 599088 words/s, in_qsize 4, out_qsize 1
2017-09-19 17:47:46,652 :gensim.models.word2vec :INFO : PROGRESS: at 56.32% examples, 601102 words/s, in_qsize 4, out_qsize 0
2017-09-19 17:47:47,668 :gensim.models.word2vec :INFO : PROGRESS: at 57.50% examples, 603159 words/s, in_qsize 5, out_qsize 0
2017-09-19 17:47:48,671 :gensim.models.word2vec :INFO : PROGRESS: at 58.66% examples, 605176 words/s, in_qsize 5, out_qsize 1
2017-09-19 17:47:49,677 :gensim.models.word2vec :INFO : PROGRESS: at 59.84% examples, 607140 words/s, in_qsize 5, out_qsize 0
2017-09-19 17:47:50,686 :gensim.models.word2vec :INFO : PROGRESS: at 61.02% examples, 609189 words/s, in_qsize 5, out_

Now that we have our word2vec model, let's find words that are similar to 'tree'

In [4]:
model.most_similar('tree')

2017-09-19 17:48:45,837 :gensim.models.keyedvectors :INFO : precomputing L2-norms of word weight vectors


[('leaf', 0.7284336090087891),
 ('trees', 0.7024068236351013),
 ('bark', 0.6984879970550537),
 ('fruit', 0.623538613319397),
 ('flower', 0.6177238821983337),
 ('nest', 0.6133654713630676),
 ('garden', 0.5962027311325073),
 ('avl', 0.5909914374351501),
 ('cave', 0.5902420282363892),
 ('pond', 0.5827507972717285)]

You can use the API to download many corpuses and models. You can get the list of all the models and corpuses that are provided, by using the code below:

In [7]:
import json
dataset_list = api.info()
print(json.dumps(dataset_list, indent=4))

{
    "gensim": {
        "model": {
            "Google_News_word2vec": {
                "desc": "Google has published pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.",
                "filename": "GoogleNews-vectors-negative300.bin.gz",
                "checksum": "4fa963d128fe65ec8cd5dd4d9377f8ed"
            },
            "fasttext_eng_model": {
                "desc": "fastText is a library for efficient learning of word representations and sentence classification.These vectors for english language in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.",
                "filename": "wiki.en.vec",
                "checksum": "2de532213d7fa8b937263337c6e9deeb"
            },
            "glove_common_crawl_42B": {
                "desc": "This model is trained on Common Crawl (42B tokens, 1.9M vocab, unca

If you want to get detailed information about the model/corpus, use:

In [3]:
api.info('glove_common_crawl_42B')

2017-09-19 21:56:42,071 :gensim.api :INFO : This model is trained on Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors). GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. 



Sometimes, you do not want to load the corpus/model to memory. You would just want to get the path to the corpus/model. For that, use :

In [9]:
text8_path = api.load('text8', return_path=True)