# NLP & representation learning: Neural Embeddings, Text Classification


To use statistical classifiers with text, it is first necessary to vectorize the text. In the first practical session we explored the **Bag of Word (BoW)** model. 

Modern **state of the art** methods uses  embeddings to vectorize the text before classification in order to avoid feature engineering.

## [Dataset](https://thome.isir.upmc.fr/classes/RITAL/json_pol.json)


## "Modern" NLP pipeline

By opposition to the **bag of word** model, in the modern NLP pipeline everything is **embeddings**. Instead of encoding a text as a **sparse vector** of length $D$ (size of feature dictionnary) the goal is to encode the text in a meaningful dense vector of a small size $|e| <<< |D|$. 


The raw classification pipeline is then the following:

```
raw text ---|embedding table|-->  vectors --|Neural Net|--> class 
```


### Using a  language model:

How to tokenize the text and extract a feature dictionnary is still a manual task. To directly have meaningful embeddings, it is common to use a pre-trained language model such as `word2vec` which we explore in this practical.

In this setting, the pipeline becomes the following:
```
      
raw text ---|(pre-trained) Language Model|--> vectors --|classifier (or fine-tuning)|--> class 
```


- #### Classic word embeddings

 - [Word2Vec](https://arxiv.org/abs/1301.3781)
 - [Glove](https://nlp.stanford.edu/projects/glove/)


- #### bleeding edge language models techniques (see next)

 - [UMLFIT](https://arxiv.org/abs/1801.06146)
 - [ELMO](https://arxiv.org/abs/1802.05365)
 - [GPT](https://blog.openai.com/language-unsupervised/)
 - [BERT](https://arxiv.org/abs/1810.04805)






### Goal of this session:

1. Train word embeddings on training dataset
2. Tinker with the learnt embeddings and see learnt relations
3. Tinker with pre-trained embeddings.
4. Use those embeddings for classification
5. Compare different embedding models

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings
import numpy as np

In [135]:
warnings.filterwarnings("ignore")

## STEP 0: Loading data 

In [2]:
import json
from collections import Counter

# Loading json
file = './datasets/json_pol.json'
with open(file,encoding="utf-8") as f:
    data = json.load(f)
    

# Quick Check
counter = Counter((x[1] for x in data))
print("Number of reviews : ", len(data))
print("----> # of positive : ", counter[1])
print("----> # of negative : ", counter[0])
print("")
print(data[0])

Number of reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old Justin Henry and a script that was sympathetic to each character (and each one\'s predicament), the thought provoking elements linger long after the tear jerking ones are over. Overall, superior acting from a solid cast, excellent directing, and a very powerful script. The right touches of humor throughout help keep a "heavy" subject from becoming tedious or difficult to sit through. Lastly, this film stands the test of time and seems in no way dated, decades after it was released.', 1]


## Word2Vec: Quick Recap

**[Word2Vec](https://arxiv.org/abs/1301.3781) is composed of two distinct language models (CBOW and SG), optimized to quickly learn word vectors**


given a random text: `i'm taking the dog out for a walk`



### (a) Continuous Bag of Word (CBOW)
    -  predicts a word given a context
    
maximizing `p(dog | i'm taking the ___ out for a walk)`
    
### (b) Skip-Gram (SG)               
    -  predicts a context given a word
    
 maximizing `p(i'm taking the out for a walk | dog)`



   

## STEP 1: train a language model (word2vec)

Gensim has one of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) fastest implementation.


### Train:

In [2]:
# if gensim not installed yet
# ! pip install gensim

In [3]:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

text = [t.split() for t,p in data]

# the following configuration is the default configuration
w2v = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=0, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2024-02-29 11:42:28,338 : INFO : collecting all words and their counts
2024-02-29 11:42:28,338 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-29 11:42:29,082 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-29 11:42:29,731 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-29 11:42:30,020 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-29 11:42:30,021 : INFO : Creating a fresh vocabulary
2024-02-29 11:42:30,263 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-29T11:42:30.263575', 'gensim': '4.3.0', 'python': '3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'prepare_vocab'}
2024-02-29 11:42:30,264 : IN

In [4]:
# Worth it to save the previous embedding
w2v.save("W2v-movies.dat")
# You will be able to reload them:
#w2v = gensim.models.Word2Vec.load("W2v-movies.dat")
# and you can continue the learning process if needed

2024-02-25 17:32:17,942 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'W2v-movies.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2024-02-25T17:32:17.942126', 'gensim': '4.3.0', 'python': '3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'saving'}
2024-02-25 17:32:17,943 : INFO : not storing attribute cum_table
2024-02-25 17:32:18,031 : INFO : saved W2v-movies.dat


## STEP 2: Test learnt embeddings

The word embedding space directly encodes similarities between words: the vector coding for the word "great" will be closer to the vector coding for "good" than to the one coding for "bad". Generally, [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is the distance used when considering distance between vectors.

KeyedVectors have a built in [similarity](https://radimrehurek.com/gensim/models /keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.similarity) method to compute the cosine similarity between words

In [5]:
# cosine similarity
# is great really closer to good than to bad ?
print("great and good:",w2v.wv.similarity("great","good"))
print("great and bad:",w2v.wv.similarity("great","bad"))

great and good: 0.77909327
great and bad: 0.52971756


Since cosine distance encodes similarity, neighboring words are supposed to be similar. The [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.most_similar) method returns the `topn` words given a query.

In [6]:
# The query can be as simple as a word, such as "movie"

# Try changing the word
#w2v.wv.most_similar("movie",topn=5) # 5 most similar words
#w2v.wv.most_similar("awesome",topn=5)
print(w2v.wv.most_similar("actor",topn=5))
print(w2v.wv.most_similar("run",topn=5))
print(w2v.wv.most_similar("send",topn=5))

[('walk', 0.8287126421928406), ('fly', 0.7930590510368347), ('hang', 0.7786651253700256), ('drive', 0.7679545283317566), ('jump', 0.7633776664733887)]
[('sell', 0.8768139481544495), ('push', 0.8391414880752563), ('blow', 0.824489414691925), ('join', 0.8211827874183655), ('drive', 0.8069530725479126)]


But it can be a more complicated query
Word embedding spaces tend to encode much more.

The most famous exemple is: `vec(king) - vec(man) + vec(woman) => vec(queen)`

In [7]:
def get_analogy(mot1, mot2, mot3, modele, top=3):
    analogies = [(w, round(v, 2)) for w, v in  modele.most_similar(positive=[mot1,mot2], negative=[mot3], topn=top) ]    
    return analogies   

In [11]:
print('1', *get_analogy(*["awesome","bad"], "good"), w2v.wv)
print('2', *get_analogy(*["actor","woman"], "man"), w2v.wv)
print('3', *get_analogy(*["actor","men"], "man"), w2v.wv)
print('4', *get_analogy(*["man","actress"], "actor"), w2v.wv)
print('5', *get_analogy(*["bad","best"], "good"), w2v.wv)
print('6', *get_analogy(*["funny","better"], "good"), w2v.wv)
print('7', *get_analogy(*["run","coming"], "come"), w2v.wv)
print('8', *get_analogy(*["send","coming"], "come"), w2v.wv)
print('9', *get_analogy(*["absolute","totally"], "total"), w2v.wv)

1 ('awful', 0.76) ('unbelievably', 0.69) ('amateur', 0.65)
2 ('actress', 0.9) ('role', 0.75) ('role,', 0.73)
3 ('roles', 0.8) ('actresses', 0.79) ('actors', 0.77)
4 ('woman', 0.9) ('lady', 0.84) ('girl', 0.81)
5 ('worst', 0.81) ('funniest', 0.7) ('greatest', 0.69)
6 ('funnier', 0.7) ('worse', 0.7) ('more', 0.66)
7 ('walking', 0.75) ('running', 0.74) ('window', 0.74)
8 ('sending', 0.71) ('leaving', 0.67) ('forcing', 0.66)
9 ('incredibly', 0.61) ('amazingly', 0.6) ('utterly', 0.6)


**To test learnt "synctactic" and "semantic" similarities, Mikolov et al. introduced a special dataset containing a wide variety of three way similarities.**

**You can download the dataset [here](https://thome.isir.upmc.fr/classes/RITAL/questions-words.txt).**

In [8]:
out = w2v.wv.evaluate_word_analogies("datasets/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2024-02-25 17:32:18,216 : INFO : Evaluating word analogies for top 300000 words in the model on datasets/questions-words.txt
2024-02-25 17:32:18,605 : INFO : capital-common-countries: 3.3% (3/90)
2024-02-25 17:32:18,903 : INFO : capital-world: 0.0% (0/71)
2024-02-25 17:32:19,031 : INFO : currency: 0.0% (0/28)
2024-02-25 17:32:20,209 : INFO : city-in-state: 0.0% (0/329)
2024-02-25 17:32:21,379 : INFO : family: 34.2% (117/342)
2024-02-25 17:32:25,494 : INFO : gram1-adjective-to-adverb: 1.4% (13/930)
2024-02-25 17:32:28,367 : INFO : gram2-opposite: 3.3% (18/552)
2024-02-25 17:32:33,195 : INFO : gram3-comparative: 24.8% (312/1260)
2024-02-25 17:32:35,642 : INFO : gram4-superlative: 6.3% (44/702)
2024-02-25 17:32:38,532 : INFO : gram5-present-participle: 15.5% (117/756)
2024-02-25 17:32:41,338 : INFO : gram6-nationality-adjective: 1.9% (15/792)
2024-02-25 17:32:45,943 : INFO : gram7-past-tense: 15.2% (191/1260)
2024-02-25 17:32:48,991 : INFO : gram8-plural: 4.1% (33/812)
2024-02-25 17:32:51

**When training the w2v models on the review dataset, since it hasn't been learnt with a lot of data, it does not perform very well.**


## STEP 3: Loading a pre-trained model

In Gensim, embeddings are loaded and can be used via the ["KeyedVectors"](https://radimrehurek.com/gensim/models/keyedvectors.html) class

> Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

>The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

>The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, they key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

**You can download the pre-trained word embedding [HERE](https://thome.isir.upmc.fr/classes/RITAL/word2vec-google-news-300.dat) .**

In [5]:
from gensim.test.utils import get_tmpfile
import gensim.downloader as api
from gensim.models import KeyedVectors
bload = True
fname = "word2vec-google-news-300"
sdir = "datasets/" # Change

if(bload==True):
    wv_pre_trained = KeyedVectors.load(sdir+fname+".dat")
else:    
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")
    

**Perform the "synctactic" and "semantic" evaluations again. Conclude on the pre-trained embeddings.**

In [13]:
print("great and good:",wv_pre_trained.similarity("great","good"))
print("great and bad:",wv_pre_trained.similarity("great","bad"))

great and good: 0.7291509
great and bad: 0.39287654


In [20]:
print(wv_pre_trained.most_similar("actor",topn=5))

[('actress', 0.7930010557174683), ('Actor', 0.7446157932281494), ('thesp', 0.6954971551895142), ('thespian', 0.6651668548583984), ('actors', 0.6519852876663208)]


In [21]:
print(wv_pre_trained.most_similar("run",topn=3))
print(wv_pre_trained.most_similar("send",topn=3))

[('runs', 0.656993567943573), ('running', 0.6062965393066406), ('drive', 0.4834050238132477)]
[('sending', 0.7407121658325195), ('sent', 0.7368507981300354), ('sends', 0.6713995337486267)]


In [8]:
print('1', *get_analogy(*["awesome","bad"], "good", wv_pre_trained))
print('2', *get_analogy(*["actor","woman"], "man", wv_pre_trained))
print('3', *get_analogy(*["actor","men"], "man", wv_pre_trained))
print('4', *get_analogy(*["man","actress"], "actor", wv_pre_trained))
print('5', *get_analogy(*["bad","best"], "good", wv_pre_trained))
print('6', *get_analogy(*["funny","better"], "good", wv_pre_trained))
print('7', *get_analogy(*["run","coming"], "come", wv_pre_trained))
print('8', *get_analogy(*["send","coming"], "come", wv_pre_trained))
print('9', *get_analogy(*["absolute","totally"], "total", wv_pre_trained))

1 ('horrible', 0.6) ('amazing', 0.59) ('weird', 0.58)
2 ('actress', 0.86) ('actresses', 0.66) ('thesp', 0.63)
3 ('actors', 0.61) ('actresses', 0.6) ('actress', 0.54)
4 ('woman', 0.84) ('girl', 0.69) ('teenage_girl', 0.68)
5 ('worst', 0.68) ('dumbest', 0.53) ('lousiest', 0.52)
6 ('funnier', 0.75) ('stupider', 0.6) ('sillier', 0.57)
7 ('running', 0.61) ('runs', 0.5) ('Mark_Grudzielanek_singled', 0.44)
8 ('sending', 0.76) ('sent', 0.63) ('Sending', 0.59)
9 ('completely', 0.6) ('absolutely', 0.58) ('utterly', 0.55)


## STEP 4:  sentiment classification

In the previous practical session, we used a bag of word approach to transform text into vectors.
Here, we propose to try to use word vectors (previously learnt or loaded).


### <font color='green'> Since we have only word vectors and that sentences are made of multiple words, we need to aggregate them. </font>


### (1) Vectorize reviews using word vectors:

Word aggregation can be done in different ways:

- Sum
- Average
- Min/feature
- Max/feature

#### a few pointers:

- `w2v.wv.vocab` is a `set()` of the vocabulary (all existing words in your model)
- `np.minimum(a,b) and np.maximum(a,b)` respectively return element-wise min/max 

In [124]:
# We first need to vectorize text:
# First we propose to a sum of them


def vectorize(text, wv_model, f_aggregation, mean=False):
    """
    This function should vectorize one review

    input: str
    output: np.array(float)
    """    
    text_vectorized = []
    for word in text.split():
        # do something
        if word in wv_model.key_to_index :
            text_vectorized.append(wv_model[word])
        
    #appliquer l'aggregation par rapport aux colonnes (car vecteur lignes)
    return f_aggregation(np.array(text_vectorized),axis=0) 
  


In [116]:
def vectorize_all_methods(data, wv_model):
    f_aggregation = [np.sum, np.mean, np.max, np.min]
    train=data[:(len(data)//10)*8] # ne garder que 80% du dataset
    test=data[len(train):]

    resultats_aggregation_train = []
    resultats_aggregation_test = []
    y_test = [pol for text,pol in test]
    y_train = [pol for text,pol in train]

    for f in f_aggregation:
        classes = [pol for text,pol in train]
        X = [vectorize(text, wv_model, f) for text,pol in train]
        X_test = [vectorize(text, wv_model, f) for text,pol in test]
        resultats_aggregation_train.append(X)
        resultats_aggregation_test.append(X_test)
    
    return resultats_aggregation_train, resultats_aggregation_test, y_train, y_test

In [117]:
text = [t.split() for t,p in data]

Vectorisation avec CBoW

In [114]:
# the following configuration is the default configuration
w2v_cbow = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=0, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2024-02-25 19:37:45,593 : INFO : collecting all words and their counts
2024-02-25 19:37:45,596 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-25 19:37:46,597 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-25 19:37:47,554 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-25 19:37:48,057 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-25 19:37:48,058 : INFO : Creating a fresh vocabulary
2024-02-25 19:37:48,448 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-25T19:37:48.448641', 'gensim': '4.3.0', 'python': '3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'prepare_vocab'}
2024-02-25 19:37:48,448 : IN

In [118]:
resultats_aggregation_train_cbow, resultats_aggregation_test_cbow, y_train, y_test = vectorize_all_methods(data, w2v_cbow.wv)

Vectorisation SkipGram

In [120]:
# the following configuration is the default configuration
w2v_sg = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=1, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                epochs=5)

2024-02-25 19:43:47,668 : INFO : collecting all words and their counts
2024-02-25 19:43:47,671 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-25 19:43:48,813 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-25 19:43:49,876 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-25 19:43:50,337 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-25 19:43:50,337 : INFO : Creating a fresh vocabulary
2024-02-25 19:43:50,721 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-25T19:43:50.721517', 'gensim': '4.3.0', 'python': '3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'prepare_vocab'}
2024-02-25 19:43:50,721 : IN

In [121]:
resultats_aggregation_train_sg, resultats_aggregation_test_sg, y_train, y_test = vectorize_all_methods(data, w2v_sg.wv)

Vectorisation wv_pretrained

In [125]:
resultats_aggregation_train_pre_trained, resultats_aggregation_test_pre_trained, y_train, y_test = vectorize_all_methods(data, wv_pre_trained)

### (2) Train a classifier 
as in the previous practical session, train a logistic regression to do sentiment classification with word vectors



Classification avec cbow

In [131]:
methodes_aggregation = {"np.sum":0, "np.mean": 1, "np.max": 2, "np.min": 3}
i = methodes_aggregation["np.sum"]

X_train_cbow = pd.DataFrame(resultats_aggregation_train_cbow[i])
X_test_cbow = pd.DataFrame(resultats_aggregation_test_cbow[i])
y_train = pd.DataFrame(y_train).iloc[:,0]
y_test = pd.DataFrame(y_test).iloc[:,0]

model_lr = LogisticRegression()
model_lr.fit(X_train_cbow, y_train)
y_pred_cbow = model_lr.predict(X_test_cbow)
accuracy_cbow= accuracy_score(y_test,y_pred_cbow)
print('Accuracy=', accuracy_cbow)

Accuracy= 0.64


Classification avec SkipGram

In [136]:

methodes_aggregation = {"np.sum":0, "np.mean": 1, "np.max": 2, "np.min": 3}
i = methodes_aggregation["np.sum"]

X_train_sg = pd.DataFrame(resultats_aggregation_train_sg[i])
X_test_sg = pd.DataFrame(resultats_aggregation_test_sg[i])
y_train = pd.DataFrame(y_train).iloc[:,0]
y_test = pd.DataFrame(y_test).iloc[:,0]

model_lr = LogisticRegression()
model_lr.fit(X_train_sg, y_train)
y_pred_sg = model_lr.predict(X_test_sg)
accuracy_sg = accuracy_score(y_test,y_pred_sg)
print('Accuracy=', accuracy_sg)

Accuracy= 0.7406


Classification avec wv_pre_trained

In [137]:
methodes_aggregation = {"np.sum":0, "np.mean": 1, "np.max": 2, "np.min": 3}
i = methodes_aggregation["np.sum"]

X_train_pre_trained = pd.DataFrame(resultats_aggregation_train_pre_trained[i])
X_test_pre_trained = pd.DataFrame(resultats_aggregation_test_pre_trained[i])
y_train = pd.DataFrame(y_train).iloc[:,0]
y_test = pd.DataFrame(y_test).iloc[:,0]

model_lr = LogisticRegression()
model_lr.fit(X_train_pre_trained, y_train)
y_pred_pre_trained = model_lr.predict(X_test_pre_trained)
accuracy_pre_trained = accuracy_score(y_test,y_pred_pre_trained)
print('Accuracy=', accuracy_pre_trained)

Accuracy= 0.7604


performance should be worst than with bag of word (~80%). Sum/Mean aggregation does not work well on long reviews (especially with many frequent words). This adds a lot of noise.

## **Todo** :  Try answering the following questions:

- Which word2vec model works best: skip-gram or cbow
- Do pretrained vectors work best than those learnt on the train dataset ?


- SkipGram fonctionne mieux que Cbow (74 > 67) <br>
- Oui, les modèles pré-entrainés fonctionnent mieux que les autres car ils ont été entraînés sur des corpus plus larges et ils peuvent donc mieux capturer les dépendances entre les mots/phrases 


**(Bonus)** To have a better accuracy, we could try two things:
- Better aggregation methods (weight by tf-idf ?)
- Another word vectorizing method such as [fasttext](https://radimrehurek.com/gensim/models/fasttext.html)
- A document vectorizing method such as [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)