# Title: Sentiment Analysis with IMDB Dataset (Using Doc2Vec Model/ Gensim)

#### Members' Names or Individual's Name: (Sara Azadeh, Lesley Milley )

####  Emails: sara.azadeh@ryerson.ca , Lesley.milley@ryerson.ca 


# Introduction:

#### Problem Description:

Many ML algorithms require the input to be represented as a fixed-length feature vector and bag-of-word was one of the common fixed-length feature but it has some weaknesses. 

#### Context of the Problem:
We want to be able to predict topics,labels and sentiments and the problem is how to use the information which we have to make the best prediction.




#### Limitation About other Approaches:
Bag-of-words has two major weakness: 
1)Lose ordering of words 2) Ignore semantics of words

#### Solution:

Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of text


# Methodology
Train a variety of Doc2Vec models on the dataset.
Evaluate the performance of each model using a logistic regression.
Examine some of the results directly

When examining results, we will look for answers for the following questions:

#. Are inferred vectors close to the precalculated ones?
#. Do close documents seem more related than distant ones?
#. Do the word vectors show useful similarities?
#. Are the word vectors from this dataset any good at analogies?

# Background

Explain the related work using the following table

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Andrew M. Dai et al. [1] | Paragraph Vectors can effectively be used for measuring semantic similarity between long pieces of texts| arXiv article , Wikipedia| Better accuracy for Paragraph vectors methods in compare to LDA,Bag-Of-Word and Average word embeding
| Quoc V. Le et al. [2] | Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations| Treebank Dataset , Imdb Dataset | Achieve new state-of-the-art results


# Implementation

In [44]:
%matplotlib inline

In [64]:
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
#logging.config.dictConfig({'disable_existing_loggers': True,})


First, let's define a convenient datatype for holding data for a single document:

* words: The text of the document, as a ``list`` of words.
* tags: Used to keep the index of the document in the entire dataset.
* split: one of ``train``\ , ``test`` or ``extra``. Determines how the document will be used (for training, testing, etc).
* sentiment: either 1 (positive), 0 (negative) or None (unlabeled document).

This data type is helpful for later evaluation and reporting.
In particular, the ``index`` member will help us quickly and easily retrieve the vectors for a document from a model.




In [65]:
import collections

SentimentDocument = collections.namedtuple('SentimentDocument', 'words tags split sentiment')

At this part we load the data 



In [3]:
import io
import re
import tarfile
import os.path

import smart_open
import gensim.utils

def download_dataset(url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'):
    fname = url.split('/')[-1]

    if os.path.isfile(fname):
       return fname

    # Download the file to local storage first.
    with smart_open.open(url, "rb", ignore_ext=True) as fin:
        with smart_open.open(fname, 'wb', ignore_ext=True) as fout:
            while True:
                buf = fin.read(io.DEFAULT_BUFFER_SIZE)
                if not buf:
                    break
                fout.write(buf)

    return fname

def create_sentiment_document(name, text, index):
    _, split, sentiment_str, _ = name.split('/')
    sentiment = {'pos': 1.0, 'neg': 0.0, 'unsup': None}[sentiment_str]

    if sentiment is None:
        split = 'extra'

    tokens = gensim.utils.to_unicode(text).split()
    return SentimentDocument(tokens, [index], split, sentiment)

def extract_documents():
    fname = download_dataset()

    index = 0

    with tarfile.open(fname, mode='r:gz') as tar:
        for member in tar.getmembers():
            if re.match(r'aclImdb/(train|test)/(pos|neg|unsup)/\d+_\d+.txt$', member.name):
                member_bytes = tar.extractfile(member).read()
                member_text = member_bytes.decode('utf-8', errors='replace')
                assert member_text.count('\n') == 0
                yield create_sentiment_document(member.name, member_text, index)
                index += 1

alldocs = list(extract_documents())

Here's what a single document looks like.



In [66]:
print(alldocs[27])

SentimentDocument(words=['I', 'was', 'looking', 'forward', 'to', 'this', 'movie.', 'Trustworthy', 'actors,', 'interesting', 'plot.', 'Great', 'atmosphere', 'then', '?????', 'IF', 'you', 'are', 'going', 'to', 'attempt', 'something', 'that', 'is', 'meant', 'to', 'encapsulate', 'the', 'meaning', 'of', 'life.', 'First.', 'Know', 'it.', 'OK', 'I', 'did', 'not', 'expect', 'the', 'directors', 'or', 'writers', 'to', 'actually', 'know', 'the', 'meaning', 'but', 'I', 'thought', 'they', 'may', 'have', 'offered', 'crumbs', 'to', 'peck', 'at', 'and', 'treats', 'to', 'add', 'fuel', 'to', 'the', 'fire-Which!', 'they', 'almost', 'did.', 'Things', 'I', "didn't", 'get.', 'A', 'woman', 'wandering', 'around', 'in', 'dark', 'places', 'and', 'lonely', 'car', 'parks', 'alone-oblivious', 'to', 'the', 'consequences.', 'Great', 'riddles', 'that', 'fell', 'by', 'the', 'wayside.', 'The', 'promise', 'of', 'the', 'knowledge', 'therein', 'contained', 'by', 'the', 'original', 'so-called', 'criminal.', 'I', 'had', 'no

Extract our documents and split into training/test sets.



In [24]:
train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']
print(f'{len(alldocs)} docs: {len(train_docs)} train-sentiment, {len(test_docs)} test-sentiment')

100000 docs: 25000 train-sentiment, 25000 test-sentiment


In [25]:
train_docs [0]

SentimentDocument(words=['I', 'rented', 'I', 'AM', 'CURIOUS-YELLOW', 'from', 'my', 'video', 'store', 'because', 'of', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when', 'it', 'was', 'first', 'released', 'in', '1967.', 'I', 'also', 'heard', 'that', 'at', 'first', 'it', 'was', 'seized', 'by', 'U.S.', 'customs', 'if', 'it', 'ever', 'tried', 'to', 'enter', 'this', 'country,', 'therefore', 'being', 'a', 'fan', 'of', 'films', 'considered', '"controversial"', 'I', 'really', 'had', 'to', 'see', 'this', 'for', 'myself.<br', '/><br', '/>The', 'plot', 'is', 'centered', 'around', 'a', 'young', 'Swedish', 'drama', 'student', 'named', 'Lena', 'who', 'wants', 'to', 'learn', 'everything', 'she', 'can', 'about', 'life.', 'In', 'particular', 'she', 'wants', 'to', 'focus', 'her', 'attentions', 'to', 'making', 'some', 'sort', 'of', 'documentary', 'on', 'what', 'the', 'average', 'Swede', 'thought', 'about', 'certain', 'political', 'issues', 'such', 'as', 'the', 'Vietnam', 'War', 'and', 'race'

## Set-up Doc2Vec Training & Evaluation Models

In [69]:


import multiprocessing
from collections import OrderedDict

import gensim.models.doc2vec
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

from gensim.models.doc2vec import Doc2Vec


# cbow=0 means skip-gram which is equivalent to the paper's 'PV-DBOW'
# mode, matched in gensim with dm=0

#A min_count=2 saves quite a bit of model memory, discarding only words that appear in a single doc

common_kwargs = dict(
    vector_size=100, epochs=10, min_count=2,
    sample=0, workers=multiprocessing.cpu_count(), negative=5, hs=0
)

simple_models = [
    # PV-DBOW plain
    Doc2Vec(dm=0, **common_kwargs),
    # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
    Doc2Vec(dm=1, window=10, alpha=0.05, comment='alpha=0.05', **common_kwargs),
    # PV-DM w/ concatenation - big, slow, experimental mode
    # window=5 (both sides) approximates paper's apparent 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, window=5, **common_kwargs),
]

for model in simple_models:
    model.build_vocab(alldocs)
    print(f"{model} vocabulary scanned & state initialized")

models_by_name = OrderedDict((str(model), model) for model in simple_models)

2021-04-09 19:06:53,603 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec(dbow,d100,n5,mc2,t4)', 'datetime': '2021-04-09T19:06:53.603925', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 02:22:02) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}
2021-04-09 19:06:53,617 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec(dm/m,d100,n5,w10,mc2,t4)', 'datetime': '2021-04-09T19:06:53.617928', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 02:22:02) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}
2021-04-09 19:06:53,621 : INFO : using concatenative 1100-dimensional layer1
2021-04-09 19:06:53,625 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec(dm/c,d100,n5,w5,mc2,t4)', 'datetime': '2021-04-09T19:06:53.624987', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 02:22:02) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}
2021-04-09 19:06:53,637 : IN

Doc2Vec(dbow,d100,n5,mc2,t4) vocabulary scanned & state initialized


2021-04-09 19:07:10,010 : INFO : PROGRESS: at example #10000, processed 2292381 words (3728373/s), 150816 word types, 0 tags
2021-04-09 19:07:10,695 : INFO : PROGRESS: at example #20000, processed 4573645 words (3329595/s), 238497 word types, 0 tags
2021-04-09 19:07:11,288 : INFO : PROGRESS: at example #30000, processed 6865575 words (3873221/s), 312348 word types, 0 tags
2021-04-09 19:07:12,090 : INFO : PROGRESS: at example #40000, processed 9190019 words (2901034/s), 377231 word types, 0 tags
2021-04-09 19:07:12,800 : INFO : PROGRESS: at example #50000, processed 11557847 words (3339456/s), 438729 word types, 0 tags
2021-04-09 19:07:13,467 : INFO : PROGRESS: at example #60000, processed 13899883 words (3515096/s), 493913 word types, 0 tags
2021-04-09 19:07:14,123 : INFO : PROGRESS: at example #70000, processed 16270094 words (3617618/s), 548474 word types, 0 tags
2021-04-09 19:07:14,858 : INFO : PROGRESS: at example #80000, processed 18598876 words (3170110/s), 598272 word types, 0 t

Doc2Vec(dm/m,d100,n5,w10,mc2,t4) vocabulary scanned & state initialized


2021-04-09 19:07:24,094 : INFO : PROGRESS: at example #10000, processed 2292381 words (4074071/s), 150816 word types, 0 tags
2021-04-09 19:07:24,691 : INFO : PROGRESS: at example #20000, processed 4573645 words (3826468/s), 238497 word types, 0 tags
2021-04-09 19:07:25,298 : INFO : PROGRESS: at example #30000, processed 6865575 words (3778100/s), 312348 word types, 0 tags
2021-04-09 19:07:25,932 : INFO : PROGRESS: at example #40000, processed 9190019 words (3671991/s), 377231 word types, 0 tags
2021-04-09 19:07:26,565 : INFO : PROGRESS: at example #50000, processed 11557847 words (3743531/s), 438729 word types, 0 tags
2021-04-09 19:07:27,121 : INFO : PROGRESS: at example #60000, processed 13899883 words (4222743/s), 493913 word types, 0 tags
2021-04-09 19:07:27,678 : INFO : PROGRESS: at example #70000, processed 16270094 words (4262920/s), 548474 word types, 0 tags
2021-04-09 19:07:28,310 : INFO : PROGRESS: at example #80000, processed 18598876 words (3685677/s), 598272 word types, 0 t

Doc2Vec(dm/c,d100,n5,w5,mc2,t4) vocabulary scanned & state initialized


In [None]:
!pip install testfixtures

Combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We willfollow, pairing the models
together for evaluation. Here, we concatenate the paragraph vectors obtained from each model with the help of a thin wrapperclass included in a gensim
test module. 


In [62]:

from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])

Predictive Evaluation Methods
-----------------------------

Given a document, our ``Doc2Vec`` models output a vector representation of the document.
In case of sentiment analysis, we want the ouput vector to reflect the sentiment in the input document.
So, in vector space, positive documents should be distant from negative documents.

We train a logistic regression from the training set:

  - regressors (inputs): document vectors from the Doc2Vec model
  - target (outpus): sentiment labels

So, this logistic regression will be able to predict sentiment given a document vector.

Next, we test our logistic regression on the test set, and measure the rate of errors (incorrect predictions).
If the document vectors from the Doc2Vec model reflect the actual sentiment well, the error rate will be low.

Therefore, the error rate of the logistic regression is indication of *how well* the given Doc2Vec model represents documents as vectors.
We can then compare different ``Doc2Vec`` models by looking at their error rates.




In [57]:
import numpy as np
import statsmodels.api as sm
from random import sample

def logistic_predictor_from_data(train_targets, train_regressors):
    """Fit a statsmodel logistic predictor on supplied data"""
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    # print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets = [doc.sentiment for doc in train_set]
    train_regressors = [test_model.dv[doc.tags[0]] for doc in train_set]
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_regressors = [test_model.dv[doc.tags[0]] for doc in test_set]
    test_regressors = sm.add_constant(test_regressors)

    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_set])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

Bulk Training & Per-Model Evaluation
------------------------------------

Note that doc-vector training is occurring on *all* documents of the dataset,
which includes all TRAIN/TEST/DEV docs.  Because the native document-order
has similar-sentiment documents in large clumps – which is suboptimal for
training – we work with once-shuffled copy of the training set.

We evaluate each model's sentiment predictive power based on error rate, and
the evaluation is done for each model.

(Note: Running this part takes about almost 1 hour) 



In [46]:
from collections import defaultdict
error_rates = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [33]:
from random import shuffle
shuffled_alldocs = alldocs[:]
shuffle(shuffled_alldocs)

for model in simple_models:
    print(f"Training {model}")
    model.train(shuffled_alldocs, total_examples=len(shuffled_alldocs), epochs=model.epochs)

    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))

for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:
    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print(f"\n{err_rate} {model}\n")

2021-04-07 08:21:57,168 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 4 workers on 265408 vocabulary and 100 features, using sg=1 hs=0 sample=0 negative=5 window=5', 'datetime': '2021-04-07T08:21:57.168046', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 02:22:02) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'}


Training Doc2Vec(dbow,d100,n5,mc2,t4)


2021-04-07 08:21:58,263 : INFO : EPOCH 1 - PROGRESS: at 3.23% examples, 752579 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:21:59,264 : INFO : EPOCH 1 - PROGRESS: at 7.72% examples, 888241 words/s, in_qsize 6, out_qsize 1
2021-04-07 08:22:00,267 : INFO : EPOCH 1 - PROGRESS: at 12.58% examples, 965239 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:22:01,273 : INFO : EPOCH 1 - PROGRESS: at 16.28% examples, 933193 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:22:02,284 : INFO : EPOCH 1 - PROGRESS: at 20.83% examples, 956765 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:22:03,295 : INFO : EPOCH 1 - PROGRESS: at 25.23% examples, 965005 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:22:04,300 : INFO : EPOCH 1 - PROGRESS: at 30.12% examples, 985229 words/s, in_qsize 6, out_qsize 1
2021-04-07 08:22:05,310 : INFO : EPOCH 1 - PROGRESS: at 35.18% examples, 1006157 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:22:06,321 : INFO : EPOCH 1 - PROGRESS: at 39.87% examples, 1013444 words/s, in_qsiz


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)


2021-04-07 08:25:58,745 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 4 workers on 265408 vocabulary and 100 features, using sg=0 hs=0 sample=0 negative=5 window=10', 'datetime': '2021-04-07T08:25:58.744995', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 02:22:02) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'}



0.113960 Doc2Vec(dbow,d100,n5,mc2,t4)

Training Doc2Vec(dm/m,d100,n5,w10,mc2,t4)


2021-04-07 08:25:59,777 : INFO : EPOCH 1 - PROGRESS: at 1.83% examples, 424245 words/s, in_qsize 8, out_qsize 0
2021-04-07 08:26:00,821 : INFO : EPOCH 1 - PROGRESS: at 4.08% examples, 457183 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:26:01,823 : INFO : EPOCH 1 - PROGRESS: at 6.39% examples, 481074 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:26:02,888 : INFO : EPOCH 1 - PROGRESS: at 8.53% examples, 474241 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:26:03,929 : INFO : EPOCH 1 - PROGRESS: at 10.49% examples, 466561 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:26:04,941 : INFO : EPOCH 1 - PROGRESS: at 12.58% examples, 468654 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:26:05,946 : INFO : EPOCH 1 - PROGRESS: at 14.69% examples, 469081 words/s, in_qsize 8, out_qsize 0
2021-04-07 08:26:06,991 : INFO : EPOCH 1 - PROGRESS: at 17.00% examples, 475342 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:26:07,997 : INFO : EPOCH 1 - PROGRESS: at 19.09% examples, 475754 words/s, in_qsize 7,


Evaluating Doc2Vec(dm/m,d100,n5,w10,mc2,t4)


2021-04-07 08:33:39,936 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 4 workers on 265409 vocabulary and 1100 features, using sg=0 hs=0 sample=0 negative=5 window=5', 'datetime': '2021-04-07T08:33:39.936851', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 02:22:02) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'}



0.188440 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

Training Doc2Vec(dm/c,d100,n5,w5,mc2,t4)


2021-04-07 08:33:41,101 : INFO : EPOCH 1 - PROGRESS: at 0.70% examples, 141704 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:33:42,148 : INFO : EPOCH 1 - PROGRESS: at 1.84% examples, 196945 words/s, in_qsize 8, out_qsize 0
2021-04-07 08:33:43,150 : INFO : EPOCH 1 - PROGRESS: at 2.77% examples, 201513 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:33:44,215 : INFO : EPOCH 1 - PROGRESS: at 3.62% examples, 196681 words/s, in_qsize 6, out_qsize 1
2021-04-07 08:33:45,258 : INFO : EPOCH 1 - PROGRESS: at 4.85% examples, 212384 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:33:46,268 : INFO : EPOCH 1 - PROGRESS: at 5.92% examples, 216753 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:33:47,289 : INFO : EPOCH 1 - PROGRESS: at 7.02% examples, 220741 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:33:48,308 : INFO : EPOCH 1 - PROGRESS: at 8.36% examples, 229770 words/s, in_qsize 7, out_qsize 0
2021-04-07 08:33:49,333 : INFO : EPOCH 1 - PROGRESS: at 9.78% examples, 239742 words/s, in_qsize 8, out_


Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.351440 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

0.1144 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.11544 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)



Achieved Sentiment-Prediction Accuracy
--------------------------------------
Compare error rates achieved, best-to-worst



In [34]:
print("Err_rate Model")
for rate, name in sorted((rate, name) for name, rate in error_rates.items()):
    print(f"{rate} {name}")

Err_rate Model
0.11396 Doc2Vec(dbow,d100,n5,mc2,t4)
0.1144 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)
0.11544 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
0.18844 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)
0.35144 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)


In our testing, contrary to the results of the paper, on this problem,
PV-DBOW alone performs as good as anything else. Concatenating vectors from
different models only sometimes offers a tiny predictive improvement – and
stays generally close to the best-performing solo model included.

The best results achieved here are just around 11% error rate





Examining Results
-----------------




Are inferred vectors close to the precalculated ones?
-----------------------------------------------------



In [35]:
doc_id = np.random.randint(len(simple_models[0].dv))  # Pick random doc; re-run cell for more examples
print(f'for doc {doc_id}...')
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print(f'{model}:\n {model.dv.most_similar([inferred_docvec], topn=3)}')

for doc 22397...
Doc2Vec(dbow,d100,n5,mc2,t4):
 [(22397, 0.9864609241485596), (53697, 0.7459406852722168), (76668, 0.7366845011711121)]
Doc2Vec(dm/m,d100,n5,w10,mc2,t4):
 [(22397, 0.9181831479072571), (22401, 0.5822756886482239), (22420, 0.5479550361633301)]
Doc2Vec(dm/c,d100,n5,w5,mc2,t4):
 [(22397, 0.8271936774253845), (75400, 0.44324803352355957), (78942, 0.4326428174972534)]


(Yes, here the stored vector from 10 epochs of training is usually one of the
closest to a freshly-inferred vector for the same words. Defaults for
inference may benefit from tuning for each dataset or model parameters.)




Do close documents seem more related than distant ones?
-------------------------------------------------------



In [70]:
import random

doc_id = np.random.randint(len(simple_models[0].dv))  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.dv.most_similar(doc_id, topn=len(model.dv))  # get *all* similar documents
print(f'TARGET ({doc_id}): «{" ".join(alldocs[doc_id].words)}»\n')
print(f'SIMILAR/DISSIMILAR DOCS PER MODEL {model}%s:\n')
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    s = sims[index]
    i = sims[index][0]
    words = ' '.join(alldocs[i].words)
    print(f'{label} {s}: «{words}»\n')

TARGET (70273): «While I disagree with the conventional wisdom about the Marx Brothers' film made before The Big Store, Go West (1940), believing it to be yet another one of their many masterpieces, I have to agree with the conventional wisdom about The Big Store. It has the feeling of a contractual obligation film. One, two or all three of the Marx Brothers are absent for long periods of time. The story is often confusing. The film doesn't flow very well. Some of the material featuring other performers simply doesn't work. Even when it does work, it's never as good as the Marx Brothers' material, and even their work is too often strangely flat.<br /><br />The Big Store is really the story of Tommy Rogers (played by famed pop singer Tony Martin). Rogers has just gained partial ownership of the Phelps Department store with the passing of a relative. However, he's not interested in the store, so he plans to sell and use the money to build a state of the art music conservatory in conjunct

Somewhat, in terms of reviewer tone, movie genre, etc... the MOST
cosine-similar docs usually seem more like the TARGET than the MEDIAN or
LEAST... especially if the MOST has a cosine-similarity > 0.5. Re-run the
cell to try another random target document.




Do the word vectors show useful similarities?
---------------------------------------------




In [37]:
import random

word_models = simple_models[:]

def pick_random_word(model, threshold=10):
    # pick a random word with a suitable number of occurences
    while True:
        word = random.choice(model.wv.index_to_key)
        if model.wv.get_vecattr(word, "count") > threshold:
            return word

target_word = pick_random_word(word_models[0])
# or uncomment below line, to just pick a word from the relevant domain:
# target_word = 'comedy/drama'

for model in word_models:
    print(f'target_word: {repr(target_word)} model: {model} similar words:')
    for i, (word, sim) in enumerate(model.wv.most_similar(target_word, topn=10), 1):
        print(f'    {i}. {sim:.2f} {repr(word)}')
    print()

target_word: 'ID' model: Doc2Vec(dbow,d100,n5,mc2,t4) similar words:
    1. 0.46 "'Swing"
    2. 0.43 'THIS.'
    3. 0.42 'tight,'
    4. 0.40 'Sheppard'
    5. 0.40 '1998-2004.'
    6. 0.40 '"COME'
    7. 0.39 "'game'"
    8. 0.39 '/>3-'
    9. 0.38 'traffic).'
    10. 0.38 'idolised'

target_word: 'ID' model: Doc2Vec(dm/m,d100,n5,w10,mc2,t4) similar words:
    1. 0.57 'profession'
    2. 0.56 'interrogator'
    3. 0.55 'authorities'
    4. 0.54 'society'
    5. 0.54 'NASA,'
    6. 0.54 'prejudices'
    7. 0.54 'Vatican'
    8. 0.54 'Communists'
    9. 0.53 'politicians'
    10. 0.52 'organization'

target_word: 'ID' model: Doc2Vec(dm/c,d100,n5,w5,mc2,t4) similar words:
    1. 0.54 'peace,'
    2. 0.53 'INFERNO'
    3. 0.53 'authority,'
    4. 0.52 'medicine'
    5. 0.52 'F-86s'
    6. 0.52 'chain-gang'
    7. 0.52 'unrest,'
    8. 0.52 'monotheism,'
    9. 0.52 'independence'
    10. 0.52 'symbolism,'



Do the DBOW words look meaningless? That's because the gensim DBOW model
doesn't train word vectors – they remain at their random initialized values –
unless you ask with the ``dbow_words=1`` initialization parameter. Concurrent
word-training slows DBOW mode significantly, and offers little improvement
(and sometimes a little worsening) of the error rate on this IMDB
sentiment-prediction task, but may be appropriate on other tasks, or if you
also need word-vectors.

Words from DM models tend to show meaningfully similar words when there are
many examples in the training data (as with 'plot' or 'actor'). (All DM modes
inherently involve word-vector training concurrent with doc-vector training.)




Are the word vectors from this dataset any good at analogies?
-------------------------------------------------------------



In [38]:
from gensim.test.utils import datapath
questions_filename = datapath('questions-words.txt')

# Note: this analysis takes many minutes
for model in word_models:
    score, sections = model.wv.evaluate_word_analogies(questions_filename)
    correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])
    print(f'{model}: {float(correct*100)/(correct+incorrect):0.2f}%% correct ({correct} of {correct+incorrect}')

2021-04-07 08:49:14,730 : INFO : Evaluating word analogies for top 300000 words in the model on /Users/sara/opt/anaconda3/lib/python3.8/site-packages/gensim/test/test_data/questions-words.txt
2021-04-07 08:49:19,114 : INFO : capital-common-countries: 0.0% (0/420)
2021-04-07 08:49:27,588 : INFO : capital-world: 0.0% (0/902)
2021-04-07 08:49:28,405 : INFO : currency: 0.0% (0/86)
2021-04-07 08:49:43,425 : INFO : city-in-state: 0.0% (0/1510)
2021-04-07 08:49:48,687 : INFO : family: 0.0% (0/506)
2021-04-07 08:49:58,794 : INFO : gram1-adjective-to-adverb: 0.0% (0/992)
2021-04-07 08:50:07,703 : INFO : gram2-opposite: 0.0% (0/756)
2021-04-07 08:50:21,614 : INFO : gram3-comparative: 0.0% (0/1332)
2021-04-07 08:50:32,326 : INFO : gram4-superlative: 0.0% (0/1056)
2021-04-07 08:50:42,616 : INFO : gram5-present-participle: 0.0% (0/992)
2021-04-07 08:50:57,817 : INFO : gram6-nationality-adjective: 0.0% (0/1445)
2021-04-07 08:51:13,323 : INFO : gram7-past-tense: 0.0% (0/1560)
2021-04-07 08:51:24,819 

Doc2Vec(dbow,d100,n5,mc2,t4): 0.00%% correct (0 of 13617


2021-04-07 08:51:33,999 : INFO : Evaluating word analogies for top 300000 words in the model on /Users/sara/opt/anaconda3/lib/python3.8/site-packages/gensim/test/test_data/questions-words.txt
2021-04-07 08:51:38,180 : INFO : capital-common-countries: 5.7% (24/420)
2021-04-07 08:51:48,321 : INFO : capital-world: 1.2% (11/902)
2021-04-07 08:51:49,023 : INFO : currency: 0.0% (0/86)
2021-04-07 08:52:05,440 : INFO : city-in-state: 0.2% (3/1510)
2021-04-07 08:52:10,622 : INFO : family: 47.8% (242/506)
2021-04-07 08:52:19,125 : INFO : gram1-adjective-to-adverb: 3.9% (39/992)
2021-04-07 08:52:25,924 : INFO : gram2-opposite: 6.6% (50/756)
2021-04-07 08:52:38,749 : INFO : gram3-comparative: 50.2% (669/1332)
2021-04-07 08:52:49,420 : INFO : gram4-superlative: 28.1% (297/1056)
2021-04-07 08:52:59,351 : INFO : gram5-present-participle: 24.9% (247/992)
2021-04-07 08:53:13,514 : INFO : gram6-nationality-adjective: 3.5% (50/1445)
2021-04-07 08:53:31,651 : INFO : gram7-past-tense: 29.2% (456/1560)
2021

Doc2Vec(dm/m,d100,n5,w10,mc2,t4): 20.03%% correct (2728 of 13617


2021-04-07 08:53:54,435 : INFO : Evaluating word analogies for top 300000 words in the model on /Users/sara/opt/anaconda3/lib/python3.8/site-packages/gensim/test/test_data/questions-words.txt
2021-04-07 08:53:58,597 : INFO : capital-common-countries: 3.8% (16/420)
2021-04-07 08:54:07,341 : INFO : capital-world: 0.4% (4/902)
2021-04-07 08:54:08,172 : INFO : currency: 0.0% (0/86)
2021-04-07 08:54:22,611 : INFO : city-in-state: 0.2% (3/1510)
2021-04-07 08:54:27,581 : INFO : family: 36.6% (185/506)
2021-04-07 08:54:37,144 : INFO : gram1-adjective-to-adverb: 5.9% (59/992)
2021-04-07 08:54:44,252 : INFO : gram2-opposite: 3.3% (25/756)
2021-04-07 08:54:56,979 : INFO : gram3-comparative: 38.8% (517/1332)
2021-04-07 08:55:06,722 : INFO : gram4-superlative: 25.5% (269/1056)
2021-04-07 08:55:16,407 : INFO : gram5-present-participle: 33.6% (333/992)
2021-04-07 08:55:29,718 : INFO : gram6-nationality-adjective: 3.5% (51/1445)
2021-04-07 08:55:44,289 : INFO : gram7-past-tense: 23.7% (369/1560)
2021-

Doc2Vec(dm/c,d100,n5,w5,mc2,t4): 17.58%% correct (2394 of 13617


Even though this is a tiny, domain-specific dataset, it shows some meager
capability on the general word analogies – at least for the DM/mean and
DM/concat models which actually train word vectors. (The untrained
random-initialized words of the DBOW model of course fail miserably.)




# Conclusion and Future Direction
Write what you have learnt in this project. In particular, write few sentences about the results and their limitations, how they can be extended in future. Make sure your own inference/learnings are depicted here.

Based on the results from experiemnt , it shows that paragraph vector overcome the weaknesses of bag of words model.
The focus of the study was on text but they mentioned this method can be applied to learn representations for sequential data. 


# References:

[1]:  Authors Andrew M. Dai, Christopher Olah, Quoc V. Le, Document Embedding with Paragraph Vectors 

[2]:  Author Quoc V. Le, Tomas Mikolov, Distributed Representations of Sentences and Documents

In [None]:
#References: 
#https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html#sphx-glr-download-auto-examples-howtos-run-doc2vec-imdb-py