# Title: 
# Evaluating Doc2Vec Model based on Sentiment Analysis on IMDB DataSet (Using Gensim)

#### Members:Sara Azadeh, Lesley Milley

####  Emails: sara.azadeh@ryerson.ca , Lesley.milley@ryerson.ca 


# Introduction:

#### Problem Description:

Many ML algorithms require the input to be represented as a fixed-length feature vector and bag-of-word was one of the common fixed-length feature but it has some weaknesses. 

#### Context of the Problem:
We want to be able to predict topics,labels and sentiments and the problem is how to use the information which we have to make the best prediction.




#### Limitation About other Approaches:
Bag-of-words has two major weakness: 
1)Lose ordering of words 2) Ignore semantics of words

#### Solution:

Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of text


# Methodology
Train a variety of Doc2Vec models on the imdb dataset.
Evaluate the performance of each model in predicting sentiment using a variety of machin learning algorithm.

Our starting point was to replicate part of the papers listed below which includes the original papers on Doc2Vec concept.
We chose to test the application of Doc2Vec on sentiment analysis.
The Auhtors did not publish their codes .However there were several implementations of their papers. We chose one of those implementation as a baseline.
That implemenation trained the models based on Gensim(Doc2Vec) and then assess the sentiments uning the Logistic Regression. We expanded that assess the sentiment using RandomForest Classifier and GaussianNB.<br />
*Other papers mentioned the difficulty in replicating the original papers both for accuracy and in terms of the best models and hyperparameters.We performed several experiments to determin the best Doc2Vec model in predicting sentiment.*



# Background

 Many ML algorithms require the input to be presented as a fixed-length feature vector. (Bag-Of-Words one of the most common fixed-length feature)

Bag-Of-Words has two major weaknesses: 

 *Loses the  ordering of words
 
 *Ignores semantics of words
 
Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of text
 In this algorithm each document is represented by a dense vector which is trained by predicting words in the document. 
This method overcomes the Bag-Of-Words weaknesses

| Reference |Explanation |  Dataset/Input |Weaknesses
| --- | --- | --- | --- |
| Andrew M. Dai et al. [1] | Paragraph Vectors can effectively be used for measuring semantic similarity between long pieces of texts| arXiv article , Wikipedia| They only used the DBOW model
| Quoc V. Le et al. [2] | Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations| Treebank Dataset , Imdb Dataset | Cannot be replicated (Others struggle to reproduce the results)


# Implementation

In [1]:
%matplotlib inline


For a single document we keep: 

* words: The text of the document, as a ``list`` of words.
* tags: Used to keep the index of the document in the entire dataset.
* split: one of ``train``\ , ``test`` or ``extra``. Determines how the document will be used (for training, testing, etc).
* sentiment: either 1 (positive), 0 (negative) or None (unlabeled document).




In [3]:
import collections

SentimentDocument = collections.namedtuple('SentimentDocument', 'words tags split sentiment')

At this part we load the data 



In [4]:
import io
import re
import tarfile
import os.path

import smart_open
import gensim.utils


def download_dataset(url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'):
    fname = url.split('/')[-1]

    if os.path.isfile(fname):
        return fname

    # Download the file to local storage first.
    with smart_open.open(url, "rb", ignore_ext=True) as fin:
        with smart_open.open(fname, 'wb', ignore_ext=True) as fout:
            while True:
                buf = fin.read(io.DEFAULT_BUFFER_SIZE)
                if not buf:
                    break
                fout.write(buf)

    return fname


In [5]:
def create_sentiment_document(name, text, index):
    _, split, sentiment_str, _ = name.split('/')
    sentiment = {'pos': 1.0, 'neg': 0.0, 'unsup': None}[sentiment_str]

    if sentiment is None:
        split = 'extra'

    tokens = gensim.utils.to_unicode(text).split()
    
    return SentimentDocument(tokens, [index], split, sentiment)

def extract_documents():
    fname = download_dataset()

    index = 0

    with tarfile.open(fname, mode='r:gz') as tar:
        for member in tar.getmembers():
            if re.match(r'aclImdb/(train|test)/(pos|neg|unsup)/\d+_\d+.txt$', member.name):
                member_bytes = tar.extractfile(member).read()
                member_text = member_bytes.decode('utf-8', errors='replace')
                assert member_text.count('\n') == 0
                yield create_sentiment_document(member.name, member_text, index)
                index += 1

alldocs = list(extract_documents())

Here's what a single document looks like.



In [6]:
print(alldocs[27])

SentimentDocument(words=['I', 'was', 'looking', 'forward', 'to', 'this', 'movie.', 'Trustworthy', 'actors,', 'interesting', 'plot.', 'Great', 'atmosphere', 'then', '?????', 'IF', 'you', 'are', 'going', 'to', 'attempt', 'something', 'that', 'is', 'meant', 'to', 'encapsulate', 'the', 'meaning', 'of', 'life.', 'First.', 'Know', 'it.', 'OK', 'I', 'did', 'not', 'expect', 'the', 'directors', 'or', 'writers', 'to', 'actually', 'know', 'the', 'meaning', 'but', 'I', 'thought', 'they', 'may', 'have', 'offered', 'crumbs', 'to', 'peck', 'at', 'and', 'treats', 'to', 'add', 'fuel', 'to', 'the', 'fire-Which!', 'they', 'almost', 'did.', 'Things', 'I', "didn't", 'get.', 'A', 'woman', 'wandering', 'around', 'in', 'dark', 'places', 'and', 'lonely', 'car', 'parks', 'alone-oblivious', 'to', 'the', 'consequences.', 'Great', 'riddles', 'that', 'fell', 'by', 'the', 'wayside.', 'The', 'promise', 'of', 'the', 'knowledge', 'therein', 'contained', 'by', 'the', 'original', 'so-called', 'criminal.', 'I', 'had', 'no

Extract our documents and split into training/test sets.



In [7]:
train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']
print(f'{len(alldocs)} docs: {len(train_docs)} train-sentiment, {len(test_docs)} test-sentiment')

100000 docs: 25000 train-sentiment, 25000 test-sentiment


In [27]:
train_docs [0]

SentimentDocument(words=['I', 'rented', 'I', 'AM', 'CURIOUS-YELLOW', 'from', 'my', 'video', 'store', 'because', 'of', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when', 'it', 'was', 'first', 'released', 'in', '1967.', 'I', 'also', 'heard', 'that', 'at', 'first', 'it', 'was', 'seized', 'by', 'U.S.', 'customs', 'if', 'it', 'ever', 'tried', 'to', 'enter', 'this', 'country,', 'therefore', 'being', 'a', 'fan', 'of', 'films', 'considered', '"controversial"', 'I', 'really', 'had', 'to', 'see', 'this', 'for', 'myself.<br', '/><br', '/>The', 'plot', 'is', 'centered', 'around', 'a', 'young', 'Swedish', 'drama', 'student', 'named', 'Lena', 'who', 'wants', 'to', 'learn', 'everything', 'she', 'can', 'about', 'life.', 'In', 'particular', 'she', 'wants', 'to', 'focus', 'her', 'attentions', 'to', 'making', 'some', 'sort', 'of', 'documentary', 'on', 'what', 'the', 'average', 'Swede', 'thought', 'about', 'certain', 'political', 'issues', 'such', 'as', 'the', 'Vietnam', 'War', 'and', 'race'

 Set-up Doc2Vec Training & Evaluation Models

In [66]:

import multiprocessing
from collections import OrderedDict

import gensim.models.doc2vec
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

from gensim.models.doc2vec import Doc2Vec


# cbow=0 means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with dm=0

#A min_count=2 saves quite a bit of model memory, discarding only words that appear in a single doc

# In the paper they used vector sized of 400 and we reduced it to 100 and also we considered 10 epochs
#We changed some parameters here  when we want to make a model to have better performance

common_kwargs = dict(
    vector_size=100, epochs=10, min_count=2,
    sample=0, workers=multiprocessing.cpu_count(), negative=5, hs=0
)

simple_models = [
    # PV-DBOW plain
    Doc2Vec(dm=0, alpha=0.025 ,min_alpha = 0.0001 ,**common_kwargs),
    
    # PV-DM plain w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
    # The initial learning rate = alpha

    Doc2Vec(dm=1, window=10, alpha=0.025,min_alpha = 0.0001, comment='alpha=0.05', **common_kwargs),
    
    # PV-DM w/ concatenation window=5 (both sides) approximates paper's apparent 10-word total window size
      
    Doc2Vec(dm=1, dm_concat=1, window=5,alpha=0.025,min_alpha = 0.0001, **common_kwargs),
  
]

for model in simple_models:
    model.build_vocab(alldocs)
    print(f"{model} vocabulary scanned & state initialized")

models_by_name = OrderedDict((str(model), model) for model in simple_models)

Doc2Vec(dbow,d100,n5,mc2,t4) vocabulary scanned & state initialized
Doc2Vec(dm/m,d100,n5,w10,mc2,t4) vocabulary scanned & state initialized
Doc2Vec(dm/c,d100,n5,w5,mc2,t4) vocabulary scanned & state initialized


Based on what described in paper combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance at this part we paired different simple models which is built above and try to make new combined models

In [67]:

from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])


Prediction Evaluation Method

We will have three experiments:

1) The first one will be using logestic Regression to predict the sentiments and also Evaluate the Models and sort the Error Rates to find the best model





In [79]:
import numpy as np
import statsmodels.api as sm
from random import sample

def logistic_predictor_from_data(train_targets, train_regressors):
    """Fit a statsmodel logistic predictor on supplied data"""
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    return predictor

def error_rate_for_model(test_model, train_set, test_set):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets = [doc.sentiment for doc in train_set]
    train_regressors = [test_model.dv[doc.tags[0]] for doc in train_set]
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_regressors = [test_model.dv[doc.tags[0]] for doc in test_set]
    test_regressors = sm.add_constant(test_regressors)

    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_set])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

Note:
Running this part takes about almost 1 hour!!! 



In [80]:
from collections import defaultdict
error_rates = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [81]:
from random import shuffle
shuffled_alldocs = alldocs[:]
shuffle(shuffled_alldocs)

print("Train our Simple Doc2Vec models and Evluate the results: (Based on Logistic Regression)\n")
for model in simple_models:
    print(f"Training {model}")
    model.train(shuffled_alldocs, total_examples=len(shuffled_alldocs), epochs=model.epochs)

    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))

print("Train our Combined Doc2Vec models and Evluate the results: (Based on Logistic Regression)\n")
for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:
    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print(f"\n{err_rate} {model}\n")

Train our Simple Doc2Vec models and Evluate the results: (Based on Logistic Regression)

Training Doc2Vec(dbow,d100,n5,mc2,t4)

Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)

0.104920 Doc2Vec(dbow,d100,n5,mc2,t4)

Training Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

Evaluating Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

0.186120 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

Training Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.354520 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

Train our Combined Doc2Vec models and Evluate the results: (Based on Logistic Regression)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

0.10464 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.10564 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)



At this part we sort our models based on the acquired Error Rates

In [82]:
print("Error Rate and Model Name: (Based on Logistic Regression)\n")
for rate, name in sorted((rate, name) for name, rate in error_rates.items()):
    print(f"{rate} \t {name}")

Error Rate and Model Name: (Based on Logistic Regression)

0.10464 	 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)
0.10492 	 Doc2Vec(dbow,d100,n5,mc2,t4)
0.10564 	 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
0.18612 	 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)
0.35452 	 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)


2) The second one will be using Random Forest to predict the sentiments and also Evaluate the Models and sort the Error Rates to find the best model

In [83]:
import numpy as np
from random import sample
from sklearn.ensemble import RandomForestClassifier

def RandomForest_predictor_from_data(train_targets, train_regressors):
 
    clf = RandomForestClassifier(max_depth=2, random_state=0)
    
    predictor = clf.fit(train_regressors,train_targets)
    return predictor

def error_rate_for_ranndomforest_model(test_model, train_set, test_set):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets = [doc.sentiment for doc in train_set]
    train_regressors = [test_model.dv[doc.tags[0]] for doc in train_set]
    predictor = RandomForest_predictor_from_data(train_targets, train_regressors)

    test_regressors = [test_model.dv[doc.tags[0]] for doc in test_set]

    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_set])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

Note:
Running this part takes about almost 1 hour!!! 

In [84]:
from collections import defaultdict
error_rates_rf = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [85]:
from random import shuffle
shuffled_alldocs = alldocs[:]
shuffle(shuffled_alldocs)

print("Train our Simple Doc2Vec models and Evluate the results (Based on RandomForest):\n")
for model in simple_models:
    print(f"Training {model}")
    model.train(shuffled_alldocs, total_examples=len(shuffled_alldocs), epochs=model.epochs)

    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_ranndomforest_model(model, train_docs, test_docs)
    error_rates_rf[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))

print("Train our Combined Doc2Vec models and Evluate the results (Based on RandomForest):\n")
for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:
    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_ranndomforest_model(model, train_docs, test_docs)
    error_rates_rf[str(model)] = err_rate
    print(f"\n{err_rate} {model}\n")

Train our Simple Doc2Vec models and Evluate the results (Based on RandomForest):

Training Doc2Vec(dbow,d100,n5,mc2,t4)

Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)

0.215840 Doc2Vec(dbow,d100,n5,mc2,t4)

Training Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

Evaluating Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

0.273120 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

Training Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.356240 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

Train our Combined Doc2Vec models and Evluate the results (Based on RandomForest):


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

0.23804 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.21232 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)



In [86]:
print("Error Rate and Model Name: (Based on RandomForest)\n")
for rate, name in sorted((rate, name) for name, rate in error_rates_rf.items()):
    print(f"{rate} \t {name}")

Error Rate and Model Name: (Based on RandomForest)

0.21232 	 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
0.21584 	 Doc2Vec(dbow,d100,n5,mc2,t4)
0.23804 	 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)
0.27312 	 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)
0.35624 	 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)


3) The third one will be using GaussianNB to predict the sentiments and also Evaluate the Models and sort the Error Rates to find the best model

In [87]:
import numpy as np
from random import sample
from sklearn.naive_bayes import GaussianNB

def GaussianNB_predictor_from_data(train_targets, train_regressors):
 
    clf = GaussianNB()
    predictor = clf.fit(train_regressors,train_targets)
    return predictor

def error_rate_for_GaussianNB_model(test_model, train_set, test_set):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets = [doc.sentiment for doc in train_set]
    train_regressors = [test_model.dv[doc.tags[0]] for doc in train_set]
    predictor = GaussianNB_predictor_from_data(train_targets, train_regressors)

    test_regressors = [test_model.dv[doc.tags[0]] for doc in test_set]

    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_set])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

Note:
Running this part takes about almost 1 hour!!! 

In [104]:
from collections import defaultdict
error_rates_NB = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [105]:
from random import shuffle
shuffled_alldocs = alldocs[:]
shuffle(shuffled_alldocs)

print("Train our Simple Doc2Vec models and Evluate the results (Based on GaussianNB):\n")
for model in simple_models:
    print(f"Training {model}")
    model.train(shuffled_alldocs, total_examples=len(shuffled_alldocs), epochs=model.epochs)

    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_GaussianNB_model(model, train_docs, test_docs)
    error_rates_NB[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))

print("Train our Combined Doc2Vec models and Evluate the results (Based on GaussianNB):\n")
for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:
    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_GaussianNB_model(model, train_docs, test_docs)
    error_rates_NB[str(model)] = err_rate
    print(f"\n{err_rate} {model}\n")

Train our Simple Doc2Vec models and Evluate the results (Based on GaussianNB):

Training Doc2Vec(dbow,d100,n5,mc2,t4)

Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)

0.126560 Doc2Vec(dbow,d100,n5,mc2,t4)

Training Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

Evaluating Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

0.258960 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

Training Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.277520 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

Train our Combined Doc2Vec models and Evluate the results (Based on GaussianNB):


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)

0.17052 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.15196 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)



In [107]:
print("Error Rate and Model Name: (Based on GaussianNB)\n")
for rate, name in sorted((rate, name) for name, rate in error_rates_NB.items()):
    print(f"{rate} \t {name}")

Error Rate and Model Name: (Based on GaussianNB)

0.12656 	 Doc2Vec(dbow,d100,n5,mc2,t4)
0.15196 	 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
0.17052 	 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)
0.25896 	 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)
0.27752 	 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)


# Conclusion and Future Direction
The Doc2Vec is a strong methodology for predicting sentiments.However the model has to be chosen carefully and it has to be tuned.
The original papers indicated that a more complex concatenated models had the lowest error rate however we found that the simpler DBOW Model was the same or better.

In [None]:
# At this part we try to compare all the results 

In [123]:
import pandas as pd
column_names= ["Error_rate_LR","Error_rate_RF","Error_rate_GBN"]
df = pd.DataFrame(columns = column_names , index = error_rates_NB.keys())
df.index.name = "Models"


In [125]:
df["Error_rate_GBN"] = error_rates_NB.values()
df["Error_rate_RF"] = error_rates_rf.values()
df["Error_rate_LR"]=error_rates.values()

In [126]:
df 

Unnamed: 0_level_0,Error_rate_LR,Error_rate_RF,Error_rate_GBN
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Doc2Vec(dbow,d100,n5,mc2,t4)",0.10492,0.21584,0.12656
"Doc2Vec(dm/m,d100,n5,w10,mc2,t4)",0.18612,0.27312,0.25896
"Doc2Vec(dm/c,d100,n5,w5,mc2,t4)",0.35452,0.35624,0.27752
"Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/m,d100,n5,w10,mc2,t4)",0.10464,0.23804,0.17052
"Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)",0.10564,0.21232,0.15196


# References:

[1]:  Authors Andrew M. Dai, Christopher Olah, Quoc V. Le, Document Embedding with Paragraph Vectors 

[2]:  Author Quoc V. Le, Tomas Mikolov, Distributed Representations of Sentences and Documents

In [None]:
#References: 
#https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html#sphx-glr-download-auto-examples-howtos-run-doc2vec-imdb-py