|<i> Note: This notebook is inspired by the Doc2Vec Text Classification series at:  </i>
https://github.com/rhasanbd/Document-Embedding-Doc2vec-Text-Classification

# Document Embedding

In the previous notebook, we implemented Word2Vec model for text classification. The model carried out Word emedding by representing words numerically.
Now, we try to explore how documents as a whole can be represented numerically by retaiing the word orders and it's semantics.

## Doc2vec

The Doc2vec model is an implementation of the  **Paragraph Vector** model proposed by (Quoc Le and Tomas Mikolov, 1994) in "Distributed Representations of Sentences and Documents". 

The Doc2vec improves the Word2vec model where every paragraph is mappeed to a unique vecot D and every word is also mapped to a unique vector W as in Word2vec. It is capable of constructing representations of input sequences of variable length like sentences, paragraph and documents.

## Distributed Memory Model of Paragraph Vectors(PV-DM)
This is one of the types of the Doc2vec model. The paragraph acts as a memory that retains what is missing from the current context from the words (i.e. Topic of the paragraph). 

In [1]:
import numpy as np
import pandas as pd
import warnings

import pickle

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer


from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB


from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report
from pymongo import MongoClient

## Load and Explore Data

In [2]:
client = MongoClient("mongodb://localhost:27017/")
db = client.yelp_database
df = pd.DataFrame(db.business_restaurant.find({},{"reviews.text":1, "_id":0}))
df = df.applymap(lambda x : x[0]['text'])
df.head() #Quick Check of the data

Unnamed: 0,reviews
0,Bolt is within walking distance of The Drake H...
1,"When people say Korean food, what do you think..."
2,Feast Buffet at Palace Station Casino\n\nMaybe...
3,I'm such a fan! Our Nishikawa Black Ramen bow...
4,Several of our friends that live in the area s...


## Description of the data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8688 entries, 0 to 8687
Data columns (total 1 columns):
reviews    8688 non-null object
dtypes: object(1)
memory usage: 68.0+ KB


## Dimension of the data

In [4]:
print("Dimension of the data: ", df.shape)

no_of_rows = df.shape[0]
no_of_columns = df.shape[1]

print("No. of Rows: %d" % no_of_rows)
print("No. of Columns: %d" % no_of_columns)

Dimension of the data:  (8688, 1)
No. of Rows: 8688
No. of Columns: 1


## Create the Document Corpus

In [5]:
corpus = df['reviews']
print("Number of Documents (emails) in the corpus: ", len(corpus))

Number of Documents (emails) in the corpus:  8688


## Pre-process the Data
Pre-processing of the text data is done using the following steps:

Convert to lowercase
Tokenize (split the documents into tokens or words)
Remove numbers, but not words that contain numbers
Remove words that are only a single character
Lemmatize the tokens/words

## Tokenization and Lemmatization
We convert all the words into lowercase then tokenize each word using NLTK Regular-Expression Tokenizer class "RegexpTokenizer".
It splits a given string to substrings using a regular expression.
Then we remove numbers and single character words since they usually don't impart much useful information and are very high in number.
Finally, we lemmatize the tokens using WordNetLemmatizer from NLTK, where we extract the root words of the tokens using the dictionary.

In [6]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def docs_preprocessor(docs):
    '''Function to Convert the 2D Document Array into a 2D Array of Tokenized Documents'''
    tokenizer = RegexpTokenizer(r'\w+') # Tokenize the words.
    
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words that are only one character.
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    
    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

In [7]:
%%time

# Convert a list of sentences to a list of lists containing tokenized words
docs = docs_preprocessor(corpus)
print("Length of the 2D Array of Tokenized Documents: ", len(docs))

# Store the data locally
pickle.dump(docs, open("tokenized_reviews_doc2vec.p", "wb" ))

Length of the 2D Array of Tokenized Documents:  8688
Wall time: 7.21 s


## Remove all stop words
- Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text. 
- The stop words may be removed to avoid them being construed as signal for prediction.
- To remove the stop words, we use the "stopwords" module from the nltk library.

In [8]:
# Load library
from nltk.corpus import stopwords

# You will have to download the set of stop words the first time
import nltk
nltk.download('stopwords')

# Load stop words
stop_words = stopwords.words('english')

# Show stop words
stop_words[:5]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rojin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we']

In [9]:
# Remove all stop words from the doc
for i in range(len(docs)):
    docs[i] = [word for word in docs[i] if word not in stop_words]

## Compute Bigrams/Trigrams:

- N-grams are combinations of adjacent words or letters of length 'n' that you can find in your source text. These combinations of words carry a special meaning. For example: car-pool is an n-gram formed using the two words car and pool that carries a distinct meaning different from the individual words. 

- If n=2, it is called a Bigram and if n=3, it is called a Trigram.

- We find all the combinations of Bigrams and Trigrams. Then, we keep only the frequent phrases. 
- We finally add the frequent phrases to the original data, since we would like to keep the words “car” and “pool” as well as the bigram “car_pool”.

In [10]:
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 10 times or more).
bigram = Phrases(docs, min_count=10, threshold=100)
trigram = Phrases(bigram[docs], min_count=10,  threshold=100)

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
    for token in trigram[docs[idx]]:
        if '_' in token:
            # Token is a trigram, add to document.
            docs[idx].append(token)

## Create Tagged Documents
For training the Doc2Vec model, we need to create tagged documented.
A single document, made up of words and tags.

In [11]:
# Load the list of lists containing tokenized words
docs = pickle.load( open("tokenized_reviews_doc2vec.p", "rb" ) )
#print(docs[0])

# Create Tagged documents
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs)]
#print(documents[0])

In [12]:
print(documents[4])

TaggedDocument(['several', 'friend', 'that', 'live', 'area', 'swear', 'this', 'place', 'after', 'having', 'spent', 'quiet', 'afternoon', 'walking', 'around', 'chaparral', 'park', 'felt', 'that', 'time', 'some', 'sushi', 'since', 'were', 'just', 'down', 'street', 'figured', 'that', 'would', 'finally', 'give', 'this', 'place', 'what', 'fuss', 'about', 'there', 'early', 'were', 'only', 'customer', 'that', 'time', 'wife', 'that', 'adventurous', 'with', 'sushi', 'refuse', 'fish', 'meat', 'other', 'animal', 'protein', 'opted', 'standard', 'california', 'roll', 'vega', 'roll', 'also', 'ordered', 'shrimp', 'crab', 'tempura', 'good', 'measure', 'weren', 'exactly', 'trying', 'push', 'envelope', 'just', 'wanted', 'this', 'restaurant', 'sushi', 'acumen', 'were', 'wholly', 'surprised', 'delectable', 'dish', 'that', 'were', 'served', 'were', 'quite', 'fresh', 'tasteful', 'overwrought', 'addition', 'fresh', 'wasabi', 'added', 'delicate', 'spicy', 'touch', 'meal', 'definitely', 'treat', 'have', 'parta

## Training the Doc2vec Model
We use the gensim.models.Doc2Vec class.

    class gensim.models.doc2vec.Doc2Vec(documents=None, corpus_file=None, dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, callbacks=(), **kwargs)

## Create the Doc2vec Model and Train

In [13]:
%%time
# Set training parameters
doc_vector_length = 600       # Dimension of the document vector
window_size = 2               # We set it 2 as the sentences weren't too long
epochs = 600                  # Number of iterations (epochs) over the corpus
min_count = 100                 # Ignores all words with total frequency lower than min_count
workers = 4                   # Number of worker threads to train the model

Wall time: 0 ns


In [14]:
%%time

# Create the Doc2vec model using gensim (If dm=1, ‘distributed memory’ (DM) algorithm is used)
model = Doc2Vec(vector_size=doc_vector_length, dm=1, window=window_size, min_count=min_count, 
                workers=workers, epochs=epochs, seed =1) # sample=0.01
# Create vocabulary
model.build_vocab(documents)

# Train the model
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

Wall time: 14min 33s


## Save the Model Locally

In [15]:
model.save('d2v_model_reviews')

## Load the Saved Model

In [16]:
# load doc2vec model
model = Doc2Vec.load('d2v_model_reviews')

In [17]:
#View the vocabulary size
print("Vocabulary Size: ", len(model.wv.vocab))

Vocabulary Size:  1378


## Model Evaluation

## Evaluation 1: Find Similar Words

In [18]:
model.wv.most_similar('restaurant')

[('place', 0.47038960456848145),
 ('location', 0.3314688205718994),
 ('food', 0.2027396708726883),
 ('cafe', 0.19334515929222107),
 ('business', 0.1923389434814453),
 ('dish', 0.17900845408439636),
 ('eatery', 0.1768079549074173),
 ('shop', 0.17461654543876648),
 ('area', 0.16709855198860168),
 ('chocolate', 0.1661548912525177)]

In [19]:
model.wv.most_similar('restaurant')

[('place', 0.47038960456848145),
 ('location', 0.3314688205718994),
 ('food', 0.2027396708726883),
 ('cafe', 0.19334515929222107),
 ('business', 0.1923389434814453),
 ('dish', 0.17900845408439636),
 ('eatery', 0.1768079549074173),
 ('shop', 0.17461654543876648),
 ('area', 0.16709855198860168),
 ('chocolate', 0.1661548912525177)]

In [20]:
model.wv.most_similar(positive=['milk'])

[('soda', 0.14916124939918518),
 ('croissant', 0.14741094410419464),
 ('sauce', 0.13659267127513885),
 ('mimosa', 0.13049207627773285),
 ('wine', 0.1294557899236679),
 ('fruit', 0.1282532960176468),
 ('date', 0.11846138536930084),
 ('syrup', 0.11613233387470245),
 ('sprout', 0.11422225832939148),
 ('crust', 0.11269959062337875)]

In [21]:
model.wv.most_similar(positive=['drink'])

[('coffee', 0.18316936492919922),
 ('delivery', 0.1599237620830536),
 ('dish', 0.14941267669200897),
 ('salty', 0.14915961027145386),
 ('margarita', 0.1485767364501953),
 ('seat', 0.14537785947322845),
 ('item', 0.14437422156333923),
 ('plate', 0.14204436540603638),
 ('cocktail', 0.13997963070869446),
 ('beer', 0.13968591392040253)]

## Evaluation 2: Find Top N Similar Words

In [22]:
model.wv.similar_by_word('restaurant', topn=5)

[('place', 0.47038960456848145),
 ('location', 0.3314688205718994),
 ('food', 0.2027396708726883),
 ('cafe', 0.19334515929222107),
 ('business', 0.1923389434814453)]

In [23]:
model.wv.similar_by_word('server', topn=5)

[('waitress', 0.2653250992298126),
 ('staff', 0.2503458559513092),
 ('waiter', 0.20660999417304993),
 ('employee', 0.1819867640733719),
 ('they', 0.16565030813217163)]

## Evaluation 3: Find Similarity Values

In [24]:
model.wv.similarity("breakfast", "egg")

-0.025044592

In [25]:
model.wv.similarity("breakfast", "morning")

0.075337075

In [26]:
model.wv.similarity("breakfast", "bacon")

0.06767225

In [27]:
model.wv.similarity("breakfast", "noodle")

0.10985135

In [28]:
model.wv.similarity("breakfast", "spicy")

0.07295551

## Evaluation 4: Get All Words that are closer to Word 1 than Word 2 

In [29]:
model.wv.closer_than("wonderful", "nice")

['good',
 'great',
 'delicious',
 'amazing',
 'excellent',
 'awesome',
 'yummy',
 'middle',
 'reminded']

## Evaluation 5: Perform Vector Translation

In [30]:
model.wv.most_similar(positive=[ 'server', 'staff'], topn=3)  #room,shower, bedroom = bathroom

[('waitress', 0.2890494465827942),
 ('waiter', 0.20884302258491516),
 ('employee', 0.20617488026618958)]

In [31]:
model.wv.most_similar(positive=['glass', 'bottle'], negative=['water'], topn=3)  #woman, king, man = queen

[('bathroom', 0.14220030605793),
 ('basically', 0.14039336144924164),
 ('dipping', 0.13085144758224487)]

## Evaluation 6: Word from the given list doesn’t go with the Others

In [32]:
# Which of the below does not belong in the sequence?
model.wv.doesnt_match('restaurant food pool server'.split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'pool'