|<i> Note: This notebook is inspired by the Doc2Vec Text Classification series at:  </i>
https://github.com/rhasanbd/Document-Embedding-Doc2vec-Text-Classification

# Document Embedding

In the previous notebook, we implemented Word2Vec model for text classification. The model carried out Word emedding by representing words numerically.
Now, we try to explore how documents as a whole can be represented numerically by retaiing the word orders and it's semantics.

## Doc2vec

The Doc2vec model is an implementation of the  **Paragraph Vector** model proposed by (Quoc Le and Tomas Mikolov, 1994) in "Distributed Representations of Sentences and Documents". 

The Doc2vec improves the Word2vec model where every paragraph is mappeed to a unique vecot D and every word is also mapped to a unique vector W as in Word2vec. It is capable of constructing representations of input sequences of variable length like sentences, paragraph and documents.

## Distributed Memory Model of Paragraph Vectors(PV-DM)
This is one of the types of the Doc2vec model. The paragraph acts as a memory that retains what is missing from the current context from the words (i.e. Topic of the paragraph). 

In [3]:
import numpy as np
import pandas as pd
import warnings

import pickle

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer


from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB


from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report
from pymongo import MongoClient

## Load and Explore Data

In [4]:
client = MongoClient("mongodb://localhost:27017/")
db = client.yelp_database
df = pd.DataFrame(db.business_restaurant.find({},{"reviews.text":1, "_id":0}))
df = df.applymap(lambda x : x[0]['text'])
df.head() #Quick Check of the data

Unnamed: 0,reviews
0,Bolt is within walking distance of The Drake H...
1,"When people say Korean food, what do you think..."
2,Feast Buffet at Palace Station Casino\n\nMaybe...
3,I'm such a fan! Our Nishikawa Black Ramen bow...
4,Several of our friends that live in the area s...


## Description of the data

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8688 entries, 0 to 8687
Data columns (total 1 columns):
reviews    8688 non-null object
dtypes: object(1)
memory usage: 68.0+ KB


## Dimension of the data

In [7]:
print("Dimension of the data: ", df.shape)

no_of_rows = df.shape[0]
no_of_columns = df.shape[1]

print("No. of Rows: %d" % no_of_rows)
print("No. of Columns: %d" % no_of_columns)

Dimension of the data:  (8688, 1)
No. of Rows: 8688
No. of Columns: 1


## Create the Document Corpus

In [10]:
corpus = df['reviews']
print("Number of Documents (emails) in the corpus: ", len(corpus))

Number of Documents (emails) in the corpus:  8688
