# Word2Vec Embedding as Features

I'm actually not entirely sure if I understand how to generate features using word2vec so I wanted to start this notebook and run this by you before spending time to train aon our feature set and training models

For this notebook, I will only be working with 100 reviews

Purpose is to show snippeets of code to make sure that I'm going out this in the right way

In [23]:
from gensim.models.word2vec import Word2Vec
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


## Load in the raw text from amazon reviews

This data has already been pre-proocessed by [amazon_review_preprocessor.py](amazon_review_preprocessor.py)

In [3]:
data_file = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-csv-100-preprocessed.csv"
df = pd.read_csv(data_file)

In [84]:
rb_df = df["review_body"]
y_df = df["star_rating"]
sample_size = rb_df.shape
# write out the review_body in LineSentence format - docs says there should be performance gains here
review_body_file = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-csv-100-preprocessed-review_body.csv"
rb_df.to_csv(review_body_file, header=False, index=False)

## Tokenize reviews

In [20]:
from nltk import WordPunctTokenizer
wpt = WordPunctTokenizer()
documents = [wpt.tokenize(value) for index, value in rb_df.iteritems()]

## Train word2vec to get word vectors

In [21]:
# Set values for various parameters
feature_size = 200    # Word vector dimensionality  
window_context = 30          # Context window size                                                                                    
min_word_count = 1   # Minimum word count                        
sample = 1e-3   # Downsample setting for frequent words
iter = 50 # number of iterations over corpus

model = Word2Vec(documents, size=feature_size, 
                          window=window_context, min_count=min_word_count,
                          sample=sample, iter=iter)


In [None]:
# save off model to be used later
model.save(f"models/word2vec-{sample_size}-{feature_size}.model")

## inspect how many words are in our vocabulary

Looks like there are about a thousand words in our vocabulary

In [27]:
model.wv.vectors.shape

In [53]:
first_review = rb_df.iloc[0]
vector_list = []
for word in wpt.tokenize(first_review):
#     print(f'{word}: {model.wv.get_vector(word)}')
    word_vector = model.wv.get_vector(word)
    print(f'{word}: {word_vector.shape}')
    vector_list.append(word_vector)

review_feature = np.average(vector_list, axis=0)
print(review_feature.shape)

good: (200,)
product: (200,)
please: (200,)
note: (200,)
not: (200,)
floating: (200,)
case: (200,)
do: (200,)
not: (200,)
clai: (200,)
somehow: (200,)
thinking: (200,)
good: (200,)
price: (200,)
does: (200,)
says: (200,)
(200,)


## Now we come up with a vector that respresents each of the reviews by averaging word vectors for every word in the review body

In [77]:
def get_review_vector(review):
    # returns a list of word vectors for all words im review
    word_vectors = [model.wv.get_vector(word) for word in wpt.tokenize(review)]
#     print(len(word_vectors))
    # average all word vectors to come up with final vector for the review
    return np.average(word_vectors, axis=0)

# generate new feature DF
print(len(rb_df))
x_df = pd.DataFrame()
for index, review in rb_df.iteritems():
    feature_vector = get_review_vector(review)
    # turn this into dictionary so we can add it as row to DF
    feature_dict = dict(enumerate(feature_vector))
    x_df = x_df.append(feature_dict, ignore_index=True)
    
x_df.info()


100
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 200 entries, 0 to 199
dtypes: float64(200)
memory usage: 156.3 KB


## Now we train a model and see how we do

In [78]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df)

gb = lgb.LGBMClassifier(objective="multiclass", num_threads=2,
                        seed=1)

gb.fit(x_train, y_train)
y_predict = gb.predict(x_test)

report = classification_report(y_test, y_predict, output_dict=True)

report


  'precision', 'predicted', average, warn_for)


{'1': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 1},
 '2': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 1},
 '3': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 5},
 '4': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 4},
 '5': {'precision': 0.5,
  'recall': 0.6428571428571429,
  'f1-score': 0.5625000000000001,
  'support': 14},
 'accuracy': 0.36,
 'macro avg': {'precision': 0.1,
  'recall': 0.1285714285714286,
  'f1-score': 0.11250000000000002,
  'support': 25},
 'weighted avg': {'precision': 0.28,
  'recall': 0.36,
  'f1-score': 0.31500000000000006,
  'support': 25}}

# Next

* Use the same steps here and create features using word2vec on larger reviews sample set and generate feature vectors for each review - ie, 50k


# Questions
* looking at documentation looks like gensim has doc2vec - should I be using this instead?
* parameters for word2vec I just took from your notebook - not sure what would be reasonable here?
* I tried using corpus_file instead of sentences as suggested by documentation but got the following error message

```
    ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-eb0baade2204> in <module>
      8 model = Word2Vec(corpus_file=review_body_file, size=feature_size, 
      9                           window=window_context, min_count=min_word_count,
---> 10                           sample=sample, iter=iter)

TypeError: __init__() got an unexpected keyword argument 'corpus_file'
```