# Word2Vec and FastText Embedding as Features

I'm actually not entirely sure if I understand how to generate features using word2vec and fasttext so I wanted to start this notebook and run this by you before spending time to train aon our feature set and training models

For this notebook, I will only be working with 100 reviews so it runs fast

Purpose is to show snippeets of code to make sure that I am on the right track


Steps I used for both are:
    * convert reviews to tokenized array
    * train either word2vec or fasttext using these tokenized arrays
    * get word vector for each word in review
    * generate new feature matrix by averaging all word vectors into a review - this becomes the feature vector for the review
    * run feature matrices through lightGBM

In [1]:
from gensim.models.word2vec import Word2Vec
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


## Load in the raw text from amazon reviews

This data has already been pre-proocessed by [amazon_review_preprocessor.py](amazon_review_preprocessor.py)

In [2]:
data_file = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-csv-100-preprocessed.csv"
df = pd.read_csv(data_file)

In [3]:
rb_df = df["review_body"]
y_df = df["star_rating"]
sample_size = rb_df.shape[0]

# this doesn't actually work - I get a parameter error when I try to use corpus_file parameter for word2vec
# write out the review_body in LineSentence format - docs says there should be performance gains here
review_body_file = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-csv-100-preprocessed-review_body.csv"
rb_df.to_csv(review_body_file, header=False, index=False)

## Tokenize reviews

In [4]:
from nltk import WordPunctTokenizer
wpt = WordPunctTokenizer()
documents = [wpt.tokenize(value) for index, value in rb_df.iteritems()]

## Train word2vec to get word vectors

In [5]:
# Set values for various parameters
feature_size = 200    # Word vector dimensionality  
window_context = 30          # Context window size                                                                                    
min_word_count = 1   # Minimum word count                        
sample = 1e-3   # Downsample setting for frequent words
iter = 50 # number of iterations over corpus

w2v_model = Word2Vec(documents, size=feature_size, 
                          window=window_context, min_count=min_word_count,
                          sample=sample, iter=iter)


In [6]:
# save off model to be used later
w2v_model.save(f"models/word2vec-{sample_size}-{feature_size}.model")

## inspect how many words are in our vocabulary

Looks like there are about a thousand words in our vocabulary

In [7]:
w2v_model.wv.vectors.shape

(996, 200)

In [8]:
first_review = rb_df.iloc[0]
vector_list = []
for word in wpt.tokenize(first_review):
#     print(f'{word}: {model.wv.get_vector(word)}')
    word_vector = w2v_model.wv.get_vector(word)
    print(f'{word}: {word_vector.shape}')
    vector_list.append(word_vector)

review_feature = np.average(vector_list, axis=0)
print(review_feature.shape)

good: (200,)
product: (200,)
please: (200,)
note: (200,)
not: (200,)
floating: (200,)
case: (200,)
do: (200,)
not: (200,)
clai: (200,)
somehow: (200,)
thinking: (200,)
good: (200,)
price: (200,)
does: (200,)
says: (200,)
(200,)


## Now we come up with a vector that respresents each of the reviews by averaging word vectors for every word in the review body

In [9]:
def get_review_vector(model, review):
    # returns a list of word vectors for all words im review
    word_vectors = [model.wv.get_vector(word) for word in wpt.tokenize(review)]
#     print(len(word_vectors))
    # average all word vectors to come up with final vector for the review
    return np.average(word_vectors, axis=0)

# generate new feature DF
def get_feature_df(model, df:pd.DataFrame) -> pd.DataFrame:
    f_df = pd.DataFrame()
    for index, review in df.iteritems():
        feature_vector = get_review_vector(model, review)
        # turn this into dictionary so we can add it as row to DF
        feature_dict = dict(enumerate(feature_vector))
        f_df = f_df.append(feature_dict, ignore_index=True)
    return f_df
    


In [10]:
# generate our feature matrix
w2v_x_df = get_feature_df(w2v_model, rb_df)

## Now we train a model and see how we do

In [11]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


def train_lightGBM(x_df, y_df):
    x_train, x_test, y_train, y_test = train_test_split(x_df, y_df)

    gb = lgb.LGBMClassifier(objective="multiclass", num_threads=2,
                            seed=1)

    gb.fit(x_train, y_train)
    y_predict = gb.predict(x_test)

    report = classification_report(y_test, y_predict, output_dict=True)
    return report



This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [12]:
train_lightGBM(w2v_x_df, y_df)

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


{'1': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3},
 '2': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 0},
 '3': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3},
 '4': {'precision': 0.3333333333333333,
  'recall': 1.0,
  'f1-score': 0.5,
  'support': 1},
 '5': {'precision': 0.8421052631578947,
  'recall': 0.8888888888888888,
  'f1-score': 0.8648648648648649,
  'support': 18},
 'accuracy': 0.68,
 'macro avg': {'precision': 0.2350877192982456,
  'recall': 0.37777777777777777,
  'f1-score': 0.27297297297297296,
  'support': 25},
 'weighted avg': {'precision': 0.6196491228070176,
  'recall': 0.68,
  'f1-score': 0.6427027027027027,
  'support': 25}}

# Implement FastText

In [13]:
from gensim.models.fasttext import FastText

# Set values for various parameters
feature_size = 200    # Word vector dimensionality  
window_context = 30          # Context window size                                                                                    
min_word_count = 1   # Minimum word count                        
sample = 1e-3   # Downsample setting for frequent words
iter = 50 # number of iterations over corpus

ft_model = FastText(documents, size=feature_size, 
                          window=window_context, min_count=min_word_count,
                          sample=sample, iter=iter)


In [14]:
# check to see we have the same number of words as before
print(len(ft_model.wv.vocab))

996


In [15]:
ft_x_df = get_feature_df(ft_model, rb_df)
ft_x_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 200 entries, 0 to 199
dtypes: float64(200)
memory usage: 156.3 KB


In [16]:
train_lightGBM(ft_x_df, y_df)

{'1': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 2},
 '2': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 2},
 '3': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3},
 '4': {'precision': 0.3333333333333333,
  'recall': 0.5,
  'f1-score': 0.4,
  'support': 4},
 '5': {'precision': 0.5882352941176471,
  'recall': 0.7142857142857143,
  'f1-score': 0.6451612903225806,
  'support': 14},
 'accuracy': 0.48,
 'macro avg': {'precision': 0.1843137254901961,
  'recall': 0.24285714285714288,
  'f1-score': 0.20903225806451614,
  'support': 25},
 'weighted avg': {'precision': 0.38274509803921575,
  'recall': 0.48,
  'f1-score': 0.4252903225806451,
  'support': 25}}

# Next

* Use the same steps here and create features using word2vec on larger reviews sample set and generate feature vectors for each review - ie, 50k


# Questions
* looking at documentation looks like gensim has doc2vec - should I be using this instead?
* parameters for word2vec I just took from your notebook - not sure what would be reasonable here?
* I tried using corpus_file instead of sentences as suggested by documentation but got the following error message

```
    ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-eb0baade2204> in <module>
      8 model = Word2Vec(corpus_file=review_body_file, size=feature_size, 
      9                           window=window_context, min_count=min_word_count,
---> 10                           sample=sample, iter=iter)

TypeError: __init__() got an unexpected keyword argument 'corpus_file'
```