# Amazon Review Feature Engineering


In this notebook, I will outline how features are generated. These feature files will be fed into our models later for evaluation

To generate the features I wrote a python program that would read in a configuration file and generate all feature files needed for future notebooks at the same time.

There are ~50k samples in each file

Here are the variations of features that we generated:

| Name | Description |
|------|-------------|
| BoW | Bag of works. Variations using unigram and bigram |
| TFIDF | Term Frequency - Inverse Document Frequency. Variations using unigram and bigram. Max features is set to 10k so we only retain the most frequent 10k words |
| Word2Vec | Word 2 Vec embedding. This is an average embedding of all words in the review. <br>There are 2 variations of this. If it's pretrained, it uses word2vec-google-news-300 else embeddings are trained with our corpus of reviews. Max features is set to 10k so we only retain the most frequent 10k words. <br>Also, there are a couple different version of this using unigram (ngram11) and bigram (ngram22). Max features for this is set to 300 |
| Fasttext | Fasttext embedding. This is an average embedding of all words in the review. Embedding is trained with our corpus of review words. <br>Also, there are a couple different version of this using unigram (ngram11) and bigram (ngram22). Max features for this is set to 300 |


In [1]:
# import libraries
import sys
sys.path.append('..')
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import logging
import util.nlp_util as nlpu

# configure logger so we can see output from the classes
logging.basicConfig(level=logging.INFO)

%matplotlib inline

BASE='20191101-feature_generator-50k'
CONFIG_DIR="../config"
CONFIG_FILE=f'{CONFIG_DIR}/{BASE}.csv'
REPORT_DIR="../reports"
REPORT_FILE=f'{REPORT_DIR}/{BASE}-report.csv'


# Running our Program To Generate File

run the following command in tools directory:
    
```
python feature_generator.py ../config/20191101-feature_generator-50k.csv
```


The program will loop through each of the entry in the configuration file and generate a new feature file.

The program dynamically uses *fn_name* column to determine which of the functions below to call.

# Our configuration file

In [2]:
config = pd.read_csv(CONFIG_FILE)
config


Unnamed: 0,data_dir,data_file,description,fn_name,lda_topics,min_df,max_df,min_ngram_range,max_ngram_range,max_features,feature_size,window_context,min_word_count,sample,iterations,feature_columns,y,status,status_date,message
0,../dataset/amazon_reviews,amazon_reviews_us_Wireless_v1_00-50k-preproces...,bow-df_default-ngram11,generate_bow_file,,,,1.0,1.0,10000.0,,,,,,review_body,"star_rating, helpful_votes, total_votes",success,,
1,../dataset/amazon_reviews,amazon_reviews_us_Wireless_v1_00-50k-preproces...,tfidf-df_default-ngram11,generate_tfidf_file,,,,1.0,1.0,10000.0,,,,,,review_body,"star_rating, helpful_votes, total_votes",success,,
2,../dataset/amazon_reviews,amazon_reviews_us_Wireless_v1_00-50k-preproces...,bow-df_default-ngram22,generate_bow_file,,,,2.0,2.0,10000.0,,,,,,review_body,"star_rating, helpful_votes, total_votes",success,,
3,../dataset/amazon_reviews,amazon_reviews_us_Wireless_v1_00-50k-preproces...,tfidf-df_default-ngram22,generate_tfidf_file,,,,2.0,2.0,10000.0,,,,,,review_body,"star_rating, helpful_votes, total_votes",success,,
4,../dataset/amazon_reviews,amazon_reviews_us_Wireless_v1_00-50k-preproces...,word2vec_pretrained-df_none-ngram_none,generate_word2vec_file,,,,,,,100.0,5.0,5.0,0.001,5.0,review_body,"star_rating, helpful_votes, total_votes",success,,
5,../dataset/amazon_reviews,amazon_reviews_us_Wireless_v1_00-50k-preproces...,word2vec-df_none-ngram_none,generate_word2vec_file,,,,,,,100.0,5.0,5.0,0.001,5.0,review_body,"star_rating, helpful_votes, total_votes",success,2019-11-02 09:46:59,
6,../dataset/amazon_reviews,amazon_reviews_us_Wireless_v1_00-50k-preproces...,fasttext-df_none-ngram_none,generate_fasttext_file,,,,,,,100.0,5.0,5.0,0.001,5.0,review_body,"star_rating, helpful_votes, total_votes",success,2019-11-02 09:54:01,


# BoW

Here is the code for generating BoW features

In [3]:
??nlpu.generate_bow_file

[0;31mSignature:[0m
[0mnlpu[0m[0;34m.[0m[0mgenerate_bow_file[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mx[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0my[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mseries[0m[0;34m.[0m[0mSeries[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeature_column[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdescription[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlda_topics[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_df[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_df[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_ngram_range[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_ngr

# TF-IDF

In [4]:
??nlpu.generate_tfidf_file

[0;31mSignature:[0m
[0mnlpu[0m[0;34m.[0m[0mgenerate_tfidf_file[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mx[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0my[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mseries[0m[0;34m.[0m[0mSeries[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeature_column[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdescription[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_df[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_df[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_ngram_range[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_ngram_range[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlda_topics[0m[0;34m:[0m [

# Feature Engineering Based on Embeddings

Embeddings - each word is mapped to a vector space that represents the relationship of different words. The distance of the vectors represents how close each word is with each other.


### Word2Vec

In [5]:
??nlpu.generate_word2vec_file

[0;31mSignature:[0m
[0mnlpu[0m[0;34m.[0m[0mgenerate_word2vec_file[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mx[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0my[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mseries[0m[0;34m.[0m[0mSeries[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdescription[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeature_column[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtimer[0m[0;34m:[0m [0mutil[0m[0;34m.[0m[0mtime_util[0m[0;34m.[0m[0mTimer[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeature_size[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwindow_context[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_word_count[0m[0;34m:[

### Fasttext

In [6]:
??nlpu.generate_fasttext_file

[0;31mSignature:[0m
[0mnlpu[0m[0;34m.[0m[0mgenerate_fasttext_file[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mx[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0my[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mseries[0m[0;34m.[0m[0mSeries[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdescription[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeature_column[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtimer[0m[0;34m:[0m [0mutil[0m[0;34m.[0m[0mtime_util[0m[0;34m.[0m[0mTimer[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeature_size[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwindow_context[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_word_count[0m[0;34m:[

# Average Embedding for Word2Vec and Fasttext

For word2vec and fasttext. I took an average embedding approach.

For every word that is in the corpus, I averaged out the embeddings for each word to come up with the final feature vector for the review

If a word is not found the vocabulary of the trainer (ie, pre-trained embedding), that entry is removed from our training examples

In [7]:
??nlpu.get_average_embedding

[0;31mSignature:[0m [0mnlpu[0m[0;34m.[0m[0mget_average_embedding[0m[0;34m([0m[0membedding[0m[0;34m,[0m [0mreview[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mget_average_embedding[0m[0;34m([0m[0membedding[0m[0;34m,[0m [0mreview[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    returns a list of word vectors for all words in review[0m
[0;34m    then average them to return a final vector[0m
[0;34m[0m
[0;34m    :param embedding: embedding object - will be either Fasttext or Word2Vec[0m
[0;34m    :param review: review text[0m
[0;34m    :return:[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0mlog[0m[0;34m.[0m[0mdebug[0m[0;34m([0m[0;34mf'Getting average embedding for: [{review}]'[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0mwpt[0m [0;34m=[0m [0mWordPunctTokenizer[0m[0;34m([0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0;31m# word_vectors = [embedding.w

# The following files were generated

In [8]:
report = pd.read_csv(REPORT_FILE)
sorted(report.outfile.tolist())

['../dataset/feature_files/review_body-bow-df_default-ngram11-49784-10000-nolda.csv',
 '../dataset/feature_files/review_body-bow-df_default-ngram22-49784-10000-nolda.csv',
 '../dataset/feature_files/review_body-fasttext-df_none-ngram_none-47523-100-nolda.csv',
 '../dataset/feature_files/review_body-tfidf-df_default-ngram11-49784-10000-nolda.csv',
 '../dataset/feature_files/review_body-tfidf-df_default-ngram22-49784-10000-nolda.csv',
 '../dataset/feature_files/review_body-word2vec-df_none-ngram_none-47523-100-nolda.csv',
 '../dataset/feature_files/review_body-word2vec_pretrained-df_none-ngram_none-47542-300-nolda.csv']

### Reading the file name

{feature column}-{feature engineering technique}-{df parameter}-{ngram parameter}-{# of output samples}-{# of features}-{whether we appended LDA features}.csv

So for

```
review_body-bow-df_default-ngram11-49784-10000-nolda.csv
```

| Info | Description |
|------|-------------|
| feature column | review_body |
| feature engineering technique | BoW |
| df parameters | default - no specified |
| ngram parameters | 11 = (1,1) or unigram <br> 22 = (2,2) or bigram |
| output samples | 49784 |
| features | 10000 |