# Feature Engineering

This section will cover the following types of features for the Yelp reviews:

1.  Bag of Words

2.  Bag of N-Grams

3.  TF-IDF (term frequency over inverse document frequency)


In [17]:
import pandas as pd
import numpy as np
import re
import nltk


The corpus or the reviews were extracted from the Yelp review dataset using pandas 

In [18]:
corpus_df = pd.read_csv('yelp_review10K.csv')
corpus = corpus_df['text']
corpus.head()

0    My wife took me here on my birthday for breakf...
1    I have no idea why some people give bad review...
2    love the gyro plate. Rice is so good and I als...
3    Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4    General Manager Scott Petello is a good egg!!!...
Name: text, dtype: object

# Text pre-processing


In [None]:
As part of Text pre-processing we removed the special characters, whitespaces and numbers and, converted all the text to 
lower case.

In [44]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I)
    # doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    doc = ''.join(i for i in doc if not i.isdigit())
    return doc

normalize_corpus = np.vectorize(normalize_document)


In [45]:
norm_corpus = normalize_corpus(corpus)
norm_corpus


array(['wife took birthday breakfast excellent weather perfect made sitting outside overlooking grounds absolute pleasure waitress excellent food arrived quickly semi - busy saturday morning . looked like place fills pretty quickly earlier get better . favor get bloody mary . phenomenal simply best \' ever . \' pretty sure use ingredients garden blend fresh order . amazing . everything menu looks excellent , white truffle scrambled eggs vegetable skillet tasty delicious . came  pieces griddled bread amazing absolutely made meal complete . best " toast " \' ever . anyway , \' wait go back !',
       'idea people give bad reviews place goes show please everyone . probably griping something fault ... many people like . case , friend arrived  :  pm past sunday . pretty crowded , thought sunday evening thought would wait forever get seat said \' seated girl comes back seating someone else . seated  :  waiter came got drink orders . everyone pleasant host seated us waiter server . prices goo

# 1.  Bag of Words Model


In [None]:
We created the Bag of Words model to determine the unique words in each document along with its value

In [46]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Thus you can see that our documents have been converted into numeric vectors such that each document is represented by one vector (row) in the above feature matrix. The following code will help represent this in a more easy to understand format.



In [47]:
# get all unique words in the corpus
vocab = cv.get_feature_names()
vocab


['______',
 '_______',
 '_______________',
 '____berto',
 '_accommodating',
 '_c',
 '_finally_',
 '_gyibeahdfylsszc_g',
 '_lozhqednolhvbg',
 '_reasonable',
 '_she',
 '_third_',
 '_us_',
 '_very',
 '_xhxtuykqnyphmylm',
 'aa',
 'aaa',
 'aaaaaalright',
 'aaaamazing',
 'aaammmazzing',
 'aaand',
 'aah',
 'aand',
 'aaron',
 'aarp',
 'ab',
 'aback',
 'abacus',
 'abandon',
 'abandoned',
 'abandoning',
 'abba',
 'abbaye',
 'abbey',
 'abbreviate',
 'abbreviated',
 'abbreviations',
 'abby',
 'abc',
 'abdomen',
 'abe',
 'aberration',
 'abhor',
 'abides',
 'abiding',
 'abilities',
 'ability',
 'abilty',
 'abita',
 'able',
 'abnormally',
 'abode',
 'abodoba',
 'abogado',
 'abou',
 'abound',
 'abrasion',
 'abrasive',
 'abreast',
 'abridged',
 'abroad',
 'abrupt',
 'abruptly',
 'abs',
 'absence',
 'absense',
 'absent',
 'absinthe',
 'abslutely',
 'absoloutely',
 'absolut',
 'absolute',
 'absolutely',
 'absolutley',
 'absolutly',
 'absorb',
 'absorbed',
 'absorption',
 'abstain',
 'abstained',
 'abstra

In [48]:
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)

Unnamed: 0,______,_______,_______________,____berto,_accommodating,_c,_finally_,_gyibeahdfylsszc_g,_lozhqednolhvbg,_reasonable,...,zuzus,zweigel,zwiebel,zy,zzed,zzzzzzzzzzzzzzzzz,éclairs,école,ém,òc
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# 2.  Bag of N-Grams Model


In [None]:
We created the Bag of bi-grams and tri-grams to look at the 2-word and 3-word strings used in our corpus.

In [49]:

bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)

bv_matrix = np.asarray(bv_matrix)
vocab = bv.get_feature_names()
# pd.DataFrame(bv_matrix, columns=vocab)
vocab


['______ ordered',
 '_______ oakland',
 '_______________ update',
 '____berto matter',
 '_accommodating evening',
 '_finally_ found',
 '_gyibeahdfylsszc_g adventures',
 '_lozhqednolhvbg http',
 '_reasonable amount',
 '_she listens',
 '_she pretty',
 '_third_ visit',
 '_us_ going',
 '_very friendly',
 '_xhxtuykqnyphmylm mqg',
 'aa accessories',
 'aa battery',
 'aa coming',
 'aa give',
 'aa good',
 'aa hall',
 'aa meeting',
 'aa prices',
 'aa store',
 'aa supports',
 'aa wholesome',
 'aaa bail',
 'aaa completely',
 'aaaaaalright worse',
 'aaaamazing best',
 'aaammmazzing located',
 'aaand absolutely',
 'aah delight',
 'aand chocolates',
 'aaron artists',
 'aaron came',
 'aaron chamberlin',
 'aaron choice',
 'aaron fantastically',
 'aaron johnson',
 'aaron may',
 'aaron please',
 'aaron tattoo',
 'aarp magazines',
 'ab ab',
 'ab area',
 'ab everything',
 'ab expecting',
 'ab fab',
 'ab forgot',
 'ab get',
 'ab repair',
 'ab weekends',
 'aback combination',
 'abacus impressive',
 'abacus i

In [50]:
bv = CountVectorizer(ngram_range=(3,3))
bv_matrix = bv.fit_transform(norm_corpus)

bv_matrix = np.asarray(bv_matrix)
vocab = bv.get_feature_names()
vocab

['______ ordered chicken',
 '_______ oakland coliseum',
 '_______________ update first',
 '____berto matter basically',
 '_accommodating evening appointments',
 '_finally_ found place',
 '_gyibeahdfylsszc_g adventures phoenix',
 '_lozhqednolhvbg http www',
 '_reasonable amount time',
 '_she listens every',
 '_she pretty busy',
 '_third_ visit since',
 '_us_ going wonder',
 '_very friendly _accommodating',
 '_xhxtuykqnyphmylm mqg dessert',
 'aa accessories fab',
 'aa battery something',
 'aa coming xl',
 'aa give call',
 'aa good thing',
 'aa hall need',
 'aa meeting alcohol',
 'aa prices hypocrite',
 'aa store never',
 'aa supports us',
 'aa wholesome bakery',
 'aaa bail fresh',
 'aaa completely average',
 'aaaaaalright worse waits',
 'aaaamazing best hot',
 'aaammmazzing located fashion',
 'aaand absolutely cher',
 'aah delight melt',
 'aand chocolates awesome',
 'aaron artists shop',
 'aaron came ask',
 'aaron chamberlin chris',
 'aaron choice kind',
 'aaron fantastically knowledgeab

# 3.  TF-IDF Model


In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)


Unnamed: 0,______,_______,_______________,____berto,_accommodating,_c,_finally_,_gyibeahdfylsszc_g,_lozhqednolhvbg,_reasonable,...,zuzus,zweigel,zwiebel,zy,zzed,zzzzzzzzzzzzzzzzz,éclairs,école,ém,òc
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The TF-IDF based feature vectors for each of our text documents show scaled and normalized values as compared to the raw Bag of Words model values. 

