# Feature Engineering

This section will cover the following types of features for the Yelp reviews:

1.  Bag of Words

2.  Bag of N-Grams

3.  TF-IDF (term frequency over inverse document frequency)


In [1]:
import pandas as pd
import numpy as np
import re
import nltk


The corpus or the reviews were extracted from the Yelp review dataset using pandas 

In [2]:
corpus_df = pd.read_csv('subset.csv')
corpus = corpus_df['text']
corpus.head()

0    Hallelujah! I FINALLY FOUND IT! The frozen yog...
1    I drop by BnC on a weekly basis to pick up my ...
2    My personally experience here wasn't the best,...
3    37 °C = 98.6°F\r\nKoreatown establisments disp...
4    My husband & I visited Toronto from the U.S. f...
Name: text, dtype: object

# Text pre-processing


As part of Text pre-processing we removed the special characters, whitespaces and numbers and, converted all the text to 
lower case.

In [3]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I)
    # doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    doc = ''.join(i for i in doc if not i.isdigit())
    return doc

normalize_corpus = np.vectorize(normalize_document)


In [4]:
norm_corpus = normalize_corpus(corpus)


# 1.  Bag of Words Model


We created the Bag of Words model to determine the unique words in each document along with its value

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
# cv_matrix = cv_matrix.toarray()


Thus you can see that our documents have been converted into numeric vectors such that each document is represented by one vector (row) in the above feature matrix. The following code will help represent this in a more easy to understand format.



In [6]:
# get all unique words in the corpus
vocab = cv.get_feature_names()
vocab


['__',
 '___',
 '____',
 '_____',
 '______',
 '_______',
 '________',
 '_________',
 '__________',
 '___________',
 '____________',
 '_____________',
 '______________',
 '_______________',
 '________________',
 '_________________',
 '__________________',
 '___________________',
 '____________________',
 '_____________________',
 '______________________',
 '_______________________',
 '________________________',
 '_________________________',
 '__________________________',
 '___________________________',
 '____________________________',
 '_____________________________',
 '______________________________',
 '_______________________________',
 '________________________________',
 '_________________________________',
 '__________________________________',
 '_____________________________________',
 '_______________________________________',
 '________________________________________',
 '__________________________________________',
 '___________________________________________',
 '_____________

# 2.  Bag of N-Grams Model


We created the Bag of bi-grams and tri-grams to look at the 2-word and 3-word strings used in our corpus.

In [7]:

bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)

bv_matrix = np.asarray(bv_matrix)
vocab = bv.get_feature_names()
# pd.DataFrame(bv_matrix, columns=vocab)
vocab


['__ __',
 '__ ____',
 '__ adorned',
 '__ also',
 '__ amount',
 '__ anyways',
 '__ bavaria',
 '__ cheap',
 '__ check',
 '__ dancing',
 '__ dine',
 '__ disappointed',
 '__ drag',
 '__ expecting',
 '__ experience',
 '__ explained',
 '__ female',
 '__ got',
 '__ hehehe',
 '__ hip',
 '__ insert',
 '__ inventory',
 '__ lastly',
 '__ line',
 '__ lot',
 '__ lychee',
 '__ maybe',
 '__ meat',
 '__ mushroom',
 '__ one',
 '__ patio',
 '__ pdf',
 '__ peppers',
 '__ pita',
 '__ really',
 '__ recall',
 '__ scrambled',
 '__ service',
 '__ something',
 '__ sucker',
 '__ suffice',
 '__ sunset',
 '__ szechuan',
 '__ tea',
 '__ thing',
 '__ tl',
 '__ top',
 '__ tough',
 '__ treat',
 '__ tried',
 '__ typically',
 '__ waitress',
 '__ went',
 '__ wish',
 '__ yes',
 '___ ___',
 '___ ____',
 '___ advised',
 '___ always',
 '___ appetizers',
 '___ asked',
 '___ big',
 '___ bird',
 '___ definitely',
 '___ desert',
 '___ dessert',
 '___ done',
 '___ drinks',
 '___ drooool',
 '___ end',
 '___ equipment',
 '___ eve

# 3.  TF-IDF Model


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
# tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names()
# pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)


In [10]:
vocab

['__',
 '___',
 '____',
 '_____',
 '______',
 '_______',
 '________',
 '_________',
 '__________',
 '___________',
 '____________',
 '_____________',
 '______________',
 '_______________',
 '________________',
 '_________________',
 '__________________',
 '___________________',
 '____________________',
 '_____________________',
 '______________________',
 '_______________________',
 '________________________',
 '_________________________',
 '__________________________',
 '___________________________',
 '____________________________',
 '_____________________________',
 '______________________________',
 '_______________________________',
 '________________________________',
 '_________________________________',
 '__________________________________',
 '_____________________________________',
 '_______________________________________',
 '________________________________________',
 '__________________________________________',
 '___________________________________________',
 '_____________

The TF-IDF based feature vectors for each of our text documents show scaled and normalized values as compared to the raw Bag of Words model values. 

