# Bag of Words (BoW)

A bag-of-words model (BoW), is a way of extacting features from text for use in modeling, such as with machine learning algorithms. 

This is a very simple and flexible approach. It can be use in myriad of ways for extracting document. It involves two things:

1. A vovabulary of known words.
2. A measure of the presence of known words. 

It is called a "bag" of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with wether known words occur in the document, not where in the document. 

Ex. 

* Betsy bought a butter
* but the butter was bitter
* so she added more butter to make bitter butter better.

Unique Words:

[Betsy, bought, a, butter, but, the, was, bitter, so, she, added, more, to, make, better]

Ex: Betsy bought a butter: [1, 1, 1, 1, 0, 0, 0,0,0,0,0,0,0,0,0]

Resulting vector is called the sparse matrix


* cleaning text
* ngrams
* Scoring words (ex: frequencies)

Limitations of BoW:

* If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.
* Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)
* We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.

# TF-IDF (term frequency-inverse document frequency)

TF-IDF is a stattistical measure the evaluates how relevant a word is to a document in a collection of documetns. 

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

### How is TF-IDF calculated?

$$tf_{i,j} \times log(\frac{N}{df_{i}})$$

* The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.
* The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.
* So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

## Application of TF-IDf
* Information retreival
* Keyword Extraction

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

![TF-IDF](tfidf.png)





# Case Study: Sentiment Analysis

In [None]:
# from google.colab import drive

# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# PATH = '/content/drive/MyDrive/NLPWorkShopANPAOct2021/'

In [1]:

import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

# Sample the data to speed up computation
# Comment out this line to match with lecture
df = df.sample(frac=0.1)

df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
190866,HTC Freestyle F5151 Pd53100 Unlocked Smartphone,HTC,39.0,3,the OS doesnt work good,1.0
160492,"BLU VIVO 5 Smartphone -5.5"" 4G LTE GSM Unlocke...",BLU,199.99,4,It met all possible expectations for a midrang...,0.0
263289,Nokia C3-00 Unlocked Cell Phone (Slate) with Q...,Nokia,49.99,5,Purchased for a friend who lives overseas. Fri...,0.0
353980,Samsung Galaxy S5 Mini G800H Unlocked Cellphon...,Samsung,350.0,1,Although I arrived An original phone but was D...,3.0
239801,LifeProof Case 1801-02 for Samsung Galaxy S4 (...,LifeProof,6.99,5,Great packaging..great product! It protects th...,0.0


In [2]:
df.shape

(41384, 6)

In [3]:
# Drop missing values
df.dropna(inplace=True)



In [4]:
# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]


In [5]:

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] >3,1,0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
160492,"BLU VIVO 5 Smartphone -5.5"" 4G LTE GSM Unlocke...",BLU,199.99,4,It met all possible expectations for a midrang...,0.0,1
263289,Nokia C3-00 Unlocked Cell Phone (Slate) with Q...,Nokia,49.99,5,Purchased for a friend who lives overseas. Fri...,0.0,1
353980,Samsung Galaxy S5 Mini G800H Unlocked Cellphon...,Samsung,350.0,1,Although I arrived An original phone but was D...,3.0,0
239801,LifeProof Case 1801-02 for Samsung Galaxy S4 (...,LifeProof,6.99,5,Great packaging..great product! It protects th...,0.0,1
256946,Motorola RAZR V3 Unlocked Phone with Quad-Band...,Motorola,47.5,5,I had dropped and broken my previous beloved M...,0.0,1
49840,"Apple iPhone 5s AT&T Cellphone, 16GB, Silver",Apple,139.95,4,lovely phone,0.0,1
276352,Nokia Lumia 925 RM-893 GSM Unlocked 4G LTE Win...,Nokia,99.0,4,"Yes, this phone is really fast, even with the ...",2.0,1
32381,Apple iPhone 5c 32GB (White) - AT&T,Apple,279.95,5,Great quality product.,1.0,1
70246,"Apple iPhone 6S Plus Unlocked Smartphone, 32 G...",Apple,749.99,5,"Big purchase (for me) so I was worried, but al...",1.0,1
291503,Pantech Pursuit 2 P6010 Unlocked GSM 3G Slider...,Pantech,249.99,5,fast shipping and just as advertised! Thank you!,4.0,1


In [6]:
# Most ratings are positive
df['Positively Rated'].mean()

0.7506031166460194

In [7]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [8]:
print(X_train.shape)
print(X_test.shape)

(23005,)
(7669,)


In [9]:

print('X_train first entry:\n\n', X_train.iloc[0:3])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 178350                       For the price it was great....
289233                               Don't like it to much.
156674    Brought several other BLU phones. All worth it...
Name: Reviews, dtype: object


X_train shape:  (23005,)


## CountVectorizer

In [10]:

from sklearn.feature_extraction.text import CountVectorizer

In [11]:


# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [12]:
vect.get_feature_names()[::2000]

['00',
 'atmos',
 'comparing',
 'ec',
 'gradually',
 'legendary',
 'opens',
 'recomendando',
 'sorround',
 'unload']

In [13]:
len(vect.get_feature_names())


19387

In [14]:

# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

X_train_vectorized

<23005x19387 sparse matrix of type '<class 'numpy.int64'>'
	with 606356 stored elements in Compressed Sparse Row format>

In [15]:
X_df = pd.DataFrame(X_train_vectorized.toarray())
X_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19377,19378,19379,19380,19381,19382,19383,19384,19385,19386
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23003,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [30]:


# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9129059498625546


In [31]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())
feature_names


array(['00', '00 and', '000', ..., 'zune', 'ítem', 'único'], dtype='<U22')

In [32]:

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'horrible' 'junk' 'terrible' 'poor' 'fake' 'don like' 'worst'
 'disappointed' 'sucks']

Largest Coefs: 
['excellent' 'excelente' 'perfect' 'excelent' 'love' 'awesome' 'not bad'
 'great' 'amazing' 'nice']


In [33]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

5428

In [34]:

X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.887143056799629


In [35]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['distort' 'foremost' 'reliably' 'briefly' 'makers' 'powerpoint'
 'softkeys' 'onboard' 'cooler' 'noticing']

Largest tfidf: 
['problems' 'fine' 'nice' 'spectacular' 'never' 'returned' 'cel' 'awesome'
 'awful' 'tks']


In [36]:

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'return' 'disappointed' 'poor' 'worst' 'terrible' 'horrible'
 'waste' 'don' 'months']

Largest Coefs: 
['great' 'love' 'excellent' 'perfect' 'good' 'works' 'easy' 'best' 'nice'
 'far']


In [37]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


In [38]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

28457

In [39]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


AUC:  0.9129059498625546


In [40]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'horrible' 'junk' 'terrible' 'poor' 'fake' 'don like' 'worst'
 'disappointed' 'sucks']

Largest Coefs: 
['excellent' 'excelente' 'perfect' 'excelent' 'love' 'awesome' 'not bad'
 'great' 'amazing' 'nice']


In [41]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]
