# Bag of Words (BoW)

A bag-of-words model (BoW), is a way of extacting features from text for use in modeling, such as with machine learning algorithms. 

This is a very simple and flexible approach. It can be use in myriad of ways for extracting document. It involves two things:

1. A vovabulary of known words.
2. A measure of the presence of known words. 

It is called a "bag" of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with wether known words occur in the document, not where in the document. 

Ex. 

* Betsy bought a butter
* but the butter was bitter
* so she added more butter to make bitter butter better.

Unique Words:

[Betsy, bought, a, butter, but, the, was, bitter, so, she, added, more, to, make, better]

Ex: Betsy bought a butter: [1, 1, 1, 1, 0, 0, 0,0,0,0,0,0,0,0,0]

Resulting vector is called the sparse matrix


* cleaning text
* ngrams
* Scoring words (ex: frequencies)

Limitations of BoW:

* If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.
* Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)
* We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.

# TF-IDF (term frequency-inverse document frequency)

TF-IDF is a stattistical measure the evaluates how relevant a word is to a document in a collection of documetns. 

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

### How is TF-IDF calculated?

$$tf_{i,j} \times log(\frac{N}{df_{i}})$$

* The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.
* The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.
* So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

## Application of TF-IDf
* Information retreival
* Keyword Extraction

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

![tfidf](https://drive.google.com/uc?id=17W0deYjn_BLGWY5lqBYKOrFFVm4uLkV)



# Case Study: Sentiment Analysis

In [None]:
# from google.colab import drive

# drive.mount('/content/drive')

In [None]:
# PATH = '/content/drive/MyDrive/NLPWorkShopANPAOct2021/'

In [None]:

# import pandas as pd
# import numpy as np

# # Read in the data
# df = pd.read_csv(PATH+'Amazon_Unlocked_Mobile.csv')

# # Sample the data to speed up computation
# # Comment out this line to match with lecture
# df = df.sample(frac=0.1)

# df.head()

In [None]:
# Drop missing values




In [None]:
# Remove any 'neutral' ratings equal to 3



In [None]:

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)


In [None]:
# Most ratings are positive


In [None]:
# from sklearn.model_selection import train_test_split

# # Split data into training and test sets
# X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
#                                                     df['Positively Rated'], 
#                                                     random_state=0)

## CountVectorizer

In [None]:

# from sklearn.feature_extraction.text import CountVectorizer

In [None]:


# Fit the CountVectorizer to the training data


In [None]:

# transform the documents in the training data to a document-term matrix


In [None]:

# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import roc_auc_score

# Train the model



In [None]:


# Predict the transformed test documents


In [None]:
# get the feature names as numpy array


In [None]:

# Sort the coefficients from the model

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest


In [None]:

# from sklearn.feature_extraction.text import TfidfVectorizer

# # Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
# vect = TfidfVectorizer(min_df=5).fit(X_train)
# len(vect.get_feature_names())

In [None]:
# These reviews are treated the same by our current model
# print(model.predict(vect.transform(['not an issue, phone is working',
#                                     'an issue, phone is not working'])))

In [None]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
# vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

# X_train_vectorized = vect.transform(X_train)

# len(vect.get_feature_names())

In [None]:
# model = LogisticRegression()
# model.fit(X_train_vectorized, y_train)

# predictions = model.predict(vect.transform(X_test))

# print('AUC: ', roc_auc_score(y_test, predictions))

In [None]:
# feature_names = np.array(vect.get_feature_names())

# sorted_coef_index = model.coef_[0].argsort()

# print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
# print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

In [None]:
# These reviews are now correctly identified
# print(model.predict(vect.transform(['not an issue, phone is working',
#                                     'an issue, phone is not working'])))