<font size=+4 color="Black"><center><b>Bag of Words & TF-IDF</b></center></font>
<font size=-1 color="Black"><center><b>*Series: All about NLP by Data Tattle </b></right></font>

<font size="+2" color="Red"><b>Please Upvote if you like the work</b></font>

### It gives motivation to a working professional (like me) to contribute more.

### About this notebook

#### This notebook is a part of Series "[All about NLP](https://www.kaggle.com/datatattle/all-about-nlp)" and will cover vectorization using Bag of Words & TF-IDF



![](https://miro.medium.com/max/2428/0*Qq8FcR-mgnvjWZLQ.gif)

Contents:

* [1. Bag of Words](#1)
* [2. TF-IDF](#2)
* [3. BOG vs. TF-IDF](#3)    

### Bag of Words

We can not feed texts (words) directly into the NLP or ML models as all the algorithms work on numbers. Hence BOG is used to preprocess the texts. Here TOTAL occurence of EACH word is counted and kept as a BAG OF WORDS. 


### Types of BOW
#### a. Count Occurrence
#### b. Normalized Count Occurrence
#### c. TF-IDF

<a id="1"></a>
    
<font size="+2" color="indigo"><b>1. Bag of Words</b></font><br>

#### a. Count Occurrence


In [None]:
import pandas as pd
import numpy as np
import re 
import nltk 
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, KFold

In [None]:
train=pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv",encoding='latin1')
test=pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_test.csv",encoding='latin1')

In [None]:
train['text'] = train.OriginalTweet
train["text"] = train["text"].astype(str)

test['text'] = test.OriginalTweet
test["text"] = test["text"].astype(str)

# Data has 5 classes, let's convert them to 3

def classes_def(x):
    if x ==  "Extremely Positive":
        return "2"
    elif x == "Extremely Negative":
        return "0"
    elif x == "Negative":
        return "0"
    elif x ==  "Positive":
        return "2"
    else:
        return "1"
    

train['label']=train['Sentiment'].apply(lambda x:classes_def(x))
test['label']=test['Sentiment'].apply(lambda x:classes_def(x))


train.label.value_counts(normalize= True)

In [None]:
x_train = train.text
y_train = train.label

In [None]:
#Using NLTK
text = "what is going to happen next in data science \
is a mystery what has happened is history it is an \
interdisciplinary field that uses scientific method \
processes algorithms and systems to extract knowledge \
and insights from many structural and unstructured data \
data science is related to data mining machine learning and big data"
txt = nltk.sent_tokenize(text)

word2count = {} 
for data in txt: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1

print(word2count)

### This was simple text, however while modeling words may reach zillions. And we need to set a cut off as we don't use all the words.


In [None]:
import heapq 
freq_words = heapq.nlargest(200, word2count, key=word2count.get)
freq_words

In [None]:
X = [] 
for data in txt: 
    vector = [] 
    for word in freq_words: 
        if word in nltk.word_tokenize(data): 
            vector.append(1) 
        else: 
            vector.append(0) 
    X.append(vector) 
X = np.asarray(X)
X

In [None]:
#Using SkLearn
text = "Natural Language Processing (NLP) is a sub-field of artificial intelligence \
that deals understanding and processing human language. In light of new advancements \
in machine learning, many organizations have begun applying natural language processing \
for translation, chatbots and candidate filtering"

count_vec = CountVectorizer()
count_occurs = count_vec.fit_transform([text])
count_occur_df = pd.DataFrame((count, word) for word, count in zip(count_occurs.toarray().tolist()[0], count_vec.get_feature_names()))
count_occur_df.columns = ['Word', 'Count']
count_occur_df.sort_values('Count', ascending=False, inplace=True)
count_occur_df.head()

#### b. Normalized Count Occurrence

 If you think that extremely high frequency may dominate the result and causing model bias. Normalization can be apply to pipeline easily.

In [None]:
text = "Natural Language Processing (NLP) is a sub-field of artificial intelligence \
that deals understanding and processing human language. In light of new advancements \
in machine learning, many organizations have begun applying natural language processing \
for translation, chatbots and candidate filtering"

norm_count_vec = TfidfVectorizer(use_idf=False, norm='l2')
norm_count_occurs = norm_count_vec.fit_transform([text])
norm_count_occur_df = pd.DataFrame((count, word) for word, count in zip(
    norm_count_occurs.toarray().tolist()[0], norm_count_vec.get_feature_names()))
norm_count_occur_df.columns = ['Word', 'Count']
norm_count_occur_df.sort_values('Count', ascending=False, inplace=True)
norm_count_occur_df.head()

<a id="2"></a>
    
<font size="+2" color="indigo"><b>2. TF-IDF</b></font><br>

Term Frequency - inverse document frequency is defined as a numeric statistic that is intended to reflect how important a word is to a document in a collection/ corpus

### TF

It is a measure of how frequently a term (t) appears in a document:

    tf = n / number of terms in a document
 
In above example 
tf (data)   = 5/41
tf (science)= 2/41


### IDF

IDF is a measure of how important a term is

    idf =  log (number of documents / number of documents with term 't')

Since we took one text above, hence number of documents will be 1. But in practical word there are millions of documents. So let's assume we had 5 documents in total but data existed in one.

So, IDF for our text is:
idf(data) = log(5/5)


    tf-idf = tf * idf
    
#### Words with a higher score are more important, and those with a lower score are less important



In [None]:
text = "Natural Language Processing (NLP) is a sub-field of artificial intelligence \
that deals understanding and processing human language. In light of new advancements \
in machine learning, many organizations have begun applying natural language processing \
for translation, chatbots and candidate filtering"

tfidf_vec = TfidfVectorizer()
tfidf_count_occurs = tfidf_vec.fit_transform([text])
tfidf_count_occur_df = pd.DataFrame((count, word) for word, count in zip(
    tfidf_count_occurs.toarray().tolist()[0], tfidf_vec.get_feature_names()))
tfidf_count_occur_df.columns = ['Word', 'Count']
tfidf_count_occur_df.sort_values('Count', ascending=False, inplace=True)
tfidf_count_occur_df.head()

In [None]:
stop_words = ['a', 'an', 'the']

# Basic cleansing
def cleansing(text):
    # Tokenize
    tokens = text.split(' ')
    # Lower case
    tokens = [w.lower() for w in tokens]
    # Remove stop words
    tokens = [w for w in tokens if w not in stop_words]
    return ' '.join(tokens)

# All-in-one preproce
def preprocess_x(x):
    processed_x = [cleansing(text) for text in x]
    
    return processed_x

def build_model(mode):
    # Intent to use default paramaters for show case
    vect = None
    if mode == 'count':
        vect = CountVectorizer()
    elif mode == 'tf':
        vect = TfidfVectorizer(use_idf=False, norm='l2')
    elif mode == 'tfidf':
        vect = TfidfVectorizer()
    else:
        raise ValueError('Mode should be either count or tfidf')
    
    return Pipeline([
        ('vect', vect),
        ('clf' , LogisticRegression(solver='newton-cg',n_jobs=-1))
    ])

def pipeline(x, y, mode):
    processed_x = preprocess_x(x)
    
    model_pipeline = build_model(mode)
    cv = KFold(n_splits=5, shuffle=True)
    
    scores = cross_val_score(model_pipeline, processed_x, y, cv=cv, scoring='accuracy')
    print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
    
    return model_pipeline

In [None]:
x = preprocess_x(x_train)
y = y_train
    
model_pipeline = build_model(mode='count')
model_pipeline.fit(x, y)

print('Number of Vocabulary: %d'% (len(model_pipeline.named_steps['vect'].get_feature_names())))

<a id="3"></a>
    
<font size="+2" color="indigo"><b>3. BOG vs. TF-IDF</b></font><br>

In [None]:
print('Using Count Vectorizer------')
model_pipeline = pipeline(x_train, y_train, mode='count')

print('Using TF Vectorizer------')
model_pipeline = pipeline(x_train, y_train, mode='tf')

print('Using TF-IDF Vectorizer------')
model_pipeline = pipeline(x_train, y_train, mode='tfidf')

### Classifier used is Logistic Regression. 
#### Count BoW performs better than Tf-Idf in our case

When to use BOW over Embeddings?

1. Building a baseline model. 
2. If your dataset is small and context is domain specific, BoW may work better than Word Embedding. Context is very domain specific which means that you cannot find corresponding Vector from pre-trained word embedding models (GloVe, fastText etc)

<font size="+3" color="Green"><b>Related Work:</b></font>

Next is Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)

### See [here](https://www.kaggle.com/datatattle/all-about-nlp) for related work

<font size="+3" color="Green"><b>Please Upvote if you liked the work</b></font>


![#Precious](https://i.imgur.com/5YSC6pg.gif)