`NLP Engineers` often deal with corpus of documents or texts. 

`Raw Text` cannot be directly fed into the machine learning algorithms. It is very important to develop some methods to represent these documents in a way computers/algorithms understand i.e `vectors of numbers`. These methods are also called as `feature extraction methods or feature encoding`.

## **Bag Of Words**

This is very flexible, intuitive and easiest of feature extraction methods.

The text/sentence is represented as a list of counts of unique words, for this reason this method is also referred as `count vectorisation`. To vectorize our documents, all we have to do is count how many time each words appears.

**`Bag Of Words`** model weighs words based on occurence.

**Note:** Remove Stop Words before doing count vectorization.

**Vocabulary** is the total number of unique words in these documents.

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer

docs = [
    'The cat sat on the mat.',
    'The dog sat on the log.',
    'Dogs and cats living together.'
]

# Initialize the tokenizer
tokenizer = Tokenizer()

# Fit the tokenizer on the documents
tokenizer.fit_on_texts(docs)

print(f'Vocabulary: {list(tokenizer.word_index.keys())}')

# Convert the texts to a matrix
text_matrix = tokenizer.texts_to_matrix(docs, mode='count')

print(text_matrix)

Vocabulary: ['the', 'sat', 'on', 'cat', 'mat', 'dog', 'log', 'dogs', 'and', 'cats', 'living', 'together']
[[0. 2. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 2. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]]


### **Drawbacks**
The model is only concerned with whether known words occur in the document, not where in the document. Obviously there is significant information loss by simply using a document vector to represent an entire document as the order or structure of words in the document is discarded, but this is sufficient for many computational linguistics applications. it computationally simpler and actively used when positioning or contextual info aren’t relevant.

___

## **TF-IDF (Term Frequency - Inverse Document Frequency)**

TF-IDF is a method that provides a way to give rarer words greater weight.

### **Term Frequency: tf(f,d)**

This summarizes how often a given word appears within a document.

It is measure of how frequently a word presents in a document.

`2 Methods`

1. Term frequency adjusted for document length:

`tf(t,d) = (number of times term t appear in document d) ÷ (number of words in d)`

2. Logarithmically scaled frequency:

`tf(t,d) = log(1 + number of times term t appear in document d)`


### **Inverse Document Frequency: idf(t,D)**

IDF is a measure of term importance. 

It is logarithmically scaled ratio of the total number of documents vs the count of documents with term t.

`idf(t,D) = log N / |{d belongs D: t belongs d}|`

**Numerator:** Total number of documents

**Denominator:** Total number of Documents with term

#### **Example:**

D = [ ‘a dog live in home’, ‘a dog live in the hut’, ‘hut is dog home’ ]   `# D is the corpus`

idf(dog, D) = log( total number of documents (3) / total number of documents with term “dog” (3) ) = log(3/3) = log(1) = 0

## **TFIDF: tf x idf**

`tfidf(t,d,D) = tf(t,d) . idf(t,D)`

We can now compute the TF-IDF score for each term in a document. 

Score implies the importance of the word.
As you can see in the above example. If the term “dog” appears in all the documents, then the inverse document frequency of the word will be zero, thus the TFIDF score will be zero. What this basically implies is that if the same word is present in all the documents, then it has no relevance.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    'a dog live in home',
    'a dog live in hut',
    'hut is dog home'
]

vectorizer = TfidfVectorizer()
vectorizer.fit(docs)

print(f'Vocabulary: {list(vectorizer.vocabulary_.keys())}')
print('N/n:', vectorizer.idf_ , '\n')
print('idf = log(N/n):', vectorizer.vocabulary_, '\n')

vector = vectorizer.transform([docs[0]])
print(vector.toarray())

Vocabulary: ['dog', 'live', 'in', 'home', 'hut', 'is']
N/n: [1.         1.28768207 1.28768207 1.28768207 1.69314718 1.28768207] 

idf = log(N/n): {'dog': 0, 'live': 5, 'in': 3, 'home': 1, 'hut': 2, 'is': 4} 

[[0.40912286 0.52682017 0.         0.52682017 0.         0.52682017]]


### **Drawbacks**
TF-IDF makes the feature extraction more robust than just counting the number of instances of a term in a document as presented in Bag-of-words model. But it doesn’t solve for the major drawbacks of BoW model, the order or structure of words in the document is still discarded in TF-IDF model.

### **Note**

**Sparsity:** As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them). NLP practitioners usually apply principal component analysis (PCA) to reduce the dimensionality.

**Naive Bayes Models:** An over-simplified assumptions model, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification with BoW model or TF-IDF