## COMMON TERMS
1. CORPUS (C): Combination of all the words in whole dataset(contains repeated words).
2. VOCABULARY (V): Unique words in the corpus.
3. DOCUMENT (D): For say we have imdb review dataset, then each individual review is called Document.
4. WORD (W): word

In [1]:
import numpy as np 
import pandas as pd

In [3]:
df=pd.DataFrame({'text':['people watch Joker','Joker watch Joker','people pass comment','Joker pass comment'],'output':[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch Joker,1
1,Joker watch Joker,1
2,people pass comment,0
3,Joker pass comment,0


## 1.Bag of Words

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

### Important Hyperparameters(Default)
1. lowercase=True
2. tokenizer = None
3. stop_word = None
4. binary = False   (for sentiment analysis: binary =True)
5. max_features = None (setting up this hyperparameter let say, to 5 extract top 5 most frequent words)
6. ngram_range=(1,1)


In [5]:
x= cv.fit_transform(df['text'])

In [6]:
# vocab

print(cv.vocabulary_)

{'people': 3, 'watch': 4, 'joker': 1, 'pass': 2, 'comment': 0}


In [7]:
x

<4x5 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [8]:
x.toarray()

array([[0, 1, 0, 1, 1],
       [0, 2, 0, 0, 1],
       [1, 0, 1, 1, 0],
       [1, 1, 1, 0, 0]], dtype=int64)

In [9]:
cv.transform(['Joker watch and pass comment']).toarray()

array([[1, 1, 1, 0, 1]], dtype=int64)

### Disadvantage:
e.g:
1. This is a very good movie.
2. This is not a very good movie.

according to Bag of words logic this two vectors are very close which means that the meaning of the two is very close.

But this is not the case.

Due to small changes when the variation in sentence become too large Bag of words unable to capture it.

#### It can be Handled very efficiently by: n-grams

### Bag of n-grams

#### Bi-grams:(2-words at a time)

vocabulary:
people watch | watch joker | Joker watch | people pass | pass comment | Joker pass

In [10]:
cv1=CountVectorizer(ngram_range=(2,2))

In [11]:
x= cv1.fit_transform(df['text'])

In [13]:
# vocab

print(cv1.vocabulary_)

{'people watch': 4, 'watch joker': 5, 'joker watch': 1, 'people pass': 3, 'pass comment': 2, 'joker pass': 0}


In [14]:
x.toarray()

array([[0, 0, 0, 0, 1, 1],
       [0, 1, 0, 0, 0, 1],
       [0, 0, 1, 1, 0, 0],
       [1, 0, 1, 0, 0, 0]], dtype=int64)

#### Uni/Bi-grams:(1 or 2 words at a time)

In [15]:
cv2=CountVectorizer(ngram_range=(1,2))

In [17]:
x= cv2.fit_transform(df['text'])

In [18]:
# vocab

print(cv2.vocabulary_)

{'people': 6, 'watch': 9, 'joker': 1, 'people watch': 8, 'watch joker': 10, 'joker watch': 3, 'pass': 4, 'comment': 0, 'people pass': 7, 'pass comment': 5, 'joker pass': 2}


In [19]:
x.toarray()

array([[0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1],
       [0, 2, 0, 1, 0, 0, 0, 0, 0, 1, 1],
       [1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0]], dtype=int64)

In [23]:
print(cv2.get_feature_names_out())

['comment' 'joker' 'joker pass' 'joker watch' 'pass' 'pass comment'
 'people' 'people pass' 'people watch' 'watch' 'watch joker']


## 2.Tf-Idf
(assigning weightage to words in each document in a way that giving more weight to word which is more frequent in a document but rare in corpus)

TF- Term Frequency: probability of term appearing in document (0<TF<1)

IDF- Inverse Document Frequency:  rare term in corpus :: IDF increases whereas, frequent term in corpus :: IDF decreases

TF = (Number of times term appears in a document) / (Total number of terms in the document)

IDF = log((Total number of documents) / (Number of documents containing the term))

## TF * IDF

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
x=tfidf.fit_transform(df['text'])

In [21]:
x.toarray()

array([[0.        , 0.49681612, 0.        , 0.61366674, 0.61366674],
       [0.        , 0.8508161 , 0.        , 0.        , 0.52546357],
       [0.57735027, 0.        , 0.57735027, 0.57735027, 0.        ],
       [0.61366674, 0.49681612, 0.61366674, 0.        , 0.        ]])

In [22]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.51082562 1.22314355 1.51082562 1.51082562 1.51082562]
['comment' 'joker' 'pass' 'people' 'watch']
