### Creating more features

 - Unigram : Every word is treated as a feature
 - Bigrams :
 - Trigrams
 - n-grams
 - TF - IDF Normalisation

In [4]:
import nltk
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import CountVectorizer

In [1]:
sent_1  = ["This is a good movie."]
sent_2 = ["This movie is good buct actor is not present."]
sent_3 = ["this movie is not good."]

Suppose, a corpus contains two sentences(sent_1 and sent_2) and is used for testing the model using unigram features.
Now both of these are positive reviews but when we test our model on sent_3 then it will also be predicted as positive
review. Beacause in unigram feature our model has learnt in both sentences good, movie,not resullted in positive review. As soon as it sees not and good in sent_3 it will also predict it as positive which is completely wrong.
So, now we can club words to form features, although it results in the increase of number of featues but in some cases it is necessary to do so as per context and requirements. 

In [14]:
# by default ngram_range is (1,1) means it is unigram
# (2,2) means it is bigrams
# (3,3) means it is trigrams
# (1,n) means it is n-grams
cv = CountVectorizer(ngram_range=(1,1)) 

In [15]:
document = [sent_1[0],sent_2[0],sent_3[0]]

In [21]:
vectorized_corpus_unigram = cv.fit_transform(document).toarray()

In [22]:
len(vectorized_corpus_unigram[0])

8

In [28]:
cv.vocabulary_

{'this': 7,
 'is': 3,
 'good': 2,
 'movie': 4,
 'buct': 1,
 'actor': 0,
 'not': 5,
 'present': 6}

In [31]:
cv_bigram = CountVectorizer(ngram_range=(2,2))

In [32]:
vectorized_corpus_bigram = cv_bigram.fit_transform(document).toarray()
len(vectorized_corpus_bigram[0])

11

In [33]:
cv_bigram.vocabulary_

{'this is': 9,
 'is good': 4,
 'good movie': 3,
 'this movie': 10,
 'movie is': 6,
 'good buct': 2,
 'buct actor': 1,
 'actor is': 0,
 'is not': 5,
 'not present': 8,
 'not good': 7}

In [38]:
cv_trigram = CountVectorizer(ngram_range=(3,3))

In [39]:
vectorized_corpus_trigram = cv_trigram.fit_transform(document).toarray()
len(vectorized_corpus_trigram[0])

11

In [40]:
cv_trigram.vocabulary_

{'this is good': 9,
 'is good movie': 4,
 'this movie is': 10,
 'movie is good': 7,
 'is good buct': 3,
 'good buct actor': 2,
 'buct actor is': 1,
 'actor is not': 0,
 'is not present': 6,
 'movie is not': 8,
 'is not good': 5}

We can see increasing the ngram_range results in increase in the number of features.

### TF-IDF Normalisation (Term Frequency - Inverse Document Frequency) 

 - Avoid features that occur very often, because they contain less information
 - Information decreases as the number of occurences increases accross different types of documents
 - So, we define a new term i.e., term-document-frequency which associates a weight with every term 

TF-IDF is product of two terms TF and IDF.
 - TF is represented as tf(t,d) where t denotes term and d denotes document
 - IDF is represented as **idf(t,d) = log( N/ 1+count(t,d) )**, where N is the total number of documents
 - count(t,d) means how many times t has appeared across all documents

In [50]:
sent_1  = ["This is good movie"]
sent_2 = ["This was good movie"]
sent_3 = ["this is not good movie"]

corpus = [sent_1,sent_2,sent_3]

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
tfidf = TfidfVectorizer()

In [53]:
vc = tfidf.fit_transform(corpus).toarray()

In [54]:
print(vc)

[[0.46333427 0.59662724 0.46333427 0.         0.46333427 0.        ]
 [0.41285857 0.         0.41285857 0.         0.41285857 0.69903033]
 [0.3645444  0.46941728 0.3645444  0.61722732 0.3645444  0.        ]]


In [55]:
tfidf.vocabulary_

{'this': 4, 'is': 1, 'good': 0, 'movie': 2, 'was': 5, 'not': 3}