## TFIDF

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. 
This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

http://www.tfidf.com/

In [1]:
# Importing dataset directly from Internet using sklearn library
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
news_train = fetch_20newsgroups(subset ='train', categories= categories, shuffle = True)
news_test = fetch_20newsgroups(subset ='test',categories= categories, shuffle = True)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [2]:
print(news_train.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])


In [3]:
print(news_train['target_names'])

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


In [4]:
news_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

# Word Count - CountVectorizer 

Count the occurance of each word, basically encoding documents

In [5]:
# Demo code to understand CountVectorizer


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
text = ["The quick brown fox jumped over the lazy dog.",
    "The dog.",
    "The fox."]

# Assign a unique number to each word as: also known as "Tokenize"
vector = CountVectorizer()
vector.fit(text)

# Learn a vocabulary dictionary of all tokens in the raw documents.
print("Print Vocabulary:"+str(vector.vocabulary_)+'\n\n')

vector.get_feature_names()
print("Feature names: "+str(vector.get_feature_names())+'\n\n')

counts = vector.transform(text)
print("The shape of count is: "+str(counts.shape)+'\n\n')
# Only 1 sample space and it has 8 features
print("Printing count: "+'\n'+str(counts.toarray()))
      


Print Vocabulary:{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


Feature names: ['brown', 'dog', 'fox', 'jumped', 'lazy', 'over', 'quick', 'the']


The shape of count is: (3, 8)


Printing count: 
[[1 1 1 1 1 1 1 2]
 [0 1 0 0 0 0 0 1]
 [0 0 1 0 0 0 0 1]]


# TFIDF Transformer

In [7]:
# Demo code to understand TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer

# Create the transformer
vectorizer = TfidfTransformer()
vectorizer.fit(counts)
print('Learning frequency of all features:'+str(vectorizer.idf_)+'\n\n')

freq = vectorizer.transform(counts)
print('Transfroming the matrix based on the learnt frequencies or weight:\n\n'+str(freq.toarray()))


Learning frequency of all features:[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


Transfroming the matrix based on the learnt frequencies or weight:

[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]
 [0.         0.78980693 0.         0.         0.         0.
  0.         0.61335554]
 [0.         0.         0.78980693 0.         0.         0.
  0.         0.61335554]]


In [8]:
# Word Count-CountVectorize -> Count the occurance of each word, basically encoding documents
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_tf = count_vect.fit_transform(news_train.data)
X_train_tf.shape

(2257, 35788)

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_tf)
X_train_tfidf.shape

(2257, 35788)

In [10]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, news_train.target)
clf

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [11]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new) # fit is used for learning which is already used before so here transform is 
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)

In [12]:
for x in predicted:
    print(x)

3
1


In [13]:
news_train.target_names


['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [14]:
X_test_tf = count_vect.transform(news_test.data)
X_test_tfidf = tfidf_transformer.transform(X_test_tf)
predicted = clf.predict(X_test_tfidf)

In [15]:
from sklearn import metrics
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(news_test.target, predicted))
print(metrics.classification_report(news_test.target, predicted, target_names = news_test.target_names)), metrics.confusion_matrix(news_test.target, predicted)

Accuracy: 0.8348868175765646
                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

           avg / total       0.88      0.83      0.84      1502



(None, array([[192,   2,   6, 119],
        [  2, 347,   4,  36],
        [  2,  11, 322,  61],
        [  2,   2,   1, 393]], dtype=int64))

In [16]:
from sklearn.metrics import classification_report
y_true = [0,1,2,2,2]
y_pred = [0,0,2,2,1] 
target_names = ['class 0','class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names =target_names))


             precision    recall  f1-score   support

    class 0       0.50      1.00      0.67         1
    class 1       0.00      0.00      0.00         1
    class 2       1.00      0.67      0.80         3

avg / total       0.70      0.60      0.61         5



In [17]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true, y_pred)

array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 2]], dtype=int64)

In [18]:
# Precision = No. of correct result/No. of total returned result
# Recall = No. of correct result/No. of correct result that should have been returned
# F1 score = 2* (precision * recall)/(precision + recall)
# MIcro-avg = Total no. of correct result/Total no. of returned result
# Macro-avg = Add all precision values/Total no. of labels
# Weighted-avg = (Sum of precision of each label * support of each label)/ Total no. of support


##### Magic-1

In [19]:
# Count Vectorizer(word count) + TDIDF Transformer (frequency) ->TDIDF Vectorizer
# TDIDF Vectorizer does both word count & frequency

##### Magic-2



TDIDF Vectorizer + MultiNomial NB => Create Pipeline
We create pipeline of TDIDF Vectorizer, MultiNomial NB  & encapsulate it into object 'a'. And, perform operation with object 'a' of pipeline
