### Author : Sanjoy Biswas
### Topic : TF - IDF Vectorizer
### Email : sanjoy.eee32@gmail.com

#### What is TF-IDF
TF-IDF stands for Term Frequency — Inverse Document Frequency and is a statistic that aims to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus.

This is performed by looking at how many times a word appears into a document while also paying attention to how many times the same word appears in other documents in the corpus.

The rationale behind this is the following:
a word that frequently appears in a document has more relevancy for that document, meaning that there is higher probability that the document is about or in relation to that specific word.

a word that frequently appears in more documents may prevent us from finding the right document in a collection; the word is relevant either for all documents or for none. Either way, it will not help us filter out a single document or a small subset of documents from the whole set.

![TF.PNG](attachment:TF.PNG)

![IDF.PNG](attachment:IDF.PNG)

![TFIDF.PNG](attachment:TFIDF.PNG)

### Import Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

In [6]:
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')

In [7]:
bagOfWordsA

['the', 'man', 'went', 'out', 'for', 'a', 'walk']

In [8]:
bagOfWordsB

['the', 'children', 'sat', 'around', 'the', 'fire']

In [9]:
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))

In [10]:
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
    numOfWordsA[word] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
    numOfWordsB[word] += 1

In [12]:
numOfWordsA

{'a': 1,
 'around': 0,
 'out': 1,
 'fire': 0,
 'went': 1,
 'man': 1,
 'walk': 1,
 'the': 1,
 'children': 0,
 'sat': 0,
 'for': 1}

In [11]:
numOfWordsB

{'a': 0,
 'around': 1,
 'out': 0,
 'fire': 1,
 'went': 0,
 'man': 0,
 'walk': 0,
 'the': 2,
 'children': 1,
 'sat': 1,
 'for': 0}