# TF-IDF
## Term frequency and weighting in Information Retrieval (IR)

When we are analyzing text data, we often encounter words that occur across
multiple documents from both classes. Those frequently occurring words typically
don't contain useful or discriminatory information. 
A useful technique called term frequency-inverse document frequency
(tf-idf) that can be used to downweight those frequently occurring words in the
feature vectors.


### Usage TF-IDF

* sentiment analysis
* stop-words filtering for text summarization and classification
* 83% of text-based recommender systems
* search engines for ranking and scoring


### Intro

* Term frequency (TF)
* Inverse document frequency (IDF)
* TF-IDF = TF * IDF


### Term frequency (TF)

Hans Peter Luhn (1957)

<img src="images/tf.png" alt="tf" style="width: 50%;"/>

* t – term,
* d - document,
* *double normalization - if corpus consisting of big and small documents


### Inverse document frequency (IDF)

Karen Spärck Jones (1972)

<img src="images/idf.png" alt="idf" style="width: 70%;"/>

* N - total number of documents in the corpus 
* nt - number of documents where the term t appears

Note that adding the constant 1 to the denominator is
optional and serves the purpose of assigning a non-zero value to terms that occur in
all training samples; the log is used to ensure that low document frequencies are not
given too much weight.


### TF-IDF

The tf-idf can be defined as the product of the term frequency and
the inverse document frequency:
TF-IDF = TF * IDF

<img src="images/tf_idf.png" alt="tf_idf" style="width: 70%;"/>


#### 1. highest when t occurs many times within a small number of documents
#### 2. lower when the term occurs fewer times in a document, or occurs in many documents
#### 3. lowest when the term occurs in virtually all documents