### Description

This notebook shows a method of word representation for NLP related problems and data analysis called **TF-IDF** which is a short of term frequency–inverse document frequency.

It is an improved concept of **Bag of Words** which treats each word equaly. **TF-IDF** is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

**TF-IDF** equation:

- term frequency

$ tf_{i,j} = \frac{n_{i,j}}{\sum_k n_{k,j}} $

- inverse document frequency

$ idf(w) = \mbox{log} \frac{N}{df_i} $

- term frequency–inverse document frequency

$ w_{i,j} = tf_{i,j} \times \mbox{log}\frac{N}{df_i} $

where:
- $i$ - index of term
- $j$ - index of document
- $k$ - number of terms in document
- $N$ - corpus length (number of documents)
- $df_i$ - number of documents containing term i

### 1. Data

In [1]:
corpus = [
    "The dog barks in the morning.",
    "Over the sofa lies sleeping dog.",
    "My dog name is Farell, it is very energetic.",
    "The dog barks at the cars.",
    "Cat dislikes vegetables.",
    "Cats sleep during day and hunt during night.",
    "Cats, dogs and elephants are animals.",
    "Dogs can run quickly.",
    "My favourite animals are dogs.",
    "There are many different animals in the world.",
    "When I buy a house I will also adopt two cats.",
    "On cat is black and the other cat is white."
]

### 2. Cleaning corpus

In [4]:
import re

import nltk
for package in ["punkt", "wordnet", "stopwords"]:
    nltk.download(package)

porter_stemmer = PorterStemmer()
wodnet_lemmatizer = WordNetLemmatizer()

def normalize_document(document, stemmer=porter_stemmer, lemmatizer=wodnet_lemmatizer):
    """Noramlizes data by performing following steps:
        1. Changing each word in corpus to lowercase.
        2. Removing special characters and interpunction.
        3. Dividing text into tokens.
        4. Removing english stopwords.
        5. Stemming words.
        6. Lemmatizing words.
    """
    
    temp = document.lower()
    temp = re.sub(r"[^a-zA-Z0-9]", " ", temp)
    temp = word_tokenize(temp)
    temp = [t for t in temp if t not in stopwords.words("english")]
    temp = [porter_stemmer.stem(token) for token in temp]
    temp = [lemmatizer.lemmatize(token) for token in temp]
        
    return temp

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\RhysL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\RhysL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\RhysL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Previeving results.

In [5]:
offset = max(map(len, corpus))
for document in corpus:
    print(document.rjust(offset), " -> ", normalize_document(document))

                 The dog barks in the morning.  ->  ['dog', 'bark', 'morn']
              Over the sofa lies sleeping dog.  ->  ['sofa', 'lie', 'sleep', 'dog']
  My dog name is Farell, it is very energetic.  ->  ['dog', 'name', 'farel', 'energet']
                    The dog barks at the cars.  ->  ['dog', 'bark', 'car']
                      Cat dislikes vegetables.  ->  ['cat', 'dislik', 'veget']
  Cats sleep during day and hunt during night.  ->  ['cat', 'sleep', 'day', 'hunt', 'night']
         Cats, dogs and elephants are animals.  ->  ['cat', 'dog', 'eleph', 'anim']
                         Dogs can run quickly.  ->  ['dog', 'run', 'quickli']
                My favourite animals are dogs.  ->  ['favourit', 'anim', 'dog']
There are many different animals in the world.  ->  ['mani', 'differ', 'anim', 'world']
When I buy a house I will also adopt two cats.  ->  ['buy', 'hous', 'also', 'adopt', 'two', 'cat']
   On cat is black and the other cat is white.  ->  ['cat', 'black', 'cat', 

It is possible to observe what tokens are left from each sentence.

### 4. Creating Bag of Words

Initiating the CountVectorizer model and removing English stopwords.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer(tokenizer=normalize_document)
bow.fit(corpus)
corpus_vectorized = bow.transform(corpus)

Building Bag Of Words based on corpus.

In [7]:
bow.fit(corpus)



Previewing tokens in the bag.

In [6]:
print(bow.get_feature_names())

['adopt', 'also', 'anim', 'bark', 'black', 'buy', 'car', 'cat', 'day', 'differ', 'dislik', 'dog', 'eleph', 'energet', 'farel', 'favourit', 'hous', 'hunt', 'lie', 'mani', 'morn', 'name', 'night', 'quickli', 'run', 'sleep', 'sofa', 'two', 'veget', 'white', 'world']


As it is possible to see te size of the bag is 31 as there are 31 tokens inside of it. Because of that each sentence will be represented with vector of size:

In [8]:
corpus_vectorized = bow.transform(corpus)

In [9]:
offset = max(map(len, corpus))
for document, document_vector in zip(corpus, corpus_vectorized.toarray()):
    print(document.rjust(offset), " -> ", document_vector)

                 The dog barks in the morning.  ->  [0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
              Over the sofa lies sleeping dog.  ->  [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0]
  My dog name is Farell, it is very energetic.  ->  [0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
                    The dog barks at the cars.  ->  [0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
                      Cat dislikes vegetables.  ->  [0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
  Cats sleep during day and hunt during night.  ->  [0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0]
         Cats, dogs and elephants are animals.  ->  [0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
                         Dogs can run quickly.  ->  [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0]
                My favourite animals are dogs.  ->  [0 0 1 0 0 0 0 0 0 0

Such vectors are now representing sentences in corpus.

### Creating TF-IDF values

Initializing Tfidf transformer.

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_idf_transformer = TfidfTransformer()
tf_idf_transformer.fit(corpus_vectorized)

Calculating frequencies.

In [13]:
tf_idf_transformer.fit(corpus_vectorized)

Visualising frequencies per term.

In [14]:
tf_idf_transformer.idf_

array([2.87180218, 2.87180218, 2.178655  , 2.46633707, 2.87180218,
       2.87180218, 2.87180218, 1.77318989, 2.87180218, 2.87180218,
       2.87180218, 1.48550782, 2.87180218, 2.87180218, 2.87180218,
       2.87180218, 2.87180218, 2.87180218, 2.87180218, 2.87180218,
       2.87180218, 2.87180218, 2.87180218, 2.87180218, 2.87180218,
       2.46633707, 2.87180218, 2.87180218, 2.87180218, 2.87180218,
       2.87180218])

In [16]:
for term, freq in zip(bow.get_feature_names_out(), tf_idf_transformer.idf_):
    print(term.rjust(10), " : ", freq)

     adopt  :  2.8718021769015913
      also  :  2.8718021769015913
      anim  :  2.1786549963416464
      bark  :  2.466337068793427
     black  :  2.8718021769015913
       buy  :  2.8718021769015913
       car  :  2.8718021769015913
       cat  :  1.7731898882334818
       day  :  2.8718021769015913
    differ  :  2.8718021769015913
    dislik  :  2.8718021769015913
       dog  :  1.4855078157817008
     eleph  :  2.8718021769015913
   energet  :  2.8718021769015913
     farel  :  2.8718021769015913
  favourit  :  2.8718021769015913
      hous  :  2.8718021769015913
      hunt  :  2.8718021769015913
       lie  :  2.8718021769015913
      mani  :  2.8718021769015913
      morn  :  2.8718021769015913
      name  :  2.8718021769015913
     night  :  2.8718021769015913
   quickli  :  2.8718021769015913
       run  :  2.8718021769015913
     sleep  :  2.466337068793427
      sofa  :  2.8718021769015913
       two  :  2.8718021769015913
     veget  :  2.8718021769015913
     white  :  2

Visualising frequency for document.

In [17]:
tfidf_docs = tf_idf_transformer.transform(corpus_vectorized)

In [14]:
print(tfidf_docs.toarray().shape)
print(tfidf_docs.toarray())

(12, 31)
[[0.         0.         0.         0.60649426 0.         0.
  0.         0.         0.         0.         0.         0.36529961
  0.         0.         0.         0.         0.         0.
  0.         0.         0.70620175 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.29839313
  0.         0.         0.         0.         0.         0.
  0.57685731 0.         0.         0.         0.         0.
  0.         0.49541176 0.57685731 0.         0.         0.
  0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.28615928
  0.         0.55320667 0.55320667 0.         0.         0.
  0.         0.         0.         0.55320667 0.         0.
  0.         0.         0.         0.         0.         0.
  0.        ]
 [0.         0.         0

In [19]:
tfidf_docs = tf_idf_transformer.transform(corpus_vectorized)

for doc_id in range(len(corpus)):
    print("Document id.{}: {}".format(doc_id, corpus[doc_id]))
    print("Tokens: {}".format(normalize_document(corpus[doc_id])))
    print("\n -- TF IDF Values for words in dictionary:")
    for term, freq in zip(bow.get_feature_names_out(), tfidf_docs[doc_id].T.toarray()):
        print(term.rjust(10), " : ", freq)
    print("\n ------------------")

Document id.0: The dog barks in the morning.
Tokens: ['dog', 'bark', 'morn']

 -- TF IDF Values for words in dictionary:
     adopt  :  [0.]
      also  :  [0.]
      anim  :  [0.]
      bark  :  [0.60649426]
     black  :  [0.]
       buy  :  [0.]
       car  :  [0.]
       cat  :  [0.]
       day  :  [0.]
    differ  :  [0.]
    dislik  :  [0.]
       dog  :  [0.36529961]
     eleph  :  [0.]
   energet  :  [0.]
     farel  :  [0.]
  favourit  :  [0.]
      hous  :  [0.]
      hunt  :  [0.]
       lie  :  [0.]
      mani  :  [0.]
      morn  :  [0.70620175]
      name  :  [0.]
     night  :  [0.]
   quickli  :  [0.]
       run  :  [0.]
     sleep  :  [0.]
      sofa  :  [0.]
       two  :  [0.]
     veget  :  [0.]
     white  :  [0.]
     world  :  [0.]

 ------------------
Document id.1: Over the sofa lies sleeping dog.
Tokens: ['sofa', 'lie', 'sleep', 'dog']

 -- TF IDF Values for words in dictionary:
     adopt  :  [0.]
      also  :  [0.]
      anim  :  [0.]
      bark  :  [0.]
  

In [20]:
tfidf_docs = tf_idf_transformer.transform(corpus_vectorized)

for doc_id in range(len(corpus)):
    print("Document id.{}: {}".format(doc_id, corpus[doc_id]))
    print("Tokens: {}".format(normalize_document(corpus[doc_id])))
    print("\n -- TF IDF Values for words in dictionary:")
    
    # Filter out terms with TF-IDF frequency of 0
    non_zero_terms = [(term, freq) for term, freq in zip(bow.get_feature_names_out(), tfidf_docs[doc_id].T.toarray()) if freq != 0]

    for term, freq in non_zero_terms:
        print(term.rjust(10), " : ", freq)
    
    print("\n ------------------")


Document id.0: The dog barks in the morning.
Tokens: ['dog', 'bark', 'morn']

 -- TF IDF Values for words in dictionary:
      bark  :  [0.60649426]
       dog  :  [0.36529961]
      morn  :  [0.70620175]

 ------------------
Document id.1: Over the sofa lies sleeping dog.
Tokens: ['sofa', 'lie', 'sleep', 'dog']

 -- TF IDF Values for words in dictionary:
       dog  :  [0.29839313]
       lie  :  [0.57685731]
     sleep  :  [0.49541176]
      sofa  :  [0.57685731]

 ------------------
Document id.2: My dog name is Farell, it is very energetic.
Tokens: ['dog', 'name', 'farel', 'energet']

 -- TF IDF Values for words in dictionary:
       dog  :  [0.28615928]
   energet  :  [0.55320667]
     farel  :  [0.55320667]
      name  :  [0.55320667]

 ------------------
Document id.3: The dog barks at the cars.
Tokens: ['dog', 'bark', 'car']

 -- TF IDF Values for words in dictionary:
      bark  :  [0.60649426]
       car  :  [0.70620175]
       dog  :  [0.36529961]

 ------------------
Docume