### Term Frequency-Inverse Document Frequency

TF-IDF is an approach to isolate descriptive words from text.

Suppose we have six documents that contain text on various topics. Like a frequency distribution, we can find the most common words in each documents. Then we can find the most common words throughout all documents. Then if we remove the most common words across all documents, we are left with descriptive words related to each specific document. Documents with the same leftover descriptive words are most likely about the same topic.

In [16]:
import nltk
import numpy as np
dataset = {}
for i in range(1,11):
    file_name = "tfidf_" + str(i) + ".txt"
    txt = open(file_name, "r").read()
    dataset[file_name] = txt
print(len(dataset))
print(dataset["tfidf_1.txt"])

10
World War II (WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, though related conflicts began earlier. It involved the vast majority of the world's nationsâ€”including all of the great powersâ€”eventually forming two opposing military alliances: the Allies and the Axis. It was the most widespread war in history, and directly involved more than 100 million people from over 30 countries. In a state of "total war", the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, erasing the distinction between civilian and military resources. Marked by mass deaths of civilians, including the Holocaust (in which approximately 11 million people were killed) and the strategic bombing of industrial and population centres (in which approximately one million were killed, and which included the atomic bombings of Hiroshima and Nagasaki), it resulted in an estimated 50 million to 85 million fatal

In [15]:
def tf(dataset, file_name):
    txt = dataset[file_name]
    tokens = nltk.word_tokenize(txt)
    freq = nltk.FreqDist(tokens)
    return freq

term_freqs = {}
def get_tfs(dataset):
    for file_name in list(dataset.keys()):
        term_freqs[file_name] = tf(dataset, file_name)

get_tfs(dataset)

In [25]:
import math
def idf(dataset, term):
    count = [term in dataset[file_name] for file_name in dataset]
    inv_df = np.log(len(count)/sum(count))
    return inv_df

In [26]:
print(idf(dataset, "world"))

0.5108256237659907


In [31]:
def tfidf(dataset, file_name, n):
    term_scores = {}
    file_fd = tf(dataset, file_name)
    for term in file_fd:
        if term.isalpha():
            idf_val = idf(dataset, term)
            tf_val = tf(dataset, file_name)[term]
            tfidf_val = tf_val * idf_val
            term_scores[term] = round(tfidf_val, 2)
    return sorted(term_scores.items(), key=lambda x: x[1], reverse=True)[:n]

In [32]:
tfidf_1 = tfidf(dataset, "tfidf_1.txt", 10)
print("tfidf_1: ", tfidf_1)

tfidf_1:  [('Soviet', 20.72), ('Union', 18.42), ('Axis', 16.12), ('Pacific', 11.51), ('Japan', 11.27), ('Germany', 11.27), ('Allies', 9.66), ('invasion', 9.66), ('World', 9.21), ('Asia', 9.21)]


In [34]:
for file_name in dataset:
    print(file_name + ": ", tfidf(dataset, file_name, 10))

tfidf_1.txt:  [('Soviet', 20.72), ('Union', 18.42), ('Axis', 16.12), ('Pacific', 11.51), ('Japan', 11.27), ('Germany', 11.27), ('Allies', 9.66), ('invasion', 9.66), ('World', 9.21), ('Asia', 9.21)]
tfidf_2.txt:  [('Armstrong', 11.51), ('lunar', 9.21), ('Aldrin', 6.91), ('Earth', 4.83), ('Apollo', 4.61), ('Moon', 4.61), ('UTC', 4.61), ('surface', 4.61), ('spacecraft', 4.61), ('step', 3.22)]
tfidf_3.txt:  [('Napoleon', 32.19), ('French', 16.86), ('Coalition', 11.51), ('Prussia', 6.91), ('military', 6.02), ('Revolution', 6.02), ('Battle', 6.02), ('against', 5.5), ('France', 4.85), ('Europe', 4.85)]
tfidf_4.txt:  [('Washington', 25.33), ('President', 6.44), ('Continental', 4.82), ('presided', 4.61), ('militia', 4.61), ('armies', 4.61), ('generals', 4.61), ('preservation', 4.61), ('opposition', 4.61), ('federal', 4.61)]
tfidf_5.txt:  [('Newton', 23.03), ('scientists', 6.91), ('motion', 4.83), ('mathematician', 4.61), ('Principia', 4.61), ('mechanics', 4.61), ('calculus', 4.61), ('laws', 4.6