## Mean Term Frequency Inverse Document Frequency Feature

[TF-IDF Site](http://www.tfidf.com/)

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

The tf-idf weight is composed by two terms:

#### 1. Term Frequency (TF)
This measures how frequently a term occurs in a document. It is the number of times a word appears in a document, divided by the total number of words in that document.
```
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
```

#### 2.  Inverse Document Frequency (IDF)
This measures how important a term is. It is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears. Certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 
```
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
```

### Example

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

In the current use case of conversations, conversations are partitioned by day. Each day conversation is therefore considered a single document. Thus, if we have 27 days of chat messages, we have 27 documents.

In [197]:
import gzip
import json
import math
import nltk
import string

from os.path import join

In [183]:
PUNCTUATION_SET = set(string.punctuation)
CHAT_LOG_FILE = "gnue_irc_chat_logs.tsv.gz"
SUMMARIZED_CHAT_LOG_FILE = "summarized_chat_logs.csv.gz"
CHAT_DATE_PARTITIONS_FILE = 'chat_date_partitions.csv'
SUMMARIZED_CHAT_DATE_PARTITIONS_FILE = 'summarized_chat_date_partitions.csv'
OUTPUT_DIR = "feature_outputs"
WORD_FREQUENCIES = {}
WORD_COUNT = 0

In [184]:
def strip_leading_and_trailing_punctuation(word):
    return word.strip(string.punctuation) 

In [185]:
def update_word_frequency_and_count(word, current_document_number):
    if word not in WORD_FREQUENCIES:
        WORD_FREQUENCIES[word] = {
            "total_frequency": 1,
            "containing_document_count": 1,
            "last_containing_document": current_document_number
            }
    else:
        WORD_FREQUENCIES[word]["total_frequency"] += 1
        if WORD_FREQUENCIES[word]["last_containing_document"] < current_document_number:
            WORD_FREQUENCIES[word]["containing_document_count"] += 1
            WORD_FREQUENCIES[word]["last_containing_document"] = current_document_number
    global WORD_COUNT 
    WORD_COUNT += 1

In [186]:
def all_chars_in_word_are_punctuation(word):
    return all(char in PUNCTUATION_SET for char in word)

In [187]:
def pre_process_words_in_sentence(sentence, current_document_number):
    if not sentence:
        print("No words here")
    if type(sentence) is not str:
        try:
            sentence = sentence.decode('utf-8')
        except Exception:
            raise ValueError("Input must be a String or ByteString")
        
    for word in sentence.split():
        if not all_chars_in_word_are_punctuation(word):
            word = strip_leading_and_trailing_punctuation(word)
            if ',' in word:
                words = word.split(',')
                for word in words:
                    if not all_chars_in_word_are_punctuation(word):
                        update_word_frequency_and_count(word, current_document_number)
            else:
                update_word_frequency_and_count(word, current_document_number)
    

In [188]:
pre_process_words_in_sentence(b"Hello,there. Ju,.,.'',,,,,,,,,st checking-in.", 2)

In [189]:
WORD_FREQUENCIES

{'Hello': {'total_frequency': 1,
  'containing_document_count': 1,
  'last_containing_document': 2},
 'there': {'total_frequency': 1,
  'containing_document_count': 1,
  'last_containing_document': 2},
 'Ju': {'total_frequency': 1,
  'containing_document_count': 1,
  'last_containing_document': 2},
 'st': {'total_frequency': 1,
  'containing_document_count': 1,
  'last_containing_document': 2},
 'checking-in': {'total_frequency': 1,
  'containing_document_count': 1,
  'last_containing_document': 2}}

In [190]:
WORD_COUNT

5

In [191]:
def generate_document_word_frequencies(date_partitions_file, input_file, output_directory, output_file):
    data_output_file = join(output_directory, output_file)
    with open(date_partitions_file, 'r') as chat_partitions, gzip.open(input_file, 'r') as chat_file, open(data_output_file, 'w') as out_file:
        csv_row = chat_partitions.readline()
        total_chats_in_document = int(csv_row.rstrip().split(',')[2])
        document_number = 1
        chat_line_count = 0
        total_chat_lines_in_all_documents = 0
        for i, chat_line in enumerate(chat_file):
             # Get only the count
            pre_process_words_in_sentence(chat_line, document_number)
            chat_line_count += 1
            if chat_line_count >= total_chats_in_document:
                total_chat_lines_in_all_documents += chat_line_count
                csv_row = chat_partitions.readline()
                if not csv_row:
                    print(i)
                    print(total_chat_lines_in_all_documents)
                    assert i+1 == total_chat_lines_in_all_documents
                    assert chat_line_count == total_chats_in_document
                    break
                    
                # reset the chat_line_count
                chat_line_count = 0
                document_number += 1
                total_chats_in_document = int(csv_row.rstrip().split(',')[2])
        global WORD_COUNT
        frequency_data = {
            "total_word_count": WORD_COUNT,
            "word_frequencies": WORD_FREQUENCIES
        }
        print(document_number)
        print(chat_line_count)
        json.dump(frequency_data, out_file)

In [192]:
# BE SURE TO RESET WORD_COUNT and WORD_FREQUENCIES BEFORE RUNNING THIS!
WORD_COUNT = 0
WORD_FREQUENCIES = {}
generate_document_word_frequencies(CHAT_DATE_PARTITIONS_FILE, CHAT_LOG_FILE, OUTPUT_DIR + "/all_chat_outputs", "chat_word_frequencies.json")
# generate_document_word_frequencies(SUMMARIZED_CHAT_DATE_PARTITIONS_FILE, SUMMARIZED_CHAT_LOG_FILE, OUTPUT_DIR, "summarized_doc_word_frequencies.json")



659164
659165
1784
337


In [194]:
WORD_FREQUENCIES

{'DEREK': {'total_frequency': 7,
  'containing_document_count': 7,
  'last_containing_document': 1458},
 'cvs': {'total_frequency': 3716,
  'containing_document_count': 720,
  'last_containing_document': 1784},
 'diff': {'total_frequency': 347,
  'containing_document_count': 202,
  'last_containing_document': 1777},
 'will': {'total_frequency': 13511,
  'containing_document_count': 1407,
  'last_containing_document': 1784},
 'give': {'total_frequency': 2396,
  'containing_document_count': 884,
  'last_containing_document': 1781},
 'you': {'total_frequency': 68619,
  'containing_document_count': 1612,
  'last_containing_document': 1784},
 'the': {'total_frequency': 134568,
  'containing_document_count': 1650,
  'last_containing_document': 1784},
 'diffs': {'total_frequency': 58,
  'containing_document_count': 47,
  'last_containing_document': 1764},
 'status': {'total_frequency': 518,
  'containing_document_count': 326,
  'last_containing_document': 1784},
 'tell': {'total_frequency': 2

In [195]:
WORD_COUNT

4311279

In [196]:
def get_term_frequency(word_frequency_in_document, total_number_of_terms_in_document):
    return word_frequency_in_document / total_number_of_terms_in_document


In [200]:
def get_inverse_document_frequency(total_number_of_documents, number_of_documents_with_word):
    return math.log(total_number_of_documents, number_of_documents_with_word)
    