# Chapter 3: Calculating the similarity between documents

https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/  
https://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html

Good explanation of cosine similarity with two words: https://stackoverflow.com/questions/1746501/can-someone-give-an-example-of-cosine-similarity-in-a-very-simple-graphical-wa

Step-by-step tutorial for calculating td-idf and how it relates to document-query similarity: http://www.site.uottawa.ca/~diana/csi4107/cosine_tf_idf_example.pdf

https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/tfidf.html

https://web.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt

https://cran.r-project.org/web/packages/tidytext/vignettes/tf_idf.html

## Working through tutorial on td-idf

In [1]:
import pandas as pd
from pandas import DataFrame

In [12]:
mycorpus = ['new york times', 'new york post', 'los angeles times']

In [3]:
from nltk.tokenize import word_tokenize

In [4]:
tokens = [word_tokenize(text) for text in mycorpus]
print(tokens)

[['new', 'york', 'times'], ['new', 'york', 'post'], ['los', 'angeles', 'times']]


In [5]:
from nltk import FreqDist

In [6]:
flatList = [word for sentList in tokens for word in sentList]
tokenFreq = FreqDist(word for word in flatList)

for word, frequency in tokenFreq.most_common(10):
    print(u'{}: {}'.format(word, frequency))

times: 2
york: 2
new: 2
angeles: 1
los: 1
post: 1


In [26]:
# Calculate IDF
tokenFreq['angeles']
tokenFreq.keys()

['times', 'angeles', 'los', 'york', 'new', 'post']

In [37]:
import math
math.log(len(mycorpus) / tokenFreq['angeles'], 2)

1.5849625007211563

In [65]:
for key in tokenFreq:    
    tokenFreq[key] = math.log(len(mycorpus) * 1.0 / tokenFreq[key], 2)
    
tokenFreq

FreqDist({'angeles': 1.5849625007211563,
          'los': 1.5849625007211563,
          'new': 0.5849625007211562,
          'post': 1.5849625007211563,
          'times': 0.5849625007211562,
          'york': 0.5849625007211562})

In [7]:
tokenFreq

FreqDist({'angeles': 1, 'los': 1, 'new': 2, 'post': 1, 'times': 2, 'york': 2})

In [8]:
from collections import Counter

In [13]:
for doc in mycorpus:
    tf = Counter()
    for word in doc.split():
        tf[word] +=1
    print tf.items()

[('new', 1), ('york', 1), ('times', 1)]
[('new', 1), ('post', 1), ('york', 1)]
[('angeles', 1), ('los', 1), ('times', 1)]


In [16]:
import string #allows for format()
    
def build_lexicon(corpus):
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return lexicon

def tf(term, document):
    return freq(term, document)

def freq(term, document):
    return document.split().count(term)

vocabulary = build_lexicon(mycorpus)

doc_term_matrix = []
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
for doc in mycorpus:
    print 'The doc is "' + doc + '"'
    tf_vector = [tf(word, doc) for word in vocabulary]
    tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
    print 'The tf vector for Document %d is [%s]' % ((mycorpus.index(doc)+1), tf_vector_string)
    doc_term_matrix.append(tf_vector)
    
    # here's a test: why did I wrap mydoclist.index(doc)+1 in parens?  it returns an int...
    # try it!  type(mydoclist.index(doc) + 1)

print 'All combined, here is our master document term matrix: '
print doc_term_matrix

Our vocabulary vector is [times, angeles, los, york, new, post]
The doc is "new york times"
The tf vector for Document 1 is [1, 0, 0, 1, 1, 0]
The doc is "new york post"
The tf vector for Document 2 is [0, 0, 0, 1, 1, 1]
The doc is "los angeles times"
The tf vector for Document 3 is [1, 1, 1, 0, 0, 0]
All combined, here is our master document term matrix: 
[[1, 0, 0, 1, 1, 0], [0, 0, 0, 1, 1, 1], [1, 1, 1, 0, 0, 0]]
