# Document distance

In this section, we analyze an algorithm for calculating the distance between two documents
in terms of the angle between them. In other words, we try to see how much overlap these
documents have in terms of their constituent words with no regard to their order. For exam-
ple, `["Steve","is","pretty","cool"]` is quite close to `["Steve","is","pretty","awesome"]` but an exact
match for `["is","steve","pretty","cool"]`. This algorithm has applications in search engines, where
we match documents based on the document distance between a query and a stored document.
In this section we walk through the algorithm line-by-line. See the "Document Distance"
handout for a full description of the algorithm and iterative improvements we make on it. The
improvements lead to the following. Note that $O(L)$ denotes time linear in the length of an array
$L$.

In [1]:
import math
import string
import sys
import cProfile

In [2]:
with open("book1.txt") as f:
    document1 = f.read()
with open("book2.txt") as f:
    document2 = f.read()

In [3]:
def read_file(filename):
    """ 
    Read the text file with the given filename;
    return a list of the lines of text in the file.
    """
    try:
        f = open(filename, 'r')
        return f.readlines()
    except IOError:
        print "Error opening or reading input file: ",filename
        sys.exit()

def get_words_from_line_list(L):
    """
    Parse the given list L of text lines into words.
    Return list of all words found.
    """

    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list = word_list + words_in_line
    return word_list

def get_words_from_string(line):
    """
    Return a list of the words in the given input string,
    converting each word to lower-case.

    Input:  line (a string)
    Output: a list of strings 
              (each string is a sequence of alphanumeric characters)
    """
    word_list = []          # accumulates words in line
    character_list = []     # accumulates characters in word
    for c in line:
        if c.isalnum():
            character_list.append(c)
        elif len(character_list)>0:
            word = "".join(character_list)
            word = word.lower()
            word_list.append(word)
            character_list = []
    if len(character_list)>0:
        word = "".join(character_list)
        word = word.lower()
        word_list.append(word)
    return word_list

def count_frequency(word_list):
    """
    Return a list giving pairs of form: (word,frequency)
    """
    L = []
    for new_word in word_list:
        for entry in L:
            if new_word == entry[0]:
                entry[1] = entry[1] + 1
                break
        else:
            L.append([new_word,1])
    return L
    
def word_frequencies_for_file(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """
    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    return freq_mapping

def inner_product(L1,L2):
    """
    Inner product between two vectors, where vectors
    are represented as lists of (word,freq) pairs.

    Example: inner_product([["and",3],["of",2],["the",5]],
                           [["and",4],["in",1],["of",1],["this",2]]) = 14.0 
    """
    sum = 0.0
    for word1, count1 in L1:
        for word2, count2 in L2:
            if word1 == word2:
                sum += count1 * count2
    return sum

def vector_angle(L1,L2):
    """
    The input is a list of (word,freq) pairs, sorted alphabetically.

    Return the angle between these two vectors.
    """
    numerator = inner_product(L1,L2)
    denominator = math.sqrt(inner_product(L1,L1)*inner_product(L2,L2))
    return math.acos(numerator/denominator)

In [4]:
def document_distance(d1, d2):
    sorted_word_list_1 = word_frequencies_for_file(d1)
    
    sorted_word_list_2 = word_frequencies_for_file(d2)
    distance = vector_angle(sorted_word_list_1,sorted_word_list_2)
    return distance


In [5]:
distance = document_distance("book1.txt", "book2.txt")
print "The distance between the documents is: %0.6f (radians)"% distance

KeyboardInterrupt: 

In [10]:
distance = document_distance("book1_small.txt", "book2_small.txt")
print "The distance between the documents is: %0.6f (radians)"% distance

KeyboardInterrupt: 

In [8]:
distance = document_distance("book1_tiny.txt", "book2_tiny.txt")
print "The distance between the documents is: %0.6f (radians)"% distance

The distance between the documents is: 0.667024 (radians)


In [9]:
cProfile.run('document_distance("book1_tiny.txt", "book2_tiny.txt")')

         42760 function calls in 0.193 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 <ipython-input-3-5fa6de42b2ce>:1(read_file)
        2    0.001    0.001    0.024    0.012 <ipython-input-3-5fa6de42b2ce>:13(get_words_from_line_list)
      200    0.016    0.000    0.023    0.000 <ipython-input-3-5fa6de42b2ce>:25(get_words_from_string)
        2    0.066    0.033    0.066    0.033 <ipython-input-3-5fa6de42b2ce>:50(count_frequency)
        2    0.000    0.000    0.090    0.045 <ipython-input-3-5fa6de42b2ce>:64(word_frequencies_for_file)
        3    0.102    0.034    0.102    0.034 <ipython-input-3-5fa6de42b2ce>:74(inner_product)
        1    0.000    0.000    0.102    0.102 <ipython-input-3-5fa6de42b2ce>:89(vector_angle)
        1    0.000    0.000    0.193    0.193 <ipython-input-4-e8249b2cd580>:1(document_distance)
        1    0.000    0.000    0.193    0.193 <string>:1(<mo

## Optimization 1: remove list concatenation

In [11]:
def get_words_from_line_list(L):
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list.extend(words_in_line)
        return word_list

In [12]:
distance = document_distance("book1_tiny.txt", "book2_tiny.txt")
print "The distance between the documents is: %0.6f (radians)"% distance

The distance between the documents is: 1.201624 (radians)


In [19]:
distance = document_distance("book1_small.txt", "book2_small.txt")
print "The distance between the documents is: %0.6f (radians)"% distance

The distance between the documents is: 1.201624 (radians)


In [21]:
cProfile.run('document_distance("book1.txt", "book2.txt")')

         442 function calls in 0.021 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 <ipython-input-11-af1902ee9362>:1(get_words_from_line_list)
        2    0.000    0.000    0.016    0.008 <ipython-input-3-5fa6de42b2ce>:1(read_file)
        2    0.000    0.000    0.000    0.000 <ipython-input-3-5fa6de42b2ce>:25(get_words_from_string)
        2    0.000    0.000    0.000    0.000 <ipython-input-3-5fa6de42b2ce>:50(count_frequency)
        2    0.000    0.000    0.017    0.008 <ipython-input-3-5fa6de42b2ce>:64(word_frequencies_for_file)
        3    0.000    0.000    0.000    0.000 <ipython-input-3-5fa6de42b2ce>:74(inner_product)
        1    0.000    0.000    0.000    0.000 <ipython-input-3-5fa6de42b2ce>:89(vector_angle)
        1    0.004    0.004    0.020    0.020 <ipython-input-4-e8249b2cd580>:1(document_distance)
        1    0.000    0.000    0.021    0.021 <string>:1(<modu

## Various smaller optimizations

In [19]:
# global variables needed for fast parsing
# translation table maps upper case to lower case and punctuation to spaces
translation_table = string.maketrans(string.punctuation+string.uppercase,
                                     " "*len(string.punctuation)+string.lowercase)

def get_words_from_text(text):
    """
    Parse the given text into words.
    Return list of all words found.
    """
    text = text.translate(translation_table)
    word_list = text.split()
    return word_list

def count_frequency(word_list):
    """
    Return a dictionary mapping words to frequency.
    """
    D = {}
    for new_word in word_list:
        if new_word in D:
            D[new_word] = D[new_word]+1
        else:
            D[new_word] = 1
    return D

def word_frequencies_for_text(text):
    """
    Return dictionary of (word,frequency) pairs for the given file.
    """
    word_list = get_words_from_text(text)
    freq_mapping = count_frequency(word_list)
    return freq_mapping

def inner_product(D1,D2):
    """
    Inner product between two vectors, where vectors
    are represented as dictionaries of (word,freq) pairs.

    Example: inner_product({"and":3,"of":2,"the":5},
                           {"and":4,"in":1,"of":1,"this":2}) = 14.0 
    """
    sum = 0.0
    for key in D1:
        if key in D2:
            sum += D1[key] * D2[key]
    return sum

def vector_angle(D1,D2):
    """
    The input is a list of (word,freq) pairs, sorted alphabetically.

    Return the angle between these two vectors.
    """
    numerator = inner_product(D1,D2)
    denominator = math.sqrt(inner_product(D1,D1)*inner_product(D2,D2))
    return math.acos(numerator/denominator)

In [20]:
cProfile.run("document_distance(document1, document2)")

         19 function calls in 0.336 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.226    0.113    0.226    0.113 <ipython-input-19-227e470e7bd3>:15(count_frequency)
        2    0.000    0.000    0.313    0.156 <ipython-input-19-227e470e7bd3>:27(word_frequencies_for_text)
        3    0.011    0.004    0.011    0.004 <ipython-input-19-227e470e7bd3>:35(inner_product)
        1    0.000    0.000    0.011    0.011 <ipython-input-19-227e470e7bd3>:49(vector_angle)
        2    0.000    0.000    0.087    0.043 <ipython-input-19-227e470e7bd3>:6(get_words_from_text)
        1    0.009    0.009    0.333    0.333 <ipython-input-7-ac93d2f4306b>:87(document_distance)
        1    0.003    0.003    0.336    0.336 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {math.acos}
        1    0.000    0.000    0.000    0.000 {math.sqrt}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lspr

In [16]:
cProfile.run("document_distance(document1, document2)")

         509 function calls in 0.026 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.021    0.011 <ipython-input-14-b6cbf467a347>:1(word_frequencies_for_text)
        3    0.000    0.000    0.000    0.000 <ipython-input-14-b6cbf467a347>:18(inner_product)
        2    0.000    0.000    0.000    0.000 <ipython-input-14-b6cbf467a347>:8(insertion_sort)
        2    0.000    0.000    0.000    0.000 <ipython-input-15-1f90468eae9b>:1(count_frequency)
        2    0.000    0.000    0.000    0.000 <ipython-input-7-ac93d2f4306b>:13(get_words_from_string)
        1    0.000    0.000    0.000    0.000 <ipython-input-7-ac93d2f4306b>:77(vector_angle)
        1    0.004    0.004    0.026    0.026 <ipython-input-7-ac93d2f4306b>:87(document_distance)
        2    0.000    0.000    0.000    0.000 <ipython-input-9-af1902ee9362>:1(get_words_from_line_list)
        1    0.000    0.000    0.026    0.026 <string>: