## Lecture 2: Models of Computation

Lecture by Erik Demaine

Video link here: [https://www.youtube.com/watch?v=Zc54gFhdpLA&list=PLUl4u3cNGP61Oq3tWYp6V_F-5jb5L2iHb&index=2](https://www.youtube.com/watch?v=Zc54gFhdpLA&list=PLUl4u3cNGP61Oq3tWYp6V_F-5jb5L2iHb&index=2)

### Problem statement: 
Given two documents, **D1** and **D2**, find the distance between them 
The distance **d(D1,D2)** can be defined in a number of ways, but we use the following definition:
* For a word 'w' in document D, D[w] is defined as the number of occurences of 'w' in D
* We create a vector for both documents D1 and D2 in this way
* Given both vectors, we compute the distance **d(D1,D2)** as the following steps:
  - d'(D1,D2): Compute the **inner product** of these vectors
      - ``d'(D1,D2) = sum(D1[w]*D2[w] for all w)``
      - This works well, but fails when the documents are very long. We can normalize this by dividing it by the lengths of the vectors
  - ``d''(D1,D2) = d'(D1,D2)/(|D1| * |D2|)``
      - |D| is the length of document D in words
      - This is also the cosine of the angle between the two vectors
      - If we take the arccos value of d''(D1,D2), we get the angle between the two vector
  - ``d(D1,D2) = arccos(d''(D1,D2))``

### Steps:
Calculating this requires broadly the following steps:
1. **Split document into words** - This can be done in a number of ways. The below list is not exhasutive
    1. Go through the document, anytime you see a non alphanumeric character, start a new word
    2. Use regex (can run in exponential time, so be very wary)
    3. Use 'split'
2. **Find word frequencies** - Couple of ways to do this:
    1. Sort the words, add to count
    2. Go through words linearly, add to count dictionary
3. Compute distance as above

In [61]:
import os, glob
import copy, math
doc_dir="Sample Documents/"
doc_list=os.listdir(doc_dir)

#### Split document into words

In [101]:
def splitIntoWords(file_contents: str) -> list:
    word_list=[]
    curr_word=[]
    for c in file_contents:
        ord_c=ord(c)
        if 65<=ord_c<=90 or 97<=ord_c<=122 or ord_c==39 or ord_c==44:
            if ord_c==44 or ord_c==39:
                continue
            curr_word.append(c)
        else:
            if curr_word:
                word_list.append("".join(curr_word).lower())
                curr_word=[]
            continue
    #remember to append the last word
    word_list.append("".join(curr_word).lower())
    return word_list

In [63]:
assert len(doc_list)==2, "Invalid number of documents. Select any two"
for i, doc in enumerate(doc_list):
    if i==0:
        D1=splitIntoWords(open(doc_dir+doc,"r").read())
    else:
        D2=splitIntoWords(open(doc_dir+doc,"r").read())

#### Compute word count

In [102]:
def computeWordCount(word_list: list)-> dict:
    '''
    This functions computes word counts by checking to see if the word is in the count dictionary
    If it is, then it increments that count by 1
    Else, it sets the count to 1
    '''
    word_count={}
    for word in word_list:
        if word in word_count:
            word_count[word]+=1
        else:
            word_count[word]=1
    return word_count

In [83]:
def computeWordCountSort(word_list: list)-> dict:
    '''
    This method computes the word counts by first sorting the list lexicographically
    If the word is the same as the previous one, it increments count by 1
    Else, it sets the count to the computed value, resets count to 1 and sets the current word to the new one
    '''
    word_list.sort()
    cur=word_list[0]
    count=1
    word_count={}
    for word in word_list[1:]:
        if word==cur:
            count+=1
        else:
            word_count[cur]=count
            count=1
            cur=word
    word_count[cur]=count
    return word_count

The above functions are equivalent. You can use either of them to compute the word counts. Below, I use the ```computeWordCount()``` function

#### Compute distance

In [95]:
def dotProduct(vec1: dict, vec2: dict) -> float:
    res=0.0
    for key in set(list(vec1.keys())+list(vec2.keys())):
        res+=vec1.get(key,0)*vec2.get(key,0)
    return res

In [105]:
def dotProductFaster(vec1: dict, vec2:dict) -> float:
    res=0.0
    if len(vec1)>len(vec2):
        smaller,larger=vec2,vec1
    else:
        smaller,larger=vec1,vec2
    for key in smaller.keys():
        res+=smaller[key]*larger.get(key,0)
    return res

In [86]:
def normalize(word_list_doc1: list, word_list_doc2: list) -> int:
    return len(word_list_doc1)*len(word_list_doc2)

In [108]:
def docdist(doc1, doc2):
    D1=splitIntoWords(open(doc_dir+doc1,"r").read())
    D2=splitIntoWords(open(doc_dir+doc2,"r").read())
    D1_WC=computeWordCount(D1)
    D2_WC=computeWordCount(D2)
    #Use either of the two function below
    #Time them to see which is faster
    DotProductValue=dotProduct(D1_WC, D2_WC)
    DotProductValueFaster= dotProductFaster(D1_WC, D2_WC)
    
    normalizedDPValue=DotProductValueFaster/(normalize(D1,D2))
    return math.acos(normalizedDPValue)
    

In [109]:
print(docdist(doc_list[0], doc_list[1]))

70.0 70.0
1.5590353029745017
