# Week 4 (Part 2): Document Similarity

In some applications, it may be difficult to define the classes that we want to use in classification ahead of time.  Or, classes might be made up various subclasses (which differ in terms of the vocabulary used).  In both of these cases (and others), it might be more appropriate to think about **document similarity**.  For a new document, can we find the most similar document in our collection?

### Preliminaries

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [5]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [1]:
#utils has functions that we have worked on in previous weeks.

import sys
sys.path.append('/content/drive/My Drive/NLE Notebooks/Week4LabsSolutions/')
from utils import *

Sussex NLTK root directory is /content/drive/My Drive/NLE Notebooks/resources


Now lets get a document collection.  Actually, lets get two collections and store them in a dictionary for easy access.

In [2]:
from sussex_nltk.corpus_readers import ReutersCorpusReader

rcr = ReutersCorpusReader()    #Create a new reader

collectionsize=10 #the number of documents in each collection
collections={"finance":[],"sport":[]}

for key in collections.keys():
    generator=rcr.category(key).raw()
    while len(collections[key])<collectionsize:
        collections[key].append(next(generator))
        


We now need a function which will tokenise documents and construct a *bag-of-words* document representation.  Combining some of the functionality we have been working over the past few weeks (which we have imported from utils.py), we could use something like this

In [3]:
def make_bow(somestring):
    rep=word_tokenize(somestring)
    rep=normalise(rep)
    rep=stem(rep)
    rep=filter_stopwords(rep)
    dict_rep={}
    for token in rep:
        dict_rep[token]=dict_rep.get(token,0)+1
    return(dict_rep)

We can apply `make_bow()` to all of the documents in our collections

In [6]:
bow_collections={key:[make_bow(doc) for doc in collection] for key,collection in collections.items()}

Let's have a look at the first document in the finance collection.

In [9]:
print(bow_collections["finance"][0])

{'navistar': 16, 'close': 4, 'foundri': 12, 'take': 2, 'charg': 2, 'intern': 2, 'indianapoli': 8, 'cast': 2, 'employ': 1, 'num': 10, 'peopl': 1, 'late': 2, 'local': 3, 'union': 6, 'member': 3, 'reject': 2, 'compani': 8, 'propos': 4, 'truck': 4, 'school': 2, 'bu': 1, 'manufactur': 1, 'said': 8, 'tuesday': 1, 'move': 1, 'follow': 1, 'ratif': 1, 'extens': 1, 'master': 1, 'contract': 2, 'percent': 2, 'vote': 1, 'unit': 1, 'auto': 1, 'worker': 4, 'uaw': 5, 'announc': 2, 'sunday': 1, 'left': 1, 'open': 3, 'possibl': 2, 'could': 3, 'remain': 1, 'seek': 2, 'agreement': 5, 'came': 1, 'back': 2, 'say': 3, 'like': 1, 'talk': 2, 'one': 1, 'time': 2, 'alway': 1, 'work': 4, 'toward': 1, 'chairman': 1, 'john': 1, 'horn': 1, 'told': 1, 'report': 1, 'lock': 1, 'door': 2, 'jack': 1, 'laskowski': 2, 'vice': 1, 'presid': 1, 'ad': 2, 'confer': 1, 'call': 1, 'held': 1, 'jointli': 1, 'think': 1, 'know': 1, 'long': 1, 'sure': 1, 'short': 1, 'period': 1, 'offici': 4, 'gave': 1, 'detail': 1, 'except': 1, 'basic

## Measuring Similarity
We are going to use the cosine measure to determine how similar two documents are.  This can be defined in terms of the dot products of vectors:

\begin{eqnarray*}
\mbox{sim}_{\mbox{cosine}}(A,B) = \frac{A.B}{\sqrt{A.A \times B.B}}
\end{eqnarray*}

where the dot product of two vectors, A and B, is defined as:

\begin{eqnarray*}
A.B = \sum_{\mbox{f}} \mbox{weight}(A,f)\times \mbox{weight}(B,f) 
\end{eqnarray*}

and $\mbox{weight}(X,f)$ tells us the value associated with feature $f$ in the vector representation of $X$

### Exercise 1.1
* Write a function `dot` which takes two documents (represented as dictionaries) and returns their dot product
* Test it out on the first two documents in the finance collection.  You should get the answer 206 

In [10]:
def dot(docA,docB):
    the_sum=0
    for (key,value) in docA.items():
        the_sum+=value*docB.get(key,0)
    return the_sum

testA=bow_collections['finance'][0]
testB=bow_collections['finance'][1]
dot(testA, testB)

206

### Exercise 1.2
* Write a function `cos_sim` which takes two documents (represented as dictionaries) and returns their cosine similarity.
* Your function should make 3 calls to the `dot` function you have already defined
* If you test it out on the first two documents in the finance collection you should get 0.24 (to 2S.F.)

In [11]:
import math

def cos_sim(docA,docB):
    sim=dot(docA,docB)/(math.sqrt(dot(docA,docA)*dot(docB,docB)))
    return sim

cos_sim(testA,testB)

0.2389942359865197

### Exercise 1.3
* Write some code that will compute the similarity of every document in a collection with every document in another collection
* Write code to compute the average similarity of two collections
* Compute (and display) the average similarity of 
    * the finance collection to the finance collection
    * the sport collection to the sport collection
    * the finance collection to the sport collection
    * the sport collection to the finance collection
    

In [12]:
def average_similarity(collectionA,collectionB):
    
    totalsim=0
    
    for docA in collectionA:
        for docB in collectionB:
            totalsim+=cos_sim(docA,docB)
    return totalsim/(len(collectionA)*len(collectionB))

for key1 in bow_collections.keys():
    for key2 in bow_collections.keys():
        sim=average_similarity(bow_collections[key1],bow_collections[key2])
        print("The average similarity of {} to {} is {}".format(key1,key2,sim))

The average similarity of finance to finance is 0.3332921405718554
The average similarity of finance to sport is 0.238305878745487
The average similarity of sport to finance is 0.23830587874548698
The average similarity of sport to sport is 0.3540280872838635


## Beyond Frequency
Frequency of a word in a document does not make a very good weight because some words occur very frequently in all documents.  If two rare words occur in both of our pair of documents, that should add more to their perceived similarity than if two common words occur in both of our pair of documents.

### TF-IDF
A commonly used weight is tf-idf which stands for **term frequency, inverse document frequency**

\begin{eqnarray*}
\mbox{tf-idf}(D_i,f) = tf(D_i,f) \times idf(D_i,f)
\end{eqnarray*}

where $tf(D_i,f)$ is simply the frequency of feature f in document $D_i$
and

\begin{eqnarray*}
idf(D_i,f) = log \frac{N}{df(f)}
\end{eqnarray*}

where $N$ is the total number of documents and $\mbox{df}(f)$ is the number of documents containing $f$:  

\begin{eqnarray*}
df(f)=|\{i|\mbox{freq}(D_i,f)>0\}|
\end{eqnarray*}

The code below will take a list of documents (represented as dictionaries) and compute the document frequency for each feature.  Test it out on one of the collections of documents.

In [13]:
def doc_freq(doclist):
    df={}
    for doc in doclist:
        for feat in doc.keys():
            df[feat]=df.get(feat,0)+1
            
    return df
    

In [14]:
doc_freq(bow_collections['finance'])

{'abraod': 1,
 'abroad': 2,
 'absolut': 1,
 'accord': 1,
 'account': 1,
 'activ': 1,
 'ad': 3,
 'address': 1,
 'administr': 4,
 'advis': 1,
 'affair': 1,
 'agenc': 2,
 'ago': 2,
 'agre': 3,
 'agreement': 4,
 'agricultur': 1,
 'aid': 2,
 'albani': 1,
 'alcohol': 1,
 'alleg': 1,
 'allow': 4,
 'also': 4,
 'altern': 2,
 'alway': 1,
 'amend': 3,
 'america': 1,
 'american': 3,
 'among': 1,
 'amount': 3,
 'ani': 1,
 'announc': 1,
 'annual': 2,
 'anoth': 1,
 'applianc': 1,
 'approach': 1,
 'appropri': 1,
 'approv': 2,
 'april': 1,
 'argu': 3,
 'arizona': 1,
 'armistic': 1,
 'asid': 1,
 'assembl': 1,
 'assist': 1,
 'atom': 1,
 'attorney': 1,
 'aug': 1,
 'australia': 1,
 'auth': 1,
 'author': 1,
 'auto': 1,
 'automat': 1,
 'averag': 1,
 'aviat': 1,
 'awar': 1,
 'away': 2,
 'b': 2,
 'back': 2,
 'balanc': 1,
 'baltic': 1,
 'baltimor': 1,
 'ban': 1,
 'banana': 1,
 'banc': 1,
 'bank': 2,
 'bargain': 1,
 'barrag': 1,
 'basic': 1,
 'basin': 1,
 'battl': 1,
 'battleground': 1,
 'becam': 1,
 'becaus': 3

### Exercise 2.1
* Write a function which will compute the idf values for features given a list of documents
* Use it to compute idf values for features given:
    * the finance collection of documents
    * the sports collection of documents
    * the combination of the finance and sports collections
    

In [15]:
def idf(doclist):
    N=len(doclist)
    return {feat:math.log(N/v) for feat,v in doc_freq(doclist).items()}


In [16]:
finance_idf=idf(bow_collections['finance'])
sports_idf=idf(bow_collections['sport'])
combined_idf=idf(bow_collections['sport']+bow_collections['finance'])

### Exercise 2.2
* Write a function `convert_to_tfidf` that takes two arguments:
    * a list of documents (represented as dictionaries {feat:freq})
    * a dictionary containing idf values
* and outputs a list of documents with tfidf weights (i.e., dictionaries {feat:tfidf})

In [17]:
def convert_to_tfidf(docs,idfvalues):
    converted=[{f:v*idfvalues.get(f,0) for f,v in doc.items()} for doc in docs]
    return converted

In [18]:
convert_to_tfidf(bow_collections['finance'],finance_idf)

[{'ad': 2.4079456086518722,
  'advis': 2.302585092994046,
  'agreement': 4.5814536593707755,
  'allow': 1.8325814637483102,
  'also': 0.9162907318741551,
  'alway': 2.302585092994046,
  'america': 2.302585092994046,
  'announc': 4.605170185988092,
  'annual': 1.6094379124341003,
  'approv': 1.6094379124341003,
  'assembl': 2.302585092994046,
  'auto': 2.302585092994046,
  'back': 3.2188758248682006,
  'basic': 2.302585092994046,
  'benefit': 4.605170185988092,
  'bu': 2.302585092994046,
  'build': 2.4079456086518722,
  'call': 2.302585092994046,
  'came': 2.302585092994046,
  'carri': 2.302585092994046,
  'cast': 4.605170185988092,
  'chairman': 1.2039728043259361,
  'chang': 3.6119184129778086,
  'charg': 4.605170185988092,
  'chatham': 2.302585092994046,
  'citi': 2.302585092994046,
  'close': 9.210340371976184,
  'compani': 7.330325854993241,
  'competit': 3.6119184129778086,
  'confer': 1.6094379124341003,
  'contract': 4.605170185988092,
  'cost': 0.9162907318741551,
  'could': 3.

### Exercise 2.3
* Convert both of your document collections so that the weights are tfidf values rather than frequencies.  Which idf_values should you use in each case?
* Recompute the average similarity between each collection of documents (as in Ex 1.3).
* What do you notice?
* As an extension, try increasing the sizes of the collections.  What do you notice now?

In [19]:
tfidf_collections={key:convert_to_tfidf(docs,combined_idf) for key,docs in bow_collections.items()}

for key1 in tfidf_collections.keys():
    for key2 in tfidf_collections.keys():
        sim=average_similarity(tfidf_collections[key1],tfidf_collections[key2])
        print("The average similarity of {} to {} is {}".format(key1,key2,sim))

The average similarity of finance to finance is 0.13306536902069344
The average similarity of finance to sport is 0.014572920713006709
The average similarity of sport to finance is 0.014572920713006707
The average similarity of sport to sport is 0.14805216431292612
