# Week 5: Document Similarity

In some applications, it may be difficult to define the classes that we want to use in classification ahead of time.  Or, classes might be made up various subclasses (which differ in terms of the vocabulary used).  In both of these cases (and others), it might be more appropriate to think about **document similarity**.  For a new document, can we find the most similar document in our collection?

### Preliminaries

In [None]:
###uncomment if working on colab

#from google.colab import drive
#drive.mount('/content/drive')


In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('gutenberg')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kerimciger/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kerimciger/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/kerimciger/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [2]:
from utils import *

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kerimciger/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now lets get a document collection.  We are going to use the Gutenberg collection of books.  We will get the tokenised content of each book and store it in a dictionary (key = the fileid of the book) for easy access.

In [3]:
from nltk.corpus import gutenberg

book_ids=gutenberg.fileids()
books={b:gutenberg.words(b) for b in book_ids}

In [4]:
books[book_ids[0]]

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]

In [5]:
books.keys()

dict_keys(['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'])

We now need to normalise the tokens in the documents and construct a *bag-of-words* document representation.  Combining some of the functionality we have been working over the past few weeks (which we have imported from utils.py), we could use something like this

In [6]:
from nltk.probability import FreqDist

book_reps={key:FreqDist(normalise(book)) for key,book in books.items()}

Let's have a look at the representation of first book:

In [7]:
print(book_reps[book_ids[0]].items())

dict_items([('emma', 865), ('jane', 301), ('austen', 1), ('volume', 3), ('chapter', 56), ('woodhouse', 313), ('handsome', 38), ('clever', 27), ('rich', 14), ('comfortable', 34), ('home', 130), ('happy', 125), ('disposition', 24), ('seemed', 141), ('unite', 3), ('best', 85), ('blessings', 6), ('existence', 8), ('lived', 25), ('nearly', 14), ('twenty', 30), ('one', 452), ('years', 57), ('world', 81), ('little', 359), ('distress', 19), ('vex', 1), ('youngest', 4), ('two', 178), ('daughters', 7), ('affectionate', 9), ('indulgent', 2), ('father', 207), ('consequence', 27), ('sister', 33), ('marriage', 35), ('mistress', 11), ('house', 95), ('early', 41), ('period', 18), ('mother', 72), ('died', 4), ('long', 146), ('ago', 32), ('indistinct', 1), ('remembrance', 8), ('caresses', 1), ('place', 93), ('supplied', 5), ('excellent', 34), ('woman', 131), ('governess', 9), ('fallen', 7), ('short', 70), ('affection', 50), ('sixteen', 9), ('miss', 599), ('taylor', 48), ('mr', 1153), ('family', 77), ('l

In [8]:
book_reps[book_ids[0]]["1816"]

0

## Measuring Similarity
We are going to use the cosine measure to determine how similar two books are. This can be defined in terms of the dot products of vectors:

$$
\text{sim}_{\text{cosine}}(A,B) = \frac{A \cdot B}{\sqrt{A \cdot A \times B \cdot B}}
$$

where the dot product of two vectors, \(A\) and \(B\), is defined as:

$$
A \cdot B = \sum_f \text{weight}(A,f) \times \text{weight}(B,f)
$$

and $(\text{weight}(X,f))$ tells us the value associated with feature $(f)$ in the vector representation of $(X)$.

### Exercise 1.1
* Write a function `dot` which takes two documents ( represented as dictionaries or `FreqDist`'s ) and returns their dot product.
* Test it out on the first two books in Gutenberg. You should get the answer 3882298!
* Why is the number so large?


In [14]:
# %%timeit
def dot( docA: nltk.probability.FreqDist, docB: nltk.probability.FreqDist ) -> int:
    summ = 0
    for key, value in docA.items():
        prod = value * docB.get( key, 0 )
        summ += prod

    return summ

In [15]:
docA = book_reps[book_ids[0]]
docB = book_reps[book_ids[1]]

dot(docA, docB)

3882298

### Exercise 1.2
* Write a function `cos_sim` which takes two documents (represented as dictionaries or `FreqDist`s) and returns their cosine similarity.
* Your function should make 3 calls to the `dot` function you have already defined
* If you test it out on the first two documents in the collection you should get 0.72 (to 2S.F.)

In [16]:
def cos_sim(docA: nltk.probability.FreqDist, docB: nltk.probability.FreqDist):
    upper = dot( docA, docB )
    lower = (dot(docA, docA) * dot(docB, docB))**0.5 
    sim = upper / lower if lower != 0 else 0
    return sim

cos_sim(docA,docB)

0.7209827887675819

In [13]:
import math

def cos_sim(docA,docB):
    sim=dot(docA,docB)/(math.sqrt(dot(docA,docA)*dot(docB,docB)))
    return sim

cos_sim(docA,docB)

0.7209827887675819

### Exercise 1.3
* Write some code that will compute the similarity of every document in a collection with every document in another collection
* Write code to compute the average similarity of two collections
* Compute (and display) the average similarity of the book collection to itself
    

In [22]:
import pandas as pd

def compute_all_pairs_sim( collectionA, collectionB ):
    sim_res = {}
    for keyA, docA in collectionA.items():
        sim_res[keyA] = {}
        for keyB, docB in collectionB.items():
            sim_res[keyA][keyB] = cos_sim( docA, docB )
    return sim_res

allsims = compute_all_pairs_sim(book_reps,book_reps)

allsims_df = pd.DataFrame(allsims)

allsims_df.head()

Unnamed: 0,austen-emma.txt,austen-persuasion.txt,austen-sense.txt,bible-kjv.txt,blake-poems.txt,bryant-stories.txt,burgess-busterbrown.txt,carroll-alice.txt,chesterton-ball.txt,chesterton-brown.txt,chesterton-thursday.txt,edgeworth-parents.txt,melville-moby_dick.txt,milton-paradise.txt,shakespeare-caesar.txt,shakespeare-hamlet.txt,shakespeare-macbeth.txt,whitman-leaves.txt
austen-emma.txt,1.0,0.720983,0.714178,0.234961,0.277624,0.484001,0.34439,0.431975,0.49351,0.566961,0.469868,0.688556,0.467413,0.315463,0.245423,0.268859,0.276954,0.420653
austen-persuasion.txt,0.720983,1.0,0.698421,0.229095,0.272859,0.488966,0.347673,0.415685,0.48617,0.572391,0.470311,0.670928,0.50542,0.331911,0.241884,0.269344,0.294243,0.435392
austen-sense.txt,0.714178,0.698421,1.0,0.254851,0.283665,0.486361,0.326153,0.43946,0.491066,0.549179,0.48028,0.67277,0.474923,0.33812,0.23948,0.26883,0.281037,0.438404
bible-kjv.txt,0.234961,0.229095,0.254851,1.0,0.375129,0.348728,0.139435,0.260582,0.33473,0.327006,0.311154,0.389897,0.359606,0.544221,0.371062,0.418882,0.393311,0.445999
blake-poems.txt,0.277624,0.272859,0.283665,0.375129,1.0,0.465337,0.289868,0.265321,0.324392,0.368291,0.313903,0.382288,0.374348,0.529776,0.275073,0.289585,0.321513,0.555633


In [20]:
def all_pairs_sims(collectionA: dict, collectionB: dict) -> dict:
    sims={}
    print( f'Types of documents: {type(collectionA)} {type(collectionB)}' )
    for keyA,docA in collectionA.items():
        sims[keyA]={}
        for keyB,docB in collectionB.items():
            sim=cos_sim(docA,docB)
            sims[keyA][keyB]=sim
            
    return sims


allsims=all_pairs_sims(book_reps,book_reps)
print(allsims)


Types of documents: <class 'dict'> <class 'dict'>
{'austen-emma.txt': {'austen-emma.txt': 1.0, 'austen-persuasion.txt': 0.7209827887675819, 'austen-sense.txt': 0.7141776265718526, 'bible-kjv.txt': 0.2349609219845515, 'blake-poems.txt': 0.2776236092963711, 'bryant-stories.txt': 0.4840013265541457, 'burgess-busterbrown.txt': 0.344389530877187, 'carroll-alice.txt': 0.4319750042697858, 'chesterton-ball.txt': 0.4935101011270622, 'chesterton-brown.txt': 0.5669614267935984, 'chesterton-thursday.txt': 0.4698682997700355, 'edgeworth-parents.txt': 0.6885556624822814, 'melville-moby_dick.txt': 0.4674131272963583, 'milton-paradise.txt': 0.3154632269322213, 'shakespeare-caesar.txt': 0.2454234304603725, 'shakespeare-hamlet.txt': 0.26885904270302036, 'shakespeare-macbeth.txt': 0.27695388261336384, 'whitman-leaves.txt': 0.42065330682321106}, 'austen-persuasion.txt': {'austen-emma.txt': 0.7209827887675819, 'austen-persuasion.txt': 1.0, 'austen-sense.txt': 0.698421056723616, 'bible-kjv.txt': 0.229094865

In [23]:
import pandas as pd

df = pd.DataFrame(allsims)
df.head()

Unnamed: 0,austen-emma.txt,austen-persuasion.txt,austen-sense.txt,bible-kjv.txt,blake-poems.txt,bryant-stories.txt,burgess-busterbrown.txt,carroll-alice.txt,chesterton-ball.txt,chesterton-brown.txt,chesterton-thursday.txt,edgeworth-parents.txt,melville-moby_dick.txt,milton-paradise.txt,shakespeare-caesar.txt,shakespeare-hamlet.txt,shakespeare-macbeth.txt,whitman-leaves.txt
austen-emma.txt,1.0,0.720983,0.714178,0.234961,0.277624,0.484001,0.34439,0.431975,0.49351,0.566961,0.469868,0.688556,0.467413,0.315463,0.245423,0.268859,0.276954,0.420653
austen-persuasion.txt,0.720983,1.0,0.698421,0.229095,0.272859,0.488966,0.347673,0.415685,0.48617,0.572391,0.470311,0.670928,0.50542,0.331911,0.241884,0.269344,0.294243,0.435392
austen-sense.txt,0.714178,0.698421,1.0,0.254851,0.283665,0.486361,0.326153,0.43946,0.491066,0.549179,0.48028,0.67277,0.474923,0.33812,0.23948,0.26883,0.281037,0.438404
bible-kjv.txt,0.234961,0.229095,0.254851,1.0,0.375129,0.348728,0.139435,0.260582,0.33473,0.327006,0.311154,0.389897,0.359606,0.544221,0.371062,0.418882,0.393311,0.445999
blake-poems.txt,0.277624,0.272859,0.283665,0.375129,1.0,0.465337,0.289868,0.265321,0.324392,0.368291,0.313903,0.382288,0.374348,0.529776,0.275073,0.289585,0.321513,0.555633


In [28]:
def compute_avg_similarity( collectionA: dict, collectionB: dict, sim_results = {} ) -> float:
    if sim_results == {}:
        sim_results = compute_all_pairs_sim( collectionA, collectionB )

    sim_sum = 0
    for sim_rec in sim_results.values():
        for sim in sim_rec.values():
            sim_sum += sim

    return sim_sum / (len(collectionA) * len(collectionB))

sim = compute_avg_similarity(book_reps,book_reps,sim_results = allsims)
print( f"The average similarity of the book collection to itself is {sim:.4f}")

The average similarity of the book collection to itself is 0.4451


In [29]:
def average_similarity(collectionA,collectionB,sims={}):
    
    if sims=={}:
        sims=all_pairs_sims(collectionA,collectionB)
    totalsim=0
    n=0
    for simvals in sims.values():
        for simval in simvals.values():
            totalsim+=simval
            n+=1
    #return totalsim/(len(collectionA)*len(collectionB))
    return totalsim/n
    
sim=average_similarity(book_reps,book_reps,sims=allsims)
print("The average similarity of the book collection to itself is {:.4f}".format(sim))

The average similarity of the book collection to itself is 0.4451


## Beyond Frequency
The frequency of a word in a document does not make a very good weight because some words occur very frequently in all documents. If two rare words occur in both of our pair of documents, that should add more to their perceived similarity than if two common words occur in both of our pair of documents.

### TFIDF
A commonly used weight is TF-IDF, which stands for **term frequency, inverse document frequency**:

$$
\text{tfidf}(D_i, f) = tf(D_i, f) \times \text{idf}(D_i, f)
$$

where $(tf(D_i, f))$ is simply the frequency of feature $(f)$ in document $(D_i)$, and:

$$
\text{idf}(D_i, f) = \log \frac{N}{df(f)}
$$

where $(N)$ is the total number of documents and $(\text{df}(f))$ is the number of documents containing $(f)$:

$$
df(f) = |\{i \mid \text{freq}(D_i, f) > 0\}|
$$

The code below will take a list of documents (represented as dictionaries) and compute the document frequency for each feature. Test it out on the collection of books.


In [34]:
def doc_freq(doclist):
    df={}
    for doc in doclist:
        for feat in doc.keys():
            df[feat]=df.get(feat,0)+1
            
    return df
    

In [35]:
doc_freq(book_reps.values())

{'emma': 2,
 'jane': 3,
 'austen': 3,
 'volume': 11,
 'chapter': 8,
 'woodhouse': 1,
 'handsome': 11,
 'clever': 10,
 'rich': 17,
 'comfortable': 13,
 'home': 18,
 'happy': 18,
 'disposition': 10,
 'seemed': 15,
 'unite': 6,
 'best': 18,
 'blessings': 7,
 'existence': 10,
 'lived': 14,
 'nearly': 13,
 'twenty': 16,
 'one': 18,
 'years': 15,
 'world': 18,
 'little': 18,
 'distress': 12,
 'vex': 7,
 'youngest': 10,
 'two': 18,
 'daughters': 11,
 'affectionate': 8,
 'indulgent': 6,
 'father': 17,
 'consequence': 13,
 'sister': 14,
 'marriage': 12,
 'mistress': 10,
 'house': 18,
 'early': 14,
 'period': 8,
 'mother': 17,
 'died': 15,
 'long': 18,
 'ago': 14,
 'indistinct': 3,
 'remembrance': 10,
 'caresses': 5,
 'place': 18,
 'supplied': 8,
 'excellent': 13,
 'woman': 17,
 'governess': 3,
 'fallen': 15,
 'short': 16,
 'affection': 12,
 'sixteen': 7,
 'miss': 13,
 'taylor': 3,
 'mr': 10,
 'family': 13,
 'less': 14,
 'friend': 18,
 'fond': 14,
 'particularly': 9,
 'intimacy': 6,
 'sisters': 

### Exercise 2.1
* Write a function which will compute the idf values for features given a list of documents
* Use it to compute idf values for features given the entire list of books in the book collection
    

In [38]:
import math

def idf(doclist):
    N=len(doclist)
    return {feat:math.log(N/v) for feat,v in doc_freq(doclist).items()}


In [39]:
books_idf=idf(book_reps.values())
books_idf

{'emma': 2.1972245773362196,
 'jane': 1.791759469228055,
 'austen': 1.791759469228055,
 'volume': 0.49247648509779424,
 'chapter': 0.8109302162163288,
 'woodhouse': 2.8903717578961645,
 'handsome': 0.49247648509779424,
 'clever': 0.5877866649021191,
 'rich': 0.05715841383994862,
 'comfortable': 0.32542240043462795,
 'home': 0.0,
 'happy': 0.0,
 'disposition': 0.5877866649021191,
 'seemed': 0.1823215567939546,
 'unite': 1.0986122886681098,
 'best': 0.0,
 'blessings': 0.9444616088408515,
 'existence': 0.5877866649021191,
 'lived': 0.25131442828090617,
 'nearly': 0.32542240043462795,
 'twenty': 0.11778303565638346,
 'one': 0.0,
 'years': 0.1823215567939546,
 'world': 0.0,
 'little': 0.0,
 'distress': 0.4054651081081644,
 'vex': 0.9444616088408515,
 'youngest': 0.5877866649021191,
 'two': 0.0,
 'daughters': 0.49247648509779424,
 'affectionate': 0.8109302162163288,
 'indulgent': 1.0986122886681098,
 'father': 0.05715841383994862,
 'consequence': 0.32542240043462795,
 'sister': 0.25131442828

### Exercise 2.2
* Write a function `convert_to_tfidf` that takes two arguments:
    * a dictionary of documents mapping fileids to documents
        * where each document is represented as a dictionary or FreqDist {feat:freq})
    * a dictionary containing idf values
* and outputs a dictionary of documents where each document is represented as a dictionary or FreqDist with tfidf weights {feat:tfidf}

In [44]:
def convert_to_tfidf(docs,idfvalues):
    converted={bookid:{f:v*idfvalues.get(f,0) for f,v in doc.items()} for bookid,doc in docs.items()}
    return converted

In [49]:
def convert_to_tfidf(docs, idfvalues):
    # Create a dictionary to store converted documents
    converted = {}

    # Iterate over each document
    for bookid, doc in docs.items():
        tfidf_doc = {}  # Store TF-IDF for the current document
        
        for feature, tf in doc.items():
            # Calculate TF-IDF for the feature
            idf = idfvalues.get(feature, 0)  # Use 0 if feature not in idfvalues
            tfidf_doc[feature] = tf * idf
        
        # Add the TF-IDF converted document to the result
        converted[bookid] = tfidf_doc

    return converted


In [42]:
list(book_reps.items())[:3]

[('austen-emma.txt',
  FreqDist({'mr': 1153, 'emma': 865, 'could': 837, 'would': 820, 'mrs': 699, 'miss': 599, 'must': 567, 'harriet': 506, 'much': 486, 'said': 484, ...})),
 ('austen-persuasion.txt',
  FreqDist({'anne': 497, 'could': 451, 'would': 355, 'captain': 303, 'mrs': 291, 'elliot': 289, 'mr': 256, 'one': 238, 'must': 228, 'wentworth': 218, ...})),
 ('austen-sense.txt',
  FreqDist({'elinor': 685, 'could': 578, 'marianne': 566, 'mrs': 530, 'would': 515, 'said': 397, 'every': 377, 'one': 331, 'much': 290, 'must': 283, ...}))]

In [59]:
convert_to_tfidf(book_reps,books_idf)

{'austen-emma.txt': {'emma': 1900.59925939583,
  'jane': 539.3196002376445,
  'austen': 1.791759469228055,
  'volume': 1.4774294552933827,
  'chapter': 45.41209210811441,
  'woodhouse': 904.6863602214995,
  'handsome': 18.714106433716182,
  'clever': 15.870239952357215,
  'rich': 0.8002177937592807,
  'comfortable': 11.06436161477735,
  'home': 0.0,
  'happy': 0.0,
  'disposition': 14.106879957650857,
  'seemed': 25.707339507947598,
  'unite': 3.295836866004329,
  'best': 0.0,
  'blessings': 5.666769653045109,
  'existence': 4.7022933192169525,
  'lived': 6.282860707022654,
  'nearly': 4.555913606084792,
  'twenty': 3.533491069691504,
  'one': 0.0,
  'years': 10.392328737255411,
  'world': 0.0,
  'little': 0.0,
  'distress': 7.703837054055123,
  'vex': 0.9444616088408515,
  'youngest': 2.3511466596084762,
  'two': 0.0,
  'daughters': 3.4473353956845596,
  'affectionate': 7.298371945946959,
  'indulgent': 2.1972245773362196,
  'father': 11.831791664869366,
  'consequence': 8.78640481173

### Exercise 2.3
* Recompute the average similarity between the collection of books (as in Ex 1.3).
* What do you notice?

In [60]:
tfidf_books=convert_to_tfidf(book_reps,books_idf)
average_similarity(tfidf_books,tfidf_books)

Types of documents: <class 'dict'> <class 'dict'>


0.08605545518027201

Using tfidf values as weight, average similarity is a lot lower.  This is presumably because of less accidental overlap of commonly occurring words.

### Exercise 2.4
For each book in the collection, find it's most similar book (NOT INCLUDING ITSELF!).
Output your results in a table

In [61]:
allsims=all_pairs_sims(tfidf_books,tfidf_books)

Types of documents: <class 'dict'> <class 'dict'>


In [62]:
df=pd.DataFrame(allsims)
df

Unnamed: 0,austen-emma.txt,austen-persuasion.txt,austen-sense.txt,bible-kjv.txt,blake-poems.txt,bryant-stories.txt,burgess-busterbrown.txt,carroll-alice.txt,chesterton-ball.txt,chesterton-brown.txt,chesterton-thursday.txt,edgeworth-parents.txt,melville-moby_dick.txt,milton-paradise.txt,shakespeare-caesar.txt,shakespeare-hamlet.txt,shakespeare-macbeth.txt,whitman-leaves.txt
austen-emma.txt,1.0,0.065044,0.048227,0.005446,0.006677,0.014007,0.005947,0.004657,0.010096,0.038462,0.007343,0.076654,0.013983,0.01606,0.000819,0.001688,0.001382,0.017726
austen-persuasion.txt,0.065044,1.0,0.050586,0.007797,0.009049,0.023258,0.005254,0.00609,0.010527,0.043906,0.008855,0.097921,0.029459,0.020528,0.001015,0.002234,0.001829,0.026094
austen-sense.txt,0.048227,0.050586,1.0,0.006294,0.007058,0.01367,0.003219,0.004582,0.006609,0.026448,0.007644,0.098512,0.012348,0.017807,0.000914,0.002165,0.001689,0.017874
bible-kjv.txt,0.005446,0.007797,0.006294,1.0,0.083024,0.06912,0.002086,0.003607,0.00741,0.021546,0.007278,0.019879,0.042034,0.201482,0.020034,0.033457,0.027459,0.102387
blake-poems.txt,0.006677,0.009049,0.007058,0.083024,1.0,0.044,0.008476,0.006896,0.009194,0.02268,0.009401,0.031436,0.035741,0.212993,0.016799,0.021174,0.022125,0.141815
bryant-stories.txt,0.014007,0.023258,0.01367,0.06912,0.044,1.0,0.027962,0.021555,0.017104,0.054973,0.01475,0.055212,0.051922,0.05677,0.003751,0.009235,0.006737,0.078491
burgess-busterbrown.txt,0.005947,0.005254,0.003219,0.002086,0.008476,0.027962,1.0,0.006396,0.004665,0.018464,0.003558,0.014187,0.006509,0.005224,0.000284,0.000596,0.00053,0.015114
carroll-alice.txt,0.004657,0.00609,0.004582,0.003607,0.006896,0.021555,0.006396,1.0,0.005226,0.016924,0.005901,0.015762,0.009133,0.009493,0.000525,0.001804,0.001075,0.013503
chesterton-ball.txt,0.010096,0.010527,0.006609,0.00741,0.009194,0.017104,0.004665,0.005226,1.0,0.041285,0.018068,0.017852,0.017743,0.024758,0.001024,0.002162,0.001458,0.03377
chesterton-brown.txt,0.038462,0.043906,0.026448,0.021546,0.02268,0.054973,0.018464,0.016924,0.041285,1.0,0.041208,0.068564,0.05117,0.049645,0.004398,0.006621,0.004456,0.083248


In [78]:
from operator import itemgetter
import pandas as pd

def nearestneighbours(simmatrix):
    nn={}

    for bookid,simdict in simmatrix.items():
        ordered=sorted(simdict.items(),key=itemgetter(1),reverse=True)
        nn[bookid]=ordered[1]
    return nn
        

print( f'\n---------------------------------------------------\n' )

nn1=nearestneighbours(allsims)
df = pd.DataFrame(nn1)
df=df.transpose()
df.columns=['nearest neighbour','similarity']
df


---------------------------------------------------



Unnamed: 0,nearest neighbour,similarity
austen-emma.txt,edgeworth-parents.txt,0.076654
austen-persuasion.txt,edgeworth-parents.txt,0.097921
austen-sense.txt,edgeworth-parents.txt,0.098512
bible-kjv.txt,milton-paradise.txt,0.201482
blake-poems.txt,milton-paradise.txt,0.212993
bryant-stories.txt,whitman-leaves.txt,0.078491
burgess-busterbrown.txt,bryant-stories.txt,0.027962
carroll-alice.txt,bryant-stories.txt,0.021555
chesterton-ball.txt,chesterton-brown.txt,0.041285
chesterton-brown.txt,whitman-leaves.txt,0.083248
