# Chapter 3: Calculating the similarity between documents

In the last chapter, we covered how we could extract a clean, normalised bag-of-words from our collections of tales. In this chapter, we are going to start doing something useful with them. Specifically, we are going to have a look at the frequency of terms within our tales in order to work out which of them are the most similar. Moreover, we will work out how to retrieve those documents that are most similar to a certain set of terms. Going forward, we'll only use the English-language tales, but you can use all of the techniques I'll cover in this and all of the remaining chapters for most languages.

We'll pick up where we left off at the end of the last chapter, using our cleansed body of tales. If you haven't prepared your own data set, you can download it from [here](https://github.com/t-redactyl/text-mining/blob/master/02%20Text%20cleaning/cleansed_data.csv).

## Term frequency
As we saw in the previous chapter, one of the most basic things we can do once we've cleaned and tokenised our documents is to take a frequency count of all of the words. Let's revisit this now for our corpus of fairytales.

In [387]:
from nltk.tokenize import word_tokenize
from nltk import FreqDist

english_tokens = [word_tokenize(text) for text in texts['english_tales']]
flat_list = [word for sent_list in english_tokens for word in sent_list]
english_freqs = FreqDist(word for word in flat_list)

ftales = DataFrame(english_freqs.most_common(20), columns=['term', 'frequency'])
ftales

Unnamed: 0,term,frequency
0,say,3026
1,go,2283
2,come,1680
3,will,1529
4,thou,1501
5,king,1199
6,take,1139
7,see,1038
8,can,1004
9,little,995


Let's make this a little less overwhelming by focusing on two terms: one that is common and one that is rare. For our common term, we'll randomly pull one of our terms out of the top 20. For the rare term, we'll take a term that has a frequency of 30.

In [336]:
# Choose common term and frequency
common = {k: v for k, v in english_freqs.iteritems() if v > 520}
common_term = common.items()[0][0]
common_freq = common.items()[0][1]

# Choose rare term and frequency
rare = {k: v for k, v in english_freqs.iteritems() if v == 30}
rare_term = rare.items()[0][0]
rare_freq = rare.items()[0][1]

print("The common term is '%s' and its frequency is %d.") % (common_term, common_freq),
print("The rare term is '%s' and its frequency is %d.") % (rare_term, rare_freq)

The common term is 'make' and its frequency is 533. The rare term is 'merchant' and its frequency is 30.


One thing that might have occurred to you is that because we are using the raw terms, words that are common in longer stories will be overrepresented. If we have a look at the length of the longest versus the shortest tales, we can get a sense of how unbalanced the representation is across tales.

In [155]:
from IPython.display import display

tale_length = DataFrame({
    'title': [sentence.replace("Brothers Grimm fairy tales -", "").replace("(Margaret Hunt)", "").lstrip().rstrip() 
               for sentence in texts['english_titles']],
    'total words': [len(tale.split()) for tale in texts['english_tales']]})
display(tale_length.sort_values('total words', ascending=False).head())
display(tale_length.sort_values('total words', ascending=False).tail())

Unnamed: 0,title,total words
59,The Two Brothers,3735
80,Brother Lustig,1942
35,"The Wishing-Table, The Gold-Ass, and the Cudge...",1838
179,The Goose-Girl at the Well,1784
3,The Story of the Youth who Went Forth to Learn...,1749


Unnamed: 0,title,total words
137,Knoist and his Three Sons,72
116,The Wilful Child,63
160,A Riddling Tale,59
104,Stories about Snakes,54
106,The Two Travellers,44


We can see that the longest tale, *The Two Brothers*, is a whopping 85 times longer than the shortest tale, *The Two Travellers*. This means that our term frequencies will be dominated by those terms occurring in the longer tale, making it hard to work out what defines the other tales in our corpus.

Another thing you might have noticed is that the our common term, *make*, is around 18 times more common than our rare term, *merchant*. While more frequent terms do have more importance than less frequent terms in a document or corpus, this relationship is not linear. In other words, a term that occurs 200 times shouldn't be considered 10 times more important than a term that occurs 20 times - after a certain number of occurences, the presence of that common term is not adding any more information about its importance. 

## Normalised term frequency
One way we can control for these problems caused by using raw frequencies is to normalise them, either on the basis of document length (i.e., convert it into a proportion) or on some other scale. One of the most popular methods is called the [sublinear term frequency](https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html). In less mysterious terms, all we are doing is taking the log of each frequency and then adding 1 **[why do we add 1??]**. By converting the frequencies into a log scale, it helps to dampen these very frequent terms and make them more proportional to less common terms.

Let's have a look at what this does to the frequencies for our common and rare term.

In [364]:
import math

def subl_tf(freq):
    return 1 + math.log(freq)

common_ntf = subl_tf(common_freq)
rare_ntf = subl_tf(rare_freq)

print("The normalised term frequency for '%s' is %0.2f,") % (common_term, common_ntf),
print("and the normalised term frequency for '%s' is %0.2f.") % (rare_term, rare_ntf)

The normalised term frequency for 'make' is 7.28, and the normalised term frequency for 'merchant' is 4.40.


We can see that *make* has been weighted, so that instead of being 10 times as large as *merchant* it is not even twice as large! However, you might have noticed *another* problem we now have. While we got rid of stop words in the cleaning process, the top 20 words are dominated by words like *say*, *come* and *go*, which don't really offer much unique information as to what each story is about. Luckily, there is another way we can deal with this, which we'll discuss in the next section.

## Inverse document frequency
As a testament to how unhelpful common words can be in discriminating between documents, lets have a look at the document frequency of our common term, *make*. This is the number of documents that this word occurs in at least once.

In [360]:
total_df = len(texts['english_tales'])
common_df = len([x for x in [tale.split(" ").count(common_term) for tale in texts['english_tales']] 
                 if x != 0])
print("'%s' occurs in %d documents, of a total of %d.") % (common_term, common_df, total_df)

'make' occurs in 158 documents, of a total of 211.


You can see that of the 211 tales, *say* occurs in 158, which is three quarters of them! This is not really all that helpful for telling the tales apart. What we need to do is add some sort of weighting that penalises terms that occur in too many documents to really offer meaningful information about their content. We can achieve this using the inverse document frequency (IDF).

To calculate this, we take the inverse of the document frequency (i.e., divide the total number of documents in the corpus by the document frequency of a term), and then take the log of this. Documents that occur in many documents have a smaller IDF, while those that occur in few documents have a higher IDF. Let's see how this works with our common and rare terms. We first need to calculate the document frequency for our rare term.

In [361]:
rare_df = len([x for x in [tale.split(" ").count(rare_term) for tale in texts['english_tales']] 
                 if x != 0])
print("'%s' occurs in %d documents, of a total of %d.") % (rare_term, rare_df, total_df)

'merchant' occurs in 9 documents, of a total of 211.


In contrast to *make*, *merchant* only occurs in 9 documents! It is likely to give us some unique information about the documents it occurs within. Let's see whether the IDF bears this out.

In [363]:
def idf(N, df):
    return math.log(N * 1.0 / df)

common_idf = idf(total_df, common_df)
rare_idf = idf(total_df, rare_df)

print("The IDF of '%s' is %0.2f, and the IDF of '%s' is %0.2f.") % (
    common_term, common_idf, rare_term, rare_idf)

The IDF of 'make' is 0.29, and the IDF of 'merchant' is 3.15.


You can see the ubiquity of *make* has made it substantially less important than the much rarer *merchant*. We're ready to throw this together with our term frequency to get our final term weighting.

## Pulling it together with tf-idf
We can now tie our normalised term frequency (TF) together with the inverse document frequency (IDF) by calculating the tf-idf. As you might have guessed from the name, the tf-idf is simply the product of the TF and the IDF for a term. The tf-idf is designed to be highest when a term occurs many times within a small number of documents, and lowest when a term occurs in basically all documents. It thus offers a nice way of weighting terms so that those that are most characteristic a document have the most importance. 

Let's have a look at the tf-idf for our example words *make* and *merchant*. As we have already calculated our normalised term frequencies and inverse document frequencies, we simply need to multiply these together.

In [1]:
common_tfidf = common_ntf * common_idf
rare_tfidf = rare_ntf * rare_idf

print("The tf-idf of '%s' is %0.2f, and the tf-idf of '%s' is %0.2f.") % (
    common_term, common_tfidf, rare_term, rare_tfidf)

NameError: name 'common_ntf' is not defined

You can see that the term *merchant*, which only occurs in 9 tales, has a far higher tf-idf than *make*, which offers very little unique information about any specific tale.

## Calculating the tf-idf using Scikit-Learn
To make it a bit easier to do this for every term, we can luckily pull in some handy functions from the Scikit-Learn `feature_extraction` module. The `TfidfVectorizer` takes a raw collection of documents and turns it into into a term-document matrix. This is a count of how many times each of the terms in the corpus occur in each specific document, and something we'll revisit in much more detail once we start discussing cosine similarity. It also sets up how the tf-idf will be calculated. In our case, we want to make sure we are using the sublinear term frequency normalisation, so we set this argument to `True`.

We then call the `fit_transformation` method, which learns the dictionary of all words in the corpus given the input text, calculates the tf-idf according to how you've set it up, and returns the final weighting for each term in a document. Again, we'll get into this more detail in the next chapter.

For now, we can have a look at the tf-idf weightings for 

In [384]:
from sklearn.feature_extraction.text import TfidfVectorizer

sklearn_tfidf = TfidfVectorizer(sublinear_tf = True)
sklearn_representation = sklearn_tfidf.fit_transform(texts['english_tales'])
idf = sklearn_tfidf.idf_
idf_weights = DataFrame({'term': sklearn_tfidf.get_feature_names(),
           'idf': idf})

In [385]:
idf_weights[idf_weights['term'] == 'make']

Unnamed: 0,idf,term
3266,1.287682,make


In [386]:
idf_weights[idf_weights['term'] == 'merchant']

Unnamed: 0,idf,term
3353,4.054001,merchant


In [377]:
vectorizer = TfidfVectorizer(sublinear_tf = True)
tfidf_matrix = vectorizer.fit_transform(texts['english_tales'])
feature_names = vectorizer.get_feature_names()
dense = tfidf_matrix.todense()
dense
#denselist = dense.tolist()
#df = pd.DataFrame(denselist, columns=feature_names, index=characters)
#s = pd.Series(df.loc['Adam'])
#s[s > 0].sort_values(ascending=False)[:10]

matrix([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        ..., 
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [402]:
tfidf = TfidfVectorizer(sublinear_tf = True)
tfs = tfidf.fit_transform(texts['english_tales'])
response = tfidf.transform(texts['english_tales'])

In [403]:
terms_l = []
tfidf_l = []

feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
    terms_l.append(feature_names[col])
    tfidf_l.append(response[0, col])
#print feature_names[col], ' - ', response[0, col]

In [414]:
l = []
for col in response.nonzero()[1]:
    l.append({'term': feature_names[col], 'tfidf': response[0, col]})
    
tfidf_tot = DataFrame(l).drop_duplicates()

In [416]:
tfidf_tot[tfidf_tot['term'] == 'merchant']

Unnamed: 0,term,tfidf
2039,merchant,0.0
