In [142]:
import pandas as pd
from pandas import DataFrame
texts = pd.read_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/cleansed_data.csv",
                    usecols=[1, 2])
english_freqs = pd.read_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/english_term_frequencies.csv",
                           usecols=[1, 2])

In [91]:
gsample = pd.melt(sample.iloc[:,0:16], id_vars = ['titles'], 
                  value_vars = sample.iloc[:,0:16].columns[1:],
                  var_name = 'term', value_name = 'frequency')
gsample['frequency'] = gsample['frequency'] - 1
gsample
gsample.to_csv("/Users/jodieburchell/Documents/text-mining/03 Document similarity/prop_frequency_sample.csv")

## Outline

* Why is raw term frequency not helpful (because documents are of different lengths, therefore the frequency of terms means different things in different contexts)
* Frequency needs to be divided by the number of words in a document (i.e., it is the proportion of all signal words in a document).(https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/tfidf.html) 
* However, even if a term is frequent within one document, it might be frequent also in *all* documents. Therefore, its rarity within the corpus of documents can be determined.
* If a term is common in one or a few documents, and rare in the corpus (only occurs in a few documents), it is likely to be highly useful for determining what that document is about.

* Now I need to work out how to do the cosine similarity - I think there is an extra step where you need to normalise the length? Look into this.

* Ok, now we can set the weights for each of the words, and then calculate the cosine similarity between one of the texts (leave out of the training set?) and a "query" (something like 'beautiful princess').

# Chapter 3: Calculating the similarity between documents

In the last chapter, we covered how we could extract a clean, normalised bag-of-words from our collections of tales. In this chapter, we are going to start doing something useful with them. Specifically, we are going to have a look at the frequency of terms within our tales in order to work out which of them are the most similar. Moreover, we will work out how to retrieve those documents that are most similar to a certain set of terms. Going forward, we'll only use the English-language tales, but you can use all of the techniques I'll cover in this and all of the remaining chapters for most languages.

We'll pick up where we left off at the end of the last chapter, using our cleansed body of tales. If you haven't prepared your own data set, you can download it from [here](https://github.com/t-redactyl/text-mining/blob/master/02%20Text%20cleaning/cleansed_data.csv).

## Term frequency
As we saw in the previous chapter, one of the most basic things we can do once we've cleaned and tokenised our documents is to take a frequency count of all of the words. Let's revisit this now for our corpus of fairytales, this time having a look at the top 15 terms.

In [156]:
from nltk.tokenize import word_tokenize
from nltk import FreqDist

english_tokens = [word_tokenize(text) for text in texts['english_tales']]
flat_list = [word for sent_list in english_tokens for word in sent_list]
english_freqs = FreqDist(word for word in flat_list)

ftales = DataFrame(english_freqs.most_common(15), columns=['term', 'frequency'])
ftales

Unnamed: 0,term,frequency
0,say,3026
1,go,2283
2,come,1680
3,will,1529
4,thou,1501
5,king,1199
6,take,1139
7,see,1038
8,can,1004
9,little,995


In [145]:
#ftales = DataFrame(english_freqs.most_common(15), columns=['term', 'frequency'])
#ftales.to_csv("/Users/jodieburchell/Documents/text-mining/03 Document similarity/raw_frequency_sample.csv",
#              index = False)

If we plot the frequencies of these terms, we can get a sense of how relatively common these terms are.

<img src="/figure/Chapter 3.1.png" title="Frequency graph 1" style="display: block; margin: auto;" />

One thing that might have occurred to you is that because we are using the raw terms, words that are common in longer stories will be overrepresented. If we have a look at the length of the longest versus the shortest tales, we can get a sense of how unbalanced the representation is across tales.

In [155]:
from IPython.display import display

tale_length = DataFrame({
    'title': [sentence.replace("Brothers Grimm fairy tales -", "").replace("(Margaret Hunt)", "").lstrip().rstrip() 
               for sentence in texts['english_titles']],
    'total words': [len(tale.split()) for tale in texts['english_tales']]})
display(tale_length.sort_values('total words', ascending=False).head())
display(tale_length.sort_values('total words', ascending=False).tail())

Unnamed: 0,title,total words
59,The Two Brothers,3735
80,Brother Lustig,1942
35,"The Wishing-Table, The Gold-Ass, and the Cudge...",1838
179,The Goose-Girl at the Well,1784
3,The Story of the Youth who Went Forth to Learn...,1749


Unnamed: 0,title,total words
137,Knoist and his Three Sons,72
116,The Wilful Child,63
160,A Riddling Tale,59
104,Stories about Snakes,54
106,The Two Travellers,44


We can see that the longest tale, *The Two Brothers*, is a whopping 85 times longer than the shortest tale, *The Two Travellers*. This means that our term frequencies will be dominated by those terms occurring in the longer tale, making it hard to work out what defines the other tales in our corpus.

Another thing you might have noticed is that the most frequent term in the corpus, *say*, is around 3 times as common as most other words in the top 15. While more frequent terms do have more importance than less frequent terms in a document or corpus, this relationship is not linear. In other words, a term that occurs 200 times shouldn't be considered 10 times more important than a term that occurs 20 times - after a certain number of occurences, the presence of that common term is not adding any more information about its importance. 

## Normalised term frequency
One way we can control for these problems caused by using raw frequencies is to normalise them, either on the basis of document length (i.e., convert it into a proportion) or on some other scale. One of the most popular methods (and the one used by Scikit-Learn) is called the [sublinear term frequency](https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html). In less mysterious terms, all we are doing is taking the log of each frequency and then adding 1 **[why do we add 1??]**. By converting the frequencies into a log scale, it helps to dampen these very frequent terms and make them more proportional to less common terms.

In [77]:
import math

tf1 = 20
tf2 = 200

def subl_tf(freq):
    return 1 + math.log(freq)

display(subl_tf(tf1))
display(subl_tf(tf2))

3.995732273553991

6.298317366548036

We can see that our second term frequency has been weighted, so that instead of being 10 times as large as the first term frequency it is not even twice as large!

Let's convert our top 15 most frequent terms from raw to sublinear term frequencies and see what difference it makes.

In [159]:
ftales = DataFrame(english_freqs.most_common(15), columns=['term', 'frequency'])
ftales['frequency'] = ftales['frequency'].apply(subl_tf)
ftales

Unnamed: 0,term,frequency
0,say,9.014997
1,go,8.733246
2,come,8.426549
3,will,8.332369
4,thou,8.313887
5,king,8.089243
6,take,8.037906
7,see,7.945051
8,can,7.911747
9,little,7.902743


In [160]:
ftales.to_csv("/Users/jodieburchell/Documents/text-mining/03 Document similarity/prop_frequency_sample.csv",
              index = False)

<img src="/figure/Chapter 3.2.png" title="Frequency graph 2" style="display: block; margin: auto;" />

We can now see that all of our top 15 terms have relatively similar weighting, which makes a lot more sense than what we saw in the raw frequencies. For example, in the raw frequencies *thou* was weighted as almost 3 times as important as *thee*, whereas in the normalised frequencies these words are roughly equivalent. However, you might have noticed *another* problem we now have. While we got rid of stop words in the cleaning process, the top 15 words are still dominated by words like *say*, *come* and *go*, which don't really offer much unique information as to what each story is about. Luckily, there is another way we can deal with this, which we'll discuss in the next section.

## Inverse document frequency
As a testament to how unhelpful common words can be in discriminating between documents, lets have a look at the document frequency of our most common term, *say*. This is the number of documents that this word occurs in at least once.

In [194]:
csay = [tale.count(' say ') for tale in texts['english_tales']]
print("Number of tales containing 'say':"), 
print(len([x for x in csay if x != 0]))

print("Total number of tales:"), 
print(len(csay))

Number of tales containing 'say': 198
Total number of tales: 211


You can see that of the 211 tales, *say* occurs in 198 of them! What we need to do is add some sort of weighting that penalises terms that occur in too many documents to really offer meaningful information about their content. We can achieve this using the inverse document frequency (IDF).

To calculate this, we take the inverse of the document frequency (i.e., divide the total number of documents in the corpus by the document frequency of a term), and then take the log of this. Documents that occur in many documents have a smaller IDF, while those that occur in few documents have a higher IDF. Let's have a look at a practical example. We already have our most common term *say*, so we'll contrast that with a term that occurs a very low number of times (I've arbitrarily picked the first term with a raw frequency of 20).

In [241]:
rare_terms = {k: v for k, v in english_freqs.iteritems() if v == 20}
rare_terms.items()[0]

('chain', 20)

We've extracted the term *chain*. Let's have a look at its document frequency:

In [239]:
len([x for x in [tale.count(' chain ') for tale in texts['english_tales']] if x != 0])

7

Now we can calculate the IDF of both of these terms.

In [248]:
N = 211
df_say = 198
df_chain = 7

idf_say = math.log(N * 1.0 / df_say)
idf_chain = math.log(N * 1.0 / df_chain)

print("The IDF of 'say' is"),
print(round(idf_say, 3))

print("The IDF of 'chain' is"),
print(round(idf_chain, 3))

The IDF of 'say' is 0.064
The IDF of 'chain' is 3.406


You can see that our rare term, *chain*, is given a far higher IDF than our common term *say*.

## Pulling it together with tf-idf

We can now tie our normalised term frequency (TF) together with the inverse document frequency (IDF) by calculating the TF-IDF. As you might have guessed from the name, the TF-IDF is simply the product of the TF and the IDF for a term. Let's have a look at the TF-IDF for our example words *say* and *chain*.