In [1]:
import pandas as pd
from pandas import DataFrame
texts = pd.read_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/cleansed_data.csv",
                    usecols=[1, 2])
english_freqs = pd.read_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/english_term_frequencies.csv",
                           usecols=[1, 2])

In [None]:
#gsample = pd.melt(sample.iloc[:,0:16], id_vars = ['titles'], 
#                  value_vars = sample.iloc[:,0:16].columns[1:],
#                  var_name = 'term', value_name = 'frequency')
#gsample[:10]
#gsample.to_csv("/Users/jodieburchell/Documents/text-mining/03 Document similarity/raw_frequency_sample.csv")

## Outline

* Why is raw term frequency not helpful (because documents are of different lengths, therefore the frequency of terms means different things in different contexts)
* Frequency needs to be divided by the number of words in a document (i.e., it is the proportion of all signal words in a document).(https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/tfidf.html) 
* However, even if a term is frequent within one document, it might be frequent also in *all* documents. Therefore, its rarity within the corpus of documents can be determined.
* If a term is common in one or a few documents, and rare in the corpus (only occurs in a few documents), it is likely to be highly useful for determining what that document is about.

* Now I need to work out how to do the cosine similarity - I think there is an extra step where you need to normalise the length? Look into this.

* Ok, now we can set the weights for each of the words, and then calculate the cosine similarity between one of the texts (leave out of the training set?) and a "query" (something like 'beautiful princess').

# Chapter 3: Calculating the similarity between documents

In the last chapter, we covered how we could extract a clean, normalised bag-of-words from our collections of tales. In this chapter, we are going to start doing something useful with them. Specifically, we are going to have a look at the frequency of terms within our tales in order to work out which of them are the most similar. Moreover, we will work out how to retrieve those documents that are most similar to a certain set of terms. Going forward, we'll only use the English-language tales, but you can use all of the techniques I'll cover in this and all of the remaining chapters for most languages.

We'll pick up where we left off at the end of the last chapter, using our cleansed body of tales. If you haven't prepared your own data set, you can download it from [here](https://github.com/t-redactyl/text-mining/blob/master/02%20Text%20cleaning/cleansed_data.csv).

## Term frequency
As we saw in the previous chapter, one of the most basic things we can do once we've cleaned and tokenised our documents is to take a frequency count of all of the words. In the last chapter, I took a frequency of the whole corpus, but to give us a bit more insight into our texts let's break this down by story. We'll do this for the first 5 stories.

To do this, use a method from `sklearn.feature_extraction.text` called `CountVectorizer`. We can use the `fit_transform` function to create what is called a term density matrix for each of the terms in each tale. We'll start with the most basic form, which gives us a simple count of how many times each word occurs in each tale.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()

We'll now create a data set containing the first 5 cleansed tales. We'll also cleanse the titles so we can use these as labels.

In [3]:
english_sample = DataFrame({'tales': texts['english_tales'][:5]})
english_titles = DataFrame({
    'titles': [sentence.replace("Brothers Grimm fairy tales -", "").replace("(Margaret Hunt)", "").lstrip().rstrip() 
               for sentence in texts['english_titles'][:5]]
})

Now that we've prepared our little sample, let's create a term-frequency matrix from it, sort the matrix by the most common words in the whole sample, and then attach the names of each tale as the label.

In [4]:
def getOrderedTFM(data, labels):
    m = DataFrame(countvec.fit_transform(data).toarray(), 
                  columns=countvec.get_feature_names())
    m = m.append(m.sum(numeric_only=True), ignore_index=True)
    m = m.T.sort_values(m.index[-1], ascending=False).T
    m = pd.merge(
        DataFrame(labels),
        m.drop(m.index[len(m)-1]),
        left_index=True, right_index=True, sort = False)

    return m

In [5]:
sample = getOrderedTFM(english_sample['tales'], english_titles)
sample

Unnamed: 0,titles,say,thou,go,will,come,can,little,shudder,thy,...,noise,not,odious,conversation,contrary,nowhere,obedient,obstinately,occasion,listener
0,"The Frog-King, or Iron Henry",11,17,8,4,6,6,10,0,14,...,0,0,1,0,0,0,0,0,1,0
1,Cat and Mouse in Partnership,14,0,10,11,3,1,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Our Lady's Child,20,21,9,10,5,11,12,0,11,...,0,0,0,0,1,0,1,1,0,0
3,The Story of the Youth who Went Forth to Learn...,51,36,38,25,27,27,10,38,10,...,1,1,0,1,0,0,0,0,0,1
4,The Wolf and The Seven Little Kids,9,5,9,7,9,2,6,0,2,...,0,0,0,0,0,1,0,0,0,0


Let's have a look at the frequency of each of the top 15 of these terms, broken down by tale.

<img src="/figure/Chapter 3.1.png" title="Frequency graph 1" style="display: block; margin: auto;" />

The first thing you might notice is that the frequency of words does differ between the tales. For example, The Frog-King has a high frequency for the word 'king', whereas Our Lady's Child has a high frequency for the words 'little' and 'child'. However, you may have also noticed that certain tables seem to be overrepresented for every word, especially The Story of the Youth who Went Forth to Learn Fear. Let's have a look at how many words are in each story.

In [32]:
DataFrame({
    'title': english_titles['titles'],
    'total words': [len(tale.split()) for tale in english_sample['tales']]
})

Unnamed: 0,title,total words
0,"The Frog-King, or Iron Henry",658
1,Cat and Mouse in Partnership,444
2,Our Lady's Child,871
3,The Story of the Youth who Went Forth to Learn...,1749
4,The Wolf and The Seven Little Kids,494


We can see that The Story of the Youth who Went Forth to Learn Fear has around twice the words of the other stories! The problem with using the raw frequency of each word is the terms that are common in longer documents automatically get scaled up compared to those contained in other documents. In addition, because most terms are more frequent in longer documents, it is difficult to see how relatively important individual words are when documents have different lengths.

One way we can control for this is to divide this raw frequency of terms in each document by the total number of words in that document. Essentially, this turns the frequency of each term into the proportion of all terms in that document that it makes up. Let's do this with our term frequency matrix, and reorder the columns so the terms with the highest proportion per document end up at the front.

In [134]:
def getOrderedTFM(data, labels):
    a = DataFrame(countvec.fit_transform(data).toarray(), 
                  columns=countvec.get_feature_names())
    b = DataFrame({'tot': [len(tale.split()) for tale in english_sample['tales']]})
    m = pd.merge(a, b, left_index=True, right_index=True, sort = False)
    
    df = DataFrame({"index": range(5)})
    for s in [DataFrame(m[title]/m['tot'], columns = [title]) for title in m.columns]:
        df = pd.merge(df, s, left_index=True, right_index=True)
    df = df.drop(labels = ['index', 'tot'], axis=1)
    
    m = df.append(df.sum(numeric_only=True), ignore_index=True)
    m = m.T.sort_values(m.index[-1], ascending=False).T
    m = pd.merge(
        DataFrame(labels),
        m.drop(m.index[len(m)-1]),
        left_index=True, right_index=True, sort = False)

    return m

In [136]:
sample = getOrderedTFM(english_sample['tales'], english_titles)
sample

Unnamed: 0,titles,say,go,thou,will,come,little,child,cat,thy,...,celebrate,chest,sadly,cellar,ropemaker,rope,roaring,respect,chatter,listener
0,"The Frog-King, or Iron Henry",0.016717,0.012158,0.025836,0.006079,0.009119,0.015198,0.00304,0.0,0.021277,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Cat and Mouse in Partnership,0.031532,0.022523,0.0,0.024775,0.006757,0.006757,0.011261,0.047297,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Our Lady's Child,0.022962,0.010333,0.02411,0.011481,0.005741,0.013777,0.020666,0.0,0.012629,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,The Story of the Youth who Went Forth to Learn...,0.02916,0.021727,0.020583,0.014294,0.015437,0.005718,0.0,0.001144,0.005718,...,0.000572,0.000572,0.000572,0.000572,0.000572,0.000572,0.000572,0.000572,0.000572,0.000572
4,The Wolf and The Seven Little Kids,0.018219,0.018219,0.010121,0.01417,0.018219,0.012146,0.018219,0.0,0.004049,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's revisit the frequencies of the new top 15 terms broken down by tale.

<img src="/figure/Chapter 3.2.png" title="Frequency graph 2" style="display: block; margin: auto;" />

It is now a lot easier to pick out which terms are important in each story. You can see that The *Frog King* contains the largest number of references to 'king', while *The Story of the Youth who Went Forth to Learn Fear* and *The Wolf and The Seven Little Kids* contain roughly the same proportion of references to 'child'. However, you might have noticed *another* problem we now have. While we got rid of stop words in the cleaning process, the top 15 words is still dominated by words like 'say', 'come' and 'go', which don't really offer much unique information as to what each story is about. Luckily, there is another way we can deal with this, which we'll discuss in the next section.

## Inverse document frequency
