In [157]:
import pandas as pd
from pandas import DataFrame
texts = pd.read_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/cleansed_data.csv",
                    usecols=[1, 2])
english_freqs = pd.read_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/english_term_frequencies.csv",
                           usecols=[1, 2])

## Outline

* Why is raw term frequency not helpful (because documents are of different lengths, therefore the frequency of terms means different things in different contexts)
* Frequency needs to be divided by the number of words in a document (i.e., it is the proportion of all signal words in a document). Is this the same as [normalising??](https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/tfidf.html) 
* However, even if a term is frequent within one document, it might be frequent also in *all* documents. Therefore, its rarity within the corpus of documents can be determined.
* If a term is common in one or a few documents, and rare in the corpus (only occurs in a few documents), it is likely to be highly useful for determining what that document is about.

# Chapter 3: Calculating the similarity between documents

In the last chapter, we covered how we could extract a clean, normalised bag-of-words from our collections of tales. In this chapter, we are going to start doing something useful with them. Specifically, we are going to have a look at the frequency of terms within our tales in order to work out which of them are the most similar. Moreover, we will work out how to retrieve those documents that are most similar to a certain set of terms. Going forward, we'll only use the English-language tales, but you can use all of the techniques I'll cover in this and all of the remaining chapters for most languages.

We'll pick up where we left off at the end of the last chapter, using our cleansed body of tales. If you haven't prepared your own data set, you can download it from [here](https://github.com/t-redactyl/text-mining/blob/master/02%20Text%20cleaning/cleansed_data.csv).

## Term frequency
As we saw in the previous chapter, one of the most basic things we can do once we've cleaned and tokenised our documents is to take a frequency count of all of the words. In the last chapter, I took a frequency of the whole corpus, but to give us a bit more insight into our texts let's break this down by story. We'll do this for the first 5 stories.

To do this, use a method from `sklearn.feature_extraction.text` called `CountVectorizer`. We can use the `fit_transform` function to create what is called a term density matrix for each of the terms in each tale. We'll start with the most basic form, which gives us a simple count of how many times each word occurs in each tale.

In [43]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()

We'll now create a data set containing the first 5 cleansed tales. We'll also cleanse the titles so we can use these as labels.

In [158]:
english_sample = DataFrame({'tales': texts['english_tales'][:5]})
english_titles = DataFrame({
    'titles': [sentence.replace("Brothers Grimm fairy tales -", "").replace("(Margaret Hunt)", "").lstrip().rstrip() 
               for sentence in texts['english_titles'][:5]]
})

Now that we've prepared our little sample, let's create a term-frequency matrix from it, sort the matrix by the most common words in the whole sample, and then attach the names of each tale as the label.

In [168]:
def getOrderedTFM(data, labels):
    m = DataFrame(countvec.fit_transform(data).toarray(), 
                  columns=countvec.get_feature_names())
    m = m.append(m.sum(numeric_only=True), ignore_index=True)
    m = m.T.sort_values(m.index[-1], ascending=False).T
    m = pd.merge(
        DataFrame(labels),
        m.drop(m.index[len(m)-1]),
        left_index=True, right_index=True, sort = False)

    return m

In [170]:
sample = getOrderedTFM(english_sample['tales'], english_titles)
sample

Unnamed: 0,titles,say,thou,go,will,come,can,little,shudder,thy,...,noise,not,odious,conversation,contrary,nowhere,obedient,obstinately,occasion,listener
0,"The Frog-King, or Iron Henry",11,17,8,4,6,6,10,0,14,...,0,0,1,0,0,0,0,0,1,0
1,Cat and Mouse in Partnership,14,0,10,11,3,1,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Our Lady's Child,20,21,9,10,5,11,12,0,11,...,0,0,0,0,1,0,1,1,0,0
3,The Story of the Youth who Went Forth to Learn...,51,36,38,25,27,27,10,38,10,...,1,1,0,1,0,0,0,0,0,1
4,The Wolf and The Seven Little Kids,9,5,9,7,9,2,6,0,2,...,0,0,0,0,0,1,0,0,0,0


In [193]:
gsample = pd.melt(sample.iloc[:,0:16], id_vars = ['titles'], 
                  value_vars = sample.iloc[:,0:16].columns[1:],
                  var_name = 'term', value_name = 'frequency')
gsample[:10]

Unnamed: 0,titles,term,frequency
0,"The Frog-King, or Iron Henry",say,11
1,Cat and Mouse in Partnership,say,14
2,Our Lady's Child,say,20
3,The Story of the Youth who Went Forth to Learn...,say,51
4,The Wolf and The Seven Little Kids,say,9
5,"The Frog-King, or Iron Henry",thou,17
6,Cat and Mouse in Partnership,thou,0
7,Our Lady's Child,thou,21
8,The Story of the Youth who Went Forth to Learn...,thou,36
9,The Wolf and The Seven Little Kids,thou,5


In [197]:
ggplot(aes(x = 'term', y = 'frequency', fill = 'titles'), data = gsample) + \
    geom_bar()

AttributeError: 'DataFrame' object has no attribute 'sort'

In [178]:
sample.iloc[:,0:16]

Unnamed: 0,titles,say,thou,go,will,come,can,little,shudder,thy,king,take,child,thee,see,cry
0,"The Frog-King, or Iron Henry",11,17,8,4,6,6,10,0,14,17,5,2,11,7,8
1,Cat and Mouse in Partnership,14,0,10,11,3,1,3,0,0,0,2,5,0,2,2
2,Our Lady's Child,20,21,9,10,5,11,12,0,11,7,10,18,4,5,3
3,The Story of the Youth who Went Forth to Learn...,51,36,38,25,27,27,10,38,10,12,15,0,17,13,10
4,The Wolf and The Seven Little Kids,9,5,9,7,9,2,6,0,2,0,3,9,2,5,8


In [180]:
sample.columns[1:]

Index(['say', 'thou', 'go', 'will', 'come', 'can', 'little', 'shudder', 'thy',
       'king',
       ...
       'noise', 'not', 'odious', 'conversation', 'contrary', 'nowhere',
       'obedient', 'obstinately', 'occasion', 'listener'],
      dtype='object', length=1072)

In [185]:
pd.melt(sample.iloc[:,0:16], id_vars = ['titles'], value_vars = sample.columns[1:])

Unnamed: 0,titles,variable,value
0,"The Frog-King, or Iron Henry",say,11.0
1,Cat and Mouse in Partnership,say,14.0
2,Our Lady's Child,say,20.0
3,The Story of the Youth who Went Forth to Learn...,say,51.0
4,The Wolf and The Seven Little Kids,say,9.0
5,"The Frog-King, or Iron Henry",thou,17.0
6,Cat and Mouse in Partnership,thou,0.0
7,Our Lady's Child,thou,21.0
8,The Story of the Youth who Went Forth to Learn...,thou,36.0
9,The Wolf and The Seven Little Kids,thou,5.0
