In [1]:
import pandas as pd
from pandas import DataFrame
texts = pd.read_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/cleansed_data.csv",
                    usecols=[1, 2])
english_freqs = pd.read_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/english_term_frequencies.csv",
                           usecols=[1, 2])

In [91]:
gsample = pd.melt(sample.iloc[:,0:16], id_vars = ['titles'], 
                  value_vars = sample.iloc[:,0:16].columns[1:],
                  var_name = 'term', value_name = 'frequency')
gsample['frequency'] = gsample['frequency'] - 1
gsample
gsample.to_csv("/Users/jodieburchell/Documents/text-mining/03 Document similarity/prop_frequency_sample.csv")

## Outline

* Why is raw term frequency not helpful (because documents are of different lengths, therefore the frequency of terms means different things in different contexts)
* Frequency needs to be divided by the number of words in a document (i.e., it is the proportion of all signal words in a document).(https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/tfidf.html) 
* However, even if a term is frequent within one document, it might be frequent also in *all* documents. Therefore, its rarity within the corpus of documents can be determined.
* If a term is common in one or a few documents, and rare in the corpus (only occurs in a few documents), it is likely to be highly useful for determining what that document is about.

* Now I need to work out how to do the cosine similarity - I think there is an extra step where you need to normalise the length? Look into this.

* Ok, now we can set the weights for each of the words, and then calculate the cosine similarity between one of the texts (leave out of the training set?) and a "query" (something like 'beautiful princess').

# Chapter 3: Calculating the similarity between documents

In the last chapter, we covered how we could extract a clean, normalised bag-of-words from our collections of tales. In this chapter, we are going to start doing something useful with them. Specifically, we are going to have a look at the frequency of terms within our tales in order to work out which of them are the most similar. Moreover, we will work out how to retrieve those documents that are most similar to a certain set of terms. Going forward, we'll only use the English-language tales, but you can use all of the techniques I'll cover in this and all of the remaining chapters for most languages.

We'll pick up where we left off at the end of the last chapter, using our cleansed body of tales. If you haven't prepared your own data set, you can download it from [here](https://github.com/t-redactyl/text-mining/blob/master/02%20Text%20cleaning/cleansed_data.csv).

## Term frequency
As we saw in the previous chapter, one of the most basic things we can do once we've cleaned and tokenised our documents is to take a frequency count of all of the words. In the last chapter, I took a frequency of the whole corpus, but to give us a bit more insight into our texts let's break this down by story. We'll do this for the first 5 stories.

To do this, use a method from `sklearn.feature_extraction.text` called `CountVectorizer`. We can use the `fit_transform` function to create what is called a term density matrix for each of the terms in each tale. We'll start with the most basic form, which gives us a simple count of how many times each word occurs in each tale.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()

We'll now create a data set containing the first 5 cleansed tales. We'll also cleanse the titles so we can use these as labels.

In [3]:
english_sample = DataFrame({'tales': texts['english_tales'][:5]})
english_titles = DataFrame({
    'titles': [sentence.replace("Brothers Grimm fairy tales -", "").replace("(Margaret Hunt)", "").lstrip().rstrip() 
               for sentence in texts['english_titles'][:5]]
})

Now that we've prepared our little sample, let's create a term-frequency matrix from it, sort the matrix by the most common words in the whole sample, and then attach the names of each tale as the label.

In [4]:
def getOrderedTFM(data, labels):
    m = DataFrame(countvec.fit_transform(data).toarray(), 
                  columns=countvec.get_feature_names())
    m = m.append(m.sum(numeric_only=True), ignore_index=True)
    m = m.T.sort_values(m.index[-1], ascending=False).T
    m = pd.merge(
        DataFrame(labels),
        m.drop(m.index[len(m)-1]),
        left_index=True, right_index=True, sort = False)

    return m

In [5]:
sample = getOrderedTFM(english_sample['tales'], english_titles)
sample

Unnamed: 0,titles,say,thou,go,will,come,can,little,shudder,thy,...,noise,not,odious,conversation,contrary,nowhere,obedient,obstinately,occasion,listener
0,"The Frog-King, or Iron Henry",11,17,8,4,6,6,10,0,14,...,0,0,1,0,0,0,0,0,1,0
1,Cat and Mouse in Partnership,14,0,10,11,3,1,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Our Lady's Child,20,21,9,10,5,11,12,0,11,...,0,0,0,0,1,0,1,1,0,0
3,The Story of the Youth who Went Forth to Learn...,51,36,38,25,27,27,10,38,10,...,1,1,0,1,0,0,0,0,0,1
4,The Wolf and The Seven Little Kids,9,5,9,7,9,2,6,0,2,...,0,0,0,0,0,1,0,0,0,0


Let's have a look at the frequency of each of the top 15 of these terms, broken down by tale.

<img src="/figure/Chapter 3.1.png" title="Frequency graph 1" style="display: block; margin: auto;" />

The first thing you might notice is that the frequency of words does differ between the tales. For example, The Frog-King has a high frequency for the word 'king', whereas Our Lady's Child has a high frequency for the words 'little' and 'child'. However, you may have also noticed that certain tables seem to be overrepresented for every word, especially The Story of the Youth who Went Forth to Learn Fear. Let's have a look at how many words are in each story.

In [17]:
DataFrame({
    'title': english_titles['titles'],
    'total words': [len(tale.split()) for tale in english_sample['tales']]
})

Unnamed: 0,title,total words
0,"The Frog-King, or Iron Henry",658
1,Cat and Mouse in Partnership,444
2,Our Lady's Child,871
3,The Story of the Youth who Went Forth to Learn...,1749
4,The Wolf and The Seven Little Kids,494


We can see that *The Story of the Youth who Went Forth to Learn Fear* has around twice the words of the other stories! The problem with using the raw frequency of each word is the terms that are common in longer documents automatically get scaled up compared to those contained in other documents. In addition, because most terms are more frequent in longer documents, it is difficult to see how relatively important individual words are when documents have different lengths.

One way we can control for this is to normalise the frequencies, either on the basis of document length (i.e., convert it into a proportion) or on some other scale. One of the most popular methods (and the one used by Scikit-Learn) is called the [sublinear term frequency](https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html). In less mysterious terms, all we are doing is taking the log of each frequency and then adding 1 **[why do we add 1??]**.

The reason we do this is that while more frequent terms *do* have more importance that less frequent terms in a document, this relationship is not linear. In other words, a term that occurs 200 times shouldn't be considered 10 times more important than a term that occurs 20 times - after a certain number of occurences, the presence of that common term is not adding any more information about its importance. If we convert the frequencies into a log scale, it helps to dampen these very frequent terms and make them more proportional to less common terms.

In [77]:
import math
from IPython.display import display

tf1 = 20
tf2 = 200

def subl_tf(freq):
    return 1 + math.log(freq)

display(subl_tf(tf1))
display(subl_tf(tf2))

3.995732273553991

6.298317366548036

We can see that our second term frequency has been weighted, so that instead of being 10 times as large as the first term frequency it is not even twice as large!

Let's do this with our term frequency matrix, and reorder the columns so the terms with the highest proportion per document end up at the front.

In [82]:
def getOrderedTFM(data, labels):
    m = DataFrame(countvec.fit_transform(data).toarray(), 
                  columns=countvec.get_feature_names())
        
    df = DataFrame({"index": range(5)})
    for s in [DataFrame(m[title].apply(lambda x: 1 + math.log(x * 1.0 + 1)), columns = [title]) 
              for title in m.columns]:
        df = pd.merge(df, s, left_index=True, right_index=True)
    df = df.drop(labels = ['index'], axis=1)
    
    m = df.append(df.sum(numeric_only=True), ignore_index=True)
    m = m.T.sort_values(m.index[-1], ascending=False).T
    m = pd.merge(
        DataFrame(labels),
        m.drop(m.index[len(m)-1]),
        left_index=True, right_index=True, sort = False)
    
    return m

In [83]:
sample = getOrderedTFM(english_sample['tales'], english_titles)
sample

Unnamed: 0,titles,say,go,will,thou,come,little,can,take,see,...,misfortune,cellar,celebrate,motionless,mountain,mouthful,cardplaying,card,nail,listener
0,"The Frog-King, or Iron Henry",3.484907,3.197225,2.609438,3.890372,2.94591,3.397895,2.94591,2.791759,3.079442,...,1.0,1.0,1.0,1.0,1.0,1.693147,1.0,1.0,1.0,1.0
1,Cat and Mouse in Partnership,3.70805,3.397895,3.484907,1.0,2.386294,2.386294,1.693147,2.098612,2.098612,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,Our Lady's Child,4.044522,3.302585,3.397895,4.091042,2.791759,3.564949,3.484907,3.397895,2.791759,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,The Story of the Youth who Went Forth to Learn...,4.951244,4.663562,4.258097,4.610918,4.332205,3.397895,4.332205,3.772589,3.639057,...,1.693147,1.693147,1.693147,1.693147,1.693147,1.0,1.693147,1.693147,1.693147,1.693147
4,The Wolf and The Seven Little Kids,3.302585,3.302585,3.079442,2.791759,3.302585,2.94591,2.098612,2.386294,2.791759,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Let's revisit the frequencies of the new top 15 terms broken down by tale.

<img src="/figure/Chapter 3.2.png" title="Frequency graph 2" style="display: block; margin: auto;" />

It is now easier to pick out which terms are important in each story. You can see that *Our Lady's Child* and *The Wolf and The Seven Little Kids* contain more references to 'child' than the other stories, and terms that are common in only one story, such as 'shudder' have disappeared from the top 15. However, you might have noticed *another* problem we now have. While we got rid of stop words in the cleaning process, the top 15 words is still dominated by words like 'say', 'come' and 'go', which don't really offer much unique information as to what each story is about. Luckily, there is another way we can deal with this, which we'll discuss in the next section.

## Inverse document frequency


Get raw term frequency in all documents



l1 = ['youth', 'able', 'say']

Take only those terms with the top 15 cumulative frequency
Break that down by the frequency within each story
Wait, the issue is that of course the raw and log transformed frequencies are the same, duh!
Perhaps a better idea would be to calculate the TF-IDF for raw frequency, show the problem, and then do it for normalised? It just doesn't quite make sense to add log transformed frequencies (as it is doing a linear transformation to log transformed variable!).

In [93]:
def testFunc(data, labels):
    m = DataFrame(countvec.fit_transform(data).toarray(), 
                  columns=countvec.get_feature_names())
        
    df = DataFrame({"index": range(5)})
    for s in [DataFrame(m[title].apply(lambda x: 1 + math.log(x * 1.0 + 1)), columns = [title]) 
              for title in l1]:
        df = pd.merge(df, s, left_index=True, right_index=True)
    df = df.drop(labels = ['index'], axis=1)

    return df

In [94]:
testFunc(english_sample['tales'], english_titles)

Unnamed: 0,youth,able,say
0,1.0,1.693147,3.484907
1,1.0,1.0,3.70805
2,1.0,1.693147,4.044522
3,4.178054,1.693147,4.951244
4,1.0,1.0,3.302585


In [104]:
from nltk.tokenize import word_tokenize
from nltk import FreqDist

english_tokens = [word_tokenize(text) for text in english_sample['tales']]
flat_list = [word for sent_list in english_tokens for word in sent_list]
english_freqs = FreqDist(word for word in flat_list)

for word, frequency in english_freqs.most_common(20):
    print(u'{}: {}'.format(word, frequency))

say: 105
thou: 76
go: 73
will: 57
can: 47
come: 47
little: 40
thy: 37
king: 36
take: 35
child: 34
thee: 32
shudder: 32
cry: 31
see: 31
open: 29
door: 27
man: 27
make: 26
sit: 24


In [118]:
d2 = {key: subl_tf(value) for key, value in english_freqs.items()}

In [128]:
sorted(d2.items(), key=lambda kv: kv[1], reverse=True)

[('say', 5.653960350157523),
 ('thou', 5.330733340286331),
 ('go', 5.290459441148391),
 ('will', 5.04305126783455),
 ('can', 4.850147601710058),
 ('come', 4.850147601710058),
 ('little', 4.688879454113936),
 ('thy', 4.610917912644224),
 ('king', 4.58351893845611),
 ('take', 4.555348061489413),
 ('child', 4.526360524616162),
 ('thee', 4.465735902799727),
 ('shudder', 4.465735902799727),
 ('cry', 4.433987204485146),
 ('see', 4.433987204485146),
 ('open', 4.367295829986475),
 ('door', 4.295836866004329),
 ('man', 4.295836866004329),
 ('make', 4.258096538021482),
 ('sit', 4.178053830347945),
 ('fire', 4.178053830347945),
 ('cat', 4.13549421592915),
 ('youth', 4.13549421592915),
 ('frog', 4.091042453358316),
 ('answer', 4.091042453358316),
 ('well', 4.091042453358316),
 ('father', 4.044522437723423),
 ('want', 4.044522437723423),
 ('away', 4.044522437723423),
 ('learn', 4.044522437723423),
 ('know', 3.995732273553991),
 ('bring', 3.9444389791664403),
 ('time', 3.9444389791664403),
 ('great'