# Chapter 2: Cleaning the text for analysis

One of the most basic (and most important) tasks when doing text mining is cleaning up your text. While this might seem a bit dull compared to some of the things we'll do later in the book, I hope to show you in this post that not only is this pretty straightforward with the right Python packages, it can also help you to get to know your data before you get stuck into modelling.

In this chapter, the ultimate aim of cleaning is to transform text from sentences into a standardised [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) for some of our later analyses, but you'll also see that we'll pick and choose from these methods to get our text into other formats more appropriate for other analyses. To demonstrate the flexibility of these packages, I'll show you how we can process both the English- and German-language Grimm's fairytales that we scraped in the last chapter (and by extension a few other common languages) using similar methods. 

## Pulling in the data

To start, we'll load in the scraped dataset we created in the last chapter.

In [1]:
import pandas as pd
from pandas import Series, DataFrame

texts = pd.read_csv("/Users/jburchell/Documents/text-mining/01 Scraping the project text/raw_data.csv")

texts[0:5]

Unnamed: 0,english_titles,english_tales,german_titles,german_tales
0,"Brothers Grimm fairy tales - The Frog-King, or...",\n In old times when wishing still hel...,Der Froschkönig oder der eiserne Heinrich - Br...,"In den alten Zeiten, wo das Wünschen noch geho..."
1,Brothers Grimm fairy tales - Cat and Mouse in ...,\n A certain cat had made the acquaint...,Katze und Maus in Gesellschaft - Brüder Grimm,Eine Katze hatte Bekanntschaft mit einer Maus ...
2,Brothers Grimm fairy tales - Our Lady's Child,\n Hard by a great forest dwelt a wood...,Marienkind - Brüder Grimm,Vor einem großen Walde lebte ein Holzhacker mi...
3,Brothers Grimm fairy tales - The Story of the ...,"\n A certain father had two sons, the ...","Von einem, der auszog, das Fürchten zu lernen ...","Ein Vater hatte zwei Söhne, davon war der älte..."
4,Brothers Grimm fairy tales - The Wolf and The ...,\n There was once on a time an old goa...,Der Wolf und die sieben jungen Geißlein - Brüd...,"Es war einmal eine alte Geiß, die hatte sieben..."


To make it a bit easier to understand how these methods work, I'll pull the first sentences from a few of the English- and German-language tales. Once we've understood how these techniques work we can apply them to our full tale corpora.

In [2]:
import re
english_sample = [unicode(' '.join(re.split(r'(?<=[.])\s', row)[:1]), "utf-8") 
                  for row in texts['english_tales'][0:5]]

from pprint import pprint
pprint(english_sample)

[u'\n         In old times when wishing still helped one, there lived a king whose  daughters were all beautiful, but the youngest was so beautiful that the  sun itself, which has seen so much, was astonished whenever it shone in  her face.',
 u'\n         A certain cat had made the acquaintance of a mouse, and had said so   much to her about the great love and friendship she felt for her, that at  length the mouse agreed that they should live and keep house together.',
 u'\n         Hard by a great forest dwelt a wood-cutter with his wife, who had an only  child, a little girl three years old.',
 u'\n         A certain father had two sons, the elder of whom was sharp and  sensible, and could do everything, but the younger was stupid and  could neither learn nor understand anything, and when people saw him  they said, "There\'s a fellow who will give his father some trouble!"  When anything had to be done, it was always the elder who was forced  to do it; but if his father bade him fet

In [3]:
german_sample = [unicode(' '.join(re.split(r'(?<=[.])\s', row)[:1]), "utf-8") 
                 for row in texts['german_tales'][0:5]]
pprint(german_sample)

[u'In den alten Zeiten, wo das W\xfcnschen noch geholfen hat, lebte ein K\xf6nig, dessen T\xf6chter waren alle sch\xf6n; aber die j\xfcngste war so sch\xf6n, da\xdf die Sonne selber, die doch so vieles gesehen hat, sich verwunderte, sooft sie ihr ins Gesicht schien.',
 u'Eine Katze hatte Bekanntschaft mit einer Maus gemacht und ihr soviel von gro\xdfer Liebe und Freundschaft vorgesagt, die sie zu ihr tr\xfcge, da\xdf die Maus endlich einwilligte, mit ihr zusammen in einem Haus zu wohnen und gemeinschaftliche Wirtschaft zu f\xfchren.',
 u'Vor einem gro\xdfen Walde lebte ein Holzhacker mit seiner Frau, der hatte nur ein einziges Kind, das war ein M\xe4dchen von drei Jahren.',
 u'Ein Vater hatte zwei S\xf6hne, davon war der \xe4lteste klug und gescheit, und wu\xdfte sich in alles wohl zu schicken.',
 u'Es war einmal eine alte Gei\xdf, die hatte sieben junge Gei\xdflein, und hatte sie lieb, wie eine Mutter ihre Kinder lieb hat.']


## Getting rid of non-alphabetical characters

A fundamental step when cleaning up a piece of text is getting rid of any non-alphabetic characters. We will eventually need to get rid of all characters, but we need to keep the apostrophies for now to get rid of the contractions. For now, we should strip out those leftover carriage returns, escape characters and quotation marks in the English text using the `lstrip` and `replace` functions.

In [4]:
english_sample = [sentence.lstrip('\n').replace('"', '').replace("\\'", "'") 
                  for sentence in english_sample]
pprint(english_sample[0])

u'         In old times when wishing still helped one, there lived a king whose  daughters were all beautiful, but the youngest was so beautiful that the  sun itself, which has seen so much, was astonished whenever it shone in  her face.'


## Expanding contractions

In English, it is pretty common for us to use contractions of words, such as isn't, you're and should've. However, these contractions cause all sorts of problems for normalisation and standardisation algorithms (which we'll speak about more later in this post). As such, it is best to get rid of them, and the easiest way to do so expand all of these contractions prior to further cleaning steps.

An easy way of doing this is to simply find the contractions and replace them with their full form. [This gist](https://gist.github.com/nealrs/96342d8231b75cf4bb82) has a nice little function, `expandContractions()`, that does just that. In the below code I am using an [updated function](https://gist.github.com/t-redactyl/aff518d750f47f0ef6c20f04ef6fb823) where I've included `text.lower()` (as suggested by a user on the original post) to make sure words at the start of a sentence are included. Let's try it on our fourth English sentence, which has a number of contractions:

In [7]:
expandContractions(english_sample[3])

u'         a certain father had two sons, the elder of whom was sharp and  sensible, and could do everything, but the younger was stupid and  could neither learn nor understand anything, and when people saw him  they said, there is a fellow who will give his father some trouble!  when anything had to be done, it was always the elder who was forced  to do it; but if his father bade him fetch anything when it was late,  or in the night-time, and the way led through the churchyard, or any  other dismal place, he answered, oh, no, father, I will not go there,  it makes me shudder! for he was afraid.'

It's worked really well! You can see that "there's" has been replaced with "there is", and "I'll" has been replaced with "I will". Let's go ahead and replace this sentence:

In [8]:
english_sample[3] = expandContractions(english_sample[3])
pprint(english_sample)

[u'         In old times when wishing still helped one, there lived a king whose  daughters were all beautiful, but the youngest was so beautiful that the  sun itself, which has seen so much, was astonished whenever it shone in  her face.',
 u'         A certain cat had made the acquaintance of a mouse, and had said so   much to her about the great love and friendship she felt for her, that at  length the mouse agreed that they should live and keep house together.',
 u'         Hard by a great forest dwelt a wood-cutter with his wife, who had an only  child, a little girl three years old.',
 u'         a certain father had two sons, the elder of whom was sharp and  sensible, and could do everything, but the younger was stupid and  could neither learn nor understand anything, and when people saw him  they said, there is a fellow who will give his father some trouble!  when anything had to be done, it was always the elder who was forced  to do it; but if his father bade him fetch anythin

## Standardising your signal words

Bag-of-words analyses rely on getting the frequency of all of the 'signal' words in a piece of text, or those that are likely to characterise what the piece of text is about. For example, in the opening line to *The Frog-King* words such as 'king', 'daughter' and 'beautiful' give a pretty good idea of what the sentence is about. As you might guess, these frequencies rely on these signal words being in the exact same format. However, the same word often has different representations depending on the context. The word 'camp', for example, can be 'camped', 'camps' and 'camping', but these words all ultimately mean the same thing and should be grouped together in a bag-of-words analysis.

One way of addressing this is [stemming](https://en.wikipedia.org/wiki/Stemming). Stemming is where you strip words back to a base form that is common to related words, even if that is not the actual grammatical root of the word. For example, 'judging' would be stripped back to 'judg', although the actual correct root is 'judge'.

As we're interested in processing both English and German texts, we'll use the [Snowball stemmer](http://snowballstem.org/) from Python's NLTK. This stemmer has support for a [wide variety of languages](http://snowballstem.org/algorithms/), including French, Italian, Spanish, Dutch, Swedish, Russian and Finnish.

Let's import the package, and assign the English and German stemmers to different variables.

In [9]:
from nltk.stem.snowball import SnowballStemmer

sbEng = SnowballStemmer('english')
sbGer = SnowballStemmer('german')

To run the stemmers over our sentences, we need to split the sentences into a list of words and run the stemmer over each of the words. We still want to do some more processing, so we'll join them back into a sentence with the `join()` function for now, but we will eventually tokenise these when we're happy with our cleaning.

In [10]:
' '.join([sbEng.stem(item) for item in (english_sample[0]).split(' ')])

u'         in old time when wish still help one, there live a king whose  daughter were all beautiful, but the youngest was so beauti that the  sun itself, which has seen so much, was astonish whenev it shone in  her face.'

This looks alright, but not completely accurate. We can see that 'times' has been stemmed to 'time' and 'wishing' has been stemmed to 'wish', which will be useful in grouping related words, but other words, like 'was' and 'shone' have not been touched.

In [11]:
' '.join([sbGer.stem(item) for item in (german_sample[0]).split(' ')])

u'in den alt zeiten, wo das wunsch noch geholf hat, lebt ein konig, dess tocht war all schon; aber die jung war so schon, dass die sonn selber, die doch so viel geseh hat, sich verwunderte, sooft sie ihr ins gesicht schien.'

The German text also has some problems. [Have a look at these later].

In order to address this, there is a more sophisticated approach called [lemmatisation](https://en.wikipedia.org/wiki/Stemming#Lemmatisation_algorithms). Lemmatisation takes into account whether a word in a sentence is a noun, verb, adjective, etc., which is known as tagging a word's [part-of-speech](https://en.wikipedia.org/wiki/Part_of_speech). This means the algorithm can apply more appropriate rules about how to standardise words. For example, nouns can be singularised and verbs can be conjugated to their infinitive form.

We will use a package called [pattern](http://www.clips.ua.ac.be/pattern) which includes both English and German lemmatisation (among many other functions). `pattern`, like `Snowball`, also supports lemmatisation in a small number of other languages. Let's install `pattern`, and then import the English and German packages:

In [12]:
import pattern.en as lemEng
import pattern.de as lemGer

Using this package, we can easily tag the part-of-speech of each word, and then run the lemmatisation algorithm over it. Have a look at this example:

In [13]:
pprint(lemEng.parse('I ate many pizzas', lemmata=True).split(' '))

[u'I/PRP/B-NP/O/i',
 u'ate/VBD/B-VP/O/eat',
 u'many/JJ/B-NP/O/many',
 u'pizzas/NNS/I-NP/O/pizza']


This output is a little confusing, but you can see that there are a few bits of information associated with each word. Let's just take the word 'pizzas', for example:

In [14]:
lemEng.parse('I ate many pizzas', lemmata=True).split(' ')[3]

u'pizzas/NNS/I-NP/O/pizza'

We can see that it is tagged as 'NNS', which indicates that it is a plural noun (information on all possible tags is [here](http://www.clips.ua.ac.be/pages/mbsp-tags)). More importantly for us, because the algorithm knows that it is a plural noun it can correctly lemmatise it to 'pizza'.

Now that we know what is going on under the hood, we can jump to pulling the lemmatised words out. Let's try again with the first sentence in our English set:

In [15]:
' '.join(lemEng.Sentence(lemEng.parse(english_sample[0], lemmata=True)).lemmata)

u'in old time when wish still help one , there live a king whose daughter be all beautiful , but the youngest be so beautiful that the sun itself , which have see so much , be astonish whenever it shine in her face .'

This looks a lot better - it has changed 'was' to 'be', and 'shone' to 'shine'. Now let's try our first German sentence again.

In [16]:
' '.join(lemGer.Sentence(lemGer.parse(german_sample[0], lemmata=True)).lemmata)

u'in den alt zeiten , wo das w\xfcnschen noch geholfen haben , leben ein k\xf6nig , dessen t\xf6chter sein alle sch\xf6n ; aber die j\xfcngste sein so sch\xf6n , da\xdf die sonne selb , die doch so vieles sehen haben , sich verwunderte , sooft sie ihr ins gesicht scheinen .'

This is *much* better. [Fill in why.] Given that this is the nicest possible result for standardising our words, let's do this for all of our sentences before moving onto the next step.

However, we have one final thing to do before we can apply the lemmatisation. It turns out the apostrophes we left in earlier to make sure we expanded the contractions are still hanging around in the form of possessives (e.g., "king's"). The `parse` method cannot deal with this properly, and gives us the result below:

In [17]:
lemEng.Sentence(lemEng.parse("king's", lemmata=True)).lemmata

[u'king', u"'s"]

You can see that it has incorrectly parsed the word into "king" and "'s", which means we will have "s" listed as a word in our lexicon after we strip out the punctuation. Before going any further, let's therefore get rid of those remaining apostrophes in the English sentences.

In [18]:
english_sample = [sentence.replace("'", '') for sentence in english_sample]

Now that we've done that, we can run our lemmatisers over both sets of sentences.

In [19]:
english_sample = [' '.join(lemEng.Sentence(lemEng.parse(sentence, lemmata=True)).lemmata) 
            for sentence in english_sample]
pprint(english_sample)

[u'in old time when wish still help one , there live a king whose daughter be all beautiful , but the youngest be so beautiful that the sun itself , which have see so much , be astonish whenever it shine in her face .',
 u'a certain cat have make the acquaintance of a mouse , and have say so much to her about the great love and friendship she feel for her , that at length the mouse agree that they should live and keep house together .',
 u'hard by a great forest dwell a wood-cutter with his wife , who have an only child , a little girl three year old .',
 u'a certain father have two son , the elder of whom be sharp and sensible , and can do everything , but the younger be stupid and can neither learn nor understand anything , and when person see him they say , there be a fellow who will give his father some trouble !\nwhen anything have to be do , it be always the elder who be force to do it ; but if his father bade him fetch anything when it be late , or in the night-time , and the wa

In [20]:
german_sample = [' '.join(lemGer.Sentence(lemGer.parse(sentence, lemmata=True)).lemmata) 
                 for sentence in german_sample]
pprint(german_sample)

[u'in den alt zeiten , wo das w\xfcnschen noch geholfen haben , leben ein k\xf6nig , dessen t\xf6chter sein alle sch\xf6n ; aber die j\xfcngste sein so sch\xf6n , da\xdf die sonne selb , die doch so vieles sehen haben , sich verwunderte , sooft sie ihr ins gesicht scheinen .',
 u'ein katze haben bekanntschaft mit ein maus machen und ihr soviel von gro\xdf liebe und freundschaft vorsagen , die sie zu ihr tr\xfcge , da\xdf die maus endlich einwilligte , mit ihr zusammen in ein haus zu wohnen und gemeinschaftlich wirtschaft zu f\xfchren .',
 u'vor ein gro\xdf walde leben ein holzhacker mit seiner frau , der haben nur ein einziges kind , das sein ein m\xe4dchen von drei jahren .',
 u'ein vater haben zwei sohn , davon sein der \xe4lteste klug und gescheit , und wu\xdfen sich in all wohl zu schicken .',
 u'es sein einmal ein alt gei\xdf , die haben sieben jung gei\xdflein , und haben sie lieb , wie ein mutter ihre kinder leiben haben .']


## Dealing with numbers

Many of the fairytales contain something kind of annoying - numbers. Even worse, they are written out as a word. For my purposes, numbers are not very useful and should be stripped out, although, of course, you might need them left in for your analysis!

To do this, we can use this [function](https://gist.github.com/t-redactyl/4297c8e01e5b37e8a4fdb0fea2ed93dd) that I wrote, based on the [text2num](https://github.com/ghewgill/text2num) package. All this function does is strip out any words related to numbers in English, as well as numbers themselves, as part of this text cleaning process. Let's run it over our fifth tale, which contains the word 'seven':

In [21]:
remove_numbers(english_sample[4])

'there be once on a time an old goat who have little kid , and love them with all the love of a mother for her child .'

It's done the job! Let's now run this over our English-language sample:

In [22]:
english_sample = [remove_numbers(sentence) for sentence in english_sample]

**You need to put together a German-language package too.**

## Normalising our text

Obviously our text still contains a lot of rubbish that needs to be cleaned up. Some important things we need to get rid of prior to tokenising the sentences are all of those leftover punctuation marks that we didn't get rid of earlier and all of that extra whitespace. Another thing we want to get rid of are non-signal, or [stop words](https://en.wikipedia.org/wiki/Stop_words), that are likely to be common across texts, such as 'a', 'the', and 'in'. These tasks fall into a process called [normalisation](https://en.wikipedia.org/wiki/Text_normalization), and surprise, surprise, there is another multi-language package called [cucco](https://github.com/davidmogar/cucco) that can do all of the most common normalisation tasks in English, German and about 10 other languages.

Let's install and import `cucco` for both English and German:

In [23]:
from cucco import Cucco

normEng = Cucco(language='en')
normGer = Cucco(language='de')

Cucco has a function called `normalize()` which, as a default, runs all of its normalisation procedures over a piece of text. While convenient, we don't want to do this as it gets rid of accent marks, and we want to keep these in our German text (we'll talk about how to get our special characters back in the next section). Instead, we'll run three specific functions over our text: `remove_stop_words`, `replace_punctuation` and `remove_extra_whitespaces`. We can run these in order by putting them in a list and adding this as an argument to `normalize()`. Let's try it with our first lines from the English and German texts.

In [24]:
norms = ['remove_stop_words', 'replace_punctuation', 'remove_extra_whitespaces']
normEng.normalize(english_sample[0], norms)

u'old time wish still help live king whose daughter beautiful youngest beautiful sun see much astonish whenever shine face'

In [25]:
normGer.normalize(german_sample[0], norms)

u'alt zeiten w\xfcnschen geholfen leben k\xf6nig t\xf6chter sch\xf6n j\xfcngste sch\xf6n sonne selb vieles sehen verwunderte sooft gesicht scheinen'

Looks great! Let's apply this over all of our texts.

In [26]:
english_sample = [normEng.normalize(sentence, norms) for sentence in english_sample]
pprint(english_sample)

[u'old time wish still help live king whose daughter beautiful youngest beautiful sun see much astonish whenever shine face',
 u'certain cat make acquaintance mouse say much great love friendship feel length mouse agree live keep house together',
 u'hard great forest dwell wood cutter wife child little girl year old',
 u'certain father son elder sharp sensible can everything younger stupid can neither learn understand anything person see say fellow will give father trouble when anything always elder force father bade fetch anything late night time way lead churchyard dismal place answer oh father will go make shudder for afraid',
 u'time old goat little kid love love mother child']


In [27]:
german_sample = [normGer.normalize(sentence, norms) for sentence in german_sample]
pprint(german_sample)

[u'alt zeiten w\xfcnschen geholfen leben k\xf6nig t\xf6chter sch\xf6n j\xfcngste sch\xf6n sonne selb vieles sehen verwunderte sooft gesicht scheinen',
 u'katze bekanntschaft maus soviel gro\xdf liebe freundschaft vorsagen tr\xfcge maus endlich einwilligte zusammen haus wohnen gemeinschaftlich wirtschaft f\xfchren',
 u'gro\xdf walde leben holzhacker frau einziges kind m\xe4dchen drei jahren',
 u'vater zwei sohn davon \xe4lteste klug gescheit wu\xdfen all wohl schicken',
 u'alt gei\xdf sieben jung gei\xdflein lieb mutter kinder leiben']


## Dealing with mojibake

[Mojibake??](https://en.wikipedia.org/wiki/Mojibake) What the heck is that?? It is a very cute term for that very annoying thing that happens when your text gets changed from one form of encoding to another and your special characters and punctuation turn into that crazy character salad. (In fact, the German term for this, *Buchstabensalat* means 'letter salad'.) As we've already noticed, this has happened with all of the special characters (like ä and ß) in our German sentences.

The good news is that it is pretty easy to reclaim our special characters. However, the **bad** news is that we need to jump over to Python 3 to do so. We can use a Python 3 package called [ftfy](https://github.com/LuminosoInsight/python-ftfy), or 'fixes text for you', which is designed to deal with these encoding issues. 

We can use the `fix_encoding()` function to get rid of all of that ugly mojibake. Let's see how it goes with our first line of German text:

In [4]:
import ftfy

#print(ftfy.fix_encoding(german_sample[0]))

Nice! You can see it has fixed up all of the umlauts. Let's fix all of the text.

In [None]:
german_sample = [ftfy.fix_encoding(sentence) for sentence in german_sample]
german_sample[1]

## Cleaning up the full DataFrame

Now that we've covered how to clean up a sample of the text, let's apply this full set of cleaning to the full text. To start, we'll remove those carriage returns, escape characters and quotation marks from the English-language tales.

In [28]:
texts['english_tales'] = [sentence.lstrip('\n').replace('"', '').replace("\\'", "'") 
                          for sentence in texts['english_tales']]

Next, we'll expand the contractions.

In [29]:
texts['english_tales'] = texts['english_tales'].apply(expandContractions)

Before lemmatising the English-language text, we need to strip out the apostrophies to make sure we don't end up with 's' in our lexicon.

In [30]:
texts['english_tales'] = [sentence.replace("'", "") for sentence in texts['english_tales']]

Now we can standardise all the words in both our English- and German-language tales using lemmatisation. Warning, this can be a little slow when you have a lot of text to process.

In [31]:
texts['english_tales'] = [' '.join(lemEng.Sentence(lemEng.parse(sentence, lemmata=True)).lemmata) 
                          for sentence in texts['english_tales']]
texts['german_tales'] = [' '.join(lemGer.Sentence(lemGer.parse(sentence, lemmata=True)).lemmata) 
                         for sentence in texts['german_tales']]

  and tokens[j] in ("'", "\"", u"”", u"’", "...", ".", "!", "?", ")", EOS):


We now remove all that leftover punctuation and whitespace and the language-specific stopwords for both the English and German texts. I've run the remove punctuation function twice as there was still leftover whitespace after cleaning up the punctuation and stop words.

In [32]:
norms = ['remove_extra_whitespaces', 'replace_punctuation', 'remove_stop_words', 
         'remove_extra_whitespaces']
texts['english_tales'] = texts['english_tales'].apply(normEng.normalize, args = (norms,))
texts['german_tales'] = texts['german_tales'].apply(normGer.normalize, args = (norms,))

Following that, we will remove all of the numbers from both sets of text.

In [33]:
texts['english_tales'] = texts['english_tales'].apply(remove_numbers)
#texts['german_tales'] = texts['german_tales'].apply(remove_numbers) -- need to write this function

In [34]:
# Export data for transition to Python 3
#texts.to_csv("/Users/jburchell/Documents/text-mining/02 Text cleaning/cleansed_data.csv",
#            encoding = 'utf-8')

In [1]:
# Import data for Python 3
#import pandas as pd
#texts = pd.read_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/cleansed_data.csv",
#                   usecols=[1, 2, 3, 4])

Finally, let's get rid of the mojibake in the German-language tales.

In [6]:
texts['german_tales'] = texts['german_tales'].apply(ftfy.fix_encoding)

0    alt zeiten wünschen geholfen leben könig töcht...
1    katze bekanntschaft maus soviel groß liebe fre...
2    groß walde leben holzhacker frau einziges kind...
3    vater zwei sohn davon älteste klug gescheit wu...
4    alt geiß sieben jung geißlein lieb mutter kind...
Name: german_tales, dtype: object

## Tokenising the text and getting the frequencies

We have finally cleaned this text to a point where we can tokenise it and get the frequencies of all of the words. This is very straightforward in NLTK - we simply use the the `word_tokenize` function from the [tokenize package](http://www.nltk.org/api/nltk.tokenize.html). We'll import it below and run it over our English and German tales separately.

In [7]:
from nltk.tokenize import word_tokenize

In [8]:
english_tokens = [word_tokenize(text) for text in texts['english_tales']]
print(english_tokens[10][0:40])

['little', 'brother', 'take', 'little', 'sister', 'hand', 'say', 'since', 'mother', 'die', 'happiness', 'stepmother', 'beat', 'us', 'every', 'day', 'come', 'near', 'kick', 'us', 'away', 'foot', 'meal', 'hard', 'crust', 'bread', 'left', 'little', 'dog', 'table', 'better', 'often', 'throw', 'nice', 'bit', 'may', 'heaven', 'pity', 'us', 'mother']


In [9]:
german_tokens = [word_tokenize(text) for text in texts['german_tales']]
print(german_tokens[10][0:40])

['brüderchen', 'nehmen', 'schwesterchen', 'hand', 'sprechen', 'seit', 'mutter', 'tot', 'gut', 'stunde', 'mehr', 'stiefmutter', 'schlagen', 'all', 'tage', 'kommen', 'stößt', 'füßen', 'fort', 'hart', 'brotkrusten', 'übrig', 'bleiben', 'unsere', 'speise', 'hündlein', 'tisch', 'gehts', 'besser', 'werfen', 'manchmal', 'gut', 'bissen', 'gott', 'erbarm', 'unsere', 'mutter', 'wüßte', 'Komm', 'miteinander']


We're now going to do a very simple frequency count of all of the words in each of the language's texts, using the `FreqDist` function from `nltk`. Let's import the package:

In [10]:
from nltk import FreqDist

Before we can use the tokenised list of words, we need to flatten it. We can then run the `FreqDist` method over it and get the top 20 results for each language.

In [11]:
flat_list = [word for sent_list in english_tokens for word in sent_list]
english_freqs = FreqDist(word for word in flat_list)

for word, frequency in english_freqs.most_common(20):
    print(u'{}: {}'.format(word, frequency))

say: 3026
go: 2283
come: 1682
will: 1529
thou: 1504
king: 1199
take: 1139
see: 1039
can: 1004
little: 996
give: 849
man: 762
get: 715
away: 701
thee: 653
now: 587
time: 568
old: 556
make: 533
look: 521


In [12]:
flat_list = [word for sent_list in german_tokens for word in sent_list]
german_freqs = FreqDist(word for word in flat_list)

for word, frequency in german_freqs.most_common(20):
    print(u'{}: {}'.format(word, frequency))

sprechen: 1665
kommen: 1433
sagen: 1387
gehen: 1340
all: 986
könig: 821
sehen: 796
sollen: 746
mußen: 597
ganz: 578
geben: 566
groß: 564
frau: 556
antwortete: 528
rufen: 518
stehen: 515
nehmen: 503
gut: 490
mann: 475
ward: 439


In [32]:
#from pandas import DataFrame
#DataFrame(english_freqs.most_common(6288), columns = ['term', 'frequency']).to_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/english_term_frequencies.csv")
#DataFrame(german_freqs.most_common(13997), columns = ['term', 'frequency']).to_csv("/Users/jodieburchell/Documents/text-mining/02 Text cleaning/german_term_frequencies.csv")

We can see that the most common words in English are verbs like 'say', 'go' and 'come', which makes sense as fairytales are generally simple stories which focus on the characters going somewhere and interacting with others. We can also see common ways of describing characters, such as 'little' and 'old'. 'Time' also pops up a lot, probably because many of the tales open with the much cliched 'Once upon a time'.

The German tales are similar, leading with 'sprechen' (speak), 'kommen' (come) and 'sagen' (say), descriptors like 'gut' (good) and 'groß' (big) and characters like 'könig' (king), 'frau' (woman) and 'mann' (man).

Now that we have this nice clean bag-of-words from our texts, we can start to do some more interesting things with our texts. We'll get started on these in the next chapter.