In [50]:
# English texts
engTexts = [u'truth universally acknowledge single man possession good fortune must want wife',
            u'angry man drag father along ground orchard stop cry groan old man last stop I drag father beyond tree',
            u'old man fish alone skiff gulf stream go day now without take fish',
            u'know without read book name adventure tom sawyer matter']

In [51]:
# Spanish texts
espTexts = [u'lugar mancha cuyo nombre querer acordarme hacer tiempo vivir hidalgo lanzar astillero adarga antiguo roc\xedn flaco galgo corredor',
            u'a\xf1o despu\xe9s frente pelot\xf3n fusilamiento coronel aureliano buend\xeda haber recordar aquella tarde remoto padre llevar conocer hielo',
            u'bastar\xe1 decir ser juan pablo castel pintor matar mar\xeda iribarne suponer proceso recuerdo necesitar mayor explicaci\xf3n persona',
            u'se\xf1or ser malo aunque faltar\xedan motivo serlo']

## Dealing with mojibake

[Mojibake??](https://en.wikipedia.org/wiki/Mojibake) What the heck is that?? It is a very cute term for that very annoying thing that happens when your text gets changed from one form of encoding to another and your special characters and punctuation turn into that crazy character salad. (In fact, the German term for this, *Buchstabensalat* means 'letter salad'.) As we've already noticed, this has happened with all of the special characters (like á and ñ) in our Spanish sentences.

The good news is that it is pretty easy to reclaim our special characters. However, the **bad** news is that we need to jump over to Python 3 to do so. We can use a Python 3 package called [ftfy](https://github.com/LuminosoInsight/python-ftfy), or 'fixes text for you', which is designed to deal with these encoding issues. Let's go ahead and install it:

In [None]:
!pip3 install ftfy

We can use the `fix_encoding()` function to get rid of all of that ugly mojibake. Let's see how it goes with our first line of Spanish text:

In [52]:
import ftfy

print(ftfy.fix_encoding(espTexts[0]))

lugar mancha cuyo nombre querer acordarme hacer tiempo vivir hidalgo lanzar astillero adarga antiguo rocín flaco galgo corredor


Nice! It has worked beautifully, with the 'í' put back into 'rocín'. Now we can fix up all of our text in preparation for the last step.

In [54]:
espTexts = [ftfy.fix_encoding(sentence) for sentence in espTexts]
espTexts[1]

'año después frente pelotón fusilamiento coronel aureliano buendía haber recordar aquella tarde remoto padre llevar conocer hielo'

## Tokenising the text and getting the frequencies

We have finally cleaned this text to a point where we can tokenise it and get the frequencies of all of the words. This is very straightforward in NLTK - we simply use the the `word_tokenize` function from the [tokenize package](http://www.nltk.org/api/nltk.tokenize.html). We'll import it below and run it over our lists of English and Spanish text separately.

In [76]:
from nltk.tokenize import word_tokenize

In [81]:
engTokens = [word_tokenize(text) for text in engTexts]
print(engTokens)

[['truth', 'universally', 'acknowledge', 'single', 'man', 'possession', 'good', 'fortune', 'must', 'want', 'wife'], ['angry', 'man', 'drag', 'father', 'along', 'ground', 'orchard', 'stop', 'cry', 'groan', 'old', 'man', 'last', 'stop', 'I', 'drag', 'father', 'beyond', 'tree'], ['old', 'man', 'fish', 'alone', 'skiff', 'gulf', 'stream', 'go', 'day', 'now', 'without', 'take', 'fish'], ['know', 'without', 'read', 'book', 'name', 'adventure', 'tom', 'sawyer', 'matter']]


In [83]:
espTokens = [word_tokenize(text) for text in espTexts]
espTokens

[['lugar', 'mancha', 'cuyo', 'nombre', 'querer', 'acordarme', 'hacer', 'tiempo', 'vivir', 'hidalgo', 'lanzar', 'astillero', 'adarga', 'antiguo', 'rocín', 'flaco', 'galgo', 'corredor'], ['año', 'después', 'frente', 'pelotón', 'fusilamiento', 'coronel', 'aureliano', 'buendía', 'haber', 'recordar', 'aquella', 'tarde', 'remoto', 'padre', 'llevar', 'conocer', 'hielo'], ['bastará', 'decir', 'ser', 'juan', 'pablo', 'castel', 'pintor', 'matar', 'maría', 'iribarne', 'suponer', 'proceso', 'recuerdo', 'necesitar', 'mayor', 'explicación', 'persona'], ['señor', 'ser', 'malo', 'aunque', 'faltarían', 'motivo', 'serlo']]

We're now going to do a very simple frequency count of all of the words in each of the language's texts, using the `FreqDist` function from `nltk`. Let's import the package:

In [85]:
from nltk import FreqDist

Before we can use the tokenised list of words, we need to flatten it. We can then run the `FreqDist` method over it and get the top 10 results for each language.

In [97]:
flatList = [word for sentList in engTokens for word in sentList]
engFreq = FreqDist(word for word in flatList)

for word, frequency in engFreq.most_common(10):
    print(u'{}: {}'.format(word, frequency))

man: 4
drag: 2
father: 2
stop: 2
old: 2
fish: 2
without: 2
truth: 1
universally: 1
acknowledge: 1


In [98]:
flatList = [word for sentList in espTokens for word in sentList]
espFreq = FreqDist(word for word in flatList)

for word, frequency in espFreq.most_common(10):
    print(u'{}: {}'.format(word, frequency))

ser: 2
lugar: 1
mancha: 1
cuyo: 1
nombre: 1
querer: 1
acordarme: 1
hacer: 1
tiempo: 1
vivir: 1


And that's it! This is obviously not the most useful metric (as we only have 4 sentences in each corpora), but you can see that we've arrived at something that, with more data, would form a solid foundation for a bag-of-words analysis. You can also see that while it is a bit of work to strip a text down to a useable form, there are plenty of Python packages to make this work pretty painless, even if you're working across a number of languages.