# A rough spellchecker on Brazilian Portuguese

This notebook is an attempt to create a very simple brazilian-portuguese spellchecker. It's inspired on [Peter Norvig's](https://norvig.com/) [famous essay](https://norvig.com/spell-correct.html) on the subject and follows the same logic of [this](https://nbviewer.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb) notebook of his.

In [11]:
import os
import re
from collections import Counter

I start with a big .txt file, a collection of many books.

In [15]:
big = open("big.txt").read()

In [16]:
len(big)

10979444

it's made up of almost 11 million characters. How many words are there?

In [52]:
def tokens(text):
    "a list of all word tokens (consecutive letters bundled together)"
    return re.findall('[a-z_À-ÿ]+', text.lower())

In [53]:
tokens('testing: to see, if this 1 2 works')

['testing', 'to', 'see', 'if', 'this', 'works']

In [56]:
WORDS = tokens(big)
len(WORDS)

1728255

1.72 million words, the first 10 being:

In [59]:
WORDS[:10]

['dom',
 'casmurro',
 'machado',
 'de',
 'assis',
 'capítulo',
 'primeiro',
 'do',
 'título',
 'uma']

Now we make a ```Counter``` for the ```WORDS``` list

In [62]:
COUNTS = Counter(WORDS)
COUNTS.most_common(10)

[('de', 92446),
 ('a', 59853),
 ('o', 47207),
 ('que', 42196),
 ('e', 40125),
 ('do', 37411),
 ('em', 28249),
 ('da', 27050),
 ('se', 21200),
 ('é', 17898)]

In [70]:
for w in tokens('haja palavra rara nesse textão aí'):
    print (COUNTS[w], w)

44 haja
208 palavra
7 rara
874 nesse
0 textão
92 aí
