unigrams pt-br

Unigram generated from 16 files provided by NILC - Núcleo Interinstitucional de Linguística Computacional.

These files are composed by +681.639.644 tokens:

The files are available at:

The reason to create this file is to provide unigrams to be used in the word segmentation algorithm:

The the script used to create this file is npl_word_segment.py and group_files.py.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
big_unigram.txt		big_unigram.txt
group_files.py		group_files.py
npl_word_segment.py		npl_word_segment.py
teste.py		teste.py

Provide feedback