Skip to content

Word segmentation to create unigrams in Portuguese (pt-br)

License

Notifications You must be signed in to change notification settings

subtosilencio/unigrams_pt-br

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

unigrams pt-br

Unigram generated from 16 files provided by NILC - Núcleo Interinstitucional de Linguística Computacional.

These files are composed by +681.639.644 tokens:

  • Wikpedia (pt-br) - 2016
  • Google News
  • SubIMDB-PT
  • G1
  • PNL-Br
  • Literancy works of public domain
  • Lacio-Web
  • Portuguese e-books
  • Mundo Estranho
  • CHC
  • Fapesp
  • Textbooks
  • Folhinha
  • NILC subcorpus
  • Para seu filho ler
  • SARESP

The files are available at:

http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc

The reason to create this file is to provide unigrams to be used in the word segmentation algorithm:

https://github.com/grantjenks/python-wordsegment

The the script used to create this file is npl_word_segment.py and group_files.py.

About

Word segmentation to create unigrams in Portuguese (pt-br)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages