Skip to content

sagorbrur/bengali_sentencepiece

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Bengali Sentencepiece

What is SentencePiece?

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

What we did?

We trained sentencepiece model with bengali wiki data and saved our bengali sentencepiece model.

NB: See bnlp more update version about bengali sentencepiece

Steps

  • sentencepiece installation

pip install sentencepiece

  • preprocess wiki data

We download bengali wiki dump data and extract using bengali_wikiextractor.

We preprocess bengali wiki data into a text file with single sentence per line.

  • data format
sentence 1
sentence 2
..........
sentence n
  • Sentencepiece Training
import sentencepiece as spm

spm.SentencePieceTrainer.train('--model_prefix=bn_spm --input=data/bn_wiki.txt --vocab_size=50000')

it will save bn_spm.model and bn_spm.vocab in your train directory.

  • Testing bengali sentencepiece model
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("bn_spm.model")

# output: True
sp.EncodeAsPieces("আমি বাংলায় গান গাই।")

# output: ['▁আমি', '▁বাংলায়', '▁গান', '▁গাই', '।']
sp.EncodeAsIds("আমি বাংলায় গান গাই।")
# output: [914, 1852, 349, 6229, 3]
sp.DecodePieces(['▁আমি', '▁বাংলায়', '▁গান', '▁গাই', '।'])
# output: 'আমি বাংলায় গান গাই।'
sp.NBestEncodeAsPieces("আমি বাংলায় গান গাই।", 5)
"""
output:
[['▁আমি', '▁বাংলায়', '▁গান', '▁গাই', '।'],
 ['▁আমি', '▁বাংলা', 'য়', '▁গান', '▁গাই', '।'],
 ['▁আমি', '▁বাংলায়', '▁গান', '▁গা', 'ই', '।'],
 ['▁আমি', '▁বাংলায়', '▁গান', '▁', 'গাই', '।'],
 ['▁', 'আমি', '▁বাংলায়', '▁গান', '▁গাই', '।']]

"""
sp.DecodeIds([914, 1852, 349, 6229, 3])

# output: 'আমি বাংলায় গান গাই।'
sp.GetPieceSize()
# output: 50000 as our vocab size is 50000
# same as len(sp)

Releases

No releases published

Packages

No packages published