Bengali Sentencepiece

What is SentencePiece?

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

What we did?

We trained sentencepiece model with bengali wiki data and saved our bengali sentencepiece model.

NB: See bnlp more update version about bengali sentencepiece

Steps

sentencepiece installation

pip install sentencepiece

preprocess wiki data

We download bengali wiki dump data and extract using bengali_wikiextractor.

We preprocess bengali wiki data into a text file with single sentence per line.

data format

sentence 1
sentence 2
..........
sentence n

Sentencepiece Training

import sentencepiece as spm

spm.SentencePieceTrainer.train('--model_prefix=bn_spm --input=data/bn_wiki.txt --vocab_size=50000')

it will save bn_spm.model and bn_spm.vocab in your train directory.

Testing bengali sentencepiece model

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("bn_spm.model")

# output: True

sp.EncodeAsPieces("আমি বাংলায় গান গাই।")

# output: ['▁আমি', '▁বাংলায়', '▁গান', '▁গাই', '।']

sp.EncodeAsIds("আমি বাংলায় গান গাই।")
# output: [914, 1852, 349, 6229, 3]

sp.DecodePieces(['▁আমি', '▁বাংলায়', '▁গান', '▁গাই', '।'])
# output: 'আমি বাংলায় গান গাই।'

sp.NBestEncodeAsPieces("আমি বাংলায় গান গাই।", 5)
"""
output:
[['▁আমি', '▁বাংলায়', '▁গান', '▁গাই', '।'],
 ['▁আমি', '▁বাংলা', 'য়', '▁গান', '▁গাই', '।'],
 ['▁আমি', '▁বাংলায়', '▁গান', '▁গা', 'ই', '।'],
 ['▁আমি', '▁বাংলায়', '▁গান', '▁', 'গাই', '।'],
 ['▁', 'আমি', '▁বাংলায়', '▁গান', '▁গাই', '।']]

"""

sp.DecodeIds([914, 1852, 349, 6229, 3])

# output: 'আমি বাংলায় গান গাই।'

sp.GetPieceSize()
# output: 50000 as our vocab size is 50000
# same as len(sp)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
bn_spm.model		bn_spm.model
bn_spm.vocab		bn_spm.vocab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

bn_spm.model

bn_spm.model

bn_spm.vocab

bn_spm.vocab

Repository files navigation

Bengali Sentencepiece

What is SentencePiece?

What we did?

Steps

About

Releases

Packages

sagorbrur/bengali_sentencepiece

Folders and files

Latest commit

History

Repository files navigation

Bengali Sentencepiece

What is SentencePiece?

What we did?

Steps

About

Topics

Resources

Stars

Watchers

Forks