<a href="https://colab.research.google.com/github/sagorbrur/bnlp/blob/master/notebook/bnlp_colab_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BNLP

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Bengali POS Tagging, Construct Neural Model for Bengali NLP purposes.

Here we are prodiving training approach of different model using **BNLP**

## Installation

In [1]:
!pip install bnlp_toolkit

Collecting bnlp_toolkit
  Downloading https://files.pythonhosted.org/packages/de/c9/376837d2bf998a511af113c82feeee703ff95f41eb3ba79ac43036f0edfd/bnlp_toolkit-2.3-py3-none-any.whl
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 6.1MB/s 
Collecting sklearn-crfsuite
  Downloading https://files.pythonhosted.org/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl
Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████████████████████████████████| 71kB 7.2MB/s 
Collecting python-crfsuite>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/95/99/869dde6dbf3e0d07a013c8e

##  Downloading Bengali Processed Wikipedia Data 

In [None]:
#drive data download code
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
downloaded = drive.CreateFile({'id':"1rQUQLsXg0TZnlrAgmNMkCXGDnYbjlLmM"})
downloaded.GetContentFile('bn_wiki_data.txt.zip')

!unzip bn_wiki_data.txt.zip
!rm -rf bn_wiki_data.txt.zip

Archive:  bn_wiki_data.txt.zip
  inflating: bn_wiki_data.txt        


## Training

Here we present `bengali sentencepiece`, `bengali word2vec`, `bengali fasttext` training on `bengali wikipedia data`

Training time will depend on data size.

### Training Bengali Sentencepice Model

After successfully compiling the below code will produce two file:

* `wiki_sp.model` 
* `wiki_sp.vecab`

In [None]:
from bnlp.sentencepiece_tokenizer import SP_Tokenizer

bsp = SP_Tokenizer()
data = "bn_wiki_data.txt"
model_prefix = "wiki_sp"
vocab_size = 30000
bsp.train_bsp(data, model_prefix, vocab_size)

punkt not found. downloading...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
wiki_sp.model and wiki_sp.vocab is saved on your current directory


### Training Bengali Word2Vec Model

After successfully compiling it will produce three file. 

* `wiki_word2vec.model`
* `wiki_word2vec.vector`
* `wiki_word2vec.model.trainables.syn1neg.npy`
* `wiki_word2vec..model.wv.vectors.npy`


In [None]:
from bnlp.bengali_word2vec import Bengali_Word2Vec
bwv = Bengali_Word2Vec(True)
data_file = "bn_wiki_data.txt"
model_name = "wiki_word2vec.model"
vector_name = "wiki_word2vec.vector"
bwv.train_word2vec(data_file, model_name, vector_name)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


wiki_word2vec.model and wiki_word2vec.vector saved in your current directory.


### Training Bengali Fasttext Model

After successfully training it will produce: 
* `wiki_fasttext.bin` 

In [None]:
from bnlp.bengali_fasttext import Bengali_Fasttext

bft = Bengali_Fasttext()
data = "bn_wiki_data.txt"
model_name = "wiki_fasttext.bin"
epoch = 1
bft.train_fasttext(data, model_name, epoch)

### Training Bengali POS TAGGING CRF model

After successfully training it will produce a trained model with accuracy on evaluation data: 

* `pos_model.pkl`

In [2]:
from bnlp.pos import POS
bn_pos = POS()
model_name = "pos_model.pkl"
tagged_sentences = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]

bn_pos.train(model_name, tagged_sentences)

punkt not found. downloading...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
1
1
Training Started........
it will take time according to your dataset size..
Training Finished!
Evaluating with Test Data...
Accuracy is: 
0.1111111111111111
Model Saved!


## Training Bengali NER model
After successfully training it will produce a trained model with accuracy on evaluation data:

* `ner_model.pkl` 

In [4]:
from bnlp.ner import NER
bn_ner = NER()
model_name = "ner_model.pkl"
tagged_sentences = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]]

bn_ner.train(model_name, tagged_sentences)

2
1
Training Started........
It will take time according to your dataset size...
Training Finished!
Evaluating with Test Data...
Accuracy is: 
1.0
Model Saved!
