<a href="https://colab.research.google.com/github/sagorbrur/bnlp/blob/master/notebook/bnlp_colab_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BNLP
BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Bengali POS Tagging, Construct Neural Model for Bengali NLP purposes.

Here we provide a to z api level use of **BNLP**

## Installation

In [1]:
!pip install bnlp_toolkit

Collecting bnlp_toolkit
  Downloading https://files.pythonhosted.org/packages/de/c9/376837d2bf998a511af113c82feeee703ff95f41eb3ba79ac43036f0edfd/bnlp_toolkit-2.3-py3-none-any.whl
Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████████████████████████████████| 71kB 4.0MB/s 
Collecting sklearn-crfsuite
  Downloading https://files.pythonhosted.org/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 13.9MB/s 
Collecting python-crfsuite>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/95/99/869dde6dbf3e0d07a013c8

## Downloading Pretrained model

NB: POS TAG and NER model may need to download from https://github.com/sagorbrur/bnlp/blob/master/model/bn_pos_model.pkl and then upload it to colab. 

Otherwise it will arise an error. 

In [2]:
!mkdir models
%cd models

/content/models


In [4]:
!wget https://github.com/sagorbrur/bnlp/raw/master/model/bn_spm.model
!wget https://github.com/sagorbrur/bnlp/raw/master/model/bn_spm.vocab
!wget https://github.com/sagorbrur/bnlp/blob/master/model/bn_pos.pkl
!wget https://github.com/sagorbrur/bnlp/blob/master/model/bn_ner.pkl

--2020-07-22 17:46:58--  https://github.com/sagorbrur/bnlp/blob/master/model/bn_pos.pkl
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘bn_pos.pkl’

bn_pos.pkl              [<=>                 ]       0  --.-KB/s               bn_pos.pkl              [ <=>                ]  71.78K  --.-KB/s    in 0.02s   

2020-07-22 17:46:58 (3.70 MB/s) - ‘bn_pos.pkl’ saved [73503]

--2020-07-22 17:47:00--  https://github.com/sagorbrur/bnlp/blob/master/model/bn_ner.pkl
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘bn_ner.pkl’

bn_ner.pkl              [ <=>                ]  71.78K  --.-KB/s    in 0.02s   

2020-07-22 17:47:00 (3.66 MB/s) - ‘bn_ner.pkl’ saved [73503]



In [5]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
downloaded = drive.CreateFile({'id':"1DxR8Vw61zRxuUm17jzFnOX97j7QtNW7U"})
downloaded.GetContentFile('bengali_word2vec.zip')
!unzip bengali_word2vec.zip
!rm -rf bengali_word2vec.zip

Archive:  bengali_word2vec.zip
  inflating: bengali_word2vec.model  
  inflating: bengali_word2vec.model.trainables.syn1neg.npy  
  inflating: bengali_word2vec.model.wv.vectors.npy  


In [6]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
downloaded = drive.CreateFile({'id':"1CFA-SluRyz3s5gmGScsFUcs7AjLfscm2"})
downloaded.GetContentFile('bengali_fasttext_wiki.zip')
!unzip bengali_fasttext_wiki.zip
!rm -rf bengali_fasttext_wiki.zip

Archive:  bengali_fasttext_wiki.zip
  inflating: bengali_fasttext_wiki.bin  


In [7]:
%cd ..

/content


## Tokenization



### Sentencepiece Tokenizer

In [8]:
from bnlp.sentencepiece_tokenizer import SP_Tokenizer

bsp = SP_Tokenizer()
model_path = "./models/bn_spm.model"
input_text = "আমি ভাত খাই। সে বাজারে যায়।"
tokens = bsp.tokenize(model_path, input_text)
print(tokens)
text2id = bsp.text2id(model_path, input_text)
print(text2id)
id2text = bsp.id2text(model_path, text2id)
print(id2text)

punkt not found. downloading...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['▁আমি', '▁ভাত', '▁খাই', '।', '▁সে', '▁বাজারে', '▁যায়', '।']
[914, 5265, 24224, 3, 124, 2244, 41, 3]
আমি ভাত খাই। সে বাজারে যায়।


### Basic Tokenizer

In [9]:
from bnlp.basic_tokenizer import BasicTokenizer
basic_t = BasicTokenizer()
raw_text = "আমি বাংলায় গান গাই।"
tokens = basic_t.tokenize(raw_text)
print(tokens)

['আমি', 'বাংলায়', 'গান', 'গাই', '।']


### NLTK Tokenizer

In [10]:
from bnlp.nltk_tokenizer import NLTK_Tokenizer

text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
bnltk = NLTK_Tokenizer()
word_tokens = bnltk.word_tokenize(text)
sentence_tokens = bnltk.sentence_tokenize(text)
print(word_tokens)
print(sentence_tokens)

['আমি', 'ভাত', 'খাই', '।', 'সে', 'বাজারে', 'যায়', '।', 'তিনি', 'কি', 'সত্যিই', 'ভালো', 'মানুষ', '?']
['আমি ভাত খাই।', 'সে বাজারে যায়।', 'তিনি কি সত্যিই ভালো মানুষ?']


## Word Embedding

### Bengali Word2Vec

In [11]:
from bnlp.bengali_word2vec import Bengali_Word2Vec

bwv = Bengali_Word2Vec()
model_path = "models/bengali_word2vec.model"
word = 'আমার'
vector = bwv.generate_word_vector(model_path, word)
print(vector.shape)
print(vector)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


(300,)
[-1.6936177e+00  3.5159554e-02 -8.5707474e-03  4.6422979e-01
  4.7176498e-01 -1.1240785e-03  5.2726853e-01  9.6344274e-01
  4.3611592e-01 -2.4183762e+00 -1.1882383e+00 -6.1812967e-01
 -2.6307828e+00 -6.1543208e-01 -1.0401576e+00 -4.4781092e-01
 -8.7368643e-01 -6.5588124e-02 -1.9416760e+00 -8.5976779e-01
  8.9258450e-01 -5.2980870e-01 -1.1779339e+00  1.6538888e-01
  5.7090968e-01 -6.8303603e-01 -5.8089417e-01  1.9823054e+00
  1.5652509e+00 -1.8102252e+00  5.1018655e-01  1.1032093e+00
 -1.0756480e+00  1.1780707e+00  1.1778240e+00 -5.2861094e-01
  3.8371810e-01  9.7755694e-01  7.2286832e-01  4.4961435e-01
 -1.0284587e+00 -4.9218610e-01  7.0426416e-01  5.1277459e-02
  7.9809263e-02 -2.3158913e+00 -5.1341558e-01  2.5855860e-01
 -1.4927088e+00 -1.4820724e+00  1.1150364e+00 -3.9570293e-01
  4.6147889e-01  8.7402004e-01 -1.1148657e+00  1.7493018e+00
  6.5046811e-01  1.6666926e+00  2.6500010e+00  1.1857886e+00
  7.1161926e-01 -1.2677008e+00 -1.1069984e+00 -7.8171343e-01
 -8.2391447e-01 -

In [12]:
from bnlp.bengali_word2vec import Bengali_Word2Vec

bwv = Bengali_Word2Vec()
model_path = "models/bengali_word2vec.model"
word = 'গ্রাম'
similar = bwv.most_similar(model_path, word)
print(similar)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  if np.issubdtype(vec.dtype, np.int):


[('মৌজা', 0.7365161180496216), ('গ্রাম,', 0.6939323544502258), ('গ্রামটি', 0.6869181394577026), ('পুরসভা', 0.6866500377655029), ('গ্রামের', 0.6699343919754028), ('৪২.৫৭', 0.6655560731887817), ('মৌজার', 0.66482013463974), ('ব্লকে', 0.6518076062202454), ('পঞ্চায়েত', 0.6460636258125305), ('মহল্লা', 0.6451084017753601)]


### Bengali Fasttext

In [13]:
from bnlp.bengali_fasttext import Bengali_Fasttext

bft = Bengali_Fasttext()
word = "গ্রাম"
model_path = "models/bengali_fasttext_wiki.bin"
word_vector = bft.generate_word_vector(model_path, word)
print(word_vector.shape)
print(word_vector)

(100,)
[ 0.22730371 -0.40870905 -0.15613425  0.3804481  -0.05980289 -0.28930932
  0.34626344  0.40258473 -0.90198106  0.4493182  -0.7325722   0.04727728
  0.7795627   0.12068285  0.4670834   0.86121595  0.19153564  0.22014432
 -0.73635215  0.4743112   0.04276856  0.24542333  0.58513665 -0.49344873
  1.2036309  -0.37963045 -0.52979314  0.42768055 -0.2915344   0.6429044
 -0.24786738 -0.34868303  0.5416647  -0.19672239 -0.5149317  -0.4899621
  0.41403815  0.84034336  0.43055257  0.05744093  1.0355072   0.6728295
 -0.46993157 -0.8494765   0.33383992  0.3980397   0.06346162 -1.2393602
  0.18511884 -0.10365435 -1.0729522   0.2701686  -0.48516303  0.7226823
  0.4941565  -0.14498085 -0.1882495   0.01020508  1.3079278  -1.0012709
  0.13207525  0.05821019 -0.5525221   0.13435237  1.1650416  -0.08389879
 -0.34301072  0.7302537  -0.1674301   0.2222631   0.56786853  0.06164984
  0.4102374   0.1456264  -0.28646046 -0.21075231  0.6185989  -0.4345684
 -0.15338174  0.96878874  0.56596994 -0.18027176  0



## Bengali POS Tagging

In [None]:
from bnlp.pos import POS
bn_pos = POS()
model_path = "models/bn_pos.pkl"
text = "আমি ভাত খাই।"
res = bn_pos.tag(model_path, text)
print(res)
# [('আমি', 'PPR'), ('ভাত', 'NC'), ('খাই', 'VM'), ('।', 'PU')]

punkt not found. downloading...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[('আমি', 'PPR'), ('ভাত', 'NC'), ('খাই', 'VM'), ('।', 'PU')]


## Bengali Name Entity Recognition

In [15]:
from bnlp.ner import NER
bn_ner = NER()
model_path = "models/bn_ner.pkl"
text = "সে ঢাকায় থাকে।"
result = bn_ner.tag(model_path, text)
print(result)

[('সে', 'O'), ('ঢাকায়', 'S-LOC'), ('থাকে', 'O')]
