<a href="https://colab.research.google.com/github/sagorbrur/bnlm/blob/master/notebook/bnlm_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bengali Language Model(BNLM)
Bengali language model is build with fastai's ULMFit and ready for prediction and classfication task.

## Installation

In [1]:
!pip install bnlm

Collecting bnlm
  Downloading https://files.pythonhosted.org/packages/91/cc/de65b81d2b4c013bd5d829e83b919ae1b7f62691f4cced31a6aab6a75fe5/bnlm-1.0.0-py3-none-any.whl
Collecting async-timeout>=3.0.1
  Downloading https://files.pythonhosted.org/packages/e1/1e/5a4441be21b0726c4464f3f23c8b19628372f606755a9d2e46c187e65ec4/async_timeout-3.0.1-py3-none-any.whl
Collecting fastai==1.0.57
[?25l  Downloading https://files.pythonhosted.org/packages/c1/e2/42342ded0385d694e3250e74f43f0dc9a3ff3d5c2241a2ddd98236b5f9de/fastai-1.0.57-py3-none-any.whl (233kB)
[K     |████████████████████████████████| 235kB 7.0MB/s 
Collecting aiohttp>=3.5.4
[?25l  Downloading https://files.pythonhosted.org/packages/7c/39/7eb5f98d24904e0f6d3edb505d4aa60e3ef83c0a58d6fe18244a51757247/aiohttp-3.6.2-cp36-cp36m-manylinux1_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 42.7MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c

## Pytorch Version

In [5]:
!pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch===1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/30/57/d5cceb0799c06733eefce80c395459f28970ebb9e896846ce96ab579a3f1/torch-1.2.0-cp36-cp36m-manylinux1_x86_64.whl (748.8MB)
[K     |████████████████████████████████| 748.9MB 22kB/s 
[?25hCollecting torchvision===0.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/06/e6/a564eba563f7ff53aa7318ff6aaa5bd8385cbda39ed55ba471e95af27d19/torchvision-0.4.0-cp36-cp36m-manylinux1_x86_64.whl (8.8MB)
[K     |████████████████████████████████| 8.8MB 39.1MB/s 
Installing collected packages: torch, torchvision
  Found existing installation: torch 1.5.0+cu101
    Uninstalling torch-1.5.0+cu101:
      Successfully uninstalled torch-1.5.0+cu101
  Found existing installation: torchvision 0.6.0+cu101
    Uninstalling torchvision-0.6.0+cu101:
      Successfully uninstalled torchvision-0.6.0+cu101
Successfully installed torch-1.2.0 torchvision-

In [1]:
import torch
torch.__version__

'1.2.0'

## Features and API

### Download Pretrained Model

In [3]:
from bnlm.bnlm import download_models

download_models()

Downloading Models...
It will take sometimes..
Download completed


### Predict N Words

In [8]:
from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import predict_n_words
model_path = 'model'
input_sen = "আমি ভাত"
output = predict_n_words(input_sen, 2, model_path)
print("Word Prediction: ", output)

Word Prediction:  আমি ভাত খাচ্ছি


### Get Sentence Encoding

In [3]:
from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_sentence_encoding
model_path = 'model'
sp_model = "model/bn_spm.model"
input_sentence = "আমি ভাত খাই।"
encoding = get_sentence_encoding(input_sentence, model_path, sp_model)
print("sentence encoding is: ", encoding)

sentence encoding is:  [ 0.016252 -0.056558  0.046531  0.211012 ... -0.111951 -0.127939  0.09549  -0.019716]


### Get Embedding Vector

In [9]:
from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_embedding_vectors
model_path = 'model'
sp_model = "model/bn_spm.model"
input_sentence = "আমি ভাত খাই।"
embed = get_embedding_vectors(input_sentence, model_path, sp_model)
print("sentence embedding is : ", embed)


sentence embedding is :  [array([ 0.252667, -0.257153, -0.137062,  0.825108, ..., -0.192384,  0.209549, -0.581286,  0.513466], dtype=float32), array([ 0.549337, -0.300522,  0.268269,  0.220001, ..., -0.479629,  0.455483, -0.806356,  1.199458], dtype=float32), array([ 0.770149, -0.830593, -0.16251 ,  1.203741, ...,  0.139086, -0.305681, -1.467003,  0.597592], dtype=float32), array([ 1.861862, -0.65491 ,  0.261645,  0.717122, ...,  0.12827 , -0.089352, -2.561541,  0.170493], dtype=float32)]


### Sentence Similarity

In [11]:
from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_sentence_similarity
model_path = 'model'
sp_model = "model/bn_spm.model"
sentence_1 = "সে খুব সুন্দর করে কথা বলে।"
sentence_2 = "তার কথা খুবেই মিষ্টি।"
sim = get_sentence_similarity(sentence_1, sentence_2, model_path, sp_model)
print("Similarity is: %0.2f"%sim)


Similarity is: 0.72


### Find Simillar Sentences

In [12]:
from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_similar_sentences

model_path = 'model'
sp_model = "model/bn_spm.model"

input_sentence = "আমি বাংলায় গান গাই।"
sen_pred = get_similar_sentences(input_sentence, 3, model_path, sp_model)
print(sen_pred)

['আমি বাংলায় গান গাই ।', 'আমি ইংরেজিতে গান গাই।', 'আমি বাংলায় গানও গাই।']
