Skip to content
Bengali NLP
Branch: master
Clone or download
Latest commit 441fb5b Mar 7, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data Add files via upload Mar 4, 2019
img added results Mar 5, 2019
LICENSE Create LICENSE Mar 4, 2019
README.md Added Visualisation Mar 6, 2019
news_vector_training.ipynb Added Vector Training Files Mar 4, 2019
test_word2vec.ipynb testing the trained word2vec Feb 28, 2019
visualise.py script to help visualise model Mar 6, 2019
wikipedia_embeddings.ipynb Added Vector Training Files Mar 4, 2019

README.md

Bangla2Vec

Language Modelling and Classification in the Bengali Language

Announcement: I will be giving a talk at IEM, Kolkata this Saturday about this work. The event link is here. Hope to see you there!!!

Bangla2Vec is an open source project for modelling the Bengali Language. The models released here can be used for a variety of tasks like classification and translation. Furthermore, all the data and models are opensourced so you can train your own model or use the pretrained models for your own tasks.

Releases

  • Trained a skipgram model on a news dataset: Training Script | Results | Model
  • Trained a skipgram model on wikipedia dataset: Training Script | Results | Model
  • Visualise Word Embeddings: Script | Create a directory vis, run the script and then start Tensorboard using tensorboard --logdir=vis
  • Scripts to scrape data from Bengali news websites: Github Repo

Results

Words most similar to the word chele (boy)

Father + Girl - Boy = Mother

Odd one out

Bengali's Love Sweets!

Data

Data was scraped from multiple online Bengali news websites.

Data was also collected from a Wikipedia dump.

You can view the data in the data folder.

Examples

  • Classification: Using the trained Bangal2vec models, a news classifier was built. This classifier can classify news into 5 categories based on the news headlines. The best model achieved a testing f1 score of 0.76 after training on just 40k news headlines.

Similar Projects

This project is a sister project of other projects working on IndicNLP. They include:

To get resources to start working on IndicNLP or to learn more about it, you can see our Awesome List of resources

Future Work

  • Build a word2vec model
  • Visualise the trained embeddings
  • Build a UlmFit model
  • Get translation data
You can’t perform that action at this time.