Language Modelling and Classification in the Bengali Language
Announcement: I will be giving a talk at IEM, Kolkata this Saturday about this work. The event link is here. Hope to see you there!!!
Bangla2Vec is an open source project for modelling the Bengali Language. The models released here can be used for a variety of tasks like classification and translation. Furthermore, all the data and models are opensourced so you can train your own model or use the pretrained models for your own tasks.
- Trained a skipgram model on a news dataset: Training Script | Results | Model
- Trained a skipgram model on wikipedia dataset: Training Script | Results | Model
- Visualise Word Embeddings: Script | Create a directory
vis, run the script and then start Tensorboard using
- Scripts to scrape data from Bengali news websites: Github Repo
Words most similar to the word chele (boy)
Father + Girl - Boy = Mother
Odd one out
Bengali's Love Sweets!
Data was scraped from multiple online Bengali news websites.
Data was also collected from a Wikipedia dump.
You can view the data in the data folder.
- Classification: Using the trained Bangal2vec models, a news classifier was built. This classifier can classify news into 5 categories based on the news headlines. The best model achieved a testing f1 score of 0.76 after training on just 40k news headlines.
This project is a sister project of other projects working on IndicNLP. They include:
To get resources to start working on IndicNLP or to learn more about it, you can see our Awesome List of resources
- Build a word2vec model
- Visualise the trained embeddings
- Build a UlmFit model
- Get translation data