Skip to content
Word Embeddings (Word2Vec) for Nepali Language
Branch: master
Clone or download
Latest commit 9b97ba6 Aug 14, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Create LICENSE Aug 14, 2019
README.md Update README.md Jul 25, 2019
word_embd_cover.PNG Add files via upload Jul 25, 2019

README.md

Word Embeddings (Word2Vec) for Nepali Language

Word Embeddings for Nepali Language

This pre-trained Word2Vec model has 300-dimensional vectors for more than 0.5 million Nepali words and phrases. A separate Nepali language text corpus was created using the news contents freely available in the public domain. The text corpus contained more than 100 million running words.

Word2Vec Model

  • Embeddings Dimension: 300
  • Architecture: Continuous - BOW
  • Training algorithm: Negative sampling = 15
  • Context (window) size: 10
  • Token minimum count: 2
  • Encoded in UTF-8

Download the model from here: https://drive.google.com/file/d/1ik38vahOmzhiU2DBi78VOqDt7YFPsk5w/.

(Size: 1,881,180,827 bytes and File Type: .txt)

Using the Word2Vec model

from gensim.models import KeyedVectors

# Load vectors
model = KeyedVectors.load_word2vec_format(''.../path/to/nepali_embeddings_word2vec.txt', binary=False)

# find similarity between words
model.similarity('फेसबूक','इन्स्टाग्राम')

#most similar words
model.most_similar('ठमेल')

#try some linear algebra maths with Nepali words
model.most_similar(positive=['', ''], negative=[''], topn=1)

The desing of the Nepali text corpus and the training of the Word2Vec model was done in Lab-03, School of Computer and System Sciences, Jawaharlal Nehru University, New Delhi.

You can’t perform that action at this time.