Skip to content

A simplistic approach to perform Myanmar Word Segment

Notifications You must be signed in to change notification settings

vpnry/myanmar_word_segmenter

Repository files navigation

myanmar_word_segmenter

In Myanmar sentences, words are often written without spaces between them. This can make it inconvenient for beginners to find them in dictionaries.

This script utilizes a simplistic approach by using a known Myanmar word list to split long words or phrases into smaller individual words, making it easier to locate them in the dictionary.

Usage

The MyWordSegmenter.js is standalone code file that can be used directly in a browser or as a NodeJs module. Check example.js file.

const MyWordSegmenter = require("./MyWordSegmenter");
const mSegmenter = new MyWordSegmenter();

const imagineTestWord =
  "123ညီအစ်ကိုမသိတသိအချိန်ညီအစ်ကိုမသိတသိအချိန်Singaporeမသေကောင်းမပျောက်ကောင်းမှုန်မှုန်မွှားမွှား";

console.log("\nTest input", imagineTestWord);
console.log("Test output", mSegmenter.word_segment(imagineTestWord));

Expected output:
[
  '123',
  'ညီအစ်ကိုမသိတသိအချိန်',
  'ညီအစ်ကိုမသိတသိအချိန်',
  'Singapore',
  'မသေကောင်းမပျောက်ကောင်း',
  'မှုန်မှုန်မွှားမွှား'
]

Attributions


Developer notes

1. Download or update the myanmar-word list

Clone or download the myanmar-words repository to the same directory with prepareWordList.js file. The relative path to prepareWordList.js should be myanmar-words/json-files.

git clone --depth=1 https://github.com/myanmartools/myanmar-words.git

# then run
node prepareWordList.js

It will use dev_standalone_template to generate MYWORDS.json and the standalone code file MyWordSegmenter.js.

2. Miscellaneous

  • Combine with dictionaries

It is feasible to integrate Myanmar dictionaries within MyWordSegmenter.js to develop a basic Myanmar Popup dictionary browser extension for Myanmar learners.

  • Work with other languages

The theSegmenter method in MyWordSegmenter.js can be easily adapted to work with many other languages with minor modifications.

Try yourself with the Pāḷi language.