# Morpheme separation in Nepali words

Nepali words are composed of various morphemes which can be broadly divided into two categories: Vowels and Consonants. A given word can be resolved into its morphemes by some elementary rules. While these rules are relatively straightforward, the unicode representation make it a little bit non-trivial to work with. Consider these scenarios:

- क is actually a single character in Unicode, while it is two morphemes, क् + अ in Nepali.
- क + ् in Unicode representation translates to क्, a single morpheme in Nepali.
- क + ि in Unicode representation translates to क् + इ in Nepali.

In this script, we define rules for the separation of morphemes in Nepali Unicode representation. This shall serve as a building block as we later construct systems for separating syllables from multi-syllables words in Nepali.


In [7]:
from __future__ import unicode_literals
import re
from ipy_table import *

In [8]:
#Dependent vowels, Independent vowels and amm and ahh
vowel = ur"[\u0904-\u0914\u093A-\u094C\u0902\u0903]"

def is_vowel(v):
    return bool(re.match(vowel, v))

In [9]:
consonant = r"[\u0915-\u0939]"
halant = u"्"

def is_consonant(c):
    return bool(re.match(consonant, c))

is_halant = lambda k: k == halant

# Rules

The rules are fairly straightforward:

- If any character is a vowel, leave it as it is
- If any character is a single unicode consonant क - ह	
  - If next character is a halanta u(्), the previous character is a single morpheme.
  - If next character is a vowel, the previous character as well as this vowel make two morphemes (क् + ि).
  - If next character is a consonant or the next character is non-existent, the previous character as well as this character make two morpheme, where the latter is the independent vowel अ.

In [25]:
def separate_morphemes(word):
    morphemes = []
    l = len(word)
    i = 0
#     for w in word:
#         print w
    while i < l:
        
        w = word[i]
        if is_vowel(w):
            morphemes.append((w, "V"))
            i += 1
            continue
        if is_consonant(w):
            # next one halant ?
            if i<l-1 and is_halant(word[i+1]):
                morphemes.append((word[i]+halant, "C"))
                i += 1
            # next one vowel or consonant?
            else:
                morphemes.append((word[i]+halant, "C"))
                if i<l-1 and is_consonant(word[i+1]) or i==l-1:
                    morphemes.append(("अ", "V"))
        i = i+1
    return morphemes

In [11]:
# helper function to translate dependent vowels into independent
def to_independent_vowel(v):
    if re.match("[\u093A-\u094C]", v):
        return unichr(ord(v)-0x093E+0x0906)
    return v

print to_independent_vowel(u"ा")
print to_independent_vowel(u"ै")


आ
ऐ


In [26]:
def break_words(word):
    morphemes = separate_morphemes(word)
    morphemes = [(u"Nepali Literal", u"Vowel / Consonant")] + [(w.encode('utf-8'), v) if v == "C" else 
                                                               (to_independent_vowel(w).encode('utf-8'), v)
                                                               for w,v in morphemes]
    return morphemes


    

word = "विद्यार्थी"
morphemes = break_words(word)
make_table(morphemes)
apply_theme("basic")

व
ि
द
्
य
ा
र
्
थ
ी


0,1
Nepali Literal,Vowel / Consonant
व्,C
इ,V
द्,C
य्,C
आ,V
र्,C
थ्,C
ई,V


In [13]:
word = "क्षितिज"
morphemes = break_words(word)
make_table(morphemes)
apply_theme("basic")

0,1
Nepali Literal,Vowel / Consonant
क,C
ष्,C
इ,V
त्,C
इ,V
ज्,C
अ,V


# What's next?

This breaking down of a word into its constituent morphemes is quite useful while doing morphological analysis of Nepali words. Personally, I see myself making a system for separation of syllables of Nepali words which can, in turn, be used for higher order analyses (including those of statistical kind). If you have any applications in mind, then go for it! Do show off your ideas or products to me though.

### Acknowledgement

My brother (Prasanna Koirala) helped me a little.. sort of. 

