# Morpheme Segmentation

## Pre-processing

Turkish file: newstest2017.tr
- Tokenize sentences using the Zemberek tool (you can use tokenize_zemberek.java)  --> newstest2017.zb.tr
- Clean and truecase using the Moses scripts --> newstest2017.zb.tc.tr
- Apply morphological parsing -->  newstest2017.zb.mp.tr   (https://github.com/BOUN-TABILab-TULAP/Morphological-Parser)
- Apply morphological disambiguation   --> newstest2017.zb.md.tr  (https://github.com/BOUN-TABILab-TULAP/Morphological-Parser)

## Morpheme segmentation (Morph)

Use the **segment_morphemes** function for segmenting a Turkish word into its morphemes:

evdekiler --> ev DA ki LAR

To use this function, you should provide the word-tokenized file (newstest2017.zb.tr) and morphologically disambiguated file (newstest2017.zb.md.tr). The generated file will be newstest2017.zb.mdnn.tr

In [5]:
def segment_morphemes(filename):
    with open(filename, encoding='utf-8') as corpus_file:
        lines = corpus_file.readlines()
    with open(filename.replace(".md", ""), encoding='utf-8') as corpus_file:
        original_lines = corpus_file.readlines()
    sentence_count = 0
    with open(filename.replace(".md", ".mdnn"), "w", encoding='utf-8') as corpus_file:
        for line in lines:
            if "BSTag" in line:
                sentence = ""
                original_sentence = original_lines[sentence_count]
                original_tokens = original_sentence.strip().split()
                token_count = 0
            elif "ESTag" in line:
                corpus_file.write(sentence.strip() + "\n")
                sentence_count += 1
            else:
                casedRoot = line.split()[0]
                morph_analysis = line.split()[1]
                morph_list = [m.split("[") for m in morph_analysis.split("]")]
                for i, token in enumerate(morph_list):
                    try:
                        word, tag = token
                        word = word.strip("-").strip("+")
                        #print(word, end=" ")
                        if word != '':
                            if word != "\'" and "\'" in word:
                                sequence = normalize_apos_suffices(word)
                                if sequence == []:
                                    first, second = word.split("\'")
                                    sequence = [first, "\'", second]
                                for s in sequence:
                                    sentence += s + " "
                            elif word != "’" and "’" in word:
                                sequence = normalize_apos_suffices(word)
                                if sequence == []:
                                    first, second = word.split("’")
                                    sequence = [first, "’", second]
                                for s in sequence:
                                    sentence += s + " "
                            elif i == 0:
                                original_current_token = original_tokens[token_count]
                                if original_current_token[0].isupper():
                                    sentence += original_current_token[0] + word[1:] + " "
                                else:
                                    sentence += word + " "
                            else:
                                sentence += word + " "
                    except:
                        """print(line)
                        print(token[0].split("["))"""
                token_count += 1

In [6]:
segment_morphemes("newstest2017.zb.md.tr")

## Concatenated morpheme segmentation (ConcatMorph)

Use the **segment_morphemes_all** function for segmenting a Turkish word into its morphemes, where all the morphemes except for the root are concatenated:

evdekiler --> ev DAkiLAR

To use this function, you should provide the word-tokenized file (newstest2017.zb.tr) and morphologically disambiguated file (newstest2017.zb.md.tr). The generated file will be newstest2017.zb.mdnnall.tr

In [8]:
def segment_morphemes_all(filename):
    with open(filename, encoding='utf-8') as corpus_file:
        lines = corpus_file.readlines()
    with open(filename.replace(".md", ""), encoding='utf-8') as corpus_file:
        original_lines = corpus_file.readlines()
    sentence_count = 0
    with open(filename.replace(".md", ".mdnnall"), "a+", encoding='utf-8') as corpus_file:
        for line in lines:
            if "BSTag" in line:
                sentence = ""
                original_sentence = original_lines[sentence_count]
                original_tokens = original_sentence.strip().split()
                token_count = 0
            elif "ESTag" in line:
                corpus_file.write(sentence.strip() + "\n")
                sentence_count += 1
            else:
                casedRoot = line.split()[0]
                morph_analysis = line.split()[1]
                morph_list = [m.split("[") for m in morph_analysis.split("]")]
                morph_token_count = 0
                for i, token in enumerate(morph_list):
                    try:
                        word, tag = token
                        word = word.strip("-").strip("+")
                        #print(word, end=" ")
                        if word != '':
                            if word != "\'" and "\'" in word:
                                sequence = normalize_apos_suffices(word)
                                if sequence == []:
                                    first, second = word.split("\'")
                                    sequence = [first, "\'", second]
                                for s in sequence:
                                    sentence += s + " "
                            elif i == 0:
                                original_current_token = original_tokens[token_count]
                                if original_current_token[0].isupper():
                                    sentence += original_current_token[0] + word[1:] + " "
                                else:
                                    sentence += word + " "
                            elif morph_token_count == 1:
                                if word != "\'":
                                    sentence += "_" + word
                                else:
                                    sentence += word + " _"
                            else:
                                sentence += word
                            morph_token_count += 1
                    except:
                        """print(line)
                        print(token[0].split("["))"""
                if sentence != "":
                    if sentence[-1] != " ":
                        sentence += " "
                token_count += 1

In [9]:
segment_morphemes_all("newstest2017.zb.md.tr")

## Last morpheme segmentation (LastMorph)

Use the **segment_morphemes_last** function for segmenting a Turkish word into its morphemes, take the root, and only the last morpheme.

evdekiler --> ev LAR

To use this function, you should provide the word-tokenized file (newstest2017.zb.tr) and morphologically disambiguated file (newstest2017.zb.md.tr). The generated file will be newstest2017.zb.mdnnlast.tr

In [10]:
def segment_morphemes_last(filename):
    with open(filename, encoding='utf-8') as corpus_file:
        lines = corpus_file.readlines()
    with open(filename.replace(".md", ""), encoding='utf-8') as corpus_file:
        original_lines = corpus_file.readlines()
    sentence_count = 0
    with open(filename.replace(".md", ".mdnnlast"), "a+", encoding='utf-8') as corpus_file:
        for line in lines:
            if "BSTag" in line:
                sentence = ""
                original_sentence = original_lines[sentence_count]
                original_tokens = original_sentence.strip().split()
                token_count = 0
            elif "ESTag" in line:
                corpus_file.write(sentence.strip() + "\n")
                sentence_count += 1
            else:
                casedRoot = line.split()[0]
                morph_analysis = line.split()[1]
                morph_list = [m.split("[") for m in morph_analysis.split("]")]
                morph_token_count = 0
                last_suffix = ""
                for i, token in enumerate(morph_list):
                    try:
                        word, tag = token
                        word = word.strip("-").strip("+")
                        #print(word, end=" ")
                        if word != '':
                            if word != "\'" and "\'" in word:
                                sequence = normalize_apos_suffices(word)
                                if sequence == []:
                                    first, second = word.split("\'")
                                    sequence = [first, "\'", second]
                                for s in sequence:
                                    sentence += s + " "
                            elif i == 0:
                                original_current_token = original_tokens[token_count]
                                if original_current_token[0].isupper():
                                    sentence += original_current_token[0] + word[1:] + " "
                                else:
                                    sentence += word + " "
                            elif word == "\'":
                                sentence += word + " "
                            else:
                                last_suffix = word
                            morph_token_count += 1
                    except:
                        """print(line)
                        print(token[0].split("["))"""
                if last_suffix != "":
                    sentence += "_" + last_suffix + " "
                if sentence != "":
                    if sentence[-1] != " ":
                        sentence += " "
                token_count += 1

In [11]:
segment_morphemes_last("newstest2017.zb.md.tr")