# Morphology and Python

Me:
- Steven Butler
- srbutler at gmail dot com

Important links:

- Morpho Project: http://www.cis.hut.fi/projects/morpho/
- ReadtheDocs: http://morfessor.readthedocs.io/en/latest/index.html
- Github: https://github.com/aalto-speech/morfessor

## Morfessor Demonstration
This is a demo of the Morfessor library using sample files from Morpho Challenge 2005. To run the code below, you'll need to download some data files from the following link and save them in a folder called "data/".

http://research.ics.aalto.fi/events/morphochallenge2005/datasets.shtml

Note that these scripts should run on both CPython and PyPy. I use PyPy most of the time when I can, as it can speed things up by nearly an order of magnitude.

In [1]:
import logging
import morfessor

## logging makes the output a lot more useful
## though apparently it only shows up in a console, not in notebooks
_logger = logging.getLogger(__name__)
default_formatter = logging.Formatter('%(asctime)s - %(message)s', '%Y-%m-%d %H:%M:%S')
logging.basicConfig(level=logging.INFO)

main_logger = logging.StreamHandler()
main_logger.setLevel(logging.INFO)
main_logger.setFormatter(default_formatter)
_logger.addHandler(main_logger)

### Training Morfessor

This function shows the basic setup to train a Morfessor model. It is *highly* recommended that you save the model to a .pickle file, as retraining every time will take up a lot of time.

In [2]:
def train_model(input_file, output_file=None):

    ## setup input and model objects
    morf_io = morfessor.MorfessorIO()
    morf_model = morfessor.BaselineModel()

    ## build a corpus from input file
    train_data = morf_io.read_corpus_file(input_file)

    ## load data into model
    ## optional param "count_modifier" can set frequency dampening;
    ## default is each token counts
    morf_model.load_data(train_data)

    ## train the model in batch form (online training also available)
    morf_model.train_batch()

    ## optionally pickle model
    if output_file is not None:
        morf_io.write_binary_model_file(output_file, morf_model)

    return morf_model

In [3]:
## train a model on the Turkish dataset and save it to a file
## be patient! it will be slow.
# model_turkish = train_model("data/wordlist.tur", "output/trainedmodel.tur")
# model_english = train_model("data/wordlist.eng", "output/trainedmodel.eng")

In [4]:
## otherwise, use the trained model if it's already saved
morf_io = morfessor.MorfessorIO()

model_turkish = morf_io.read_binary_model_file("output/trainedmodel.tur")
model_english = morf_io.read_binary_model_file("output/trainedmodel.eng")

### Segmenting with Morfessor

A trained model can be used to segment a list of words. If you trust the model for your corpus (i.e., you've seen gold standard tested results in another paper), this might be your last step.

The output from the segmenter is a list of segments and the log probability that this is the correct segmentation.

In [5]:
## segment data using the trained Turkish model

words_turkish = ["bilincinle", "eylemelerine", "komedidir", "uygarlaStIramadIklarImIzdanmISsInIzcasIna"]

for word in words_turkish:
    print(model_turkish.viterbi_segment(word))

(['bilincin', 'le'], 19.89222449520654)
(['eyleme', 'lerine'], 20.86651055320383)
(['komedi', 'dir'], 19.75392446733094)
(['uygar', 'laStI', 'ramadIklarI', 'mIzdan', 'mIS', 'sInIz', 'casIna'], 72.7934925399708)


In [6]:
## segment data using the trained English model

words_english = ["unbelievable", "presented", "quickly", "morphology", "morphological", "undecomposable", "the"]

for word in words_english:
    print(model_english.viterbi_segment(word))

(['un', 'believable'], 19.32113757371785)
(['present', 'ed'], 16.555310976357404)
(['quick', 'ly'], 17.118524084472)
(['morpholog', 'y'], 18.430928507097743)
(['morpholog', 'ical'], 21.07542358801313)
(['unde', 'compos', 'able'], 29.499639370389204)
(['the'], 9.775824751238426)


### Evaluation with Morfessor

A *gold standard* segmentation file is needed to assess the quality of the model. It will provide precision, recall, and F-measure scores for a given trained model.

In [7]:
# need to set sample sizes
from morfessor.evaluation import EvaluationConfig

# 10 samples of 50 words
eval_config = EvaluationConfig(10, 50)

# get the data from the annotations files
morf_io = morfessor.MorfessorIO()
goldstd_turkish = morf_io.read_annotations_file("data/goldstdsample.tur.txt")
goldstd_english = morf_io.read_annotations_file("data/goldstdsample.eng.txt")

In [8]:
# build evaluator object and run evaluation against Turkish gold standard
evaluator_turkish = morfessor.MorfessorEvaluation(goldstd_turkish)
results_turkish = evaluator_turkish.evaluate_model(model_turkish, configuration=eval_config)
print(results_turkish)

Filename   : UNKNOWN
Num samples: 10
Sample size: 50.0
F-score    : 0.577
Precision  : 0.777
Recall     : 0.461


In [9]:
# build evaluator object and run evaluation against English gold standard
evaluator_english = morfessor.MorfessorEvaluation(goldstd_english)
results_english = evaluator_english.evaluate_model(model_english, configuration=eval_config)
print(results_english)

Filename   : UNKNOWN
Num samples: 10
Sample size: 50.0
F-score    : 0.7
Precision  : 0.714
Recall     : 0.69


## Infixer Demonstration

This process targets language patterns that involve **non-concatenative** morphology, which Morfessor is not terribly good at. The basic idea is: supervise the Morfessor algorithm with regular expressions. The data that have the targeted patterns will be **linearized** by rebuilding the *non-concatenative* patterns into *concatenative* ones.

The needed files should be saved in the same directory--I haven't built them into a library yet.

In [10]:
from model import InfixerModel
from eval_segment import InfixerSegmenter

In [11]:
affix_list = [r'^(?P<CON>\w)(in)(?P<VOW>\w)((?P=CON)(?P=VOW)\w*)',
              r'^(?P<CON>\w)(um)(?P<VOW>\w)((?P=CON)(?P=VOW)\w*)',
              r'^(i?\w)(in)(\w+)',
              r'^(i?\w)(um)(\w+)',
              r'^(?P<REDUP>\w+)((?P=REDUP)\w+)']

In [12]:
# get stored model and feature dict
infixer_io = morfessor.MorfessorIO()
model_tagalog = infixer_io.read_binary_model_file("output/tl_wiki_rdi_none_bin")
features_tagalog = InfixerModel.get_feature_dict_from_file("output/tl_wiki_rdi_none.json")

In [13]:
# set up a segmenter
segmenter_tagalog = InfixerSegmenter(model_tagalog, features_tagalog, affix_list)

words_tagalog = ["basa", "bumasa", "bumabasa", "binasa", "binabasa", "babasahin"]

segmentations_tagalog = segmenter_tagalog.segment_list(words_tagalog)

for (word, segmentation) in segmentations_tagalog:
    print(word+":", segmentation)

basa: basa
bumasa: <um> basa
bumabasa: <redup> <um> basa
binasa: <in> basa
binabasa: <redup> <in> basa
babasahin: <redup> basa hin
