# Learning form-meaning systematicity 

## Sean Trott

Is there learnable form-meaning systematicity at the sub-morphemic level, beyond what we'd expect by chance? And does it occur more in particular syllable components than others (e.g. *onsets* vs. *codas* vs. *nuclei*)?

In this analysis, we ask whether a Linear SVM classifier can learn to predict a target label (e.g. a particular *onset*, such as **l**) from the word embedding of the word associated with that label. 


In [62]:
import os 
import gensim
import numpy as np
import pandas as pd
import re

import src.build_models as model_utils

# Variables
MODEL_PATH = os.environ['WORD2VEC_PATH']
ROOT_PATH = 'data/raw/roots_celex_monosyllabic.txt'

## Loading the data

First we load our data. We used only monosyllabic, monomorphemic words from the CELEX database. We also require that these words contain no capital letters, and are at least three characters.

In [23]:
MIN_LENGTH = 3

In [48]:
def construct_syllable_structure(syllable, nuclei='[5@694{8312i7u$#eqFEIQVU$]|ju'):
    """Return dict of possible onset, nucleus, and coda, using phonetic transcription."""
    nucleus = re.findall(nuclei, syllable)
    if len(nucleus) < 1:
        return None
    onset, coda = syllable.split(nucleus[0])
    return {'nucleus': nucleus[0],
            'onset': onset,
            'coda': coda}

In [49]:
entries = open(ROOT_PATH, "r").read().split("\n")
entries[0]

'a\\1\\1\\1'

In [50]:
words = [(entry.split("\\")[0], entry.split("\\")[-1]) for entry in entries if entry != "" and entry.islower()]
words[0]

('a', '1')

In [51]:
words = [(w[0], construct_syllable_structure(w[1])) for w in words if len(w[0]) >= MIN_LENGTH]
words[0]

('ace', {'coda': 's', 'nucleus': '1', 'onset': ''})

In [53]:
roots_to_syllables = dict(words)
roots_to_syllables['ace']

{'coda': 's', 'nucleus': '1', 'onset': ''}

In [55]:
N = len(roots_to_syllables)
print("This resulted in an initial set of {num_tokens} tokens.".format(num_tokens=N))

This resulted in an initial set of 2069 tokens.


## Loading the word2vec model and creating our dataset

In [60]:
model = gensim.models.KeyedVectors.load_word2vec_format(MODEL_PATH, binary=True)

In [3]:
def create_dataset(roots_to_syllables, model, syllable_component='nucleus'):
    """Return mappings from vectors to specified syllable component (e.g. 'nucleus')."""
    X, y, words = [], [], []
    for root, syllable in roots_to_syllables.items():
        if root in model:
            syl = syllable[syllable_component]
            if syl != '':
                X.append(model[root])
                y.append(syl)
                words.append(root)
    return np.array(X), np.array(y), words