# Systematicity in English monomorphemic words by word class

### Sean Trott

Do certain word classes have more sub-morphemic systematicity than others?

**TO DO**:
* Use Levenshtein distance over phonemes, instead of orthography
* 

## Load model and dataset

In [1]:
import os 
import gensim
import numpy as np
import pandas as pd
import re

# Variables
MODEL_PATH = os.environ['WORD2VEC_PATH']
ROOT_PATH = 'data/raw/roots_celex_monosyllabic.txt'

LOAD_MODEL = True

Using TensorFlow backend.


In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format(MODEL_PATH, binary=True)

NameError: name 'entries' is not defined

In [4]:
entries = open(ROOT_PATH, "r").read().split("\n")

In [5]:
words = [entry.split("\\")[0] for entry in entries if entry != "" and entry.islower()]
words[0]

'a'

## Filter by words that appear in model

In [6]:
critical_words = list(set([w for w in words if w in model.vocab]))

In [7]:
len(critical_words)

2082

## Obtain form and meaning similarity metrics

Here, we import the class `SystematicityUtilities` from a [custom library](https://github.com/seantrott/nlp_utilities). By default, this class uses *Levenshtein distance* as its metric for *form similarity*, and *cosine distance* as its metric for *meaning similarity*. The `compare_form_and_meaning` method used below compares every word pair along form and meaning dimensions.

In [8]:
from nlp_utilities.compling import SystematicityUtilities
systematicity_utils = SystematicityUtilities(model)
comparisons = systematicity_utils.compare_form_and_meaning(critical_words)

In [9]:
import pandas as pd

In [10]:
comparisons_df = pd.DataFrame.from_dict(comparisons)

In [11]:
print("{length} comparisons total".format(length=len(comparisons_df)))

2166321 comparisons total


In [12]:
comparisons_df.sample(4)

Unnamed: 0,form,meaning,w1,w2
281769,5,-0.020309,feign,port
1115572,4,0.380588,snoot,steed
1078249,4,0.168448,blah,flaunt
638521,4,0.064024,tyke,cause


## Global correlation

In [13]:
from scipy.stats import linregress

In [14]:
true_regression = linregress(comparisons_df['form'], comparisons_df['meaning'])
true_regression.rvalue

-0.040672612879521695

In other words, words with more **form differences** (e.g. a higher Levenshtein distance) will have *less* similar meanings (e.g. cosine similarity).

## Compare global correlation to permuted distributions

In [15]:
import numpy as np

In [16]:
permuted_results = []
for permute in range(100):
    permuted_meaning = np.random.permutation(comparisons_df['meaning'])
    random_regression = linregress(comparisons_df['form'], permuted_meaning)
    permuted_results.append(random_regression)

In [17]:
permuted_cors = [reg.rvalue for reg in permuted_results]

Now we can compare the *true correlation* with the distribution of correlations obtained by shuffling our dataset.

In [18]:
greater = [cor for cor in permuted_cors if cor <= true_regression.rvalue]
p_global = len(greater) / len(permuted_cors)
p_global

0.0

## Systematicity coefficients for each word

Now, we can use leave-one-out regression to determine how each word contributes to the overall correlation. For each word, we remove all comparisons involving that word, then take the global correlation again, and compare that score to the original correlation.

If **original** - **new** is negative, that means that removing the word results in a *lower* correlation (e.g. closer to 0), which suggests that the word provided a source of **form-meaning systematicity** to the correlation.

If **original** - **new** is positive, that means that removing the word results in a *higher* correlation (e.g. further from 0), which suggests that the word provided a source of **form-meaning arbitrariness** to the correlation.

Thus:
* **Negative** impact values suggest a word is more systematic
* **Positive** impact values suggest a word is more arbitrary

In [20]:
word_to_systematicity = {
}

In [36]:
index = 1
for word in critical_words:
    print("{pct}% done...".format(pct=round(index/len(critical_words), 2)*100))
    df_copy = comparisons_df[(comparisons_df['w1'] != word) & (comparisons_df['w2'] != word)]
    new_correlation = linregress(df_copy['form'], df_copy['meaning'])
    word_to_systematicity[word] = true_regression.rvalue - new_correlation.rvalue
    index += 1

0.0% done...
0.0% done...
0.0% done...
0.0% done...
17.0% done...
33.0% done...
50.0% done...
100.0% done...


In [40]:
len(word_to_systematicity)

2082

In [43]:
words_systematicity_df = pd.DataFrame.from_dict({'word': list(word_to_systematicity.keys()),
                                                 'impact': list(word_to_systematicity.values())})

In [45]:
words_systematicity_df.sample(4)

Unnamed: 0,impact,word
200,5.1e-05,warm
738,-2.5e-05,sad
620,2.5e-05,tote
1146,5.8e-05,spoil


In [52]:
words_systematicity_df['word_length'] = words_systematicity_df['word'].apply(lambda x: len(x))

In [54]:
words_systematicity_df.sample(4)

Unnamed: 0,impact,word,word_length
1153,-3.1e-05,adze,4
607,-1.1e-05,plead,5
951,8.3e-05,urge,4
1270,-5e-06,doom,4


In [76]:
impact_by_length = linregress(words_systematicity_df['word_length'], words_systematicity_df['impact'])
impact_by_length

LinregressResult(slope=1.8222392564057991e-05, intercept=-7.9401498039186234e-05, rvalue=0.12023504769825923, pvalue=3.7364486442637083e-08, stderr=3.2989866433695122e-06)

Regressing **impact** against **word length** gives us a positive correlation value, suggesting that as word length increases, words become *less* systematic. (Recall that a more positive "impact" value means *removing* the word resulted in a stronger correlation between form and meaning, thus that word is less systematic.)

### Merge with concreteness data

Here, we merge our data with the concreteness data that Reilly et al use. This ends up losing a lot of words, unfortunately.

In [67]:
concreteness = pd.read_csv("data/raw/reilly_data.csv")

In [80]:
concreteness['word'] = concreteness.WORD.str.lower()

In [82]:
concreteness['CNC'] = concreteness['CNC'].apply(lambda x: int(x) if x != "-" else None)

In [83]:
merged_df = pd.merge(concreteness, words_systematicity_df)

In [89]:
merged_df.sample(4)

Unnamed: 0,WORD,Description,ID,block,BFRQ,CNC,FAM,IMG,KFFRQ,NLET,...,Freq_HAL,Log_Freq_HAL,I_Mean_RT,I_Mean_Accuracy,I_NMG_Mean_RT,I_NMG_Mean_Accuracy,Klattese,word,impact,word_length
413,WAIST,high-img,1961,1,1,563.0,540,530,11.0,5,...,5276.0,8.57,601.39,0.97,574.42,1.0,west,waist,8.038293e-05,5
553,RACK,medium-img,2630,3,-,535.0,486,439,9.0,4,...,11655.0,9.36,622.41,1.0,586.41,1.0,r@k,rack,-7.222829e-05,4
183,FLUTE,high-img,1113,1,,587.0,496,581,1.0,5,...,3491.0,8.16,607.41,0.97,639.79,1.0,flut,flute,3.500315e-05,5
4,CAN,low-img,74,0,501,365.0,620,369,1772.0,3,...,1625073.0,14.3,596.41,1.0,626.62,1.0,k@n,can,-1.988884e-07,3


In [96]:
merged_df = merged_df.dropna()
linregress(merged_df['CNC'], merged_df['impact'])

LinregressResult(slope=-1.1361586443697379e-07, intercept=8.6271205037920915e-05, rvalue=-0.081925747672091931, pvalue=0.26498725415227203, stderr=1.0161792705768856e-07)

In [72]:
len(merged_df)

624

In [73]:
merged_df.to_csv("data/processed/systematicity_plus_concreteness.csv")