# Systematicity in English monomorphemic words by word class

### Sean Trott

Do certain word classes have more sub-morphemic systematicity than others?

**TO DO**:
* Use Levenshtein distance over phonemes, instead of orthography
* Relate to word features: grammatical class, AoA, Concreteness

## Load model and dataset

In [69]:
import os 
import gensim
import numpy as np
import pandas as pd
import re
from statsmodels.formula.api import ols

# Variables
MODEL_PATH = os.environ['WORD2VEC_PATH']
ROOT_PATH = 'data/raw/roots_celex_monosyllabic.txt'

LOAD_MODEL = True

In [70]:
model = gensim.models.KeyedVectors.load_word2vec_format(MODEL_PATH, binary=True)

In [71]:
entries = open(ROOT_PATH, "r").read().split("\n")

In [72]:
words = [entry.split("\\")[0] for entry in entries if entry != "" and entry.islower()]
words[0]

'a'

## Filter by words that appear in model

In [73]:
critical_words = list(set([w for w in words if w in model.vocab]))

In [74]:
len(critical_words)

2082

## Obtain form and meaning similarity metrics

Here, we import the class `SystematicityUtilities` from a [custom library](https://github.com/seantrott/nlp_utilities). By default, this class uses *Levenshtein distance* as its metric for *form similarity*, and *cosine similarity* as its metric for *meaning similarity*. The `compare_form_and_meaning` method used below compares every word pair along form and meaning dimensions.

In [75]:
from nlp_utilities.compling import SystematicityUtilities
systematicity_utils = SystematicityUtilities(model)
comparisons = systematicity_utils.compare_form_and_meaning(critical_words)

In [77]:
import pandas as pd

In [78]:
comparisons_df = pd.DataFrame.from_dict(comparisons)

In [79]:
print("{length} comparisons total".format(length=len(comparisons_df)))

2166321 comparisons total


In [80]:
comparisons_df.sort_values('form').head(n=10)

Unnamed: 0,form,meaning,w1,w2
1567657,1,0.199551,joy,boy
501172,1,0.089559,dorm,corm
1164769,1,-0.003794,coax,hoax
1247141,1,0.266609,hap,ha
422626,1,0.094712,clap,claw
2077351,1,-0.002984,ask,asp
1209026,1,0.032284,plan,clan
501166,1,0.081128,dorm,doom
571643,1,0.151699,fab,fay
1988738,1,0.059454,wide,wipe


## Global correlation

In [87]:
from scipy.stats import linregress

In [88]:
true_regression = linregress(comparisons_df['form'], comparisons_df['meaning'])
print("r={r}, p={p}".format(r=true_regression.rvalue, p=true_regression.pvalue))

r=-0.040672612879521675, p=0.0


In other words, words with higher **form distance** (e.g. a higher Levenshtein distance) will have smaller **meaning similarity** (e.g. cosine similarity).

## Compare global correlation to permuted distributions

In [83]:
import numpy as np

In [84]:
permuted_results = []
for permute in range(10):
    permuted_meaning = np.random.permutation(list(comparisons_df['meaning']))
    random_regression = linregress(comparisons_df['form'], permuted_meaning)
    permuted_results.append(random_regression)

In [85]:
permuted_cors = [reg.rvalue for reg in permuted_results]

Now we can compare the *true correlation* with the distribution of correlations obtained by shuffling our dataset.

In [86]:
greater = [cor for cor in permuted_cors if cor <= true_regression.rvalue]
p_global = len(greater) / len(permuted_cors)
p_global

0.0

## Systematicity coefficients for each word

Now, we can use leave-one-out regression to determine how each word contributes to the overall correlation. For each word, we remove all comparisons involving that word, then take the global correlation again, and compare that score to the original correlation. This follows the procedure in [Monaghan et al, 2014](http://rstb.royalsocietypublishing.org/content/369/1651/20130299.short).

Recall that **original** was negative. So if **original** - **new** is negative, that means that removing the word results in a *lower* correlation (e.g. closer to 0), which suggests that the word provided a source of **form-meaning systematicity** to the correlation.

If **original** - **new** is positive, that means that removing the word results in a *higher* correlation (e.g. further from 0), which suggests that the word provided a source of **form-meaning arbitrariness** to the correlation.

Thus:
* **Negative** impact values suggest a word is more systematic
* **Positive** impact values suggest a word is more arbitrary

In [92]:
word_to_systematicity = {
}

In [93]:
index = 1
for word in critical_words:
    if index % 100 == 0:
        print("{pct}% done...".format(pct=round(index/len(critical_words), 2)*100))
    df_copy = comparisons_df[(comparisons_df['w1'] != word) & (comparisons_df['w2'] != word)]
    new_correlation = linregress(df_copy['form'], df_copy['meaning'])
    word_to_systematicity[word] = true_regression.rvalue - new_correlation.rvalue
    index += 1

5.0% done...
10.0% done...
14.000000000000002% done...
19.0% done...
24.0% done...
28.999999999999996% done...
34.0% done...
38.0% done...
43.0% done...
48.0% done...
53.0% done...
57.99999999999999% done...
62.0% done...
67.0% done...
72.0% done...
77.0% done...
82.0% done...
86.0% done...
91.0% done...
96.0% done...


In [94]:
len(word_to_systematicity)

2082

In [95]:
word_to_systematicity['mute']

-1.0503107176006166e-05

In [96]:
words_systematicity_df = pd.DataFrame.from_dict({'word': list(word_to_systematicity.keys()),
                                                 'impact': list(word_to_systematicity.values())})

In [99]:
words_systematicity_df.sort_values('impact').head(4)

Unnamed: 0,word,impact
583,pleased,-0.001439
438,strained,-0.000916
2029,rights,-0.000891
562,fraught,-0.000799


In [100]:
words_systematicity_df['word_length'] = words_systematicity_df['word'].apply(lambda x: len(x))

In [101]:
model = ols("impact ~ word_length", words_systematicity_df).fit()
model.summary()

0,1,2,3
Dep. Variable:,impact,R-squared:,0.014
Model:,OLS,Adj. R-squared:,0.014
Method:,Least Squares,F-statistic:,30.51
Date:,"Tue, 04 Sep 2018",Prob (F-statistic):,3.74e-08
Time:,16:43:25,Log-Likelihood:,15410.0
No. Observations:,2082,AIC:,-30820.0
Df Residuals:,2080,BIC:,-30800.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-7.94e-05,1.47e-05,-5.389,0.000,-0.000,-5.05e-05
word_length,1.822e-05,3.3e-06,5.524,0.000,1.18e-05,2.47e-05

0,1,2,3
Omnibus:,670.324,Durbin-Watson:,2.04
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9327.518
Skew:,-1.12,Prob(JB):,0.0
Kurtosis:,13.125,Cond. No.,21.3


### Write data to file

In [89]:
comparisons_df.to_csv("data/processed/wordpair_comparisons.csv")

In [102]:
words_systematicity_df.to_csv("data/processed/all_words_systematicity.csv")