# Systematicity in English monomorphemic words by word class

### Sean Trott

Do certain word classes have more sub-morphemic systematicity than others?

**TO DO**:
* Use Levenshtein distance over phonemes, instead of orthography
* Relate to word features: grammatical class, AoA, Concreteness

## Load model and dataset

In [1]:
import os 
import gensim
import numpy as np
import pandas as pd
import re
from statsmodels.formula.api import ols

# Variables
MODEL_PATH = os.environ['WORD2VEC_PATH']
ROOT_PATH = 'data/raw/roots_celex_monosyllabic.txt'

LOAD_MODEL = True

In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format(MODEL_PATH, binary=True)

In [3]:
entries = open(ROOT_PATH, "r").read().split("\n")

In [4]:
words = [entry.split("\\")[0] for entry in entries if entry != "" and entry.islower()]
words[0]

'a'

## Filter by words that appear in model

In [5]:
critical_words = list(set([w for w in words if w in model.vocab]))

In [6]:
len(critical_words)

2082

## Obtain form and meaning similarity metrics

Here, we import the class `SystematicityUtilities` from a [custom library](https://github.com/seantrott/nlp_utilities). By default, this class uses *Levenshtein distance* as its metric for *form similarity*, and *cosine similarity* as its metric for *meaning similarity*. The `compare_form_and_meaning` method used below compares every word pair along form and meaning dimensions.

In [7]:
from nlp_utilities.compling import SystematicityUtilities
systematicity_utils = SystematicityUtilities(model)
comparisons = systematicity_utils.compare_form_and_meaning(critical_words)

In [8]:
import pandas as pd

In [9]:
comparisons_df = pd.DataFrame.from_dict(comparisons)

In [10]:
print("{length} comparisons total".format(length=len(comparisons_df)))

2166321 comparisons total


In [11]:
comparisons_df.sample(4)

Unnamed: 0,form,meaning,w1,w2
395754,3,0.143498,down,post
1057731,3,0.095133,rile,whale
361919,5,0.181221,skirl,hole
217214,5,0.118652,smile,leek


## Global correlation

In [12]:
from scipy.stats import linregress

In [13]:
true_regression = linregress(comparisons_df['form'], comparisons_df['meaning'])
true_regression.rvalue

-0.04067261287952176

In other words, words with higher **form distance** (e.g. a higher Levenshtein distance) will have smaller **meaning similarity** (e.g. cosine similarity).

## Compare global correlation to permuted distributions

In [14]:
import numpy as np

In [15]:
permuted_results = []
for permute in range(10):
    permuted_meaning = np.random.permutation(comparisons_df['meaning'])
    random_regression = linregress(comparisons_df['form'], permuted_meaning)
    permuted_results.append(random_regression)

In [16]:
permuted_cors = [reg.rvalue for reg in permuted_results]

Now we can compare the *true correlation* with the distribution of correlations obtained by shuffling our dataset.

In [17]:
greater = [cor for cor in permuted_cors if cor <= true_regression.rvalue]
p_global = len(greater) / len(permuted_cors)
p_global

0.0

## Systematicity coefficients for each word

Now, we can use leave-one-out regression to determine how each word contributes to the overall correlation. For each word, we remove all comparisons involving that word, then take the global correlation again, and compare that score to the original correlation. This follows the procedure in [Monaghan et al, 2014](http://rstb.royalsocietypublishing.org/content/369/1651/20130299.short).

Recall that **original** was negative. So if **original** - **new** is negative, that means that removing the word results in a *lower* correlation (e.g. closer to 0), which suggests that the word provided a source of **form-meaning systematicity** to the correlation.

If **original** - **new** is positive, that means that removing the word results in a *higher* correlation (e.g. further from 0), which suggests that the word provided a source of **form-meaning arbitrariness** to the correlation.

Thus:
* **Negative** impact values suggest a word is more systematic
* **Positive** impact values suggest a word is more arbitrary

In [18]:
word_to_systematicity = {
}

In [19]:
index = 1
for word in critical_words:
    if index % 100 == 0:
        print("{pct}% done...".format(pct=round(index/len(critical_words), 2)*100))
    df_copy = comparisons_df[(comparisons_df['w1'] != word) & (comparisons_df['w2'] != word)]
    new_correlation = linregress(df_copy['form'], df_copy['meaning'])
    word_to_systematicity[word] = true_regression.rvalue - new_correlation.rvalue
    index += 1

0.0% done...
1.0% done...
1.0% done...
2.0% done...
2.0% done...
3.0% done...
3.0% done...
4.0% done...
4.0% done...
5.0% done...
5.0% done...
6.0% done...
6.0% done...
7.000000000000001% done...
7.000000000000001% done...
8.0% done...
8.0% done...
9.0% done...
9.0% done...
10.0% done...
10.0% done...
11.0% done...
11.0% done...
12.0% done...
12.0% done...
12.0% done...
13.0% done...
13.0% done...
14.000000000000002% done...
14.000000000000002% done...
15.0% done...
15.0% done...
16.0% done...
16.0% done...
17.0% done...
17.0% done...
18.0% done...
18.0% done...
19.0% done...
19.0% done...
20.0% done...
20.0% done...
21.0% done...
21.0% done...
22.0% done...
22.0% done...
23.0% done...
23.0% done...
24.0% done...
24.0% done...
24.0% done...
25.0% done...
25.0% done...
26.0% done...
26.0% done...
27.0% done...
27.0% done...
28.000000000000004% done...
28.000000000000004% done...
28.999999999999996% done...
28.999999999999996% done...
30.0% done...
30.0% done...
31.0% done...
31.0% done.

In [20]:
len(word_to_systematicity)

2082

In [21]:
words_systematicity_df = pd.DataFrame.from_dict({'word': list(word_to_systematicity.keys()),
                                                 'impact': list(word_to_systematicity.values())})

In [22]:
words_systematicity_df.sample(4)

Unnamed: 0,word,impact
20,dark,-0.041383
949,wold,-0.04137
474,price,-0.041344
1699,wave,-0.041352


In [23]:
words_systematicity_df['word_length'] = words_systematicity_df['word'].apply(lambda x: len(x))

In [24]:
words_systematicity_df.sample(4)

Unnamed: 0,word,impact,word_length
1858,tray,-0.041356,4
630,stark,-0.041328,5
294,brawl,-0.041373,5
1207,dam,-0.041332,3


In [43]:
model = ols("impact ~ word_length", words_systematicity_df).fit()
model.summary()

0,1,2,3
Dep. Variable:,impact,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,0.1737
Date:,"Tue, 04 Sep 2018",Prob (F-statistic):,0.677
Time:,13:35:52,Log-Likelihood:,19504.0
No. Observations:,2082,AIC:,-39000.0
Df Residuals:,2080,BIC:,-38990.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0413,2.06e-06,-2.01e+04,0.000,-0.041,-0.041
word_length,1.924e-07,4.62e-07,0.417,0.677,-7.13e-07,1.1e-06

0,1,2,3
Omnibus:,13.379,Durbin-Watson:,1.926
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.969
Skew:,-0.004,Prob(JB):,7.6e-05
Kurtosis:,3.468,Cond. No.,21.3


### Write data to file

In [26]:
comparisons_df.to_csv("data/processed/wordpair_comparisons.csv")

In [27]:
words_systematicity_df.to_csv("data/processed/all_words_systematicity.csv")

### Merge with AoA data

[Monaghan et al, 2014](http://rstb.royalsocietypublishing.org/content/369/1651/20130299.short) found an inverse relationship between age of acquisition and systematicity. That is, words that were learned earlier were more systematic. 

In [30]:
aoa = pd.read_csv("data/raw/AoA.csv", delim_whitespace=True)
aoa['word'] = aoa['Word'].str.lower()

In [31]:
aoa.sample(4)

Unnamed: 0,Word,AoA,word
1391,REGRET,428,regret
540,ELECTRICITY,400,electricity
937,JUSTIFICATION,603,justification
887,INSTANCE,471,instance


In [34]:
aoa_plus_systematicity = pd.merge(words_systematicity_df, aoa)
len(aoa_plus_systematicity)

348

In [35]:
aoa_plus_systematicity.sample(4)

Unnamed: 0,word,impact,word_length,Word,AoA
50,crime,-0.041339,5,CRIME,383
262,girl,-0.04134,4,GIRL,183
42,male,-0.041365,4,MALE,383
173,call,-0.041363,4,CALL,225


In [41]:
model = ols("impact ~ AoA", aoa_plus_systematicity).fit()
model.summary()

0,1,2,3
Dep. Variable:,impact,R-squared:,0.001
Model:,OLS,Adj. R-squared:,-0.005
Method:,Least Squares,F-statistic:,0.2202
Date:,"Tue, 04 Sep 2018",Prob (F-statistic):,0.803
Time:,13:33:24,Log-Likelihood:,3267.6
No. Observations:,348,AIC:,-6529.0
Df Residuals:,345,BIC:,-6518.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0413,6.78e-06,-6101.908,0.000,-0.041,-0.041
AoA,6.479e-10,1.06e-08,0.061,0.951,-2.02e-08,2.15e-08
word_length,9.194e-07,1.44e-06,0.640,0.522,-1.9e-06,3.74e-06

0,1,2,3
Omnibus:,3.298,Durbin-Watson:,2.033
Prob(Omnibus):,0.192,Jarque-Bera (JB):,3.124
Skew:,-0.162,Prob(JB):,0.21
Kurtosis:,3.333,Cond. No.,2340.0
