# Systematicity in English monomorphemic words by word class

### Sean Trott

Do certain word classes have more sub-morphemic systematicity than others?

**TO DO**:
* Use Levenshtein distance over phonemes, instead of orthography
* Relate to word features: grammatical class, AoA, Concreteness

## Load model and dataset

In [1]:
import os 
import gensim
import numpy as np
import pandas as pd
import re
from statsmodels.formula.api import ols

# Variables
MODEL_PATH = os.environ['WORD2VEC_PATH']
ROOT_PATH = 'data/raw/roots_celex_monosyllabic.txt'

LOAD_MODEL = True

In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format(MODEL_PATH, binary=True)

In [3]:
entries = open(ROOT_PATH, "r").read().split("\n")

In [4]:
words = [(entry.split("\\")[0], entry.split("\\")[-1]) for entry in entries if entry != "" and entry.islower()]

## Filter by words that appear in model

In [5]:
critical_words = list(set([w for w in words if w[0] in model.vocab]))

In [6]:
critical_words_dict = dict(critical_words)
critical_words_dict
len(critical_words_dict)

2082

## Obtain form and meaning similarity metrics

Here, we import the class `SystematicityUtilities` from a [custom library](https://github.com/seantrott/nlp_utilities). By default, this class uses *Levenshtein distance* as its metric for *form similarity*, and *cosine similarity* as its metric for *meaning similarity*. The `compare_form_and_meaning` method used below compares every word pair along form and meaning dimensions.

In [7]:
from itertools import combinations
w1, w2 = zip(*combinations(critical_words_dict.keys(), 2))
new_df = pd.DataFrame.from_dict({'w1': w1,
                                 'w2': w2})

In [8]:
new_df.head(5)

Unnamed: 0,w1,w2
0,la,swoon
1,la,zee
2,la,blaze
3,la,noun
4,la,glass


In [9]:
from nlp_utilities.compling import SystematicityUtilities

In [10]:
systematicity_utils = SystematicityUtilities(model, orth_to_phone=critical_words_dict)

In [11]:
systematicity_utils.compare_form('mind', 'mast')

3

In [12]:
systematicity_utils = SystematicityUtilities(model)
comparisons = systematicity_utils.compare_form_and_meaning_df(new_df, w1_column='w1', w2_column='w2')

  if np.issubdtype(vec.dtype, np.int):


In [13]:
print("{length} comparisons total".format(length=len(comparisons)))

2166321 comparisons total


In [14]:
comparisons.sort_values('form').head(n=10)

Unnamed: 0,w1,w2,form,meaning
853572,teat,tea,1,0.062589
1980980,toot,moot,1,-0.023852
757553,tope,tote,1,0.087876
991789,clack,black,1,0.161644
1320822,dirk,dark,1,0.048723
112237,sight,might,1,0.10681
1110253,needs,weeds,1,0.055369
888068,yaw,yawn,1,0.125627
1300196,kale,ale,1,0.285508
26795,fat,pat,1,0.192674


## Global correlation

In [15]:
from scipy.stats import linregress

In [16]:
true_regression = linregress(comparisons['form'], comparisons['meaning'])
print("r={r}, p={p}".format(r=true_regression.rvalue, p=true_regression.pvalue))

r=-0.040672613104546416, p=0.0


In other words, words with higher **form distance** (e.g. a higher Levenshtein distance) will have smaller **meaning similarity** (e.g. cosine similarity).

## Compare global correlation to permuted distributions

In [17]:
import numpy as np

In [18]:
permuted_results = []
for permute in range(10):
    permuted_meaning = np.random.permutation(list(comparisons['meaning']))
    random_regression = linregress(comparisons['form'], permuted_meaning)
    permuted_results.append(random_regression)

In [19]:
permuted_cors = [reg.rvalue for reg in permuted_results]

Now we can compare the *true correlation* with the distribution of correlations obtained by shuffling our dataset.

In [20]:
greater = [cor for cor in permuted_cors if cor <= true_regression.rvalue]
p_global = len(greater) / len(permuted_cors)
p_global

0.0

## Systematicity coefficients for each word

Now, we can use leave-one-out regression to determine how each word contributes to the overall correlation. For each word, we remove all comparisons involving that word, then take the global correlation again, and compare that score to the original correlation. This follows the procedure in [Monaghan et al, 2014](http://rstb.royalsocietypublishing.org/content/369/1651/20130299.short).

Recall that **original** was negative. So if **original** - **new** is negative, that means that removing the word results in a *lower* correlation (e.g. closer to 0), which suggests that the word provided a source of **form-meaning systematicity** to the correlation.

If **original** - **new** is positive, that means that removing the word results in a *higher* correlation (e.g. further from 0), which suggests that the word provided a source of **form-meaning arbitrariness** to the correlation.

Thus:
* **Negative** impact values suggest a word is more systematic
* **Positive** impact values suggest a word is more arbitrary

In [21]:
comparisons.head(5)

Unnamed: 0,w1,w2,form,meaning
0,la,swoon,5,0.124078
1,la,zee,3,0.388086
2,la,blaze,3,-0.057967
3,la,noun,4,0.239854
4,la,glass,3,0.028939


In [22]:
word_to_systematicity = {
}

In [23]:
index = 1
for row in critical_words:
    word = row[0]
    print(word)
    if index % 100 == 0:
        print("{pct}% done...".format(pct=round(index/len(critical_words), 2)*100))
    df_copy = comparisons[(comparisons['w1'] != word) & (comparisons['w2'] != word)]
    new_correlation = linregress(df_copy['form'], df_copy['meaning'])
    word_to_systematicity[word] = true_regression.rvalue - new_correlation.rvalue
    index += 1

la
swoon
zee
blaze
noun
glass
hive
rout
oat
jive
bourn
alms
fat
ruse
pie
clew
scout
die
oaf
ma
brag
freeze
clout
meet
foul
marc
wee
wax
fruits
clam
doge
stall
lance
heart
ma'am
parse
foil
spoof
gas
serge
barred
sough
mike
half
rights
pane
trounce
pains
toe
snout
doe
o
boned
nab
sight
lake
flaw
flight
tweed
mack
lease
pad
gauge
plan
cant
ail
strand
sleet
gleam
leap
bass
rue
peat
snag
quartz
bo
feel
stars
barm
vat
case
brow
gout
seat
ah
turf
drab
prime
tall
goose
smart
brass
glean
hart
by
crowd
steam
lack
flea
gag
5.0% done...
harm
bloke
keep
nee
floe
stoep
stores
waits
curd
drive
pierce
play
lounge
day
sew
soak
joint
trice
turd
world
mag
ramp
bow
slurp
rise
cite
ounce
swoop
road
terms
reign
gapes
light
troops
aid
marl
broad
frown
phase
valve
beau
weal
crag
tide
skate
cloy
lief
feud
swede
sprout
knight
hope
wheel
flax
scan
sole
tap
doom
creek
spawn
lb
piles
burke
closed
stamp
scrounge
pile
slaw
sleeve
dyne
ti
cream
tone
vile
gorge
maul
bounds
pleat
roast
sound
yam
squeak
glebe
snipe
crea

crawl
rime
serf
join
hoarse
flute
clown
cards
halt
wile
croup
71.0% done...
scorn
right
tile
league
toils
hand
pawn
toy
tow
truce
fray
farm
scowl
suit
grass
aged
mute
loins
plead
plague
tout
dad
rouse
badge
stoop
true
vaunt
stoned
groat
mange
file
bold
date
van
greet
sax
waft
stack
teal
mousse
dace
scarce
brad
whom
whack
wo
cab
deed
flee
alp
use
poke
five
steal
faze
grave
cold
tail
voile
hoof
field
trance
slap
whirl
tome
beard
earn
haze
weigh
aunt
broom
strafe
liege
grail
fight
joy
cute
ease
squeal
saint
float
pay
sign
dope
lark
lam
roan
free
lurk
mule
mount
nope
spurn
stride
fie
grasp
hag
troll
fad
wrack
76.0% done...
stance
fagged
jowl
tight
fierce
flared
aught
yaws
clique
swerve
prance
sue
lien
scraps
bourse
zoom
soup
spruce
trail
poop
rate
mound
lab
gloat
prole
wall
hard
say
souse
guy
first
creep
cask
games
knee
cos
borscht
walk
bey
wide
earl
fugue
masque
siege
hike
b
skein
fife
lope
talc
roux
yea
valse
crate
fan
neigh
glade
beak
smile
flake
hay
mi
plaid
seek
eaves
scrap
cruel
prid

In [24]:
len(word_to_systematicity)

2082

In [25]:
word_to_systematicity

{'la': -0.00017177082276061129,
 'swoon': 1.4238673936822765e-05,
 'zee': -0.00011118759022626934,
 'blaze': -9.811611806485876e-06,
 'noun': -6.781165303169218e-05,
 'glass': 2.049311684522437e-06,
 'hive': 1.4130038771778541e-05,
 'rout': 0.00011085862213526532,
 'oat': 1.5332997676496818e-05,
 'jive': 1.8903706994588543e-05,
 'bourn': 9.009046084014483e-05,
 'alms': 1.5901655190525554e-06,
 'fat': -0.00014706441737508125,
 'ruse': 3.426196386285829e-05,
 'pie': -5.666495094389101e-05,
 'clew': -5.2214768906114206e-05,
 'scout': -7.605640098667932e-05,
 'die': -7.666481072345077e-05,
 'oaf': -0.0002952435988603508,
 'ma': -0.0004776515782508872,
 'brag': -3.912942953709919e-05,
 'freeze': -0.00021004681478461323,
 'clout': -5.267522925270218e-05,
 'meet': 7.136896492737632e-05,
 'foul': 9.092333779626288e-05,
 'marc': -0.00011848849164834002,
 'wee': -0.00033444697103302,
 'wax': -0.00014238848984512997,
 'fruits': -0.00010278988446652676,
 'clam': -8.850086216453945e-05,
 'doge': 2.

In [26]:
words_systematicity_df = pd.DataFrame.from_dict({'word': list(word_to_systematicity.keys()),
                                                 'impact': list(word_to_systematicity.values())})

In [27]:
words_systematicity_df.sort_values('impact').head(4)

Unnamed: 0,word,impact
335,pleased,-0.001439
1274,strained,-0.000916
44,rights,-0.000891
1268,fraught,-0.000799


In [28]:
words_systematicity_df['word_length'] = words_systematicity_df['word'].apply(lambda x: len(x))

In [29]:
model = ols("impact ~ word_length", words_systematicity_df).fit()
model.summary()

0,1,2,3
Dep. Variable:,impact,R-squared:,0.014
Model:,OLS,Adj. R-squared:,0.014
Method:,Least Squares,F-statistic:,30.51
Date:,"Thu, 27 Sep 2018",Prob (F-statistic):,3.74e-08
Time:,16:29:36,Log-Likelihood:,15410.0
No. Observations:,2082,AIC:,-30820.0
Df Residuals:,2080,BIC:,-30800.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-7.94e-05,1.47e-05,-5.389,0.000,-0.000,-5.05e-05
word_length,1.822e-05,3.3e-06,5.524,0.000,1.18e-05,2.47e-05

0,1,2,3
Omnibus:,670.324,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9327.518
Skew:,-1.12,Prob(JB):,0.0
Kurtosis:,13.125,Cond. No.,21.3


### Write data to file

In [31]:
comparisons.to_csv("data/processed/wordpair_comparisons.csv")

In [32]:
words_systematicity_df.to_csv("data/processed/all_words_systematicity.csv")