# Modeling similarity judgments with word embeddings

Previous work has illustrated that similarity judgments between words can be estimated using distributional models of word meaning. But are there certain kinds of words for which distributioanl information is more or less useful?

There are theoretical reasons to think that distributional information is more helpful for learning abstract concepts/words (Lupyan & Winter, 2018), and some empirical evidence as well (Kiros et al, 2018). But so far, others have not asked whether distributional information better predicts similarity judgments as a function of the *level of abstractness* of a word, or other kinds of semantic features. 

We use the Brysbaert norms as an estimate of concreteness, and ELMo as a model for word embeddings.

For the pairwise similarity judgments, we will consider several different datasets in turn.

We consider two approaches:

## Approach 1

Approach 1 is simple. For a given dataset, we classify each wordpair into one of three bins: `Abstract`, `Concrete`, and `Mixed`. `Abstract` corresponds to a wordpair in which both words fall below the median concreteness for that dataset, `Concrete` corresponds to a wordpair in which both words fall above, and `Mixed` means that one falls below and one falls above.

We then regress `similarity ~ cosine_distance` for Abstract wordpairs only, and Concrete wordpairs only, and compare the resulting `R^2` values.

### Median split 

Because each dataset isn't balanced in terms of its concreteness, we can't necessarily use the median split from the Brysbaert norms. Thus, we run and report two approaches to splitting the data:

1) Using the median concreteness from a given dataset.  
2) Using the median concreteness from the four datasets we consider.

(1) will ensure that a given dataset is balanced, but might result in the same words being considered `Abstract` or `Concrete` in different datasets.

(2) will result in less balanced datasets, but still more balanced than using the single Brysbaert norm, and will ensure consistency in which words are categorized as `Abstract` or `Concrete` across datasets.

## Approach 2

A median split of concreteness is useful for model interpretation, but is also problematic for two reasons:

1. It dichotomizes a variable that may be better characterized as continuous.  
2. Each dataset isn't balanced in terms of its concreteness; that is, the median concreteness in the Brysbaert norms isn't necessarily the same as the median concreteness in each similarity rating dataset.

One solution is to ask whether the concreteness of each word in the word pairs predicts a given word pair's contribution to the fit of a model. That is, we can fit the entire dataset, then iteratively remove different wordpairs and fit the dataset after each removal. A given wordpair can thus be characterized in terms of whether its removal impairs or improves model fit. We can then ask whether a wordpair's contribution is related to each of those words' concreteness. 

The `contribution` of a wordpair is defined as `original_r2 - new_r2`, where `new_r2` is the fit of the model without that wordpair. Thus, a more positive value means that removing that wordpair **impairs** the model (the fit is lower without the wordpair), and a more negative value means that removing that wordpair **improves** the model (the fit is higher without the wordpair).

Thus, a positive coefficient for `w1_concreteness`, `w2_concreteness`, or their interaction means that more concrete words (or wordpairs) improve the model, e.g., they have a *positive* contribution on R2. A negative coefficient means that more abstract words (or wordpairs) improve the model.

In [1]:
import pandas as pd
from tqdm import tqdm
import itertools
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
import statistics
import statsmodels.formula.api as sm

In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # makes figs nicer!

# Helper functions

In [4]:
def get_comparison_type(row):
    if row['w1_is_abstract'] and row['w2_is_abstract']:
        return "Abstract"
    elif not row['w1_is_abstract'] and not row['w2_is_abstract']:
        return "Concrete"
    return "Mixed"

In [5]:
def tag_as_abstract(df, median_concreteness_amount):
    """Tag each word in a word pair as whether it is concrete, according to the median concreteness."""
    df['w1_is_abstract'] = df['w1_conc'].apply(lambda x: x < median_concreteness_amount)
    df['w2_is_abstract'] = df['w2_conc'].apply(lambda x: x < median_concreteness_amount)
    df['same'] = df['w1_is_abstract'] == df['w2_is_abstract']
    return df

In [6]:
def print_distribution(df):
    """Print out dataset distribution in terms of comparisons."""
    print("#Abstract: {n}".format(n=len(df[df['Comparison Type']=='Abstract'])))
    print("#Concrete: {n}".format(n=len(df[df['Comparison Type']=='Concrete'])))
    print("#Mixed: {n}".format(n=len(df[df['Comparison Type']=='Mixed'])))

In [62]:
def get_stats_for_dataset(df, formula):
    """Get R2 and coefficients for different subsets of dataset."""
    results = []
    result = sm.ols(formula=FORMULA, 
                data=df).fit()
    results.append({
        'comparison': 'overall',
        'r2': result.rsquared,
        'coef': result.params['decontextualized_elmo_similarity'],
        'n': len(df)
    })
    for i in ['Abstract', 'Concrete', 'Mixed']:
        df_reduced = df[df['Comparison Type']==i]
        result = sm.ols(formula=FORMULA, 
                    data=df_reduced).fit()
        results.append({
            'comparison': i,
            'r2': result.rsquared,
            'coef': result.params['decontextualized_elmo_similarity'],
            'n': len(df_reduced)
        })
    
    return pd.DataFrame(results)

In [7]:
FORMULA = 'similarity ~ decontextualized_elmo_similarity'

In [8]:
def leave_one_pair_out(df, formula):
    
    result = sm.ols(formula=formula, 
                data=df).fit()
    original_coef = result.params['decontextualized_elmo_similarity']
    
    differences = []
    for pair_index in tqdm(range(len(df))):
        df_copy = df.copy()
        df_copy = df_copy.drop(pair_index)
        
        result = sm.ols(formula=formula, 
            data=df_copy).fit()
        new_coef = result.params['decontextualized_elmo_similarity']
        differences.append(original_coef - new_coef)
    return differences

# Load datasets

Load each processed dataset and print out descriptive statistics for it.

In [14]:
df_sim = pd.read_csv("data/processed/sim3500_with_cosine_distance.csv")
combined = list(df_sim['w1']) + list(df_sim['w2'])
print(len(df_sim))
print(len(set(combined)))

3487
824


In [16]:
df_wordsim = pd.read_csv("data/processed/wordsim_with_cosine_distance.csv")
combined = list(df_wordsim['Word 1']) + list(df_wordsim['Word 2'])
print(len(df_wordsim))
print(len(set(combined)))

353
437


In [17]:
df_simlex = pd.read_csv("data/processed/simlex_with_cosine_distance.csv")
combined = list(df_simlex['word1']) + list(df_simlex['word2'])
print(len(df_simlex))
print(len(set(combined)))

999
1028


In [20]:
df_mturk = pd.read_csv("data/processed/mturk771_with_cosine_distance.csv")
combined = list(df_mturk['w1']) + list(df_mturk['w2'])
print(len(df_mturk))
print(len(set(combined)))

754
1094


## Descriptive statistics about concreteness

In [23]:
combined = list(df_sim['w1_conc']) +  list(df_sim['w2_conc'])
simverb_conc = statistics.median(combined)
simverb_conc

3.03

In [24]:
combined = list(df_wordsim['w1_conc']) +  list(df_wordsim['w2_conc'])
wordsim_conc = statistics.median(combined)
wordsim_conc

3.94

In [22]:
combined = list(df_simlex['w1_conc']) +  list(df_simlex['w2_conc'])
simlex_conc = statistics.median(combined)
simlex_conc

3.73

In [25]:
combined = list(df_mturk['w1_conc']) +  list(df_mturk['w2_conc'])
mturk_conc = statistics.median(combined)
mturk_conc

4.1

In [57]:
median_conc = statistics.median([simverb_conc, wordsim_conc, simlex_conc, mturk_conc])
median_conc

3.835

# Analysis

## SimVerb

### Approach 1: Using median concreteness for dataset

In [63]:
df_sim = tag_as_abstract(df_sim, simverb_conc)
df_sim['Comparison Type'] = df_sim.apply(lambda row: get_comparison_type(row), axis=1)
print_distribution(df_sim)

#Abstract: 1164
#Concrete: 1179
#Mixed: 1144


In [64]:
get_stats_for_dataset(df_sim, FORMULA)

Unnamed: 0,coef,comparison,n,r2
0,-10.137249,overall,3487,0.169454
1,-9.44391,Abstract,1164,0.168931
2,-8.368142,Concrete,1179,0.104506
3,-12.454512,Mixed,1144,0.221531


### Approach 2: Using median concreteness across all four datasets

In [65]:
df_sim = tag_as_abstract(df_sim, median_conc)
df_sim['Comparison Type'] = df_sim.apply(lambda row: get_comparison_type(row), axis=1)
print_distribution(df_sim)

#Abstract: 2223
#Concrete: 399
#Mixed: 865


In [66]:
get_stats_for_dataset(df_sim, FORMULA)

Unnamed: 0,coef,comparison,n,r2
0,-10.137249,overall,3487,0.169454
1,-10.24471,Abstract,2223,0.182513
2,-4.036077,Concrete,399,0.024355
3,-12.434353,Mixed,865,0.21749


## Wordsim

### Approach 1: Using median concreteness for dataset

In [67]:
df_wordsim = tag_as_abstract(df_wordsim, wordsim_conc)
df_wordsim['Comparison Type'] = df_wordsim.apply(lambda row: get_comparison_type(row), axis=1)
print_distribution(df_wordsim)

#Abstract: 101
#Concrete: 138
#Mixed: 114


In [68]:
get_stats_for_dataset(df_wordsim, FORMULA)

Unnamed: 0,coef,comparison,n,r2
0,-8.312924,overall,353,0.292742
1,-8.498852,Abstract,101,0.36458
2,-7.768396,Concrete,138,0.24965
3,-9.554172,Mixed,114,0.217952


### Approach 2: Using median concreteness across all four datasets

In [69]:
df_wordsim = tag_as_abstract(df_wordsim, median_conc)
df_wordsim['Comparison Type'] = df_wordsim.apply(lambda row: get_comparison_type(row), axis=1)
print_distribution(df_wordsim)

#Abstract: 95
#Concrete: 150
#Mixed: 108


In [70]:
get_stats_for_dataset(df_wordsim, FORMULA)

Unnamed: 0,coef,comparison,n,r2
0,-8.312924,overall,353,0.292742
1,-8.764007,Abstract,95,0.395844
2,-7.549924,Concrete,150,0.239757
3,-9.409771,Mixed,108,0.213429


## SimLex

### Approach 1: Using median concreteness for dataset

In [71]:
df_simlex = tag_as_abstract(df_simlex, simlex_conc)
df_simlex['Comparison Type'] = df_simlex.apply(lambda row: get_comparison_type(row), axis=1)
print_distribution(df_simlex)

#Abstract: 433
#Concrete: 434
#Mixed: 132


In [72]:
get_stats_for_dataset(df_simlex, FORMULA)

Unnamed: 0,coef,comparison,n,r2
0,-7.734998,overall,999,0.166691
1,-8.841296,Abstract,433,0.169374
2,-6.649626,Concrete,434,0.1575
3,-9.468034,Mixed,132,0.253343


### Approach 2: Using median concreteness across all four datasets

In [73]:
df_simlex = tag_as_abstract(df_simlex, median_conc)
df_simlex['Comparison Type'] = df_simlex.apply(lambda row: get_comparison_type(row), axis=1)
print_distribution(df_simlex)

#Abstract: 451
#Concrete: 427
#Mixed: 121


In [74]:
get_stats_for_dataset(df_simlex, FORMULA)

Unnamed: 0,coef,comparison,n,r2
0,-7.734998,overall,999,0.166691
1,-8.676706,Abstract,451,0.166289
2,-6.611504,Concrete,427,0.154087
3,-9.601434,Mixed,121,0.25623


## MTurk

### Approach 1: Using median concreteness for dataset

In [75]:
df_mturk = tag_as_abstract(df_mturk, mturk_conc)
df_mturk['Comparison Type'] = df_mturk.apply(lambda row: get_comparison_type(row), axis=1)
print_distribution(df_mturk)

#Abstract: 244
#Concrete: 246
#Mixed: 264


In [76]:
get_stats_for_dataset(df_mturk, FORMULA)

Unnamed: 0,coef,comparison,n,r2
0,-4.54975,overall,754,0.326633
1,-5.012256,Abstract,244,0.387926
2,-4.142269,Concrete,246,0.272679
3,-4.865899,Mixed,264,0.322219


### Approach 2: Using median concreteness across all four datasets

In [77]:
df_mturk = tag_as_abstract(df_mturk, median_conc)
df_mturk['Comparison Type'] = df_mturk.apply(lambda row: get_comparison_type(row), axis=1)
print_distribution(df_mturk)

#Abstract: 195
#Concrete: 322
#Mixed: 237


In [78]:
get_stats_for_dataset(df_mturk, FORMULA)

Unnamed: 0,coef,comparison,n,r2
0,-4.54975,overall,754,0.326633
1,-4.746439,Abstract,195,0.380477
2,-3.954131,Concrete,322,0.262926
3,-5.568677,Mixed,237,0.345884
