# All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality

### Replication Instructions

## 1) Build a corpus
Before we do anything, we need a corpus of natural language data. We used randomly selected dumps from wikipedia, but a dataset such as [wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) should suffice for most experiments. For the final experiment (Section 5), you will need to use a larger corpus, hence our use of several dump partitions. Otherwise, any sufficently large corpus should give the same qualitative results

We acquired our wikipedia files from: https://dumps.wikimedia.org/enwiki/latest/. The specific partitions we used are:
enwiki-latest-pages-articles[X].xml where X is 4, 5, 7, 8, 10, 11, 12, 13, 15, 16, 18, 20, 22, 24, 25.

If you used raw wikipedia dumps, then you'll need to use [WikiExtractor](https://github.com/attardi/wikiextractor) to extract and clean the text.

Save your wikipedia dumps to a folder in the same directory called "wikitext/"



## 2) Preprocessing
Now, we'll get embeddings from all of the models using our corpus. Feel free to add any other language models you'd like to investigate from https://huggingface.co/models using the model identifier. If you'd like to skip this step and use the embeddings we used in the paper, you can find them [here](https://drive.google.com/file/d/1YUgEHEu6QU2ChD0X9N8IlZqNJlr3rk0P/view?usp=sharing).

In [None]:
from transformers import AutoModel, AutoTokenizer
import torch
import os
import random
import numpy as np
from glob import glob
from tqdm import tqdm
from scipy import stats
import matplotlib.pyplot as plt
import pickle
import sys
from sklearn.decomposition import PCA
from scipy.spatial.distance import cosine

random.seed(2398)

if torch.cuda.is_available():  
    dev = "cuda:0"
    sys.stdout.write('using CUDA!\n')
else:
    sys.stdout.write('NOT using CUDA!\n')
    dev = "cpu"  
device = torch.device(dev)


def load_model_tokenizer(model_name):
  model = AutoModel.from_pretrained(model_name, output_hidden_states=True)
  model.to(device)
  model.eval()
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  return model, tokenizer

In [None]:
#these are the four models we use in our experiments 
#you can add/change models, though some of the code is dependent on all models having the same # of layers.
models = ['gpt2', 'xlnet-base-cased', 'bert-base-cased', 'roberta-base']

#choose a model, or add a for loop to get embeddings for all models.
model_name = models[0]
model, tokenizer = load_model_tokenizer(model_name)

#context length is the number of tokens in each model input. In our case, we chose 128:
context_length = 128

wikitext_fnames = glob('wikitext/*/*/*')
random.shuffle(wikitext_fnames)

In [None]:
#preprocesssing the dump files to ignore headers, spacing etc. You can ignore this if you're not using wiki dumps
all_article_texts = []
for fname in wikitext_fnames[:100]:
    with open(fname, 'r', encoding='utf-8') as f:
    lines = f.readlines()
    in_article = False
    article_texts = []
    curr_article = []
    i = 0
    while(i < len(lines)):
        if(len(lines[i]) > 1):
            if(in_article):
                if(lines[i][:6] != '</doc>'):
                    curr_article.append(" ".join(lines[i].strip().split()))
                else:
                    in_article = False
                    if(len(curr_article) > 10):
                        article_texts.append(" ".join(curr_article))
                    curr_article = []
            elif(lines[i][:4] == '<doc'):
                in_article = True
                i+=1
        i+=1
    all_article_texts.extend(article_texts)
    
random.shuffle(all_article_texts)
texts = all_article_texts[:150] #for this part, we don't need all of the article text, so lets just take a random subset

The next cell is where we actually get the embeddings from samples in our corpus. If you are using a different corpus, then create a list of ```str``` paragraphs called ```texts``` with whatever text you'd like.

In [None]:
corpus_sample_embeds = []
all_toks = []
for text in tqdm(texts):
    text_toks = tokenizer.tokenize(text)[:context_length]
    all_toks += text_toks
    toks = torch.tensor([tokenizer.convert_tokens_to_ids(text_toks)]).to(device)
    model_out = model(toks)
    embeddings = torch.stack(model_out.hidden_states).squeeze().detach().cpu().numpy()
    corpus_sample_embeds.append(embeddings)

corpus_sample_embeds = np.concatenate(corpus_sample_embeds, axis=1)

out = {'corpus':texts, 'tokens':all_toks, 'embeddings':corpus_sample_embeds}
directory = 'wikitext_embs'
if not os.path.exists(directory):
os.makedirs(directory)
pickle.dump(out, open(os.path.join(directory, model_name + "_" + str(context_length) + '.p'), "wb" ))

## 2.5) Context Aggregation for representational quality evaluation (section 5)
If you would like to replicate our results directly from the wiki corpus, then you'll need to aggregate across corpora using the strategy of [Bommasani et al. 2020](https://aclanthology.org/2020.acl-main.431.pdf). To save space, we've included all of the aggregated contexts for all words in all of the human word similarity/relatedness datasets. Our data can be found [here](https://drive.google.com/file/d/1_1wQGXSmgjnKXmKpHaLopLf0A1_6gwo6/view?usp=sharing). Extract these files to the ```data/``` directory. 

Otherwise, if you'd just like to use our precomputed context-aggregated embeddings, they are availible [here](https://drive.google.com/file/d/1AenKhgoJcvzwaAqgCchNkR55Q3uiYdPA/view?usp=sharing).

If you'd like to use your own corpus, then format the files as follows:
each file is named ```fname.sents```, where ```fname``` is the name of the target word from the word similarity dataset. Each file has line seperated sentences including the target word, where the first element of each line is the index of the target word in the sentence. All words are space-delimited. 

The word similarity/relatedness datasets can be found in the github repository for our paper.

In [None]:
sim_filenames = ['word_sim/simlex-999/SimLex-999.txt', 
                'word_sim/simverb-3500/SimVerb-3500.txt', 
                'word_sim/wordsim-353_r/wordsim_relatedness_goldstandard.txt', 
                'word_sim/wordsim-353_s/wordsim_similarity_goldstandard.txt',
                'word_sim/rg-65/RG65.txt']

#load all of the vocabulary items from all of the word similarity datasets
def load_sim_dataset_vocab(sim_filenames):
    sim_lex_vocab = set()
    sim_lex_pairs = []
    for sim_lex_filename in sim_filenames:
        with open(sim_lex_filename, 'r', encoding='utf-8') as f:
            dataset = sim_lex_filename.split('/')[0]
            word_pair_lines = f.readlines()
            for line in word_pair_lines[1:]:
                if(sim_lex_filename[:2] == 'rg'):
                    vals = line.split(';')
                else:
                    vals = line.split()
                word1 = vals[0].lower()
                word2 = vals[1].lower()
                sim_lex_vocab.add(word1)
                sim_lex_vocab.add(word2)
                sim_lex_pairs.append((word1, word2, dataset))
    return sim_lex_vocab, sim_lex_pairs

#tokenize words using BPE, such that we can recover the indicies of the target word.
def tokenize_sent_subword_inds(sent, target_word_ind, tokenizer):
    final_sent_toks = []
    start_ind = 0
    end_ind = 0
    for i, word in enumerate(sent):
        if i != 0:
            word = " " + word #add prefix space for intermediate words
        curr_word_toks = tokenizer.tokenize(word)
        if i == target_word_ind:
            start_ind = len(final_sent_toks)
            final_sent_toks.extend(curr_word_toks)
            end_ind = len(final_sent_toks)
        else:
            final_sent_toks.extend(curr_word_toks)
    return final_sent_toks, (start_ind, end_ind)

def get_embeddings_for_word(model, tokenizer, target_word, target_words_dir):
    with open(os.path.join(target_words_dir, target_word) + ".sents", 'r', encoding='utf-8') as f:
        f_lines = f.readlines()
        sents = []
    for line in f_lines:
        target_word_ind = int(line.split()[0])
        sent = line.split()[1:]
        sents.append((target_word_ind, sent))
  #get embeddings
    mean_embs = []
    for sent in sents[:500]: #no more than 500 sentences for time considerations
        tokens, target_span = tokenize_sent_subword_inds(sent[1], sent[0], tokenizer)
        tokens_tensor = torch.tensor([tokenizer.convert_tokens_to_ids(tokens)]).to(device)
        with torch.no_grad():
            model_out = model(tokens_tensor)
            embeddings = torch.stack(model_out.hidden_states).squeeze()[:,target_span[0]:target_span[1],:]
            embeddings = embeddings.detach().cpu().numpy()
        mean_embedding = embeddings.mean(axis=1) #average over subwords
        mean_embs.append(mean_embedding)
    decontextualized_emb = np.stack(mean_embs).mean(axis=0)
    return decontextualized_emb

  
vocab, pairs = load_sim_dataset_vocab(sim_filenames)
vocab_processed = set()
#model, tokenizer = load_model_tokenizer(model_name)
for word_i in tqdm(vocab):
    mean_emb = get_embeddings_for_word(model, tokenizer, word_i, 'data')
  
    directory = 'wordsim_embs/{}'.format(model_name)
    if not os.path.exists(directory):
        os.makedirs(directory)
    if(mean_emb is not None and word_i not in vocab_processed):
        vocab_processed.add(word_i)
        pickle.dump(mean_emb, open(os.path.join(directory, '{}.p'.format(word_i)), "wb" ))


Great - preprocessing complete! Now for some analyses

## 3) Reproducing Section 3

### Section 3.1
Now we'll reproduce the results in section 3.1. This section examines the cosine sim. contribution of each dimension of each layer of the models. The final cell will print the results table in latex format.


In [None]:
#imports, some of these might be replicates
import pickle
import random
import numpy as np
from scipy.stats import entropy
from transformers import AutoModelForCausalLM, AutoModelForMaskedLM, AutoTokenizer
import torch
import sys
import copy
from tqdm import tqdm
from scipy.spatial.distance import cosine
from scipy import stats
from matplotlib import pyplot as plt
random.seed(2398)

if torch.cuda.is_available():  
    dev = "cuda:0"
    sys.stdout.write('using CUDA!\n')
else:
    sys.stdout.write('NOT using CUDA!\n')
    dev = "cpu"  
device = torch.device(dev)


models = ['gpt2', 'xlnet-base-cased', 'bert-base-cased', 'roberta-base']

num_layers = {'gpt2':13, 'xlnet-base-cased':13, 'bert-base-cased':13, 'bert-base-uncased':13, 'roberta-base':13}
num_dims = {'gpt2':768, 'xlnet-base-cased':768, 'bert-base-cased':768, 'bert-base-uncased':768, 'roberta-base':768}

In [None]:
#this is the cosine contribution function we define in eq. 3
def cos_contrib(emb1, emb2):
    numerator_terms = emb1 * emb2
    denom = np.linalg.norm(emb1) * np.linalg.norm(emb2)
    return numerator_terms / denom 

In [None]:
#run the experiments!
cd_s = {}
for model_name in models:
    print(f'{model_name} Cosine sim contribution analysis:')
    sample_output = pickle.load(open("wikitext_embs/{}_128.p".format(model_name), "rb" ))
    sample_data = sample_output['embeddings']
    rogue_dist = []
    num_toks = sample_data.shape[1]
    
    #randomly sample embedding pairs to compute avg. cosine similiarity contribution
    random_pairs = [random.sample(range(num_toks), 2) for i in range(500000)]
    
    cos_contribs_by_layer = []
    for layer in range(num_layers[model_name]):
        layer_cosine_contribs = []
        layer_rogue_cos_contribs = []
        for pair in random_pairs:
            emb1 = sample_data[layer,pair[0],:]
            emb2 = sample_data[layer,pair[1],:]
            layer_cosine_contribs.append(cos_contrib(emb1, emb2))

        layer_cosine_contribs = np.array(layer_cosine_contribs)
        layer_cosine_sims = layer_cosine_contribs.sum(axis=1)
        layer_cosine_contribs_mean = layer_cosine_contribs.mean(axis=0)
        cos_contribs_by_layer.append(layer_cosine_contribs_mean)
    cos_contribs_by_layer = np.array(cos_contribs_by_layer)
    
    aniso = cos_contribs_by_layer.sum(axis=1) #total anisotropy, measured as avg. cosine sim between random emb. pairs
    for layer in range(num_layers[model_name]):
        top_3_dims = np.argsort(cos_contribs_by_layer[layer])[-3:]
        top = cos_contribs_by_layer[layer,top_3_dims[2]] / aniso[layer]
        second = cos_contribs_by_layer[layer,top_3_dims[1]] / aniso[layer]
        third = cos_contribs_by_layer[layer,top_3_dims[0]] / aniso[layer]
        print("& {} & {:.3f} & {:.3f} & {:.3f} & {:.3f} \\\\".format(layer, top, second, third, aniso[layer]))
    
    #save cos_contribs for later analyses
    pickle.dump(cos_contribs_by_layer, open('{}_cos_contrib.p'.format(model_name), "wb" ))


Word2Vec and GloVe baselines

1) download Word2Vec and GloVe models used in the paper:

    Word2Vec: https://zenodo.org/record/4421380
    
    GloVe: https://nlp.stanford.edu/projects/glove/ (Wikipedia+Gigaword 5, 300d)
    
    
2) place each .txt (glove) and .bin (word2vec) file in the folder ```static_embs/```

In [None]:
import gensim
#run above imports if running these cells from scratch
def get_cos_contrib(vocab_list, vec_dict):
    vocab_pairs_to_sample = [random.sample(vocab_list, 2) for i in range(500000)]
    cos_contribs = []
    
    for pair in vocab_pairs_to_sample:
        w1 = vec_dict[pair[0]]
        w2 = vec_dict[pair[1]]
        cos_contribs.append(cos_contrib(w1, w2))
    
    cos_contribs = np.array(cos_contribs)
    cosine_sims = cos_contribs.sum(axis=1)
    cosine_contribs_mean = cos_contribs.mean(axis=0)
    aniso_estimate = cosine_sims.mean()
    top_3_dims = np.sort(np.abs(cosine_contribs_mean))[-3:]
    top_dim_inds = np.argsort(np.abs(cosine_contribs_mean))
    return top_3_dims, aniso_estimate, top_dim_inds, cos_contribs

In [None]:
#get word2vec results
w2v_model = gensim.models.KeyedVectors.load_word2vec_format('static_embs/GoogleNews-vectors-negative300.bin', binary=True)
w2v_vocab = list(w2v_model.vocab.keys())
_, _, _, w2v_cos_contribs = get_cos_contrib(w2v_vocab, w2v_model)

In [None]:
#get glove results

#load model
with open('static_embs/glove.6B.300d.txt', 'r', encoding='utf-8') as f:
    glove = {x[0]:np.array(x[1:]).astype('float32') for x in [y.split() for y in f.readlines()]}

glove_vocab = list(glove.keys())
_, _, _, glove_cos_contribs = get_cos_contrib(glove_vocab, glove)

### Section 3.2
The following section reproduces results from section 3.2, where we analyse the variance explained by the full d dimensional embedding space and a d-k dimensional embedding space, where each of the k dimensions is a "rogue" dimensions. Run the following cell to get results for cosine similarity.

In [None]:
contribs = {}
for model_name in models:
    contribs[model_name] = np.zeros((num_layers[model_name], 3))
    print(f'{model_name} cosine informativity analysis:')
    sample_output = pickle.load(open("wikitext_embs/{}_128.p".format(model_name), "rb" ))
    sample_data = sample_output['embeddings']
  
    rogue_dist = []
    num_toks = sample_data.shape[1]
    random_pairs = [random.sample(range(num_toks), 2) for i in range(100000)]
    for layer in range(num_layers[model_name]):
        layer_sims = []
        layer_cos_contrib = []
        for pair in random_pairs:
            emb1 = sample_data[layer,pair[0],:]
            emb2 = sample_data[layer,pair[1],:]
            layer_sims.append(1 - cosine(emb1, emb2))
            layer_cos_contrib.append(cos_contrib(emb1, emb2))
    mean_cos_contrib = np.array(layer_cos_contrib).mean(axis=0)
    for i, top_n in enumerate([1,3,5]):
        no_top_dim_sim = []
        botton_n_dims = np.argsort(mean_cos_contrib)[:-top_n]
        for pair in random_pairs:
            emb1_nonrogue = sample_data[layer,pair[0],botton_n_dims]
            emb2_nonrogue = sample_data[layer,pair[1],botton_n_dims]
            no_top_dim_sim.append(1 - cosine(emb1_nonrogue, emb2_nonrogue))
        l2_var_explained = stats.pearsonr(layer_sims, no_top_dim_sim)[0]**2
        contribs[model_name][layer,i] = l2_var_explained
    print(f'& {layer} & {contribs[model_name][layer,0]:.3f} &  {contribs[model_name][layer,1]:.3f} &  {contribs[model_name][layer,1]:.3f} \\\\')


Run this cell to replicate the results for the same experiments with Euclidean distance

In [None]:
contribs = {}
for model_name in models:
    contribs[model_name] = np.zeros((num_layers[model_name], 3))
    print(f'{model_name} l2 informativity analysis:')
    sample_output = pickle.load(open("wikitext_embs/{}_128.p".format(model_name), "rb" ))
    sample_data = sample_output['embeddings']
    rogue_dist = []
    num_toks = sample_data.shape[1]
    random_pairs = [random.sample(range(num_toks), 2) for i in range(100000)]
    for layer in range(num_layers[model_name]):
        layer_distances = []
        layer_dist_by_dim = []
    for pair in random_pairs:
        emb1 = sample_data[layer,pair[0],:]
        emb2 = sample_data[layer,pair[1],:]
        layer_distances.append(np.linalg.norm(emb1 - emb2))
        layer_dist_by_dim.append(np.abs(emb1-emb2))
    mean_dist_by_dim = np.array(layer_dist_by_dim).mean(axis=0)
    for i, top_n in enumerate([1,3,5]):
        no_top_dim_sim = []
        botton_n_dims = np.argsort(mean_dist_by_dim)[:-top_n]
        for pair in random_pairs:
            emb1_nonrogue = sample_data[layer,pair[0],botton_n_dims]
            emb2_nonrogue = sample_data[layer,pair[1],botton_n_dims]
            no_top_dim_sim.append(np.linalg.norm(emb1_nonrogue - emb2_nonrogue))
        l2_var_explained = stats.pearsonr(layer_distances, no_top_dim_sim)[0]**2
        contribs[model_name][layer,i] = l2_var_explained
    print(f'& {layer} & {contribs[model_name][layer,0]:.3f} &  {contribs[model_name][layer,1]:.3f} &  {contribs[model_name][layer,1]:.3f} \\\\')


Word2Vec and GloVe baselines

(run the cell with the definition of ```get_cos_contrib``` above at least once before running this cell.

In [None]:
def get_cos_informativity(vocab_list, vec_dict):
    vocab_pairs_to_sample = [random.sample(vocab_list, 2) for i in range(500000)]
    _,_,top_dims = get_cos_contrib(vocab_list, vec_dict)
    var_explained = []
    for dims_to_remove in [1,3,5]:
        orig_cos_sims = []
        dim_rm_cos_sims = []
        for pair in vocab_pairs_to_sample:
            w1 = vec_dict[pair[0]].copy()
            w2 = vec_dict[pair[1]].copy()
            orig_cos_sims.append(1-cosine(w1,w2))
            w1[top_dims[-dims_to_remove:]] = 0
            w2[top_dims[-dims_to_remove:]] = 0
            dim_rm_cos_sims.append(1-cosine(w1,w2))
        var_explained.append(stats.pearsonr(orig_cos_sims,dim_rm_cos_sims)[0]**2)
    return var_explained

In [None]:
#get word2vec results
w2v_model = gensim.models.KeyedVectors.load_word2vec_format('static_embs/GoogleNews-vectors-negative300.bin', binary=True)
w2v_vocab = list(w2v_model.vocab.keys())
print(get_cos_informativity(w2v_vocab, w2v_model))

In [None]:
#get GloVe results
with open('static_embs/glove.6B.300d.txt', 'r', encoding='utf-8') as f:
    glove = {x[0]:np.array(x[1:]).astype('float32') for x in [y.split() for y in f.readlines()]}
print(get_cos_informativity(glove_vocab, glove))

## 4) Reproducing Section 4
In this section, we use ablation experiments to quantify the influence of a single dimension on model behavior. This takes a very long time to run.

In [None]:
print('behavior influence analysis:')
sm = torch.nn.Softmax(dim=1)
layer_norm_names = {'gpt2':'transformer.h.{}.ln_1', #gpt-2 normalizes before components, so we have to zero out the first norm term of the next layer
                    'xlnet-base-cased':'transformer.layer.{}.ff.layer_norm', 
                    'bert-base-cased':'bert.encoder.layer.{}.output.LayerNorm', 
                    'bert-base-uncased':'bert.encoder.layer.{}.output.LayerNorm', 
                    'roberta-base':'roberta.encoder.layer.{}.output.LayerNorm'}
gpt_final_layer_norm_name = 'transformer.ln_f'

# modifies layernorm by setting dimensions weights and biases to zero, 
# zeroing out that dimension as future input for ablation tests
# returns a model with the modified parameters
def remove_dim_from_model_params(layer, dimension, model, model_name):
    model_copy = copy.deepcopy(model)
    state_dict = model_copy.state_dict()

    if(model_name == 'gpt2'):
        if(layer == -1):
            target_param_bias = 'transformer.ln_f.bias'
            target_param_weight = 'transformer.ln_f.weight'
        else:
            target_param_bias = (layer_norm_names[model_name] + '.bias').format(num_layers[model_name] + layer)
            target_param_weight = (layer_norm_names[model_name] + '.weight').format(num_layers[model_name] + layer)
    else:    
        target_param_bias = (layer_norm_names[model_name] + '.bias').format(num_layers[model_name] + layer - 1)
        target_param_weight =  (layer_norm_names[model_name] + '.weight').format(num_layers[model_name] + layer - 1)

    state_dict[target_param_bias][dimension] = 0.0
    state_dict[target_param_weight][dimension] = 0.0

    model_copy.load_state_dict(state_dict)
    return model_copy


In [None]:
#tokenize and make predictions!
def tokenize_mlm(tokenizer, paragraphs):
    paragraph_tokens = [tokenizer.tokenize(paragraph)[:128] for paragraph in paragraphs]
    all_tok_ids = []
    all_masked_toks = []
    for tokens in paragraph_tokens:
        num_toks_to_mask = int(len(tokens) * .15)
        toks_to_mask = random.sample(range(len(tokens)), num_toks_to_mask) #mask
        tok_ids = tokenizer.convert_tokens_to_ids(tokens)
        for mask_ind in toks_to_mask:
            tok_ids[mask_ind] = tokenizer.mask_token_id
        all_tok_ids.append(tok_ids)
        all_masked_toks.append(toks_to_mask)
    return all_tok_ids, all_masked_toks


def tokenize_clm(tokenizer, paragraphs):
    return [tokenizer.encode(paragraph)[:128] for paragraph in paragraphs]

#given a list of tokenized inputs, compute and return the output distributions (softmaxed logits) for next-word predictions in the model
def make_predictions_clm(model, tokenizer, all_tok_ids, get_embs=False):
    all_texts_dists = []
    if(get_embs):
        all_embs = []
    else:
        all_embs = None
    for tok_ids in all_tok_ids:
        tokens_tensor = torch.tensor([tok_ids]).to(device)
        model_out = model(tokens_tensor)
        preds = model_out.logits.squeeze()
        if(get_embs):
            embeddings = torch.stack(model_out.hidden_states).squeeze().detach().cpu().numpy()
            all_embs.append(embeddings)
        dists = sm(preds).detach().cpu().numpy()
        all_texts_dists.append(dists)
    all_texts_dists = np.concatenate(all_texts_dists, axis=0)
    if(get_embs):
        all_embs = np.concatenate(all_embs, axis=1)
    return all_texts_dists, all_embs

#given a list of tokenized inputs, compute and return the output distributions (softmaxed logits) for masked tokens in the model
def make_predictions_mlm(model, tokenizer, all_tok_ids, all_masked_toks, get_embs=False):
    all_texts_dists = []
    if(get_embs):
        all_embs = []
    else:
        all_embs = None
    for i, tok_ids in enumerate(all_tok_ids):
        toks_to_mask = all_masked_toks[i]
        tokens_tensor = torch.tensor([tok_ids]).to(device)
        model_out = model(tokens_tensor)
        preds = model_out.logits.squeeze()[toks_to_mask,:]
        if(get_embs):
            embeddings = torch.stack(model_out.hidden_states).squeeze().detach().cpu().numpy()
            embeddings = embeddings[:,toks_to_mask,:]
            all_embs.append(embeddings)
        dists = sm(preds).detach().cpu().numpy()
        all_texts_dists.append(dists)
    all_texts_dists = np.concatenate(all_texts_dists, axis=0)
    if(get_embs):
        all_embs = np.concatenate(all_embs, axis=1)
    return all_texts_dists, all_embs

In [None]:
#compute mean kl_d across all token distributions in the sample  
def compute_avg_kl_d(model_distributions, ref_distributions):
    all_tok_kl_d = entropy(ref_distributions, qk=model_distributions, axis=1)
    return all_tok_kl_d.mean(), all_tok_kl_d.std()

#load the appropriate type of language modeling head for the given model, additionally return whether the model does masked lm
def load_model_tokenizer(model_name):
    if model_name in ['bert-base-cased', 'bert-base-uncased', 'roberta-base']:
        model = AutoModelForMaskedLM.from_pretrained(model_name, output_hidden_states=True)
    else:
        model = AutoModelForCausalLM.from_pretrained(model_name, output_hidden_states=True)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer, (model_name in ['bert-base-cased', 'bert-base-uncased', 'roberta-base'])

In [None]:
layers_to_test = [-4, -3, -2, -1] #for time considerations we only look at ablation in the last 4 layers
sample_output = pickle.load(open("wikitext_embs/{}_512.p".format(model_name), "rb" ))
sample_corpus = sample_output['corpus']
model, tokenizer, is_mlm = load_model_tokenizer(model_name)
model.to(device)
model.eval()
#need more data for masked lm distributions since some tokens need to be preserved, 
#regardless for time considerations we limit the size of the sample
if(is_mlm):
    sample_for_exp = sample_corpus
    all_toks, mask_locations = tokenize_mlm(tokenizer, sample_for_exp)
    original_model_preds, embs = make_predictions_mlm(model, tokenizer, all_toks, mask_locations, get_embs=True)

else:
    sample_for_exp = sample_corpus[:23] #since we're only looking at masked tokens in mlm, but all tokens in clm, we truncate the text sample (23)
    all_toks = tokenize_clm(tokenizer, sample_for_exp)
    original_model_preds, embs = make_predictions_clm(model, tokenizer, all_toks, get_embs=True)
    
pickle.dump(original_model_preds, open('{}_preds.p'.format(model_name), "wb" )) 
pickle.dump(embs, open('{}_pred_embs.p'.format(model_name), "wb" )) 


model_mean_kl_ds = np.zeros((len(layers_to_test),num_dims[model_name]))
model_std_kl_ds = np.zeros((len(layers_to_test),num_dims[model_name]))
for layer in tqdm(layers_to_test):
    for dim in tqdm(list(range(num_dims[model_name]))):
        ablated_model = remove_dim_from_model_params(layer, dim, model, model_name)
        ablated_model.to(device)
        ablated_model.eval()
        if(is_mlm):
            ablated_model_preds = make_predictions_mlm(ablated_model, tokenizer, all_toks, mask_locations)
        else:
            ablated_model_preds = make_predictions_clm(ablated_model, tokenizer, all_toks)
        mean_kl_d, std_kl_d = compute_avg_kl_d(ablated_model_preds, original_model_preds)
        model_mean_kl_ds[layer, dim] = mean_kl_d
        model_std_kl_ds[layer, dim] = std_kl_d
out = {'model_mean_kl_ds':model_mean_kl_ds, 'model_std_kl_ds':model_std_kl_ds}
pickle.dump(out, open('{}_kl_ds.p'.format(model_name), "wb" )) 

### Visualize Results from sections 4.1-4.4 (Figure 1)
Figure 1 was generated using ggplot in R. The Rmd file is provided in the github repository (in the folder ```visualize_contribution_mismatch```). Place all of the results (pickle files) from section 3.2 and section 4 into the same folder as the Rmd and run - files will render in the ```img/``` directory.


### Visualize results from section 4.5 (Figure 2)
The following cell renders the figures for section 4.5, showing the distributions of values in rogue dimensions across tokens

In [None]:
models = ['gpt2', 'xlnet-base-cased', 'bert-base-cased', 'roberta-base']
fig, axs = plt.subplots(7, 4, figsize=(12,16))    
fig.tight_layout() 
for i, model_name in enumerate(models):
    sample_output = pickle.load(open("wikitext_embs/{}_30.p".format(model_name), "rb" ))
    sample_data = sample_output['embeddings']
    sample_toks = sample_output['tokens']
    is_intial = [i % 30 == 0 for i in range(sample_data.shape[1])]
    not_is_intial = [not x for x in is_intial]
    is_punct = [x in {".", "\\", "&"} for x in sample_output['tokens']]
    not_is_punct = [not (x or y) for x, y in zip(is_punct, is_intial)]
    for layer in range(13):
        top_5_dims = np.argsort(sample_data[layer,:,:].std(axis=0))[-1:]
        for dim in top_5_dims:
            tok_no_punct_dist = sample_data[layer,not_is_punct,dim]
            tok_punct_dist = sample_data[layer,is_punct,dim]
            tok_is_inital_dist = sample_data[layer,is_intial,dim]
            hist_data = np.array([tok_punct_dist, tok_is_inital_dist, tok_no_punct_dist])
            bins=np.histogram(np.hstack((tok_no_punct_dist,tok_punct_dist,tok_is_inital_dist)), bins=40)[1]
            axs[layer,i].hist(hist_data, stacked=True, bins=40, label=("punctuation","position 0", "other tokens"))
      
      axs[layer,i].set_title(f'{model_name} | l:{layer} | d:{dim}')
axs[0,0].legend()
fig.text(0.5, 0.00, 'Acivations across tokens in dimension', ha='center')
fig.text(-0.015, 0.5, 'count', va='center', rotation='vertical')

## Reproducing Section 5

To reproduce the results from section 5, simply run all of the following cells. You can download/use the precomputed embeddings referened in section 2 of this notebook.

In [None]:
#imports needed for this section, defining the sim datasets
from glob import glob
import pickle
from sklearn.decomposition import PCA
import numpy as np
from scipy.spatial.distance import cosine
from scipy import stats
import os
from tqdm import tqdm
from matplotlib import pyplot as plt

In [None]:
#load similarity datasets
sim_filenames = ['word_sim/simlex-999/SimLex-999.txt', 
                'word_sim/simverb-3500/SimVerb-3500.txt', 
                'word_sim/wordsim-353_r/wordsim_relatedness_goldstandard.txt', 
                'word_sim/wordsim-353_s/wordsim_similarity_goldstandard.txt',
                'word_sim/rg-65/RG65.txt']

sim_lex_pairs = {}
for sim_lex_filename in sim_filenames:
    dataset = sim_lex_filename.split('/')[0]
    if dataset not in sim_lex_pairs.keys():
        sim_lex_pairs[dataset] = []
    with open(sim_lex_filename, 'r', encoding='utf-8') as f:
        word_pair_lines = f.readlines()
        for line in word_pair_lines[1:]:
            if(sim_lex_filename[:2] == 'rg'):
                vals = line.split(';')
            else:
                vals = line.split()
            word1 = vals[0].lower()
            word2 = vals[1].lower()
        if(sim_lex_filename[:2] == 'wo' or sim_lex_filename[:2] == 'rg'):
            score = float(vals[2])
        else:
            score = float(vals[3])
        sim_lex_pairs[dataset].append((word1,word2,score))

sim_lex_pairs = {k:list(set(v)) for k,v in sim_lex_pairs.items()}

In [None]:
def get_corpus_sample_embs(model_name):
    all_embs_fs = glob('wordsim_embs/{}/*.p'.format(model_name))
    all_mean_embs = []
    for word_fname in all_embs_fs:
        mean_emb = pickle.load(open(word_fname, 'rb'))
        all_mean_embs.append(mean_emb)
    return np.stack(all_mean_embs, axis=1)


#get embedding sample for mean/std of emb space, get PCs for all-but-the-top (Mu et al. 2018)
def get_model_mean_std_pcs(model_name):
    corpus_sample_embeds = get_corpus_sample_embs(model_name)
    corpus_sample_means = corpus_sample_embeds.mean(axis=1)
    corpus_sample_stds = corpus_sample_embeds.std(axis=1)
    num_pcs = 3
    corpus_sample_pcs = []
    for layer in range(corpus_sample_embeds.shape[0]):
        layer_sample = corpus_sample_embeds[layer,:,:]
        layer_sample = layer_sample - corpus_sample_means[layer]
        pca = PCA(n_components=num_pcs)
        pca.fit(layer_sample)
        corpus_sample_pcs.append(pca.components_)   
    return corpus_sample_means, corpus_sample_stds, corpus_sample_pcs

#general-purpose similarity function, returns specified similarity resutls given some function
#really we should be passing the sim function in as an arguement here, but I ended up implementing this with boolean fields
def similarity(emb1, emb2, layer, means=None, stds=None, pcs=None, ustd_cosine=False, dim_rm_cosine = False, std_cosine=False, mean_sub_cosine=False, rm_pcs_cosine=False, spearman=False):
    if(ustd_cosine):
        return 1 - cosine(emb1, emb2)
  
    elif(dim_rm_cosine):
        emb1_rm = emb1.copy()
        emb2_rm = emb2.copy()
        top_5_dims = np.argsort(np.abs(means[layer]))[-5:]
        for dim in top_5_dims:
            emb1_rm[dim] = 0
            emb2_rm[dim] = 0
        return 1 - cosine(emb1_rm, emb2_rm)
  
    elif(spearman):
        return stats.spearmanr(emb1, emb2)[0]
  
    else:
        mean_rm_emb1 = (emb1 - means[layer])
        mean_rm_emb2 = (emb2 - means[layer]) 
    
    if(std_cosine):
        emb1_std = mean_rm_emb1 / stds[layer]
        emb2_std = mean_rm_emb2 / stds[layer]
        return 1 - cosine(emb1_std, emb2_std)
    
    elif(mean_sub_cosine):
        return 1 - cosine(mean_rm_emb1, mean_rm_emb2)
    
    elif(rm_pcs_cosine): #all-but-the-top
        layer_pcs = pcs[layer]
        rm_term_1 = np.zeros(emb1.shape[0])
        rm_term_2 = np.zeros(emb2.shape[0])
        for pc in layer_pcs:
            rm_term_1 += pc.dot(emb1) * pc
            rm_term_2 += pc.dot(emb2) * pc
        emb1_pc_rm = mean_rm_emb1 - rm_term_1
        emb2_pc_rm = mean_rm_emb2 - rm_term_2
        
    return 1 - cosine(emb1_pc_rm, emb2_pc_rm)

def get_decontextualized_sim(word1, word2, model_name, corpus_sample_means, corpus_sample_stds, corpus_sample_pcs):
    if(not os.path.exists('wordsim_embs/{}/{}.p'.format(model_name, word1))):
        return None
    if(not os.path.exists('wordsim_embs/{}/{}.p'.format(model_name, word2))):
        return None
    word1_emb = pickle.load(open('wordsim_embs/{}/{}.p'.format(model_name, word1), 'rb'))
    word2_emb = pickle.load(open('wordsim_embs/{}/{}.p'.format(model_name, word2), 'rb'))
    layer_sims = {'ustd_cosine':[],
                'std_cosine':[],
                'mean_sub_cosine':[],
                'rm_pcs_cosine':[],
                'spearman':[]}
    for layer in range(word1_emb.shape[0]):
        layer_sims['ustd_cosine'].append(similarity(word1_emb[layer], word2_emb[layer], layer, ustd_cosine=True))
        layer_sims['spearman'].append(similarity(word1_emb[layer], word2_emb[layer], layer, spearman=True))
        layer_sims['std_cosine'].append(similarity(word1_emb[layer], word2_emb[layer], layer, means=corpus_sample_means, stds=corpus_sample_stds, std_cosine=True))
        layer_sims['mean_sub_cosine'].append(similarity(word1_emb[layer], word2_emb[layer], layer, means=corpus_sample_means, stds=corpus_sample_stds, mean_sub_cosine=True))
        layer_sims['rm_pcs_cosine'].append(similarity(word1_emb[layer], word2_emb[layer], layer, means=corpus_sample_means, pcs=corpus_sample_pcs, rm_pcs_cosine=True))

    return layer_sims

In [None]:
def generate_plots_per_model(model_name, sims):
    layerwise_sims_spearman = {'ustd_cosine':[],
          'std_cosine':[]}
    label_keys = {'ustd_cosine':'original',
          'std_cosine':'standardized',
          'mean_sub_cosine':'mean removed',
          'rm_pcs_cosine':'abtt',
          'spearman':'spearman'}
    markers = ['o',"^","s","D","v"]
    plt.figure(figsize=(4,4.5))
    ax = plt.axes() 
    for i, sim_type in enumerate(layerwise_sims_spearman.keys()):
        for layer in range(13):
            layer_sims = []
            for dataset in sims.keys():
                layer_sims.append(stats.spearmanr(sims[dataset][layer][sim_type], sims[dataset][layer]['human'])[0])
            layerwise_sims_spearman[sim_type].append(np.array(layer_sims).mean())
        plt.plot(layerwise_sims_spearman[sim_type], label=label_keys[sim_type], linestyle='--', marker=markers[i])
    #plt.ylim([0, .55])
    plt.title(model_name, fontsize=20)
    if(model_name == 'gpt2'):
        plt.legend(fontsize=14)
    plt.xlabel('layer', fontsize=16)
    plt.ylabel('Spearman\'s rho', fontsize=16)
    ax.xaxis.grid()
    ax.yaxis.grid()
    plt.show()

In [None]:
def generate_plots_per_model(model_name, sims):
    layerwise_sims_spearman = {'ustd_cosine':[],
          'std_cosine':[]}
    label_keys = {'ustd_cosine':'original',
          'std_cosine':'standardized',
          'mean_sub_cosine':'mean removed',
          'rm_pcs_cosine':'abtt',
          'spearman':'spearman'}
    markers = ['o',"^","s","D","v"]
    plt.figure(figsize=(4,4.5))
    ax = plt.axes() 
    for i, sim_type in enumerate(layerwise_sims_spearman.keys()):
        for layer in range(13):
            layer_sims = []
            for dataset in sims.keys():
                layer_sims.append(stats.spearmanr(sims[dataset][layer][sim_type], sims[dataset][layer]['human'])[0])
            layerwise_sims_spearman[sim_type].append(np.array(layer_sims).mean())
        plt.plot(layerwise_sims_spearman[sim_type], label=label_keys[sim_type], linestyle='--', marker=markers[i])
    #plt.ylim([0, .55])
    plt.title(model_name, fontsize=20)
    if(model_name == 'gpt2'):
        plt.legend(fontsize=14)
    plt.xlabel('layer', fontsize=16)
    plt.ylabel('Spearman\'s rho', fontsize=16)
    ax.xaxis.grid()
    ax.yaxis.grid()
    plt.show()

In [None]:
def generate_plots(model_name, sims):
    layerwise_sims_spearman = {'ustd_cosine':[],
          'std_cosine':[]}
    label_keys = {'ustd_cosine':'original',
          'std_cosine':'standardized',
          'mean_sub_cosine':'mean removed',
          'rm_pcs_cosine':'abtt',
          'spearman':'spearman'}
    markers = ['o',"^","s","D","v"]
    fig, axs = plt.subplots(1, 4, figsize=(12,3))
    fig.tight_layout() 
    for i, dataset in enumerate(sims.keys()):
        for j, sim_type in enumerate(layerwise_sims_spearman.keys()):
            layer_sims = []
            for layer in range(13):
                layer_sims.append(stats.spearmanr(sims[dataset][layer][sim_type], sims[dataset][layer]['human'])[0])
            axs[i].plot(layer_sims, label=label_keys[sim_type], linestyle='--', marker=markers[j])
     
        axs[i].set_title(dataset, fontsize=18)
        axs[i].xaxis.grid()
        axs[i].yaxis.grid()
        axs[i].set_xticks(range(0,13,2))
    axs[0].legend(fontsize=11)
    fig.text(.5, 1.05, model_name, ha='center', fontsize=24)
    fig.text(0.5, 0.00, 'layer', ha='center', fontsize=16)
    fig.text(-0.015, 0.5, 'Spearman\'s rho', va='center', rotation='vertical', fontsize=16)

    plt.show()
  

In [None]:
#generate the plots!
models = ['gpt2', 'xlnet-base-cased', 'bert-base-cased', 'roberta-base']

for model_name in models:
    corpus_sample_means, corpus_sample_stds, corpus_sample_pcs = get_model_mean_std_pcs(model_name)
  
    excl = 0
    dataset_sims = {}
    for sim_dataset, sim_list in sim_lex_pairs.items():
        layerwise_decontextual_sims = [{'ustd_cosine':[],
                'std_cosine':[],
                'mean_sub_cosine':[],
                'rm_pcs_cosine':[],
                'spearman':[],
                'human':[]} for x in range(13)]
    
        for sim_pair in tqdm(sim_list):
            human_score = sim_pair[2]
            word1 = sim_pair[0]
            word2 = sim_pair[1]
            decontextual_sim_score = get_decontextualized_sim(word1, word2, model_name, corpus_sample_means, corpus_sample_stds, corpus_sample_pcs)
      
            if(decontextual_sim_score is not None):
                for sim_type, layer_scores in decontextual_sim_score.items():
                    for layer_ind, layer_score in enumerate(layer_scores):
                        layerwise_decontextual_sims[layer_ind][sim_type].append(layer_score)
                for layer_ind in range(13):
                    layerwise_decontextual_sims[layer_ind]['human'].append(human_score)
            else:
                excl += 1
        dataset_sims[sim_dataset] = layerwise_decontextual_sims   
    
    generate_plots(model_name, dataset_sims)
    print(excl)