<a href="https://colab.research.google.com/github/jalammar/ecco/blob/main/notebooks/Ecco_CCA_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# ! pip install ecco

Load Ecco and BERT.

In [1]:
import ecco
lm = ecco.from_pretrained('distilbert-base-uncased', gpu=False)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

Let's give BERT a passage of text to proccess

In [2]:
text = '''Now I ask you: what can be expected of man since he is a being endowed with strange qualities? Shower upon him every earthly blessing, drown him in a sea of happiness, so that nothing but bubbles of bliss can be seen on the surface; give him economic prosperity, such that he should have nothing else to do but sleep, eat cakes and busy himself with the continuation of his species, and even then out of sheer ingratitude, sheer spite, man would play you some nasty trick. He would even risk his cakes and would deliberately desire the most fatal rubbish, the most uneconomical absurdity, simply to introduce into all this positive good sense his fatal fantastic element. It is just his fantastic dreams, his vulgar folly that he will desire to retain, simply in order to prove to himself--as though that were so necessary-- that men still are men and not the keys of a piano, which the laws of nature threaten to control so completely that soon one will be able to desire nothing but by the calendar. And that is not all: even if man really were nothing but a piano-key, even if this were proved to him by natural science and mathematics, even then he would not become reasonable, but would purposely do something perverse out of simple ingratitude, simply to gain his point. And if he does not find means he will contrive destruction and chaos, will contrive sufferings of all sorts, only to gain his point! He will launch a curse upon the world, and as only man can curse (it is his privilege, the primary distinction between him and other animals), may be by his curse alone he will attain his object--that is, convince himself that he is a man and not a piano-key!
'''

inputs = lm.tokenizer([text], return_tensors="pt")
output = lm(inputs)

the `output` variable now contains the result of BERT processing the passge of text. The property `output.decoder_hidden_states` contains the hidden states after each layer.

In [3]:
hidden_states = output._get_encoder_hidden_states()
embed = output.embedding_states.detach().numpy()[0,:,:].T
hidden_state_layer = [layer.detach().numpy()[0,:,:].T for layer in hidden_states]
embed.shape, hidden_state_layer[0].shape, len(hidden_state_layer)

((768, 363), (768, 363), 6)

`embed` now contains the embeddings of the inputs. Its dimensions are (embed_dim, number of tokens). 
`hidden_state_layer` has the outputs of each of the model's 6 layers. The output of each layer is (embed_dim, number of tokens).

This is how to calculate the cka similarity score between the embeddings layer and the output of the first layer:

In [4]:
from ecco import analysis
analysis.cka(embed, hidden_state_layer[0])

0.9042735255823328

When we compare the embeddings with the output of the second layer, we see less similarity

In [5]:
analysis.cka(embed, hidden_state_layer[1])

0.7774274463814453

And so on

In [6]:
analysis.cka(embed, hidden_state_layer[2])

0.6922863613160068

We can try with `cca`, `svcca` and `pwcca`. But we need to choose a subset of the neurons because these methods require more tokens than neurons (and advise 10x as many tokens as neurons to get a proper similarity score). 

Let's compare the similarities of the first 50 neurons.

In [7]:
print("CCA - Embed vs. layer 0:", analysis.cca(embed[:50,:], hidden_state_layer[0][:50,:]))
print("CCA - Embed vs. layer 1:", analysis.cca(embed[:50,:], hidden_state_layer[1][:50,:]))

CCA - Embed vs. layer 0: 0.8518187688700852
CCA - Embed vs. layer 1: 0.7220358158064838


In [8]:
print("SVCCA - Embed vs. layer 0:", analysis.svcca(embed[:50,:], hidden_state_layer[0][:50,:]))
print("SVCCA - Embed vs. layer 1:", analysis.svcca(embed[:50,:], hidden_state_layer[1][:50,:]))

SVCCA - Embed vs. layer 0: 0.7830642779004243
SVCCA - Embed vs. layer 1: 0.6833412966387719


In [9]:
print("PWCCA - Embed vs. layer 0:", analysis.pwcca(embed[:50,:], hidden_state_layer[0][:50,:]))
print("PWCCA - Embed vs. layer 1:", analysis.pwcca(embed[:50,:], hidden_state_layer[1][:50,:]))

PWCCA - Embed vs. layer 0: 0.8695735290407357
PWCCA - Embed vs. layer 1: 0.7461958851582353
