In [None]:
__author__ = "Ricardo Primi adapted from modules from Christopher Potts, CS224u, Stanford, Spring 2021"

### General set-up



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import sys
sys.path.append('/content/drive/MyDrive/Stanford_cs224u')

In [None]:
!pip3 install transformers

Modules `vsm`, `utils` and `sst` are from Stanford's CS224u https://github.com/cgpotts/cs224u

In [None]:
import os
import pandas as pd
import torch
from transformers import BertModel, BertTokenizer
from transformers import RobertaModel, RobertaTokenizer

import utils
import vsm
import sst

In [None]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


In [None]:
!nvidia-smi

Tue Nov 29 17:19:34 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Data

In [None]:
bd_metaf = pd.read_csv("/content/drive/MyDrive/unicamp - IA024 /projeto_metaf/dt_metaf_unicamp.csv") 
bd_metaf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12174 entries, 0 to 12173
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   iddd2          12174 non-null  object
 1   Código         12173 non-null  object
 2   Item           12174 non-null  object
 3   n_resposta     12174 non-null  int64 
 4   train_subj     12174 non-null  int64 
 5   resp_relacao3  12174 non-null  object
 6   y_theta        12174 non-null  int64 
 7   y_score1       12174 non-null  int64 
 8   y_score2       12174 non-null  int64 
dtypes: int64(5), object(4)
memory usage: 856.1+ KB


In [None]:
utils.fix_random_seeds()

In [None]:
#import logging
#logger = logging.getLogger()
#logger.level = logging.ERROR

### Loading Transformer models
Specify a model, a tokenizer, and load a model pretrained weights:

In [None]:
bert_weights_name = 'neuralmind/bert-base-portuguese-cased'
bert_tokenizer = BertTokenizer.from_pretrained(bert_weights_name)
bert_model = BertModel.from_pretrained(bert_weights_name)

Downloading:   0%|          | 0.00/210k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### The basics of tokenizing


In [None]:
example_text = bd_metaf['resp_relacao3'][1]
type(example_text)
print(example_text)
print(bert_tokenizer.tokenize(example_text))
ex_ids = bert_tokenizer.encode(example_text, add_special_tokens=True)
print(ex_ids)
print(bert_tokenizer.convert_ids_to_tokens(ex_ids))

Macaco é o comunicador da Floresta. Porque avisa os outros animais do perigo
['Maca', '##co', 'é', 'o', 'comunicado', '##r', 'da', 'Floresta', '.', 'Por', '##que', 'avis', '##a', 'os', 'outros', 'animais', 'do', 'perigo']
[101, 13399, 303, 253, 146, 16677, 22282, 180, 13509, 119, 566, 455, 7598, 22278, 259, 736, 3155, 171, 9538, 102]
['[CLS]', 'Maca', '##co', 'é', 'o', 'comunicado', '##r', 'da', 'Floresta', '.', 'Por', '##que', 'avis', '##a', 'os', 'outros', 'animais', 'do', 'perigo', '[SEP]']


### Get decontextualized representations

https://huggingface.co/docs/transformers/main_classes/output

To obtain the representations for a batch of examples, we use the `forward` method of the model, as follows:

In [None]:
with torch.no_grad():
    reps = bert_model(torch.tensor([ex_ids]), output_hidden_states=True)

The return value `reps` is a special `transformers` class that holds a lot of representations. If we want just the final output representations for each token, we use `last_hidden_state`:

In [None]:
print(reps.last_hidden_state.shape)
reps.last_hidden_state[:, 9, :].shape

torch.Size([1, 20, 768])


torch.Size([1, 768])

The shape indicates that our batch has 1 example, with 20 tokens, and each token is represented by a vector of dimensionality 1024. 

Aside: Hugging Face `transformers` models also have a `pooler_output` value. For BERT, this corresponds to the output representation above the [CLS] token, which is often used as a summary representation for the entire sequence. However, __we cannot use `pooler_output` in the current context__, as `transformers` adds new randomized parameters on top of it, to facilitate fine-tuning. If we want the [CLS] representation, we need to use `reps.last_hidden_state[:, 0]`.

Finally, if we want access to the output representations from each layer of the model, we use `hidden_states`. This will be `None` unless we set `output_hidden_states=True` when using the `forward` method, as above. 

In [None]:
len(reps.hidden_states)

13

The length 25 corresponds to the initial embedding layer (layer 0) and the 24 layers of this BERT model.

The final layer in `hidden_states` is identical to `last_hidden_state`:

In [None]:
reps.hidden_states[-1].shape

torch.Size([1, 20, 768])

In [None]:
torch.equal(reps.hidden_states[-1], reps.last_hidden_state)

True

### The decontextualized approach

As discussed above, Bommasani et al. (2020) define and explore two general strategies for obtaining static representations for word using a model like BERT. The simpler one involves processing individual words and, where they correspond to multiple tokens, pooling those token representations into a single vector using an operation like mean. Now we want to scale the above process to a large vocabulary, so that we can create a full VSM. The function `vsm.create_subword_pooling_vsm` makes this easy. To start, we get the vocabulary from one of our count VSMs (all of which have the same vocabulary):

In [None]:
bd_metaf['sentence'] = bd_metaf.resp_relacao3
bd_metaf['label'] = bd_metaf.y_score2
X, y = sst.build_rnn_dataset(bd_metaf) 
vocabulary = utils.get_vocab(X, mincount=1)


In [None]:
print(len(X))
print(len(vocabulary ))

12174
10978


**embed0:** decontextualized embedding layer 0 (using `vsm.create_subword_pooling_vsm`)

In [None]:
def create_subword_pooling_vsm(vocab, tokenizer, model, layer=1, pool_func=vsm.mean_pooling):
    vocab_ids = [hf_encode(w, tokenizer) for w in vocab]
    vocab_hiddens = [hf_represent(w, model, layer=layer) for w in vocab_ids]
    pooled = [pool_func(h) for h in vocab_hiddens]
    pooled = [p.squeeze().cpu().numpy() for p in pooled]
    return pd.DataFrame(pooled, index=vocab)

In [None]:
%%time
embed0 = vsm.create_subword_pooling_vsm(
    vocabulary, bert_tokenizer, bert_model, layer=0)

In [None]:
print(embed0.shape)
embed0.head()

In [None]:
embed0.to_csv("/content/drive/MyDrive/unicamp - IA024 /projeto_metaf/embed0.csv")

In [None]:
%%time
embed12 = vsm.create_subword_pooling_vsm(
    vocabulary, bert_tokenizer, bert_model, layer=-1)

CPU times: user 12min 52s, sys: 3.38 s, total: 12min 56s
Wall time: 3min 17s


In [None]:
embed12.to_csv("/content/drive/MyDrive/unicamp - IA024 /projeto_metaf/embed12.csv")

In [None]:
%%time
embed9 = vsm.create_subword_pooling_vsm(
    vocabulary, bert_tokenizer, bert_model, layer=9)

CPU times: user 12min 46s, sys: 3 s, total: 12min 49s
Wall time: 3min 15s


In [None]:
embed9.to_csv("/content/drive/MyDrive/unicamp - IA024 /projeto_metaf/embed9.csv")

In [None]:
%%time
embed3 = vsm.create_subword_pooling_vsm(
    vocabulary, bert_tokenizer, bert_model, layer=3)

embed3.to_csv("/content/drive/MyDrive/unicamp - IA024 /projeto_metaf/embed3.csv")

CPU times: user 13min 6s, sys: 3.36 s, total: 13min 9s
Wall time: 3min 26s


### The aggregated approach

The aggregated is also straightfoward to implement given the above tool. To start, we can create a map from vocabulary items into their sequences of ids:

In [None]:
vocab_ids = {w: vsm.hf_encode(w, bert_tokenizer)[0] for w in vocabulary}

In [None]:
vocab_ids

Next, let's assume we have a corpus of texts that contain the words of interest:

In [None]:
corpus = list(bd_metaf['resp_relacao3'])  


In [None]:
len(corpus)

12174

The following embeds every corpus example, keeping `layer=1` representations:

In [None]:
corpus_ids = [vsm.hf_encode(text, bert_tokenizer, add_special_tokens=True)
              for text in corpus]


In [None]:
%%time
corpus_reps3 = [vsm.hf_represent(ids, bert_model, layer=3)
               for ids in corpus_ids]

len(corpus_reps3)

In [None]:
corpus_reps3[0].shape
corpus_reps3[1].shape
len(corpus_reps3[0][:, 0])

for idea in corpus_reps3[0:20]:
  print(idea.shape)

len(corpus_reps12[0][0, 0, :])

embed_CLS = [idea[0, 0, :] for idea in corpus_reps3]

len(embed_CLS)

pd.DataFrame(embed_CLS).astype("float").to_csv('/content/drive/MyDrive/unicamp - IA024 /projeto_metaf/embed_CLS.csv')

embed_CLS 

In [None]:
%%time
corpus_reps9 = [vsm.hf_represent(ids, bert_model, layer=9)
               for ids in corpus_ids]

In [None]:
corpus[0]
len(corpus_ids[0][0])

len(corpus_reps12)
len(corpus_reps)
type(corpus_reps)
len(corpus_reps[0])
len(corpus_reps[:][0][:])
len(corpus_reps[0][0][0])

corpus_reps12[:]

Finally, we define a convenience function for finding all the occurrences of a sublist in a larger list:


In [31]:
def find_sublist_indices(sublist, mainlist):
    indices = []
    length = len(sublist)
    for i in range(0, len(mainlist)-length+1):
        if mainlist[i: i+length] == sublist:
            indices.append((i, i+length))
    return indices

I had to modify to (mainlist[i: i+length] == sublist).all() in order to deal with tokes that are maped to more than one indice.[link text](https://)

In [32]:
def find_sublist_indices_tensor(sublist, mainlist):
    indices = []
    length = len(sublist)
    for i in range(0, len(mainlist)-length+1):
        if (mainlist[i: i+length] == sublist).all():
            indices.append((i, i+length))
    return indices

For example:

In [None]:
find_sublist_indices([1,2], [1, 2, 3, 0, 1, 2, 3])

[(0, 2), (4, 6)]

And here's an example using our `vocab_ids` and `corpus`:

In [33]:
def calculate_aggreg_rep(token_text, vocab_ids, corpus_ids, corpus_reps):
  aggreg_rep = []
  tokens_ids = vocab_ids[token_text]
  for ids, reps in zip(corpus_ids, corpus_reps):
    offsets = find_sublist_indices_tensor(tokens_ids, ids.squeeze(0))
    for (start, end) in offsets:
        pooled = vsm.mean_pooling(reps[:, start: end])
        aggreg_rep.append(pooled)
  if len(aggreg_rep) != 0: 
    aggreg_rep = torch.mean( torch.cat(aggreg_rep), axis=0).squeeze(0)
  return(aggreg_rep)

In [None]:
# calculate_aggreg_rep(token_text = 'macaco', vocab_ids=vocab_ids, corpus_ids=corpus_ids, corpus_reps=corpus_reps) 
# calculate_aggreg_rep(token_text = 'pq', vocab_ids=vocab_ids, corpus_ids=corpus_ids, corpus_reps=corpus_reps) 

In [None]:
vocabulary
len(vocabulary)

10978

In [35]:
%%time
embed_pmc9 = []
for word in vocabulary:
   print(word)
   aggreg_word_rep = calculate_aggreg_rep(token_text = word, vocab_ids=vocab_ids, corpus_ids=corpus_ids, corpus_reps=corpus_reps9) 
   embed_pmc9.append(aggreg_word_rep)

   
pd.DataFrame(embed_pmc9, index=vocabulary).astype("float").to_csv('/content/drive/MyDrive/unicamp - IA024 /projeto_metaf/embedd_pool_mc9.csv')

!


NameError: ignored

In [37]:
%%time
embed_pmc3 = []
for word in vocabulary:
   print(word)
   aggreg_word_rep = calculate_aggreg_rep(token_text = word, vocab_ids=vocab_ids, corpus_ids=corpus_ids, corpus_reps=corpus_reps3) 
   embed_pmc3.append(aggreg_word_rep)

   
pd.DataFrame(embed_pmc3, index=vocabulary).astype("float").to_csv('/content/drive/MyDrive/unicamp - IA024 /projeto_metaf/embedd_pool_mc3.csv')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
ida
idade
ideal
ideia
ideias
idem
identidade
identifica
identificam
identificar.
identificação
idioma
idolatrada
idolatram
idolatrando.
idolo
idolos
idolos,
idoso
idosos
idéia
idéias
idéias,
ietm
igal
ignorados
ignorância
igreja
igreja,
iguais
igual
igual.
iguala
ilhas
iliminação
ilumima
ilumina
ilumina,
iluminada
iluminada...
iluminadas
iluminado
iluminados
iluminam
iluminam,
iluminam-nos
iluminando
iluminar
iluminar,
iluminar-los
iluminarias
iluminarão
iluminaçoes
iluminação
iluministas
iluminosidade
iluminá-los
ilumnam
ilusionistas
ilusão
ilusórias
imagem
imagens
imagina
imagina-se
imaginam
imaginamos
imaginar
imaginasão
imaginação
imaginário
imensidão
imenso
imfeita
imfeite
imita
imitar
imoral
impaciencia
impaciente
impacientes
impacto
impactos
impecilho
impecável
impede
impede-nos
impedem
impedimento
impedir
imperador
importa
importancia
importancias
importante
importante,
importante.
importantes
importantes,
importâ

In [None]:


pd.DataFrame(embed_pmc12, index=vocabulary).astype("float").head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
!,0.146565,-0.248279,-0.070733,-0.026844,0.486853,0.806113,0.491228,0.003391,0.334737,0.208428,...,0.072705,0.457313,-0.740429,0.177964,0.181191,-0.231247,0.283681,0.354993,-0.045231,-0.215859
"""",-0.17254,-0.059289,0.47211,0.527272,0.194514,0.357842,0.316965,-0.64291,0.100852,-0.750452,...,0.067175,-0.509143,-0.906271,-0.407602,0.286751,-0.754477,0.375574,0.131364,-0.107265,-0.749649
"""???""",-0.188018,0.091027,-0.026015,0.252892,0.469525,0.318168,0.289357,-0.226312,0.023744,-0.230036,...,-0.105501,-0.274744,-0.883289,-0.392057,0.989325,-0.34942,0.479251,0.009281,-0.422455,-0.853668
"""Procurando",-0.271998,-0.535824,0.017344,0.191156,0.111892,0.192713,0.224732,0.480667,0.237073,-0.357113,...,0.188196,-0.007507,-1.374439,-0.297422,0.371894,-0.633138,0.731279,-0.069465,0.110304,-0.020759
"""ai""",0.081846,-0.075211,0.825686,0.341451,0.395813,0.345409,-0.01381,-0.356292,0.253199,-0.259342,...,-0.058527,0.200778,-0.818004,-0.301954,0.026207,-0.831876,0.352601,0.087669,-0.120449,-0.574873


### Miscelaneous

In [None]:
vocab_ids['atencao']

token_text = 'atencao'
aggreg_rep = []
tokens_ids = vocab_ids[token_text]
  for ids, reps in zip(corpus_ids, corpus_reps):
    offsets = find_sublist_indices_tensor(tokens_ids, ids.squeeze(0))
    for (start, end) in offsets:
        pooled = vsm.mean_pooling(reps[:, start: end])
        aggreg_rep.append(pooled)
  
  if len(aggreg_rep) != 0: 
     aggreg_rep = torch.mean( torch.cat(aggreg_rep), axis=0).squeeze(0)

macaco_reps = []

for ids, reps in zip(corpus_ids, corpus_reps):
    for i in range(0, len(macaco)):
      offsets = find_sublist_indices(macaco[i], ids.squeeze(0))
      for (start, end) in offsets:
        pooled = vsm.mean_pooling(reps[:, start: end])
        macaco_reps.append(pooled)

 #   macaco_rep = torch.mean(torch.cat(macaco_reps), axis=0).squeeze(0)

In [None]:
corpus_reps2 = pd.DataFrame(corpus_reps)
corpus_reps[2].size()

len(corpus_reps[1])



np.empty(correct_shape, dtype=object)
values = np.array([convert(v) for v in values])


The above building blocks could be used as the basis for an original system and bakeoff entry for this unit. The major question is probably which data to use for the corpus.

## Some related work

1. [Ethayarajh (2019)](https://www.aclweb.org/anthology/D19-1006/) uses dimensionality reduction techniques (akin to LSA) to derive static representations from contextual models, and explores layer-wise variation in detailed, with findings that are likely to align with your experiences using the above techniques.

1. [Akbik et al (2019)](https://www.aclweb.org/anthology/N19-1078/) explore techniques similar to those of Bommasani et al. specifically for the supervised task of named entity recognition.

1. [Wang et al. (2020](https://arxiv.org/pdf/1911.02929.pdf) learn static representations from contextual ones using techniques adapted from the word2vec model.