<a href="https://colab.research.google.com/github/franfram/Transformers-nlp/blob/main/DeepNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.5.2-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 6.6 MB/s 
[?25hCollecting evaluate
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 7.7 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 37.6 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 59.2 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 52.5 MB/s 
Collecting multiprocess
  Downloading mul

In [None]:
from transformers import AutoTokenizer

# Load tokenizer
checkpoint = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)




In [None]:
sequence = "hola como estas" #@param {type: "string"}

# tokenization process is done by the tokenize() method of the tokenizer
tokens = tokenizer.tokenize(sequence)

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)
# decoding is the other way around of encoding, from vocabulary indices, we want to get a string. This can be done with the decode() method
decoded_string = tokenizer.decode(ids)

print(decoded_string)



[1734, 1151, 1932]
hola como estas


As we’ve seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method. Again, we need to use the same vocabulary used when the model was pretrained.

To get a better understanding of the two steps, we’ll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs (as shown in the section 2).


In [None]:
# prepare inputs
raw_inputs = [
    "quiero comer ",
    "yo quiero",
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

print(inputs)

{'input_ids': tensor([[   4, 1563, 1987,    5],
        [   4, 1252, 1563,    5]]), 'token_type_ids': tensor([[0, 0, 0, 0],
        [0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1]])}


In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)

print(outputs)
print(outputs.last_hidden_state.shape) #[batch size, sequence length, hidden size]



Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dens

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3107, -0.0482, -0.2685,  ..., -1.2886,  0.7412, -0.8510],
         [-0.4834, -0.1054, -0.0760,  ..., -0.3920, -0.1903, -0.4548],
         [-0.4013, -0.4127, -0.4304,  ..., -0.2887, -0.7466, -0.2979],
         [-0.3224, -0.0176, -0.0630,  ..., -0.6006, -0.0127, -0.1626]],

        [[ 0.8068,  0.8387,  0.1999,  ...,  0.5771,  1.1830, -0.1755],
         [ 0.5724, -0.2247, -0.0648,  ...,  0.0827,  0.2711, -0.5267],
         [ 0.4977,  0.0194,  0.0029,  ...,  0.3079,  1.0262, -0.6613],
         [ 0.1892,  0.7965, -0.0492,  ...,  0.7769,  1.2281, -0.4855]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.2322,  0.1761, -0.3326,  ..., -0.2519,  0.1239,  0.1035],
        [-0.5266,  0.1396, -0.5428,  ..., -0.4514,  0.7699,  0.2066]],
       grad_fn=<TanhBackward0>), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)
torch.Size([2, 4, 768])


In [None]:
vars(outputs)

print(outputs.last_hidden_state) # creo que son los logits; the raw, unnormalized scores outputted by the last layer of the model

# to convert logits to probabilities, they need to go through a softmax layer
import torch
predictions = torch.nn.functional.softmax(outputs.last_hidden_state, dim=-1)

print(predictions)

tensor([[[ 0.3107, -0.0482, -0.2685,  ..., -1.2886,  0.7412, -0.8510],
         [-0.4834, -0.1054, -0.0760,  ..., -0.3920, -0.1903, -0.4548],
         [-0.4013, -0.4127, -0.4304,  ..., -0.2887, -0.7466, -0.2979],
         [-0.3224, -0.0176, -0.0630,  ..., -0.6006, -0.0127, -0.1626]],

        [[ 0.8068,  0.8387,  0.1999,  ...,  0.5771,  1.1830, -0.1755],
         [ 0.5724, -0.2247, -0.0648,  ...,  0.0827,  0.2711, -0.5267],
         [ 0.4977,  0.0194,  0.0029,  ...,  0.3079,  1.0262, -0.6613],
         [ 0.1892,  0.7965, -0.0492,  ...,  0.7769,  1.2281, -0.4855]]],
       grad_fn=<NativeLayerNormBackward0>)
tensor([[[0.0014, 0.0010, 0.0008,  ..., 0.0003, 0.0021, 0.0004],
         [0.0006, 0.0009, 0.0010,  ..., 0.0007, 0.0009, 0.0007],
         [0.0007, 0.0007, 0.0007,  ..., 0.0008, 0.0005, 0.0008],
         [0.0008, 0.0010, 0.0010,  ..., 0.0006, 0.0010, 0.0009]],

        [[0.0023, 0.0023, 0.0012,  ..., 0.0018, 0.0033, 0.0009],
         [0.0018, 0.0008, 0.0010,  ..., 0.0011, 0.0013, 0.

All that we seem to need

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel

# Define model checkpoint
checkpoint = "dccuchile/bert-base-spanish-wwm-uncased"

# Define tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Define model
model = AutoModel.from_pretrained(checkpoint)

seq1 = "yo quiero" #@param {type: "string"} 
seq2 = "quiero ir a" #@param {type: "string"}

sequences = [
    seq1, 
    seq2
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

output = model(**tokens)

print(output)

Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dens

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.8068,  0.8387,  0.1999,  ...,  0.5771,  1.1830, -0.1755],
         [ 0.5724, -0.2247, -0.0648,  ...,  0.0827,  0.2711, -0.5267],
         [ 0.4977,  0.0194,  0.0029,  ...,  0.3079,  1.0262, -0.6613],
         [ 0.1892,  0.7965, -0.0492,  ...,  0.7769,  1.2281, -0.4855],
         [ 0.2981,  0.7136, -0.0072,  ...,  0.7018,  1.0105, -0.2157]],

        [[ 0.4667,  0.0963, -0.1903,  ..., -1.1062,  0.5403,  0.3907],
         [ 0.3396,  0.0048, -0.3239,  ..., -0.1553,  0.3557,  0.0766],
         [ 0.1299, -0.3269, -0.4146,  ..., -0.4310, -0.3135,  0.6949],
         [-0.2690, -0.4313,  0.3067,  ...,  0.1194,  0.4537, -0.0747],
         [ 0.1312,  0.0034, -0.3424,  ..., -0.6802,  0.6460, -0.0816]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.2017, -0.1935,  0.3371,  ...,  0.0971,  0.6859,  0.2973],
        [ 0.4669,  0.4709,  0.0425,  ...,  0.1149,  0.4237,  0.5939]],
       grad_fn=<TanhBack

In [None]:
from transformers import AutoTokenizer

# Load tokenizer
checkpoint = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


