In [1]:
import numpy as np

# Hugging Face Datasets

In [2]:
from datasets import load_dataset

emotions=load_dataset("emotion")

Using the latest cached version of the module from C:\Users\amrul\.cache\huggingface\modules\datasets_modules\datasets\emotion\348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705 (last modified on Fri Mar 18 22:16:45 2022) since it couldn't be found locally at emotion., or remotely on the Hugging Face Hub.
Using custom data configuration default
Reusing dataset emotion (C:\Users\amrul\.cache\huggingface\datasets\emotion\default\0.0.0\348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705)


  0%|          | 0/3 [00:00<?, ?it/s]

   # Import Tokenizer and DistilBert model

It is important to use the right pretrained tokenizer for a pretrained model. Otherwise pretrained token representations become obsolete

In [3]:
from transformers import AutoTokenizer

In [5]:
# We will use DistilBERT which is smaller version of BERT to classify emotion text
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

In [3]:
from transformers import AutoModel
import torch

In [7]:
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_name).to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Above we are checking if GPU is available. If not we are loading the model to CPU device

To warm up let's extract the last hidden states for a simple string

In [8]:
text = "Tokenization is an important step in building NLP models"
tokens=tokenizer.encode(text, return_tensors="pt").to(device)

```return_tensors="pt"``` ensures that we return token embeddings as PyTorch tensors and we load them into the same device as the model.

In [9]:
tokens.shape

torch.Size([1, 8])

In [12]:
def view_tokens(tokenizer,tokens):
    for token in tokens:
        print(token,tokenizer.decode(token))

view_tokens(tokenizer,tokens[0])

tensor(101) [CLS]
tensor(1045) i
tensor(2175) go
tensor(2000) to
tensor(2147) work
tensor(2296) every
tensor(2154) day
tensor(102) [SEP]


In [13]:
output=model(tokens)
output.last_hidden_state.shape

torch.Size([1, 8, 768])

Looking at the hidden states we can see it has the shape ```[batch_size,n_tokens,hid_dim]```. BERT generates a hidden state for each input token. Then it uses these hidden states to predict masked tokens. For classification tasks it is common to use the hidden state of [CLS]

# Tokenizing the whole dataset

```padding``` will pad each sequence with zeroes to the longest sequence in the batch. ```truncation``` will truncate at model's maximum context size.

In [16]:
def tokenize(tokenizer,batch):
    return tokenizer(batch["text"],padding=True,truncation=True)

In [18]:
tokenize(tokenizer,emotions["train"][:3])

{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102], [101, 10047, 9775, 1037, 3371, 2000, 2695, 1045, 2514, 20505, 3308, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

Above you will notice that batch tokenizer returns ```attention_masks``` in addition to ```input_ids```. This is necessary so that the model doesn't get confused with paddings and can ignore them when processing each text.

In [19]:
emotions_encoded=emotions.map(lambda batch : tokenize(tokenizer,batch),batched=True,batch_size=None)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

By default ```DatasetDict.map``` operates on operates individually on every example in the corpus, so setting ```batched=True``` will encode the tweets in batches, while ```batch_size=None``` applies ```tokenize``` in one single batch and ensures that input tensors and attention masks have the same shape globally. We can confirm that this operation added two new features to the dataset ```input_ids``` and ```attention_masks```

In [21]:
emotions_encoded["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

We can pass ```input_ids``` and ```attention_mask``` to the model in a below manner if we had single example. Notice we have to convert them into PyTorch tensors before passing them into the model

In [43]:
train_set=emotions_encoded["train"]
train_set.features
input_ids=train_set['input_ids']
attention_mask=train_set["attention_mask"]
with torch.no_grad():
    output=model(torch.tensor(input_ids[:5]),torch.tensor(attention_mask[:5]))
last_hidden_state=output.last_hidden_state
print(last_hidden_state.shape)
lhs_np=last_hidden_state.cpu().numpy()
print(type(lhs_np))

torch.Size([5, 87, 768])
<class 'numpy.ndarray'>


What we really want are hidden states across the whole dataset. For this, we can use the ```DatasetDict.map``` function again!

In [54]:
def forward_pass(batch):
    input_ids=torch.tensor(batch["input_ids"]).to(device)
    attention_mask = torch.tensor(batch["attention_mask"]).to(device)
    with torch.no_grad():
        last_hidden_state=model(input_ids,attention_mask).last_hidden_state
        last_hidden_state = last_hidden_state.cpu().numpy()
    
    # Use average of unmasked hidden states for classification
    lhs_shape = last_hidden_state.shape
    boolean_mask = ~np.array(batch["attention_mask"]).astype(bool)
    boolean_mask = np.repeat(boolean_mask,lhs_shape[-1], axis=-1)
    boolean_mask = boolean_mask.reshape(lhs_shape)
    masked_mean = np.ma.array(last_hidden_state,mask=boolean_mask).mean(axis=-1)
    batch["hidden_state"]=masked_mean.data
    return batch

In [55]:
emotions_encoded = emotions_encoded.map(forward_pass, batched=True, batch_size=16)

  0%|          | 0/1000 [00:00<?, ?ba/s]

  0%|          | 0/125 [00:00<?, ?ba/s]

  0%|          | 0/125 [00:00<?, ?ba/s]