## Training a Text classifier
Models lke DistilBERT are pretrained to predict masked words in a sequence of text. We can't use these language models directly for text classification. We have two options to train a model on our Twitter dataset:
 - Feature Extraction: usage of hidden states as features and train a classifier on them, without modifiying the pretrained model
 - Fine-tuning: train model end-to-end, which also updates the parameters of the pretrained model

### Setup tokenizer (see previous notebook)

In [39]:
from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)


from datasets import load_dataset
emotions = load_dataset("emotion")

# Applies tokenizer to a batch of examples, padding true adds examples with zeros and truncation true truncates examples to max context length.
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)



### Tranformers as feature extractors

Using transformers as feature extractor is fairly simple. We freeze the body weights during training and use the hidde states as features for the classifier. We can quickly train a small or shallow model. Such a model could be a neural classification layer o a method that does not rely on gradients, such as a random forest. This method is especially convenient if GPUs are unavaillable, since the hidden states only need to be precomputed once.

In [40]:
from transformers import AutoModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

# The AutoModel converts the token encodings to embeddings, and then feeds them through the encoder stack to return the hidden states.

text = "this is a test"
inputs = tokenizer(text, return_tensors="pt")
print(f"inputs tensor shape: {inputs['input_ids'].shape}")


inputs tensor shape: torch.Size([1, 6])


In [41]:
# Now that we have the encodings as a tensor, the final step is to place them on the same device as the model and pass the inputs as follows
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
    outputs = model(**inputs)
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.1565, -0.1862,  0.0528,  ..., -0.1188,  0.0662,  0.5470],
         [-0.3575, -0.6484, -0.0618,  ..., -0.3040,  0.3508,  0.5221],
         [-0.2772, -0.4459,  0.1818,  ..., -0.0948, -0.0076,  0.9958],
         [-0.2841, -0.3917,  0.3753,  ..., -0.2151, -0.1173,  1.0526],
         [ 0.2661, -0.5094, -0.3180,  ..., -0.4203,  0.0144, -0.2149],
         [ 0.9441,  0.0112, -0.4714,  ...,  0.1439, -0.7288, -0.1619]]]), hidden_states=None, attentions=None)

Above we used `torch.no_grad()` context manager to disable the automatic calc of the gradient. This is useful for inference since it reduces the mem footprint.

In [42]:
outputs.last_hidden_state.size()

torch.Size([1, 6, 768])

768 vector is returend for each of the 6 input tokens.
For classificaion tasks, it is common practicie to us the hdiden state associated with the [CLS] token as the input feature. Since the token appears at the start of each sequence, we can extract it by simply indexing into outputs.last_hidden_state as follows:

In [43]:
outputs.last_hidden_state[:,0].size()

torch.Size([1, 768])

In [44]:
# now that we can do this for a single string, we can generalize it in a function
# the only diff is that we place the hidden state on the cpu as a NumPy array
def extract_hidden_states(batch):
    inputs = {k: torch.tensor(v).to(device) for k, v in batch.items() if k in tokenizer.model_input_names}
    with torch.no_grad():
        last_hidden_states = model(**inputs).last_hidden_state
    return {"hidden_state": last_hidden_states[:, 0].cpu().numpy()}


# since out model expects tensor as inputs, the next thing is to convert the inputs_ids and attention_mask
emotions_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)

# the latter adds a new hidden_state column to the dataset
emotions_hidden["train"].column_names

['text', 'label', 'input_ids', 'attention_mask', 'hidden_state']

### Creating a feature matrix
Now that we have the hdden states associated with each tweet, the next step is to train a classifier on them. For that we need a feature matrix.
We will use the hidden states as input features and the labels as targets. We can easily create the corresponding arrays in the Scikit-learn format as follows: 

In [45]:
import numpy as np

X_train = np.array(emotions_hidden["train"]["hidden_state"])
X_valid = np.array(emotions_hidden["validation"]["hidden_state"])
y_train = np.array(emotions_hidden["train"]["label"])
y_valid = np.array(emotions_hidden["validation"]["label"])
X_train.shape, X_valid.shape

((16000, 768), (2000, 768))

In [46]:
# To check if the training set contains the emotions that we want to represent, we should visualize the results
from umap import UMAP
from sklearn.preprocessing import MinMaxScaler
from pandas import DataFrame as pd

# Scale feature to [0,1] range
X_scaled = MinMaxScaler().fit_transform(X_train)
# Init UMAP and fit it to the scaled data
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
# Create a DataFrame with the resulting 2D embedding and the corresponding labels
df_emb = pd.DataFrame(mapper.embedding_, columns=["x", "y"])
df_emb["label"] = y_train
df_emb.head()

ImportError: cannot import name 'UMAP' from 'umap' (/home/nurbot/anaconda3/envs/hf/lib/python3.10/site-packages/umap/__init__.py)