**Tokenization**



When we collect Data, we get raw text. Meanwhile, the model is unable to read text, it processes numbers. Therefore we need to tokenize texts to make them understandable to the model.

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "My name is Zeineb, I am 23 years old and I want to learn more about model training and fine tuning."
tokens = tokenizer.tokenize(sequence)

print(tokens)

['My', 'name', 'is', 'Z', '##ein', '##eb', ',', 'I', 'am', '23', 'years', 'old', 'and', 'I', 'want', 'to', 'learn', 'more', 'about', 'model', 'training', 'and', 'fine', 'tuning', '.']


**Convert to IDs to be read by the model**

In [5]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[1422, 1271, 1110, 163, 20309, 15581, 117, 146, 1821, 1695, 1201, 1385, 1105, 146, 1328, 1106, 3858, 1167, 1164, 2235, 2013, 1105, 2503, 19689, 119]


**Decoding Ids to get text**

In [7]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

My name is Zeineb, I am 23 years old and I want to learn more about model training and fine tuning.


**Handling multiple sentences**

In [12]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)


Now we will tokenize a list of sentences :

In [11]:
sentences =["I am currently a software engineering student.", "I still want to learn so many things about data science and machine learning.", "Education never ends!",]

tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]

for id in ids :
  print("ID : ", id)

ID :  [1045, 2572, 2747, 1037, 4007, 3330, 3076, 1012]
ID :  [1045, 2145, 2215, 2000, 4553, 2061, 2116, 2477, 2055, 2951, 2671, 1998, 3698, 4083, 1012]
ID :  [2495, 2196, 4515, 999]


Now we need to convert them into a tensor so that the model can read them. The prolem is : the tensor needs to have a list of arrays with the same length, that's why we willa dd the attribute padding=True (it adds a list of 0 and a mask)

In [28]:
import torch
ids = tokenizer(sentences, padding = True)
all_ids=torch.tensor(ids.input_ids)

#prediction
output=model(all_ids)

print(output)

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.4805, -0.2369],
        [-2.8513,  2.9278],
        [-0.5036,  0.5805]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [33]:
print(output.logits)

tensor([[ 0.4805, -0.2369],
        [-2.8513,  2.9278],
        [-0.5036,  0.5805]], grad_fn=<AddmmBackward0>)


**Conclusion**

In [35]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
print(output)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
