<a href="https://colab.research.google.com/github/tobias-hoepfl/Digital-Organizations-SE/blob/main/learning_portfolio/3_nlp_introduction_Hoepfl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Theoretical overview

### NLP

- Natural language processes
- Focuses on enabling computers to understand, interpret, and generate human language (text, speech, dialogue)
- Applications: predict the next word in a sentence, classify text into categories, translation, chatbots, virtual assistants, sentiment analysis, text summarization

### Transformers

- Transformers: type of neural network architecture
- Used for: natural language processing tasks (e.g., translation, summarization)
- Use: attention mechanisms to focus on input parts
- Attention mechanism: captures long-range dependencies
- Advantages over traditional models: faster, more efficient
- Widely used in NLP
- Encoder vs. decoder models (or combination): different use cases
- Encoders: receive a sequence of input tokens and produce a condensed representation of the sequence which captures semantic meaning, auto-encoding, trained to reconstruct its input data (e.g. BERT)
- Decoders: focus on predicting the next word, auto-regressive (e.g. GPT)

- Essential: attention mechanism - allows the model to focus on certain parts of the input for a prediction

### Pre-trained NLP models:

- Have already been trained on a large amount of text data
- Training a full model from the start takes a huge amount of resources, so it makes sense to use pre-trained models in order to save time and resources compared to training a new model from scratch.
- Fine-tuning means to adapt a pre-trained NLP model to a specific task or domain

### Tokenization:

- Breaking down a piece of text into smaller parts, like individual words or parts of words (subwords)
- Necessary for many NLP tasks, as it allows computers to understand and analyze the meaning of words in a sentence (also context!)
    


## Getting used to Hugging Face

In order to get used to Hugging Face, I replicated the most important steps from the beginning chapters (and partially in the video) and commented and played around with them in the following part:

In [4]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m105.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0

### Using pipelines

In [5]:
#easy access
from transformers import pipeline

raw_inputs = [
        "This is great",
        "I love it",
        "Although there are many positives, I still hate it",
        "I kind of like it",
        "I am not sure if I like it"
]

classifier = pipeline("sentiment-analysis")
classifier(raw_inputs)

#Works quite good already for this easy task

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998785257339478},
 {'label': 'POSITIVE', 'score': 0.9998799562454224},
 {'label': 'NEGATIVE', 'score': 0.9966195821762085},
 {'label': 'POSITIVE', 'score': 0.9996015429496765},
 {'label': 'NEGATIVE', 'score': 0.9977300763130188}]

### Behind the pipeline

In [6]:
#First tokenize
from transformers import AutoTokenizer
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

#input_ids identify the tokens uniquely
#attenton_mask:

{'input_ids': tensor([[ 101, 2023, 2003, 2307,  102,    0,    0,    0,    0,    0,    0,    0,
            0],
        [ 101, 1045, 2293, 2009,  102,    0,    0,    0,    0,    0,    0,    0,
            0],
        [ 101, 2348, 2045, 2024, 2116, 3893, 2015, 1010, 1045, 2145, 5223, 2009,
          102],
        [ 101, 1045, 2785, 1997, 2066, 2009,  102,    0,    0,    0,    0,    0,
            0],
        [ 101, 1045, 2572, 2025, 2469, 2065, 1045, 2066, 2009,  102,    0,    0,
            0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}


In [7]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

#two asterix seems to mean that we pass multiple arguments
outputs = model(**inputs)

#print outputs directly
print(outputs)

#We have the logits, which are raw outputs before applying softmax functions

#Softmax function is applied here (similar to sigmoid):
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

#The first column seems to be the predicted probability for negative, the second for positive
print(predictions)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3328,  4.6832],
        [-4.3327,  4.6951],
        [ 3.0994, -2.5869],
        [-3.7798,  4.0477],
        [ 3.3777, -2.7080]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
tensor([[1.2144e-04, 9.9988e-01],
        [1.2001e-04, 9.9988e-01],
        [9.9662e-01, 3.3804e-03],
        [3.9844e-04, 9.9960e-01],
        [9.9773e-01, 2.2700e-03]], grad_fn=<SoftmaxBackward0>)


### Fine tuning a pretrained model

In [8]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

#Train with single training example

training_example = [
    "It is not good",
    "It is fine"
]

training_batch = tokenizer(training_example, padding=True, truncation=True, return_tensors="pt")

# 0 for negative, 1 for positive
training_batch["labels"] = torch.tensor([0, 1])

#optimizer used in the example
optimizer = AdamW(model.parameters())
loss = model(**training_batch).loss
#calculate gradient
loss.backward()
optimizer.step()



In [9]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

#I use sst2 dataset instead of mrpc
ds = load_dataset("glue", "sst2")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

#discover structure of the dataset (a bit different to the one used in the chapter)
ds

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/sst2 to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [10]:
def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

tokenized_ds = ds.map(tokenize_function, batched=True)

#collator refers to putting the samples in a batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [11]:
#The following part did not work with the new dataset, so I switched back to the one provided, by using the code of the chapter online
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [12]:
#prepare training
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

#define model
from transformers import AutoModelForSequenceClassification

#We need two labels (because there are pairs)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
#fast way to compute metrics like shown in the tutorial chapter
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [13]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
#did only work successfully after switching to GPU
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5457
1000,0.3367


TrainOutput(global_step=1377, training_loss=0.3755762880057984, metrics={'train_runtime': 198.8452, 'train_samples_per_second': 55.34, 'train_steps_per_second': 6.925, 'total_flos': 406183858377360.0, 'train_loss': 0.3755762880057984, 'epoch': 3.0})