# Hand-in Group
TODO: State the names of all group members and TUM-IDs

* Bhat Maitri: 03756699
* Biller Valentin: 03724152
* Szabo Daniel: 03726951

# Preparation

## Install Dependencies

In [1]:
!pip install datasets
!pip install tokenizers
!pip install transformers
!pip install stanza
# -- Initialize Stanza --
import stanza
stanza.download('en')





Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-13 19:32:18 INFO: Downloaded file to /Users/valentinbiller/stanza_resources/resources.json
2024-05-13 19:32:18 INFO: Downloading default packages for language: en (English) ...
2024-05-13 19:32:19 INFO: File exists: /Users/valentinbiller/stanza_resources/en/default.zip
2024-05-13 19:32:23 INFO: Finished downloading models and saved to /Users/valentinbiller/stanza_resources


## Download Datasets
Note: In case the following script fails because the medical transcriptions dataset is no longer available it can be downloaded from https://www.kaggle.com/tboyle10/medicaltranscriptions .

In [2]:
# -- OPEN-I dataset --
!mkdir -p open_i
!wget -q -N -P open_i https://openi.nlm.nih.gov/imgs/collections/NLMCXR_reports.tgz
!tar zxf open_i/NLMCXR_reports.tgz -C open_i
import glob
num_files = len(glob.glob("open_i/ecgen-radiology/*.xml"))
print(f'Downloaded {num_files} files from the OpenI dataset into "open_i/ecgen-radiology"')

# -- Transcriptions Dataset --
!mkdir -p medical_transcriptions
!gdown --id 1E0hm3r9bwK8cujyIcOjp_y-ZEPt1HBjn -O medical_transcriptions/medical_transcriptions.zip
!unzip -o medical_transcriptions/medical_transcriptions.zip -d medical_transcriptions
print(f'Downloaded medical transriptions dataset')




Downloaded 9051 files from the OpenI dataset into "open_i/ecgen-radiology"
Downloading...
From: https://drive.google.com/uc?id=1E0hm3r9bwK8cujyIcOjp_y-ZEPt1HBjn
To: /Users/valentinbiller/Library/Mobile Documents/com~apple~CloudDocs/• Uni/M.Sc. Robotics, Cognition, Intelligence/Artificial Intelligence in Medicine II/Homework/Code/NLP/medical_transcriptions/medical_transcriptions.zip
100%|██████████████████████████████████████| 5.08M/5.08M [00:00<00:00, 6.92MB/s]
Archive:  medical_transcriptions/medical_transcriptions.zip
  inflating: medical_transcriptions/mtsamples.csv  
Downloaded medical transriptions dataset


# Task 1: Pre-Processing and Vocabulary Construction (not graded!)

## Note
- Task 1 is not graded, i.e. it is not mandatory to solve this task. It insteads serves soleley to show how pre-processing and tokenization can be done.
- Note that tasks 2 and 3 are mandatory!

## Goals
- Understand tools for text pre-processing
- Apply text normalization and sentence splitting to a medical corpus
- Train a tokenizer on a medical corpus

## Tools
- [Huggingface Datasets](https://huggingface.co/docs/datasets/): Library for efficient loading, saving and processing of text datasets
  - Note: In many cases using pandas is sufficient for text datasets (as they often fit into memory), but Hugginface Datasets provides some usful features like batched mapping of samples.
- [Huggingface Tokenizers](https://huggingface.co/docs/tokenizers/python/latest/): Library for text normalization and tokenization
- [Stanza](https://stanfordnlp.github.io/stanza/): Text processing toolkit from the Stanford NLP group
   - usefull for sentence splitting

## Dataset
OpenI [[Website](https://openi.nlm.nih.gov/faq#collection)] [[Download](https://openi.nlm.nih.gov/imgs/collections/NLMCXR_reports.tgz)]
- Contains reports and scans
- We only use the reports from this dataset

## Step 1 - Text Normalization

In this step the goal is to normalize text from the corpus.

TODO: Implement a function for normalizing a single text sample (e.g. a report from). This function will later be applied to all sample of the corpus.
You can decide which text normalization you find appropriate and implement one or some of them. Typical examples are:
- Unicode normalization
- Stripping accents
- Lowercasing
- Removing control characters
- Normalizing whitespace characters (i.e. replacing all whitespaces like tabs, ... by the default whitespace) and removing redundant whitespaces
- Removing or normalizing special characters

Note: you may implement each normalizazion using regex/string operations or you can use normalizers from https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers

In [None]:
from tokenizers.normalizers import *

# TODO: implement normalize_text using regex/string operations and/or normalizers from tokenizers.normalizers
def normalize_text(text: str) -> str:
  return None





In [None]:
# apply normalization to some example text
# you can play around with different example texts to understand the effects of each normalizer
example_text = 'Schöner Tag'

print(normalize_text(example_text))

## Step 2 - Sentence Splitting

In this step the goal is to split each section into sentences.

TODO: Implement a function for splitting a single report section (given as a single string) into sentences (returned as a list of strings, one string for each sentence). This function will later be applied to all samples of the corpus.
For sentence splitting the stanza tokenizer should be used as it provides a more robust solution compared to splitting on pre-defined characters (like ".") .


In [None]:
import stanza
from typing import List

stanza_tokenizer = stanza.Pipeline('en', processors='tokenize', use_gpu=True)

def split_sentences(text: str) -> List[str]:
  # TODO: use the stanza_tokenizer to split the text string into sentences
  # See https://stanfordnlp.github.io/stanza/tokenize.html#tokenization-and-sentence-segmentation
  return None


In [None]:
# apply sentence splitting on some example text
example_text = "This is the first sentence. Let's try something else, e.g. this. Another sentence?"

print(split_sentences(example_text))

## Step 3 - Apply text normalization and sentence splitting to the Open-I dataset (nothing TODO here)
We now apply text normalization and sentences splitting to the Open-I dataset.
Each sample of the Open-I dataset contains a report with (besides others) a "findings" and "impression" section.

We make use of the huggingface datasets library for efficiently applying text normalization and sentence splitting to both sections of each sample of the Open-I dataset.
This step is already implemented but utilizes the functions implemented in Step 1 and 2.


1. Load the Open-I dataset as a huggingface Dataset

In [None]:
from typing import List
import glob
from tqdm import tqdm
import xml.etree.ElementTree as ET
from datasets import Dataset

def load_open_i_dataset():
  # Load from xml-files
  uIds = []
  findings_sections = []
  impression_sections = []
  for file in tqdm(glob.glob("open_i/ecgen-radiology/*.xml")):
    tree = ET.parse(file)

    uId = tree.find("./uId").get('id')
    uIds.append(uId)

    finding = tree.find(".//AbstractText[@Label='FINDINGS']").text
    findings_sections.append(finding if finding is not None else '')

    imp = tree.find(".//AbstractText[@Label='IMPRESSION']").text
    impression_sections.append(imp if imp is not None else '')

  assert len(uIds) > 0, 'No data found. Download the data first.'
  return Dataset.from_dict({
      'uId': uIds,
      'findings': findings_sections,
      'impression': impression_sections
  })

data = load_open_i_dataset()
print(data) # Overview over the dataset
print(data[0])  # The first sample of the dataset
# Save to be loaded later
data.save_to_disk('open_i_raw')

2. Apply normalize_text to the findings and impression sections using the map function of datasets.Dataset

In [None]:
from datasets import load_from_disk, Dataset

data: Dataset = load_from_disk('open_i_raw')
def normalize_text_samples(sample: dict) -> dict:
  return {
      'findings_normalized': normalize_text(sample['findings']),
      'impression_normalized': normalize_text(sample['impression'])
  }
data = data.map(normalize_text_samples)

print(data) # Overview over the dataset
print(data[0])  # The first sample of the dataset

3. Apply split_sentences to the findings and impression sections using the map function of datasets.Dataset

In [None]:
def split_sample_sentences(sample: dict) -> dict:
  return {
      'findings_sentences': split_sentences(sample['findings']),
      'impression_sentences': split_sentences(sample['impression'])
  }

split_dataset = data.map(split_sample_sentences)

print(split_dataset)
print(split_dataset[0])
split_dataset.save_to_disk('open_i_split')





## Step 4 - Vocabulary Creation: Train a tokenizer
In this step the goal is to learn a vocabulary (i.e. tokenizer) from the Open-I sentences. We therefore extract a list of all sentences (from findings and impression section) in the dataset. This list is then used for learning the vocabulary.

TODO: Extract all sentences and store them in all_sentences.

In [None]:
from datasets import load_from_disk, Dataset
from typing import List
from tqdm import tqdm

split_dataset: Dataset = load_from_disk('open_i_split')
print(split_dataset)

# TODO: extract the list of all sentences in the whole dataset (from both sections)
all_sentences: List[str] = None

print(f'\nTotal number of sentences: {len(all_sentences)}')
print(f'Total number of words (by whitespaces): {sum(len(sent.split()) for sent in all_sentences)}')
num_unique_words = len(set(word for sent in all_sentences for word in sent.split()))
print(f'Total number of different words: {num_unique_words}')


We no define the initial alphabet (optional, by default the alphabet is derived from the characters present in the dataset) and the size of the vocabulary we want to create.

In [None]:
import string

# All ascii characters (lowercase only) + digits
initial_alphabet = list(string.ascii_lowercase) + list(string.digits)
print(f'Initial alphabet size: {len(initial_alphabet)}')

vocab_size = 50000

Now we use the huggingface tokenizer library to learn a WordPiece tokenizer. This is already implemented but you can try with different arguments or also try other tokenizers, e.g. BPE.

In [None]:

from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, decoders, trainers, processors

special_tokens = ["<PAD>", "<UNK>", "<BOS>", "<EOS>"]
tokenizer = Tokenizer(models.WordPiece(unk_token='<UNK>'))
# we use a lowercase vocabulary
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokenizer.post_processor = processors.TemplateProcessing(
    single="<BOS> $0 <EOS>",
    pair="<BOS> $A <EOS> <BOS> $B <EOS>",
    special_tokens=[("<BOS>", 2), ("<EOS>", 3)],
)
tokenizer.decoder = decoders.WordPiece()

trainer = trainers.WordPieceTrainer(
    vocab_size=vocab_size,
    initial_alphabet=initial_alphabet,
    special_tokens=special_tokens,
)

# now train it
tokenizer.train_from_iterator(all_sentences, trainer=trainer)

print(f'Vocab size: {tokenizer.get_vocab_size()}')
print('Vocab: \n', tokenizer.get_vocab())

tokenizer.save('tokenizer.json', pretty=True)

### Inspection of the created vocabulary
You can now take a look at the created vocabulary by opening the file "tokenizer.json".

You can also try around tokenizing some samples:

In [None]:
example_section = split_dataset[0]['findings_sentences']
example_sentence = example_section[0]

print(example_sentence)

encoded = tokenizer.encode(example_sentence)
print(f'Encoded tokens: {encoded.tokens}')
print(f'Encoded token ids: {encoded.ids}')
print(f'Decoded again: {tokenizer.decode(encoded.ids)}')
print(f'Decoded (keep special tokens): {tokenizer.decode(encoded.ids, skip_special_tokens=False)}')

# Task 2: Training a Text Classifier (graded!)
## Goals
- Understand how pre-trained language models can be utilized for downstream tasks

## Tools
- [Huggingface Transformers](https://huggingface.co/transformers/): Library for pre-trained transformer-based language models

## Dataset
Medical Transcriptions [[Kaggle](https://www.kaggle.com/tboyle10/medicaltranscriptions)]

## Step 1 - Load the Medical Transactions Dataset (Nothing TODO here)
We first load the Medical Transcriptions dataset as a huggingface dataset. The dataset is split into train and test set, so the returned type is a datasets.DatasetDict, which acts like a dictionary of datasets.Dataset but provides utility functions, e.g. for mapping all datasets of the dict using the map method.
This step is already implemented.

In [3]:
from typing import List, Tuple
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split

def load_transciptions_dataset() -> Tuple[DatasetDict, List[str]]:
    df = pd.read_csv('medical_transcriptions/mtsamples.csv')
    df = df.drop(['Unnamed: 0'],axis=1,)

    counts = df['medical_specialty'].value_counts()
    print(f'Original counts:\n{counts}')
    dropped_specialities = [k for k, v in counts.items() if v < 100]
    for dropped_speciality in dropped_specialities:
      df = df[df['medical_specialty'] != dropped_speciality]
    df.dropna(inplace=True)
    counts = df['medical_specialty'].value_counts()
    print(f'Counts after removing small specialities:\n{counts}')

    df['medical_specialty'] = df['medical_specialty'].astype('category')
    class_names = df['medical_specialty'].cat.categories.tolist()
    print(f'Class names : {class_names}')
    df['medical_specialty'] = df['medical_specialty'].cat.codes

    train, test = train_test_split(df, stratify=df['medical_specialty'], test_size=0.25)
    dataset = DatasetDict({'train': Dataset.from_pandas(train), 'test': Dataset.from_pandas(test)})
    dataset = dataset.remove_columns(['__index_level_0__'])

    return dataset, class_names

In [4]:
dataset, class_names = load_transciptions_dataset()

Original counts:
 Surgery                          1103
 Consult - History and Phy.        516
 Cardiovascular / Pulmonary        372
 Orthopedic                        355
 Radiology                         273
 General Medicine                  259
 Gastroenterology                  230
 Neurology                         223
 SOAP / Chart / Progress Notes     166
 Obstetrics / Gynecology           160
 Urology                           158
 Discharge Summary                 108
 ENT - Otolaryngology               98
 Neurosurgery                       94
 Hematology - Oncology              90
 Ophthalmology                      83
 Nephrology                         81
 Emergency Room Reports             75
 Pediatrics - Neonatal              70
 Pain Management                    62
 Psychiatry / Psychology            53
 Office Notes                       51
 Podiatry                           47
 Dermatology                        29
 Cosmetic / Plastic Surgery         27
 Dentist

In [5]:
print('Training samples: ', len(dataset['train']['transcription']))
print('Test samples: ', len(dataset['test']['transcription']))
print('Columns: ', dataset['train'].features)

Training samples:  2315
Test samples:  772
Columns:  {'description': Value(dtype='string', id=None), 'medical_specialty': Value(dtype='int8', id=None), 'sample_name': Value(dtype='string', id=None), 'transcription': Value(dtype='string', id=None), 'keywords': Value(dtype='string', id=None)}


## Step 2 - Compute sentence embeddings using pre-trained language model
Our goal is to use a pre-trained language model (from the huggingface transformers library) to encode sentences into sentence representations. Therefore, each sentence is tokenized and passed to the language model. The resulting contextualized token representations (i.e. the outputs of the last hidden layer of the language model) are then globally pooled to get sentence-level representations. As we use a pre-trained model, no training is involved in this step.

### Specify pre-trained language model
TODO: Decide which pre-trained language model you want to use and specify its name here.
You can search for models at the huggingface model hub: https://huggingface.co/models . You can either use standard models, e.g. BERT, or use biomedcial models.

When you decided for a model, copy the name of the model from the URL (removing only https://huggingface.co/) and insert it as model_name here.
For more reference on how to use the tokenizer have a look at https://huggingface.co/docs/transformers/preprocessing.

In [6]:
# TODO: Insert model name here
model_name = 'google-bert/bert-base-uncased'

In [7]:
# Load the tokenizer (nothing TODO)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(tokenizer)

BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)


In [8]:
# Load the model (nothing TODO) and inspect it
from transformers import AutoModel
model = AutoModel.from_pretrained(model_name)
print(model)

  torch.utils._pytree._register_pytree_node(


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

### Experiment with tokenizer and language model (nothing to implement here)

In [9]:
# Tokenize some example text and inspect the results to understand the outputs of the tokenizer
sample_text = dataset['train'][:2]['transcription']  # batch of 2 samples of text
print('example text: ', sample_text)
toknized = tokenizer(sample_text, truncation=True, max_length=512)
print(toknized.keys())
print('input ids: ', toknized['input_ids'])
print('attention mask: ', toknized['attention_mask'])
print('length: ', len(toknized['input_ids'][0]))



example text:  ['PREOPERATIVE DIAGNOSES:,1.  Status post multiple trauma/motor vehicle accident.,2.  Acute respiratory failure.,3.  Acute respiratory distress/ventilator asynchrony.,4.  Hypoxemia.,5.  Complete atelectasis of left lung.,POSTOPERATIVE DIAGNOSES:,1.  Status post multiple trauma/motor vehicle accident.,2.  Acute respiratory failure.,3.  Acute respiratory distress/ventilator asynchrony.,4.  Hypoxemia.,5.  Complete atelectasis of left lung.,6.  Clots partially obstructing the endotracheal tube and completely obstructing the entire left main stem and entire left bronchial system.,PROCEDURE PERFORMED: ,Emergent fiberoptic plus bronchoscopy with lavage.,LOCATION OF PROCEDURE:  ,ICU.  Room #164.,ANESTHESIA/SEDATION:,  Propofol drip, Brevital 75 mg, morphine 5 mg, and Versed 8 mg.,HISTORY,:  The patient is a 44-year-old male who was admitted to ABCD Hospital on 09/04/03 status post MVA with multiple trauma and subsequently diagnosed with multiple spine fractures as well as bilate

In [10]:
# Feed some example data to understand the outputs of the language model

# pad the batch and convert it to Pytorch
# return_tensors='pt' => return the tokenized values as PyTorch tensors instead of lists
x = tokenizer.pad(toknized, return_tensors='pt')
print(x.keys())
print('input ids: ', x['input_ids'])
print('attention mask: ', x['attention_mask'])
print('shape: ', x['input_ids'].shape)

# Encode the tokenized and padded input
results = model(**x)
print('output shape: ', results.last_hidden_state.shape)  # (batch_size x num_tokens x d_hidden)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
input ids:  tensor([[  101,  3653, 25918,  ...,  2000,  2023,   102],
        [  101,  7709,  1024,  ...,     0,     0,     0]])
attention mask:  tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]])
shape:  torch.Size([2, 512])
output shape:  torch.Size([2, 512, 768])


### Tokenize the dataset
We now tokenize the whole dataset using the batched map-method.

TODO: Implement the tokenize_batch function using the tokenizer. Note: do not pad the data yet but truncate it to a max length of 512.

In [11]:
def tokenize_batch(text_batch: List[str]):
  # TODO: implement tokenize batch using the tokenizer
  return tokenizer(text_batch, truncation=True, max_length=512, padding=False)

In [12]:
# Apply tokenize_batch to dataset (nothing TODO here)
tokenized_dataset = dataset.map(lambda examples: tokenize_batch(examples["transcription"]), batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(['sample_name', 'transcription', 'description', 'keywords']).rename_column('medical_specialty', 'labels')
print('Columns after tokenization', list(tokenized_dataset['train'].features.keys()))

Map:   0%|          | 0/2315 [00:00<?, ? examples/s]

Map:   0%|          | 0/772 [00:00<?, ? examples/s]

Columns after tokenization ['labels', 'input_ids', 'token_type_ids', 'attention_mask']


### Token Pooling
Define Pooling functions to compute sentence embeddings from token outputs.

TODO: implement three possible pooling functions
- CLS: use the output of the [CLS] token only and ignore the other tokens.
- max/avg: globally max/avg pool over all token outputs, but make sure to ignore padding tokens.

In [13]:
import torch

def CLS_pool(hidden_state: torch.FloatTensor, attention_mask: torch.BoolTensor):
    """
    Returns only the hidden state of the [CLS] token
    :param hidden_state: (N x M x d_hidden) where N is the batch size and M is the number of tokens
    :param attention_mask: (N x M), True for "real" tokens, False for padding tokens.
    :return (N x d_hidden)
    """
    # TODO: implement function
    return hidden_state[:, 0, :]

def max_pool(hidden_state: torch.FloatTensor, attention_mask: torch.BoolTensor):
    """
    Globally pools the hidden states over all tokens using max pooling.
    :param hidden_state: (N x M x d_hidden) where N is the batch size and M is the number of tokens
    :param attention_mask: (N x M), True for "real" tokens, False for padding tokens.
    :return (N x d_hidden)
    """
    # TODO: implement function
    mask_expanded = attention_mask.bool().unsqueeze(-1).expand_as(hidden_state)
    hidden_state_masked = torch.where(mask_expanded, hidden_state, torch.tensor(-1e9).to(hidden_state.dtype).to(hidden_state.device))

    max_pooled, _ = torch.max(hidden_state_masked, dim=1)
    return max_pooled

def avg_pool(hidden_state: torch.FloatTensor, attention_mask: torch.BoolTensor):
    """
    Globally pools the hidden states over all tokens using average pooling.
    :param hidden_state: (N x M x d_hidden) where N is the batch size and M is the number of tokens
    :param attention_mask: (N x M), True for "real" tokens, False for padding tokens.
    :return (N x d_hidden)
    """
    # TODO: implement function
    mask_expanded = attention_mask.bool().unsqueeze(-1).expand_as(hidden_state).float()
    sum_masks = mask_expanded.sum(1)  # (N x 1 x d_hidden)
    
    sum_pooled = torch.sum(hidden_state * mask_expanded, dim=1)
    avg_pooled = sum_pooled / sum_masks.clamp(min=1e-9)

    return avg_pooled

### Sentence Embedder
The sentence embedder takes a tokenized input, uses the language model to compute token representations (i.e. the last hidden_state of the language model) and then uses a pooling function to compute a single sentence representation.

TODO: implement the sentence embedder forward method.

In [14]:
from torch import nn
from typing import Dict

class SentenceEmbedder(nn.Module):
  def __init__(self, model, pool):
    super().__init__()
    self.model = model  # this is a huggingface language model from AutoModel.from_pretrained
    self.pool = pool  # this is one of the three defined pooling functions (CLS, max, avg)

  def forward(self, x: Dict[str, torch.Tensor]) -> torch.Tensor:
    """
    :param x: dict containing the following elements:
      - input_ids: torch.Tensor of the token_type ids with shape (N x M)
      - attention_mask: torch.Tensor containing the attention mask of shape (N x M)
    :return sentence_embedding for each sample of shape (N x d_hidden)
    """
    # TODO: implement forward

    outputs = self.model(input_ids=x['input_ids'], attention_mask=x['attention_mask'])
    
    last_hidden_state = outputs.last_hidden_state
    
    sentence_embedding = self.pool(last_hidden_state, x['attention_mask'])

    return sentence_embedding

TODO: now decide which pooling function you want to use

In [15]:
# TODO: try different pooling functions (nothing more to implement here)
# (also change the dataset_name so you can later easily switch between pooling functions without the need to recompute the embeddings)
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'
    
# device = 'cpu'

dataset_name = 'encoded_transciptions_AVG'
pool = avg_pool

sentence_embedder = SentenceEmbedder(model, pool).to(device=device)

 Now run the sentence embedder (nothing TODO here).

 The resulting dataset will then contain a sentence_embedding column.

In [16]:
def embed_sentence(batch):
  model_input = {'input_ids': batch['input_ids'], 'attention_mask': batch['attention_mask']}
  model_input = tokenizer.pad(model_input, return_tensors='pt').to(device=device)

  with torch.no_grad():
    sentence_embeddings = sentence_embedder(model_input)
  return {'sentence_embedding': sentence_embeddings.detach().cpu().numpy()}

encoded_dataset = tokenized_dataset.map(embed_sentence, batched=True, batch_size=64)
encoded_dataset.save_to_disk(dataset_name)



Map:   0%|          | 0/2315 [00:00<?, ? examples/s]

Map:   0%|          | 0/772 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2315 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/772 [00:00<?, ? examples/s]

## Step 5 - Train Classification Model on Sentence Embeddings
Now we have a single sentence representation vector for each sentence. We now learn a simple classifier based on a MLP (i.e. a 2-layer fully-connected neural network). We first project each sentence embedding from d_hidden to d_mlp, apply ReLU (or another non-linearity) and then project the resulting vector into num_classes where we then apply the cross entropy loss. The classification head will then be learned based on the training data. This is the only learned component in our model.

TODO: implement the MLP head

In [17]:
from torch import nn
import datasets
from datasets import load_from_disk
from torch.utils.data import DataLoader
import torch
from tqdm import tqdm

class ClassificationHead(nn.Module):
  def __init__(self, d_hidden: int, d_mlp: int, num_classes: int):
    super().__init__()
    # TODO: add the required layers here
    self.mlp_head = nn.Sequential(
            nn.Linear(d_hidden, d_mlp),
            nn.ReLU(),
            nn.Linear(d_mlp, num_classes)
    )

    # TODO: define the loss function to use here
    self.loss = nn.CrossEntropyLoss()

  def forward(self, x, y_true):
    """
    :param x: sentence embeddings (N x d_hidden)
    :param y_true: target classes (multiclass) (N)
    """
    # TODO: apply MLP head to sentence embeddings
    # (N x num_classes)
    logits = self.mlp_head(x)

    # TODO: apply the loss function
    loss = self.loss(logits, y_true) if y_true is not None else None
    # TODO: compute the predictions (i.e. target classes as longs) from the logits
    # (N)
    _, y_pred = torch.max(logits, dim=1)

    return y_pred, loss



Now train the classification head on the sentence embeddings.
The training is already implemented.

TODO: specify and try different hyperparameters

In [18]:
# TODO: try different hyperparameters
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'
dataset_name = 'encoded_transciptions_AVG'
num_epochs = 20
lr = 5e-4
weight_decay = 1e-4
d_hidden = 768  # must match the language model hidden output
d_mlp = 1024
batch_size = 128

In [19]:
# Now run the training...

print('num classes: ', len(class_names))
encoded_dataset = load_from_disk(dataset_name)
# make sure PyTorch Tensors are returned
encoded_dataset.set_format('pt')

# remove all columns except sentence_embedding and labels
columns_to_remove = set(encoded_dataset['train'].column_names) - {'sentence_embedding', 'labels'}
encoded_dataset = encoded_dataset.remove_columns(columns_to_remove)

train_data_loader = DataLoader(encoded_dataset['train'], batch_size=batch_size, shuffle=True)
test_data_loader = DataLoader(encoded_dataset['test'], batch_size=batch_size, shuffle=False)

classification_model = ClassificationHead(d_hidden=d_hidden,
                                          d_mlp=d_mlp,
                                          num_classes=len(class_names))
classification_model = classification_model.to(device=device)
optimizer = torch.optim.Adam(classification_model.parameters(), lr=lr, weight_decay=weight_decay)

classification_model.train()
print('Training...')
for epoch in range(num_epochs):
  train_loss = []
  for train_batch in tqdm(train_data_loader):
    x = train_batch['sentence_embedding'].to(device=device)
    y_true = train_batch['labels'].to(device=device)

    y_pred, loss = classification_model(x, y_true)
    train_loss.append(loss.item())

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
  print('train loss: ', torch.mean(torch.tensor(train_loss)).item())

classification_model.eval()
print('Testing...')
f1_metric = datasets.load_metric("f1")
with torch.no_grad():
  test_loss = []
  for test_batch in tqdm(test_data_loader):
    x = test_batch['sentence_embedding'].to(device=device)
    y_true = test_batch['labels'].to(device=device)

    y_pred, loss = classification_model(x, y_true)

    f1_metric.add_batch(predictions=y_pred, references=y_true)
    test_loss.append(loss)
print('\ntest loss: ', torch.mean(torch.tensor(test_loss)).item())
print('F1 (Macro): ', f1_metric.compute(average="macro"))

num classes:  12
Training...


100%|███████████████████████████████████████████| 19/19 [00:00<00:00, 38.99it/s]


train loss:  2.1185073852539062


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 195.54it/s]


train loss:  1.7768936157226562


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 201.50it/s]


train loss:  1.593662977218628


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 173.52it/s]


train loss:  1.5410226583480835


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 177.26it/s]


train loss:  1.4411345720291138


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 175.20it/s]


train loss:  1.4048923254013062


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 172.16it/s]


train loss:  1.3656760454177856


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 178.08it/s]


train loss:  1.3311318159103394


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 174.83it/s]


train loss:  1.2859617471694946


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 171.10it/s]


train loss:  1.2920438051223755


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 146.31it/s]


train loss:  1.2504011392593384


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 147.07it/s]


train loss:  1.2291159629821777


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 145.01it/s]


train loss:  1.19967520236969


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 145.93it/s]


train loss:  1.19120454788208


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 144.44it/s]


train loss:  1.1519535779953003


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 130.94it/s]


train loss:  1.1699175834655762


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 121.45it/s]


train loss:  1.1350510120391846


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 113.09it/s]


train loss:  1.145141363143921


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 108.97it/s]


train loss:  1.1350880861282349


100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 103.13it/s]
  f1_metric = datasets.load_metric("f1")


train loss:  1.0997015237808228
Testing...


100%|████████████████████████████████████████████| 7/7 [00:00<00:00, 106.77it/s]


test loss:  1.28374183177948
F1 (Macro):  {'f1': 0.3769100747896881}





# Task 3: Generating text from a pre-trained decoder LM (graded!)
## Goals
- Understand how inference (text generation) works on decoder languages models

## Tools
- [Huggingface Transformers](https://huggingface.co/transformers/): Library for pre-trained transformer-based language models)

## Model
- Architecture: https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel
- Model weights: https://huggingface.co/healx/gpt-2-pubmed-medium

## Notes and Tips
- You can read this blog post to learn more about generation strategies: https://huggingface.co/blog/how-to-generate
- We are going to implement greedy-search.
- Please do not use the "generate" method of the pre-trained model (as there generation is already implemented)
- You can however take a look at their implementation of greedy-search: https://github.com/huggingface/transformers/blob/04ab5605fbb4ef207b10bf2772d88c53fc242e83/src/transformers/generation/utils.py#L2080


## Step 1: Implement generate method
Implement the generate method that takes a language model and its tokenizer together with a list of text prefixes and outputs a list of generated sentences (one for each prefix). The prefixes should be "autocompleted" by the model.
Use the greedy-search method and process all prefixes as a single batch.

TODO: finish the implementation of the generate method by filling in the missing lines.

In [None]:
import torch
from typing import List

def generate(model, tokenizer, prefix: List[str], max_predicted: int) -> List[str]:
  """
  :param model: PreTrainedModel.
  :param tokenizer: PreTrainedTokenizer (https://huggingface.co/docs/transformers/main/main_classes/tokenizer#transformers.PreTrainedTokenizer)
  :prefix: Batch of prefixes to be autocompleted by the language model
  """
  if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
  pad_token_id = tokenizer.pad_token_id
  bos_token_id = tokenizer.bos_token_id
  eos_token_id = tokenizer.eos_token_id
  device = model.device

  # 1. Tokenize the prefixes and prepare them for input into the language model
  #    Notes
  #    - add the start token (BOS) but not the end token (EOS)
  #    - left! pad them to the maximum length in the batch, i.e. paddding is left to the "real" tokens
  #    - inputs_ids and attention_mask should be computes, and should already be stacked in the batch dim
  #    -> this can all be done by calling the tokenizer! See: https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__
  tokenizer.padding_side = 'left'
  # TODO: call the tokenizer to prepare the inputs (a dictionary)
  inputs = tokenizer(...  # add arguments here
                     return_tensors='pt')
  inputs = inputs.to(device=device)
  input_ids = inputs['input_ids']
  attention_mask = inputs['attention_mask']

  # TODO: Compute position ids based on the attention mask
  # Note: the position ids are the indices of the positions, starting with zero and increasing for each non-padding token
  # The positions ids for padding tokens can take any value
  position_ids: torch.LongTensor = None

  # Initialize some variables (nothing TODO here)
  N = position_ids.shape[0]
  past_key_values = None
  unfinished_sequences = torch.ones(N, dtype=torch.bool, device=device)
  predicted_input_ids = input_ids.clone()

  while True:
    # 2. Predict the next token logits by passing the previous inputs into the language model
    #    Note: already computed steps can be given by past_key_values, other steps are given as input_ids
    outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        return_dict=True
    )
    # (N x |V|)
    next_token_logits: torch.LongTensor = outputs.logits[:, -1, :]

    # 3. Select the next predicted tokens by greedy search
    # (N)
    next_tokens: torch.LongTensor = None  # TODO

    # 4. For all already finished sentences, replace the next_tokens by <PAD>
    next_tokens = None  # TODO

    # 5. Check which sentences are already finished by checking whether they contain <END>
    # Which sentences have been finished with the predicted token (next_tokens)
    # (N)
    newly_finished = None  # TODO
    # Which sentences, therefore, remain unfinished
    # (N)
    unfinished_sequences = None  # TODO

    # 6. Concatenate the next input to predicted_input_ids, attention_mask
    # (N x M_new) where M_new is one longer than the previous M of predicted_input_ids
    predicted_input_ids = None  # TODO
    # (N x M_new)
    attention_mask = None  # TODO: extend attention mask by ones for next step
    # Nothing TODO here
    position_ids = position_ids.amax(dim=1, keepdim=True) + 1
    input_ids = next_tokens.unsqueeze(dim=1)
    past_key_values = outputs.past_key_values

    # 7. Check if finished (nothing TODO here)
    if predicted_input_ids.shape[1] >= max_predicted or unfinished_sequences.max() == 0:
      break

  # 8. Convert back to text (nothing TODO here)
  generated_sentences: List[str] = tokenizer.batch_decode(predicted_input_ids, skip_special_tokens=True)
  return generated_sentences




## Step 2: Load a decoder language model (nothing TODO here)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

decoder_model_name = "gpt2-medium"

decoder_tokenizer = AutoTokenizer.from_pretrained(decoder_model_name)
decoder_model = AutoModelForCausalLM.from_pretrained(decoder_model_name)
decoder_model = decoder_model.to(device="cuda:0")

## Step 3: Apply the generate method to some example text (nothing TODO here)

In [None]:
# You can play around with different texts here
prefix_1 = "This sentences is about"
prefix_2 = "Can you complete this sentence?"
prefixes = [prefix_1, prefix_2]

generated_sentences = generate(decoder_model, decoder_tokenizer, prefixes, max_predicted=50)

for i, (prefix, sent) in enumerate(zip(prefixes, generated_sentences)):
  print(f'Completed sentence {i}: "{prefix}"')
  print("-------------------------------------------------------\n")
  print(sent)
  print("\n=======================================================\n\n")

As you can see, the results tend to be quite repetetive.
This is why greedy search is typically not used in practice.
Common alternatives include beam search and sampling from the distribution of next tokens. While this is not part of the mandatory exercise, you can try to implement other generation methods as part of a bonus task and see how it can improve the results.

## Extra Task (not graded): Implement a beam search generate function and apply it
Re-implement the generate method but instead of doing greedy search, use beam search instead.

Note that this is a completely optional and non-graded task and will not be discussed in the soutions!

In [None]:
def beam_search_generate(model, tokenizer, prefix: List[str], max_predicted: int, num_beams: int = 5) -> List[str]:
  pass  # TODO