# Training GPT from Scratch on Discharge Summaries

---


In this lab, we will walk through a slightly modified version of this huggingface tutorial on training GPT from scratch: https://huggingface.co/learn/llm-course/en/chapter7/6

We'll explore how training on a small corpus of discharge summaries affects the embeddings and generations of our model.

We'll primarily use the higher level "transformers" API for this exercise. 

This notebook runs in colab, connect to a T4 runtime.

In [None]:
!pip install transformers torch accelerate

In [None]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import Dataset
import pandas as pd
import torch.nn.functional as F
import torch
import math
from google.colab import files

First, we'll load the data that we used in Lab 2.

In [None]:
uploaded = files.upload()

In [None]:
discharge_summaries = pd.read_csv('lab2-data.csv')
dataset = Dataset.from_pandas(discharge_summaries)
print(dataset)

Here, we'll use a pre-trained tokenizer. We'll also restrict our context length to 512. The tokenizer class allows us to break our large discharge summaries into smaller chunks that fit into our context limit.

In [None]:
context_length = 512
tokenizer = AutoTokenizer.from_pretrained("gpt2")

outputs = tokenizer(
    dataset[:10]["TEXT"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Each chunk will be list of indices corresponding to tokens in our vocabulary.

In [None]:
print(outputs['input_ids'][0])

We can decode this back to text as well:

In [None]:
tokenizer.decode(outputs['input_ids'][0])

Let's now apply the tokenizer across the whole dataset

In [None]:
def tokenize(element):
    outputs = tokenizer(
        element["TEXT"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    return {"input_ids": outputs['input_ids']}


tokenized_dataset = dataset.map(
    tokenize, batched=True, remove_columns=dataset.column_names
)

Now we'll load in a randomly initialized model with the GPT2 architecture:

In [None]:
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    n_layer=6,
    n_head=6
)
model = GPT2LMHeadModel(config)

Let's explore this architecture a bit further:

In [None]:
model

WTE (Word Token Embeddings) maps each vocabulary token to a 768-dimensional semantic vector, while WPE (Word Position Embeddings) encodes each token’s position in the sequence so the model can capture word order.

We can can search for similar tokens using vector distances in the embedding space.

In [None]:
tokens = tokenizer.encode('hospital')
result_vector = model.transformer.wte.weight[tokens].mean(axis=0)

similarities = F.cosine_similarity(
    result_vector,
    model.transformer.wte.weight
)
top_indices = similarities.topk(10).indices
print([tokenizer.decode(idx) for idx in top_indices if idx not in tokens])

These tokens are not quite related to “hospital”.

Because we are using the raw word embedding table that is not trained to make similar words close to each other.

# Use this model to generate some text.

In [None]:
output=model.generate(max_length=100)
print(output)
print(tokenizer.decode(output[0]))

The output is repeating and meaningless.

# Forward Pass

Let's pass one of our input vectors in and explore the outputs.

First, we'll pass our tokenized input and retrieve the embeddings.

In [None]:
tokenized_input = torch.tensor(outputs['input_ids'][0:1])
token_embeddings = model.transformer.wte(tokenized_input)
print(token_embeddings)

# Generate the position embeddings

In [None]:
with torch.no_grad():
    seq_len = tokenized_input.size(1)
    position_ids = torch.arange(seq_len, dtype=torch.long).unsqueeze(0)
    position_embeddings = model.transformer.wpe(position_ids)

print(position_embeddings)

# Combine the position and token embeddings

In [None]:
inputs_embeds = token_embeddings + position_embeddings
print(inputs_embeds)
outputs = model(inputs_embeds=inputs_embeds, output_hidden_states=True)
hidden_states_all = outputs.hidden_states



In [None]:
print(hidden_states_all)

In [None]:
hidden_states = hidden_states_all[5]

Now we'll layer normalize. Our transformer has 6 attention heads, let's dive into one of them:

In [None]:
normalized_hidden_states = model.transformer.h[0].ln_1(hidden_states)

print("Unnormalized:")
print(hidden_states)
print('-----------------')
print("Normalized:")
print(normalized_hidden_states)

Let's take a look at the self-attention matrices

In [None]:
model.transformer.h[0].attn.c_attn.weight.shape

In class, we discussed the Wq, Wk, and Wv matrices. In practice, these are often stacked into a single matrix for efficient computation. So the matrix above represents [Wq Wk Wv].

We can derive our Q, K, V matrices by splitting the output:

In [None]:
Q, K, V = model.transformer.h[0].attn.c_attn(normalized_hidden_states).split(768, dim=2)
print('Q:', Q)
print('K:', K)
print('V:', V)

In [None]:
print(Q.shape)

Remember, there's nothing special about these matrices/vectors. They are random. They only gain significance during training because the following constraint is applied during the forward pass:

In [None]:
att = (Q @ K.transpose(-2, -1)) * (1.0 / math.sqrt(K.size(-1)))
print("QK^T Dim:", att.shape)
A = F.softmax(att, dim=-1)
Z = A @ V
print("Z Dim:", Z.shape)

Finally, the output of our model will be a matrix of logits (dimensionality of our vocabulary) for each position in the sequence. During training, we will compute the cross entropy between that logit and the embedding vectors of the next tokens (i.e. shifted by 1).

In [None]:
model(tokenized_input).logits.shape

# Model Training

In [None]:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [None]:
args = TrainingArguments(
    output_dir="lab3",
    per_device_train_batch_size=14, # increase/decrease this based on your memory
    eval_steps=50,
    logging_steps=50,
    gradient_accumulation_steps=1,
    num_train_epochs=2,
    weight_decay=0.1,
    warmup_steps=10,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=500,
    fp16=True,
    report_to="none"
    # use_cpu=True # Very slow! Feel free to use without a GPU if you'd like
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset
)

The model will internally handle the next-token prediction loss. Go ahead and start the training. This will take a while!

In [None]:
trainer.train()

# Try again

In [None]:
tokens = tokenizer.encode('hospital')
result_vector = model.transformer.wte.weight[tokens].mean(axis=0)

similarities = F.cosine_similarity(
    result_vector,
    model.transformer.wte.weight
)
top_indices = similarities.topk(10).indices
print([tokenizer.decode(idx) for idx in top_indices if idx not in tokens])

It's more related to hospital.

# Use the trained model to generate  a few samples of text.

In [None]:
# YOUR CODE HERE
# model.generate(max_length=100, do_sample=True, temperature=0.1)
output=model.generate(max_length=100, do_sample=True, temperature=0.1)
print(output)
print(tokenizer.decode(output[0]))

It's more like a discharge summary.