# Training GPT from Scratch on Discharge Summaries

---


In this lab, we will walk through a slightly modified version of this huggingface tutorial on training GPT from scratch: https://huggingface.co/learn/llm-course/en/chapter7/6

We'll explore how training on a small corpus of discharge summaries affects the embeddings and generations of our model.

We'll primarily use the higher level "transformers" API for this exercise. 

This notebook runs in colab, connect to a T4 runtime.

In [1]:
!pip install transformers torch accelerate



In [2]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import Dataset
import pandas as pd
import torch.nn.functional as F
import torch
import math
from google.colab import files

First, we'll load the data that we used in Lab 2.

In [3]:
uploaded = files.upload()

Saving lab2-data.csv to lab2-data (1).csv


In [4]:
discharge_summaries = pd.read_csv('lab2-data.csv')
dataset = Dataset.from_pandas(discharge_summaries)
print(dataset)

Dataset({
    features: ['HADM_ID', 'TEXT'],
    num_rows: 3668
})


Here, we'll use a pre-trained tokenizer. We'll also restrict our context length to 512. The tokenizer class allows us to break our large discharge summaries into smaller chunks that fit into our context limit.

In [5]:
context_length = 512
tokenizer = AutoTokenizer.from_pretrained("gpt2")

outputs = tokenizer(
    dataset[:10]["TEXT"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Input IDs length: 73
Input chunk lengths: [512, 512, 512, 512, 512, 512, 512, 512, 278, 512, 512, 512, 512, 512, 512, 512, 512, 512, 207, 512, 512, 512, 512, 512, 512, 512, 512, 315, 512, 110, 512, 512, 512, 512, 295, 512, 512, 512, 512, 512, 512, 512, 512, 475, 512, 512, 512, 512, 512, 115, 512, 512, 512, 512, 512, 512, 110, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 196, 512, 512, 512, 512, 437]
Chunk mapping: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9]


Each chunk will be list of indices corresponding to tokens in our vocabulary.

In [6]:
print(outputs['input_ids'][0])

[2782, 3411, 7536, 25, 220, 685, 1174, 17, 11623, 12, 17, 12, 24, 1174, 60, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 3167, 10136, 7536, 25, 220, 220, 685, 1174, 17, 11623, 12, 17, 12, 1433, 1174, 60, 628, 198, 16177, 25, 26112, 2149, 8881, 198, 198, 3237, 6422, 444, 25, 198, 57, 420, 273, 1220, 406, 3798, 349, 198, 198, 8086, 1571, 33250, 1174, 37564, 4586, 6530, 1248, 3553, 1174, 60, 198, 23675, 20011, 2913, 25, 198, 45170, 2356, 198, 198, 24206, 311, 31839, 393, 10001, 17443, 34997, 25, 198, 30645, 8710, 516, 1627, 36075, 357, 3506, 5387, 45808, 934, 27208, 8, 198, 198, 18122, 286, 21662, 5821, 1108, 25, 198, 5246, 13, 685, 1174, 29870, 938, 3672, 1248, 3365, 1174, 60, 318, 281, 9508, 27406, 582, 351, 10768, 257, 419, 291, 45219, 5958, 357, 43435, 198, 49257, 9809, 287, 685, 1174, 17, 17464, 1174, 60, 351, 685, 1174, 14749, 357, 403, 8, 16003, 1174, 60, 352, 12067, 17, 11, 31312, 2579, 8085, 39, 70, 11, 10768, 198, 2781, 1373, 842, 3686, 3780, 11, 11607, 257, 

We can decode this back to text as well:

In [7]:
tokenizer.decode(outputs['input_ids'][0])

'Admission Date:  [**2125-2-9**]              Discharge Date:   [**2125-2-16**]\n\n\nService: MEDICINE\n\nAllergies:\nZocor / Lescol\n\nAttending:[**Doctor Last Name 1857**]\nChief Complaint:\nChest pain\n\nMajor Surgical or Invasive Procedure:\nCentral venous line insertion (right internal jugular vein)\n\nHistory of Present Illness:\nMr. [**Known lastname 1858**] is an 84 yo man with moderate aortic stenosis (outside\nhospital echo in [**2124**] with [**Location (un) 109**] 1 cm2, gradient 28 mmHg, moderate\nmitral regurgitation, mild aortic insufficiency), chronic left\nventricular systolic heart failure with EF 25-30%, hypertension,\nhyperlipidemia, diabetes mellitus, CAD s/p CABG in [**2099**] with\nSVG-LAD-Diagonal, SVG-OM, and SVG-RPDA-RPL, with a re-do CABG in\n[**9-/2117**] with LIMA-LAD, SVG-OM, SVG-diagonal, and SVG-RCA. He also\nhas severe peripheral arterial disease s/p peripheral bypass\nsurgery. He presented to [**Hospital 1474**] Hospital ER this morning with\nshortness

Let's now apply the tokenizer across the whole dataset

In [8]:
def tokenize(element):
    outputs = tokenizer(
        element["TEXT"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    return {"input_ids": outputs['input_ids']}


tokenized_dataset = dataset.map(
    tokenize, batched=True, remove_columns=dataset.column_names
)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Now we'll load in a randomly initialized model with the GPT2 architecture:

In [9]:
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    n_layer=6,
    n_head=6
)
model = GPT2LMHeadModel(config)

Let's explore this architecture a bit further:

In [10]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

WTE (Word Token Embeddings) maps each vocabulary token to a 768-dimensional semantic vector, while WPE (Word Position Embeddings) encodes each token’s position in the sequence so the model can capture word order.

We can can search for similar tokens using vector distances in the embedding space.

In [11]:
tokens = tokenizer.encode('hospital')
result_vector = model.transformer.wte.weight[tokens].mean(axis=0)

similarities = F.cosine_similarity(
    result_vector,
    model.transformer.wte.weight
)
top_indices = similarities.topk(10).indices
print([tokenizer.decode(idx) for idx in top_indices if idx not in tokens])

[' snipers', 'Forge', 'angel', ' joint', ']:', ' infused', ' exc', ' Ern', ' reconcile']


These tokens are not quite related to “hospital”.

Because we are using the raw word embedding table that is not trained to make similar words close to each other.

# Use this model to generate some text.

In [None]:
output=model.generate(max_length=100)
print(output)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


tensor([[50256, 22556, 22556, 22556, 22556, 49032, 49032, 49032, 41491, 20589,
         20589, 20589, 20589, 20589, 20589, 20589, 38640, 38640, 38640, 38640,
         27216, 31480, 31480, 31480, 31480, 31480, 31480, 31480, 31480, 31480,
         31480, 31480, 31480, 31480, 31480, 31480, 31480, 31480, 31480, 31480,
         31480, 31480, 31480, 31480, 31480, 31480, 31480, 31480, 31480, 32582,
         32582, 32582, 32582, 32582, 32582, 32582, 32582, 32582, 32582, 32582,
         32582, 32582, 32582, 32582, 32582, 32582, 32582, 32582, 32582, 32582,
         32582, 32582, 32582, 32582, 32582, 32582, 32582, 32582, 32582, 32582,
         32582, 32582, 32582, 32582, 10143, 16913, 16913, 16913, 16913, 16913,
         16913, 16913, 16913, 16913, 16913, 16913, 16913, 16913, 16913, 16913]])
<|endoftext|>yyyyyyyy announcer announcer announcer successors wicked wicked wicked wicked wicked wicked wicked binaries binaries binaries binaries cubic Eh Eh Eh Eh Eh Eh Eh Eh Eh Eh Eh Eh Eh Eh Eh Eh Eh Eh 

The output is repeating and meaningless.

# Forward Pass

Let's pass one of our input vectors in and explore the outputs.

First, we'll pass our tokenized input and retrieve the embeddings.

In [13]:
tokenized_input = torch.tensor(outputs['input_ids'][0:1])
token_embeddings = model.transformer.wte(tokenized_input)
print(token_embeddings)

tensor([[[-0.0089,  0.0089,  0.0566,  ...,  0.0097,  0.0179, -0.0189],
         [ 0.0215,  0.0145,  0.0035,  ...,  0.0365,  0.0021,  0.0020],
         [ 0.0201,  0.0087,  0.0410,  ..., -0.0274,  0.0130,  0.0029],
         ...,
         [-0.0414,  0.0022,  0.0016,  ...,  0.0114, -0.0440, -0.0103],
         [ 0.0062, -0.0015,  0.0366,  ..., -0.0113,  0.0155, -0.0396],
         [-0.0095, -0.0323,  0.0049,  ...,  0.0196, -0.0068,  0.0189]]],
       grad_fn=<EmbeddingBackward0>)


# Generate the position embeddings

In [15]:
with torch.no_grad():
    seq_len = tokenized_input.size(1)
    position_ids = torch.arange(seq_len, dtype=torch.long).unsqueeze(0)
    position_embeddings = model.transformer.wpe(position_ids)

print(position_embeddings)

tensor([[[ 0.0029, -0.0111,  0.0005,  ...,  0.0163, -0.0182,  0.0263],
         [-0.0074, -0.0013, -0.0127,  ...,  0.0270,  0.0201,  0.0368],
         [-0.0191, -0.0167, -0.0087,  ..., -0.0213, -0.0149,  0.0073],
         ...,
         [ 0.0194,  0.0234,  0.0026,  ...,  0.0251, -0.0051, -0.0079],
         [ 0.0032,  0.0075,  0.0110,  ..., -0.0017,  0.0375, -0.0280],
         [-0.0149, -0.0069,  0.0313,  ..., -0.0061, -0.0156, -0.0316]]])


# Combine the position and token embeddings

In [None]:
inputs_embeds = token_embeddings + position_embeddings
print(inputs_embeds)
outputs = model(inputs_embeds=inputs_embeds, output_hidden_states=True)
hidden_states_all = outputs.hidden_states



tensor([[[-0.0060, -0.0022,  0.0571,  ...,  0.0260, -0.0004,  0.0074],
         [ 0.0141,  0.0132, -0.0093,  ...,  0.0636,  0.0223,  0.0388],
         [ 0.0009, -0.0080,  0.0323,  ..., -0.0487, -0.0020,  0.0102],
         ...,
         [-0.0220,  0.0257,  0.0041,  ...,  0.0365, -0.0492, -0.0182],
         [ 0.0094,  0.0060,  0.0476,  ..., -0.0130,  0.0529, -0.0676],
         [-0.0244, -0.0391,  0.0362,  ...,  0.0135, -0.0224, -0.0127]]],
       grad_fn=<AddBackward0>)


In [17]:
print(hidden_states_all)

(tensor([[[-0.0035, -0.0148,  0.0639,  ...,  0.0469, -0.0207,  0.0375],
         [ 0.0074,  0.0131, -0.0244,  ...,  0.0000,  0.0471,  0.0841],
         [-0.0202, -0.0274,  0.0263,  ..., -0.0779, -0.0188,  0.0194],
         ...,
         [-0.0029,  0.0546,  0.0075,  ...,  0.0685, -0.0603, -0.0291],
         [ 0.0140,  0.0150,  0.0651,  ..., -0.0164,  0.1005, -0.1062],
         [-0.0000, -0.0511,  0.0750,  ...,  0.0083, -0.0421, -0.0491]]],
       grad_fn=<MulBackward0>), tensor([[[ 0.0027, -0.1910,  0.4467,  ...,  0.3890, -0.2522, -0.0575],
         [-0.0132, -0.0493,  0.2402,  ...,  0.3599,  0.3264,  0.2404],
         [ 0.1071, -0.1741,  0.2694,  ...,  0.1385, -0.0628, -0.1560],
         ...,
         [ 0.1630, -0.0845,  0.0721,  ...,  0.1247,  0.1242, -0.2141],
         [ 0.0289, -0.0574,  0.1486,  ..., -0.0137,  0.2151, -0.1069],
         [-0.1946, -0.2816,  0.1755,  ...,  0.1111, -0.1591, -0.0589]]],
       grad_fn=<AddBackward0>), tensor([[[ 0.2531, -0.3715,  0.2102,  ...,  0.3261,

In [18]:
hidden_states = hidden_states_all[5]

Now we'll layer normalize. Our transformer has 6 attention heads, let's dive into one of them:

In [19]:
normalized_hidden_states = model.transformer.h[0].ln_1(hidden_states)

print("Unnormalized:")
print(hidden_states)
print('-----------------')
print("Normalized:")
print(normalized_hidden_states)

Unnormalized:
tensor([[[ 0.0736, -0.1952,  0.0094,  ...,  0.6943, -0.3875, -0.0372],
         [ 0.3043, -0.1925,  0.2230,  ...,  0.5090,  0.3864,  0.1194],
         [ 0.1710, -0.0502,  0.0658,  ...,  0.1658, -0.1181, -0.4205],
         ...,
         [ 0.1438, -0.2793,  0.1675,  ...,  0.2942,  0.2153, -0.3679],
         [ 0.1421, -0.2088,  0.0471,  ...,  0.1610, -0.0361, -0.2121],
         [-0.0873, -0.2747,  0.4522,  ...,  0.4678, -0.4298, -0.2994]]],
       grad_fn=<AddBackward0>)
-----------------
Normalized:
tensor([[[ 0.1932, -0.5694,  0.0111,  ...,  1.9544, -1.1152, -0.1211],
         [ 0.9443, -0.6384,  0.6854,  ...,  1.5967,  1.2059,  0.3552],
         [ 0.5724, -0.1586,  0.2250,  ...,  0.5554, -0.3830, -1.3823],
         ...,
         [ 0.5942, -1.0714,  0.6878,  ...,  1.1865,  0.8758, -1.4201],
         [ 0.5864, -0.7640,  0.2207,  ...,  0.6589, -0.0993, -0.7769],
         [-0.2997, -1.0198,  1.7729,  ...,  1.8328, -1.6154, -1.1145]]],
       grad_fn=<NativeLayerNormBackward0>

Let's take a look at the self-attention matrices

In [20]:
model.transformer.h[0].attn.c_attn.weight.shape

torch.Size([768, 2304])

In class, we discussed the Wq, Wk, and Wv matrices. In practice, these are often stacked into a single matrix for efficient computation. So the matrix above represents [Wq Wk Wv].

We can derive our Q, K, V matrices by splitting the output:

In [21]:
Q, K, V = model.transformer.h[0].attn.c_attn(normalized_hidden_states).split(768, dim=2)
print('Q:', Q)
print('K:', K)
print('V:', V)

Q: tensor([[[-0.0176, -0.2581, -1.1487,  ..., -0.2125, -0.8392, -0.0996],
         [-0.0337, -0.1256, -0.9511,  ...,  0.1579, -0.1324,  0.7849],
         [-0.4454,  0.2913, -1.0810,  ..., -0.3892, -0.5600,  0.5538],
         ...,
         [-0.4624, -0.0847,  0.5706,  ...,  0.7240,  0.5958, -1.0402],
         [ 0.1089, -0.1007,  0.2355,  ...,  0.3979,  0.6039, -0.1321],
         [-0.7290,  0.3986, -0.3051,  ...,  0.0587, -0.2318,  0.5221]]],
       grad_fn=<SplitBackward0>)
K: tensor([[[ 0.8372, -0.5049,  0.4022,  ..., -0.1745, -0.8787,  0.3928],
         [ 0.3752, -1.1121,  0.4688,  ...,  0.6258, -0.2741,  0.3571],
         [ 0.7830, -0.3831,  0.6921,  ...,  0.4104, -0.3931,  0.7063],
         ...,
         [-0.2577,  0.0925,  0.5434,  ..., -0.6406, -0.0526,  0.0418],
         [ 0.0377,  0.0011,  0.0932,  ...,  0.1470,  0.0587,  0.5681],
         [ 0.5439,  0.7332,  0.4775,  ...,  0.0252, -0.2701,  0.8814]]],
       grad_fn=<SplitBackward0>)
V: tensor([[[ 0.4149, -0.1343,  0.4462,  ...

In [22]:
print(Q.shape)

torch.Size([1, 512, 768])


Remember, there's nothing special about these matrices/vectors. They are random. They only gain significance during training because the following constraint is applied during the forward pass:

In [23]:
att = (Q @ K.transpose(-2, -1)) * (1.0 / math.sqrt(K.size(-1)))
print("QK^T Dim:", att.shape)
A = F.softmax(att, dim=-1)
Z = A @ V
print("Z Dim:", Z.shape)

QK^T Dim: torch.Size([1, 512, 512])
Z Dim: torch.Size([1, 512, 768])


Finally, the output of our model will be a matrix of logits (dimensionality of our vocabulary) for each position in the sequence. During training, we will compute the cross entropy between that logit and the embedding vectors of the next tokens (i.e. shifted by 1).

In [24]:
model(tokenized_input).logits.shape

torch.Size([1, 512, 50257])

# Model Training

In [25]:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [26]:
args = TrainingArguments(
    output_dir="lab3",
    per_device_train_batch_size=14, # increase/decrease this based on your memory
    eval_steps=50,
    logging_steps=50,
    gradient_accumulation_steps=1,
    num_train_epochs=2,
    weight_decay=0.1,
    warmup_steps=10,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=500,
    fp16=True,
    report_to="none"
    # use_cpu=True # Very slow! Feel free to use without a GPU if you'd like
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset
)

  trainer = Trainer(


The model will internally handle the next-token prediction loss. Go ahead and start the training. This will take a while!

In [27]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,7.0667
100,5.5128
150,4.9103
200,4.4115
250,4.0625
300,3.8123
350,3.6578
400,3.4573
450,3.3304
500,3.2497


TrainOutput(global_step=3926, training_loss=2.399813412526889, metrics={'train_runtime': 2214.1298, 'train_samples_per_second': 24.814, 'train_steps_per_second': 1.773, 'total_flos': 7178083035512832.0, 'train_loss': 2.399813412526889, 'epoch': 2.0})

# Try again

In [None]:
tokens = tokenizer.encode('hospital')
result_vector = model.transformer.wte.weight[tokens].mean(axis=0)

similarities = F.cosine_similarity(
    result_vector,
    model.transformer.wte.weight
)
top_indices = similarities.topk(10).indices
print([tokenizer.decode(idx) for idx in top_indices if idx not in tokens])

[' hospital', 'stay', 'Room', 'medical', 'home', 'event', 'different', 'community', 'past']


It's more related to hospital.

# Use the trained model to generate  a few samples of text.

In [29]:
# YOUR CODE HERE
# model.generate(max_length=100, do_sample=True, temperature=0.1)
output=model.generate(max_length=100, do_sample=True, temperature=0.1)
print(output)
print(tokenizer.decode(output[0]))

tensor([[50256,    13,   198,   198,  7155,   929, 27759,    25,   198,  5492,
          1061,   510,   351,   534,  4217,    47,   685,  1174,  5956,  6530,
           357,  5376, 47546,    19,     8, 12429,  4083,   685,  1174,  5956,
          6530,   357,  2257,  2578,     8, 12429,    60,   319,   685,  1174,
            17, 21652,    12,    23,    12,    23,  1174,    60,   379,  1367,
            25,  1270,  3001,    13,   198,   198,  5492,  1061,   510,   351,
          1583,    13,   685,  1174,  5956,  6530,   357,  2257,  2578,     8,
         12429,    60,   379,   685,  1174, 31709,  4862,    14, 46512,   357,
            16,     8,   718,  1065,  1174,    60,   198,   198,  5492,  1061,
           510,   351,   534,  4165,  1337,  6253,    11,   685,  1174,  5956]],
       device='cuda:0')
<|endoftext|>.

Followup Instructions:
Please follow up with your PCP [**Last Name (NamePattern4) **]. [**Last Name (STitle) **] on [**2185-8-8**] at 11:30 AM.

Please follow up with D

It's more like a discharge summary.