## Testing GPT2 Model
https://huggingface.co/gpt2

In [1]:
from transformers import pipeline, set_seed


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

Downloading (…)lve/main/config.json: 100%|██████████| 665/665 [00:00<00:00, 183kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading pytorch_model.bin: 100%|██████████| 548M/548M [00:18<00:00, 28.9MB/s] 
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 64.6kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 2.37MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.34MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 3.01MB/s]
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, I'm writing a new language for you. But first, I'd like to tell you about the language itself"},
 {'generated_text': "Hello, I'm a language model, and I'm trying to be as expressive as possible. In order to be expressive, it is necessary to know"},
 {'generated_text': "Hello, I'm a language model, so I don't get much of a license anymore, but I'm probably more familiar with other languages on that"},
 {'generated_text': "Hello, I'm a language model, a functional model... It's not me, it's me!\n\nI won't bore you with how"},
 {'generated_text': "Hello, I'm a language model, not an object model.\n\nIn a nutshell, I need to give language model a set of properties that"}]

### Finetuning the Model

In [4]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained GPT-2 model and tokenizer
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Load your dataset
with open('./data/calregs-part1.txt', 'r', encoding='utf-8') as f:
    data = f.read()

# Tokenize the dataset
encoded_data = tokenizer.encode(data)

# Create input-output pairs for training
seq_len = model.config.n_positions
input_seqs = []
label_seqs = []
for i in range(0, len(encoded_data) - seq_len, seq_len):
    input_seqs.append(encoded_data[i:i+seq_len])
    label_seqs.append(encoded_data[i+1:i+seq_len+1])

# Convert input-output pairs to PyTorch tensors
input_seqs = torch.tensor(input_seqs)
label_seqs = torch.tensor(label_seqs)

# Create a PyTorch DataLoader for batching the input-output pairs
batch_size = 4
data_loader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(input_seqs, label_seqs),
    batch_size=batch_size,
    shuffle=True
)

Token indices sequence length is longer than the specified maximum sequence length for this model (69486 > 1024). Running this sequence through the model will result in indexing errors


### AWS SageMaker Training Job