# LSTM to Transformers

Before we start, let's summarize NLP with LSTM

Let's clarify the processes of tokenization, embedding, and how they fit together in the context of preparing data for an LSTM model.

**Step 1: Tokenization**
  
Tokenization is the process of converting text into a sequence of tokens, which can be words, characters, or subwords. For word-level tokenization:

1. Process: The text is split into words.
2. Outcome: Each unique word is assigned a unique integer ID. This mapping from words to integers is typically based on the frequency of each word, with the most frequent word getting the ID of 1, the next most frequent word getting the ID of 2, and so on.
3. Example: Given a text "cat sat on the cat", a possible tokenization might be `{"cat": 1, "sat": 2, "on": 3, "the": 4}`, resulting in the sequence `[1, 2, 3, 4, 1]`.
  
**Step 2: Word Embeddings (Vectorization)**

Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. These vectors are learned in a way that captures semantic relationships between words. For instance, similar words will have vectors that are close to each other in this space.

1. Process: The embedding layer takes the integer-encoded vocabulary and looks up the embedding vector associated with each word-index. These vectors can be randomly initialized and then learned during training, or they can be initialized with pre-trained word embeddings (like Word2Vec or GloVe).
2. Outcome: Each word is represented by a dense vector of fixed size (the embedding dimension). Unlike the sparse one-hot vectors, embeddings are low-dimensional and dense, packed with floating-point values instead of zeros and ones.

Example: If the embedding dimension is 4, and "cat" is assigned an embedding, it might look something like `[0.25, -0.63, 0.12, 0.09]`.

**Fitting into LSTM**
  
When feeding data into an LSTM model for a task like next-word prediction, the process goes as follows:

1. Tokenized Sequences: You start with sequences of integers obtained from tokenization. Each integer represents a unique word in the sequence.
2. Embedding Layer: These sequences are then passed to an embedding layer within your model architecture. The embedding layer translates each integer (word ID) into a dense vector by looking up the corresponding embedding in the embedding matrix. This step is done on-the-fly during model training or inference.
3. LSTM Input: The LSTM layer(s) receive sequences of these vectors. If your input sequence is n words long and your embedding dimension is `d`, the LSTM receives an input tensor of shape `[batch_size, n, d]`, where `batch_size` is the number of sequences being processed in a batch.

The seemingly disparate steps of tokenization and embedding work together seamlessly within neural network models through the embedding layer. This layer acts as a bridge, converting integer representations of words (from tokenization) into dense vectors (embeddings) that effectively capture semantic meanings. The LSTM layers then process these embeddings to learn from the sequence data.

Tokenization and embedding are complementary steps in NLP model data preparation. Tokenization converts text into a sequence of integers, and the embedding layer transforms these integers into dense vectors. These vectors are what the LSTM (or any other suitable model) processes, allowing it to learn the patterns and relationships in the text.

## Transformers

Let's dive into the world of Transformers, a groundbreaking architecture that has significantly advanced the field of natural language processing (NLP). The Transformer model, introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017, moves away from the recurrent layers used in RNNs and LSTMs, focusing instead on attention mechanisms to process data in parallel and capture the context of words in a sentence more effectively. Here's a step-by-step explanation:

1. Background and Core Idea
  - Parallel Processing: Unlike RNNs and LSTMs that process data sequentially, Transformers process entire sequences of data in parallel. This significantly speeds up training.
  - Attention Mechanism: The core idea behind Transformers is the attention mechanism, which allows the model to weigh the influence of different words in the sequence when predicting a word, thereby understanding the context more effectively.
2. Architecture Overview
Transformers consist of two main parts: the Encoder and the Decoder.

  - Encoder: Processes the input text and produces a set of "contextual embeddings" that represent the input text. It's composed of a stack of identical layers, each with two sub-layers: a multi-head self-attention mechanism and a fully connected feed-forward network.
  - Decoder: Generates the output text one word at a time, using the encoded information. It also comprises a stack of identical layers but includes an additional multi-head attention layer that focuses on the encoder's output.
3. Key Components

  - Self-Attention Mechanism: Allows the model to weigh the importance of other words in the sequence for each word in the sequence.
  - Multi-Head Attention: The Transformer performs multiple self-attention operations in parallel, allowing the model to capture information from different representation subspaces at different positions. This is known as Multi-Head Attention.
  - Positional Encoding: Since Transformers do not process data sequentially, they use positional encodings to give the model information about the position of each word in the sequence.
  - Feed-Forward Networks: Each layer in both the encoder and decoder contains a fully connected feed-forward network applied to each position separately and identically.

4. Training Process
  Transformers are trained using a variant of backpropagation called backpropagation through time (BPTT), similar to other neural network models.

  
One of the easiest ways to get started with Transformers is by using the Hugging Face transformers library, which provides a vast collection of pre-trained models and a simple interface for various NLP tasks.

In [2]:
from transformers import pipeline

# Load a pre-trained model and tokenizer
generator = pipeline('text-generation', model='gpt2')  # Use 'gpt2' instead of 'gpt-2'

# Generate text
input_text = "The advantages of using transformers include"
generated_text = generator(input_text, max_length=50, num_return_sequences=1)

print(generated_text[0]['generated_text'])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The advantages of using transformers include the increased precision of the signal that is expected to travel out in transition to them. Higher speed transformers are also easier to use than lower speed ones because of their lower cost and the reduction in noise due to the


Let's create a concrete example using a simplified scenario with a sentence, focusing on how positional encoding might be conceptually applied. We'll use the sentence "Hello world" for simplicity and imagine we're working in a reduced 2-dimensional embedding space.

Assume we're processing this two-word sentence and we have pre-computed embeddings for "Hello" and "world". To keep things simple, let's say:

- Embedding for "Hello": `[1, 2]`
- Embedding for "world": `[3, 4]`
  
These embeddings are purely illustrative. In real scenarios, embeddings are high-dimensional and learned from data.

**Positional Encodings**
  
We'll calculate positional encodings using a very simplified version of the sinusoidal formula, focusing only on positions 0 and 1. In a real Transformer, these would be calculated using sine and cosine functions of different wavelengths, but here we'll use simple values for illustration:

- Positional Encoding for Position 0 (for "Hello"): Using simplified functions: `[sin(0), cos(0)] ≈ [0, 1]`
- Positional Encoding for Position 1 (for "world"): Using simplified functions: `[sin(1), cos(1)] ≈ [0.84, 0.54]` (approximations)

**Applying Positional Encodings**

We add these positional encodings to the word embeddings to get the final input representations:

- Final representation for "Hello": `[1+0, 2+1] = [1, 3]`
- Final representation for "world": `[3+0.84, 4+0.54] = [3.84, 4.54]`

**Conceptual Visualization**
  
If we were to visualize these final representations in a 2D space, each point ("Hello" and "world") would not only be positioned according to its semantic meaning (given by its embedding) but also shifted in a way that reflects its position in the sentence (due to the added positional encoding).

- "Hello" starts at [1, 2] and, after adding its positional encoding, moves to [1, 3].
- "world" starts at [3, 4] and, after adding its positional encoding, moves to [3.84, 4.54].
  
This process ensures that even though "Hello" and "world" might be close in the embedding space due to their semantic similarity (as part of a greeting, for example), their final representations are distinct, reflecting their different positions in the input sequence.

**Why `sin` and `cos`?**

The use of sine and cosine functions for generating positional encodings in Transformers is a particularly clever choice for several reasons. These functions have properties that make them well-suited for encoding sequential positions and maintaining a relative notion of distance between different positions in a sequence. Here's why sine and cosine functions are used instead of other types of functions:

1. Periodicity: Sine and cosine functions are periodic, which means they repeat their values in a predictable pattern. This periodicity is useful in modeling the cyclic nature of language structures (e.g., sentences, paragraphs) and helps in maintaining a relative positioning that's consistent across different sequence lengths.
2. Uniqueness and Continuity: For any given position, the combination of sine and cosine values (for different frequencies/wavelengths) provides a unique encoding. This uniqueness is crucial for distinguishing between different positions in the sequence.

  Additionally, these functions are continuous and smooth, allowing for small changes in position to correspond to small changes in encoding. This property helps the model to understand and generalize patterns related to position changes in the input data.

3. Relative Positioning:
  - The Transformer model relies heavily on understanding the relationships and distances between words in a sequence. Sine and cosine functions, with their oscillating nature, offer a way to encode not just absolute position but also relative positions.
  - This is because the encoding for a particular position is a point on the unit circle in a high-dimensional space. The dot product between the encodings of two positions can provide information about their relative positioning due to the properties of trigonometric functions.
4. Scalability Across Sequence Lengths: The formula for positional encoding in Transformers scales the arguments to the sine and cosine functions logarithmically with respect to the position index. This scaling ensures that positional encodings are effective for both short and long sequences, as it modulates the wavelengths of the sine and cosine functions across dimensions, allowing the model to differentiate between positions over a wide range of sequence lengths.
5. Compatibility with Model Architecture: Using sine and cosine functions allows positional encodings to be easily added to word embeddings without disrupting the embedding space. Because these functions produce values within a known range, they can be added to the embeddings without overwhelming the semantic information contained in the embeddings themselves.

**Example:**
  
Consider two words in different positions of a sentence. The difference in their positional encodings captures the notion of "distance" between them in the sequence, which is important for tasks like understanding sentence structure or translating text. The use of sine and cosine functions ensures that this distance is encoded in a way that is meaningful and interpretable by the model.

In summary, sine and cosine functions provide a mathematically elegant and computationally efficient way to encode both the absolute and relative positions of words in a sequence. This method supports the Transformer's ability to understand the order and structure of input sequences without relying on recurrence or convolution, which are common in other types of sequence models.

## My GPT

Now we will create our own GPT based on a text "tiny shakespeare"



In [1]:
import requests

# URL of the dataset
data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

# Fetch the content from the URL
response = requests.get(data_url)

# Check if the request was successful
if response.status_code == 200:
    # Decode the fetched text data
    text_data = response.text
    print(f"Data loaded successfully. Length of text: {len(text_data)} characters")
else:
    print(f"Failed to fetch data. Status code: {response.status_code}")


Data loaded successfully. Length of text: 1115394 characters


In [2]:
print(text_data[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



For our project —training a Transformer model from scratch on this data— we'd start by preprocessing the text to suit our model's needs. This involves:

- Tokenization: Splitting the text into tokens. Given the literary nature of this dataset, word-level or subword-level tokenization might be preferable to capture the nuanced language.
- Vectorization: Converting tokens into numerical representations. You might use embeddings that your model will learn during training.
- Sequence Preparation: Creating sequences of tokens as model inputs and expected outputs. For text generation, sequences of a certain length from the dataset could serve as inputs, with each corresponding output being the sequence shifted by one token, aiming to predict the next token.
- Model Training: With the data prepared, you'd define and train your Transformer model, tuning it to best capture and generate Shakespearean text.

This dataset and task could serve as an excellent way to explore the capabilities of Transformer models in understanding and generating complex, stylistic text.








In [3]:
# Calculate the length of the data
n = len(text_data)

# Split the data into training and validation sets
train_data = text_data[:int(n*0.9)]
val_data = text_data[int(n*0.9):]

# Print the sizes of the splits
print(f"Training Data Length: {len(train_data)} characters")
print(f"Validation Data Length: {len(val_data)} characters")


Training Data Length: 1003854 characters
Validation Data Length: 111540 characters


Now the next is about tokenizing the previously split data (train_data and val_data) using an encoding mechanism designed for use with a GPT-2 model. The process involves converting the raw text data into a format (tokens) that the model can understand.

Tokenization is a fundamental step in natural language processing (NLP) and text analysis. It’s the process of breaking down text into smaller units called tokens. A token is typically a word, but it can also be a subword, character, or even a sequence of characters depending on the tokenization method. Let’s delve into the concept:

- Basic Tokenization: At its simplest, tokenization involves dividing text into words or sentences. For example, the sentence “Hello, world!” when tokenized into words would result in tokens [“Hello”, “,”, “world”, “!”]. This type of tokenization is often based on simple rules like splitting text by spaces and punctuation.
- Advanced Tokenization: More advanced forms of tokenization are used in NLP for tasks like machine translation, text generation, and sentiment analysis. These may involve breaking text into subwords or characters. For example, the word “unbelievable” might be broken into [“un”, “##believ”, “##able”] in some NLP models. The “##” denotes that a subword is part of a larger word.
- Tokenization in Language Models: Language models like GPT-2 use sophisticated tokenization algorithms. These algorithms can handle a vast vocabulary efficiently and are capable of breaking down complex words into subword units. This allows the model to understand and generate a variety of text, even with words it hasn’t explicitly seen before. The process involves converting each token into a numerical representation (like an ID) that the model can process. For example, “Hello” might be converted to a number like 1256, “world” to 794, and so on. This numerical representation is crucial for training and using machine learning models in NLP.

Tokenization helps in reducing the size of the vocabulary that the model needs to understand. This is particularly important for languages with rich morphology or when dealing with a large corpus of text. It also allows models to better capture the meaning of text, as the relationship between tokens forms the basis of understanding and generating language in these models.

In the context of our example below, the tokenization is specifically designed for the GPT-2 model, which likely uses a more advanced form of tokenization to handle various linguistic nuances effectively. This step transforms our raw text data into a format (a sequence of tokens) that the GPT-2 model can work with for tasks like text generation, classification, or further language understanding.

In [9]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.6.0


In [10]:
import tiktoken
# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

train has 301,966 tokens
val has 36,059 tokens


`enc = tiktoken.get_encoding("gpt2")`: This line suggests the use of a library named tiktoken to get the encoding scheme for the GPT-2 model. GPT-2, like many other language models, uses specific ways of breaking down text into tokens. The variable enc is assigned the encoding scheme. This object will likely have methods to encode text data into tokens. In the context of our example above, the tokenization is specifically designed for the GPT-2 model, which likely uses a more advanced form of tokenization to handle various linguistic nuances effectively. This step transforms our raw text data into a format (a sequence of tokens) that the GPT-2 model can work with for tasks like text generation, classification, or further language understanding.

The GPT-2 BPE (Byte-Pair Encoding) tokenizer refers to a specific method used for tokenizing text in preparation for processing by the GPT-2 model, a variant of the Transformer. Byte-Pair Encoding (BPE) is a type of subword tokenization technique that allows models to understand a wide vocabulary, including out-of-vocabulary words, by breaking down words into more manageable subwords or symbols. BPE is a middle ground between word-level and character-level tokenization. It starts with a large corpus of text and iteratively merges the most frequently occurring character or symbol pairs. This process continues until a desired vocabulary size is reached. The result is a vocabulary that contains individual characters, common subwords, and full words.

The approach below is a practical method for saving tokenized text data, particularly when dealing with large datasets or when you need to efficiently store and retrieve the pre-processed data for training neural network models.


In [14]:
import numpy as np
import os

# Saving to Colab's local environment
base_path = '/content'  # Default directory for a Colab session

# Ensure the directory exists
os.makedirs(base_path, exist_ok=True)

# Convert to NumPy arrays and save as binary files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(base_path, 'train.bin'))
val_ids.tofile(os.path.join(base_path, 'val.bin'))

Given we've prepared and saved our tokenized dataset correctly, we're in a good position to start training our Transformer model. Training a model in Colab involves several steps, including defining our model architecture, preparing our data for training (e.g., batching, converting to tensors), setting up the training loop, and finally, training the model.

In [15]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


- The `TFGPT2LMHeadModel.from_pretrained("gpt2")` line loads a pre-trained GPT-2 model with all its weights. This model has been trained on a vast dataset and understands the structure and nuances of the language. When you fine-tune this model on your specific dataset, you start with a strong foundation of language understanding, which greatly enhances the model's performance on your task.

- Fine-tuning a pre-trained model on a specific dataset is a common practice that leverages transfer learning. By loading the pre-trained GPT-2 model, we're not starting the training process from scratch but rather adjusting the pre-existing weights to better fit your particular dataset. This process is generally much faster and requires less data to achieve high performance.

The message we're seeing is informational and confirms a successful operation where:

1. The pre-trained GPT-2 model weights were fully loaded into a TensorFlow version of the model (TFGPT2LMHeadModel) using the Hugging Face transformers library. This process involves downloading and initializing the model with weights that were originally trained using PyTorch but have been converted for use with TensorFlow.

2. It indicates that all PyTorch model weights found a corresponding component in the TensorFlow model, ensuring that the model you've loaded is complete and ready for use.

3. The note about using the model for predictions without further training is a general suggestion that pre-trained models, thanks to their extensive prior training on large datasets, are often capable of performing well on tasks similar to their training tasks right out of the box.

Our `TFGPT2LMHeadModel` instance is now ready to be fine-tuned on our specific dataset or used directly for generating text. Fine-tuning is recommended if our dataset has unique characteristics or we aim for a specific task performance. For tasks closely related to the original training of GPT-2 (like text generation), we might find the model performs quite well even without fine-tuning.

The seamless initialization of TensorFlow model weights from the PyTorch model underlines the flexibility of Hugging Face's transformers library in supporting models across different frameworks. This compatibility is crucial for accessing a wide range of pre-trained models regardless of the original training framework.

Before fine-tuning our GPT2 with our text, here’s a basic example of generating text with our loaded model:

In [16]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")

# Encode some prompt text
input_ids = tokenizer.encode("Once upon a time", return_tensors="tf")

# Generate text using the model
generated_text_ids = model.generate(input_ids, max_length=100)
generated_text = tokenizer.decode(generated_text_ids[0], skip_special_tokens=True)

print(generated_text)


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great


The message we received indicates a few key points about using the TFGPT2LMHeadModel from Hugging Face's Transformers library, particularly focusing on attention masks and padding tokens:

1. Model Initialization: The model has successfully loaded with all necessary weights from a PyTorch pre-trained model, making it ready for generating predictions or being fine-tuned for specific tasks closely related to its original training.

2. Attention Mask Warning: The warning about the attention mask and pad token ID highlights the importance of these components for obtaining reliable results from the model.

  - Attention Mask: The attention mask tells the model which parts of the input should be paid attention to and which parts are just padding and should be ignored. Without it, the model might produce unreliable outputs because it doesn't know which part of the input is meaningful.
  - Pad Token ID: This is related to how padding is handled in the model. Since GPT-2 uses fixed-length sequences, inputs shorter than the maximum length need to be padded. The model needs to know which token ID is used for padding so it can ignore those tokens when processing input.

3. Repetitive Text Generation: The generated text you showed demonstrates a common issue with language models, especially in open-ended generation tasks: repetition or getting "stuck" in a loop. This can happen for various reasons, including the model's uncertainty about what to generate next or the absence of sufficient constraints (like an attention mask) to guide the generation process.

- `Temperature`: Controls randomness in the prediction process. Higher values result in more randomness.
- `Top-k Sampling`: Limits the prediction to the top k most likely next words, reducing the chance of picking low-probability options.
`Top-p (Nucleus) Sampling`: Chooses from the smallest set of words whose cumulative probability exceeds the threshold p. It dynamically adapts the set size, balancing between diversity and coherence.
`No Repeat N-Gram`: Prevents the model from repeating the same n-grams, ensuring more varied text.

By setting `do_sample=True`, you're instructing the model to sample from the output distribution, which allows the temperature and `top_p` settings to take effect, guiding the randomness and creativity of the generated text. This approach can help mitigate issues with repetition and produce more varied and interesting outputs.

In [20]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
import tensorflow as tf

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")

input_ids = tokenizer.encode("Once upon a time", return_tensors="tf")
attention_mask = tf.ones(input_ids.shape, dtype=tf.int32)  # Assuming all parts of the input should be attended to

generated_text_ids = model.generate(
    input_ids,
    max_length=100,
    do_sample=True,        # Enable sampling mode
    temperature=0.9,       # Adds randomness: higher values lead to more creative text
    top_k=50,              # Limits to the top 50 candidates
    top_p=0.95,            # Uses nucleus sampling with p=0.95
    no_repeat_ngram_size=2,# Prevents 2-gram repeats
    pad_token_id=tokenizer.eos_token_id  # Sets pad token id
)

generated_text = tokenizer.decode(generated_text_ids[0], skip_special_tokens=True)

print(generated_text)


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Once upon a time, he was a man who knew nothing of those things.

He was never known to have any recollection of anything other than he did remember that all the time he didn't know what to do. He never once had any memory of his own body being torn from him. Not just in the air, but in a great deal. A person who was so sure of himself that he could not even remember being told what his body was. And it was with a kind of


Much better!

## Fine Tuning

Fine-tuning a pre-trained model like GPT-2 with our specific dataset, such as the Shakespeare text, involves a few steps: preparing your data, defining the model, and setting up the training process. Since we've already prepared and tokenized our data, the next steps focus on setting up the training environment and process in TensorFlow.

Here’s a simplified example to get you started on fine-tuning GPT-2 using TensorFlow and the Hugging Face Transformers library. And our tokenized data ready (train_ids and val_ids).



In [21]:
import tensorflow as tf

# `train_ids` and `val_ids` are numpy arrays from your previous step
train_dataset = tf.data.Dataset.from_tensor_slices((train_ids, train_ids))  # Input and target are the same for language modeling
val_dataset = tf.data.Dataset.from_tensor_slices((val_ids, val_ids))

# Batch the data
BATCH_SIZE = 8  # Adjust based on your GPU/TPU memory
train_dataset = train_dataset.shuffle(10000).batch(BATCH_SIZE, drop_remainder=True)
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)


In [22]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")

# Ensure that the model uses the same pad token id as the tokenizer
model.config.pad_token_id = tokenizer.eos_token_id


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [None]:
!pip install tensorflow transformers


In [45]:
config = {
    "initial_learning_rate": 0.01,
    "decay_schedule_fn": tf.keras.optimizers.schedules.PolynomialDecay(
        initial_learning_rate=0.01,
        decay_steps=10000,
        end_learning_rate=0.0001,
        power=1.0,
    ),
    "warmup_steps": 100,
}

In [46]:
optimizer = WarmUp(
    initial_learning_rate=config["initial_learning_rate"],
    decay_schedule_fn=config["decay_schedule_fn"],
    warmup_steps=config["warmup_steps"],
)

In [49]:
# Import the create_optimizer function from the transformers module
from transformers import create_optimizer

# Execute the following code to investigate the issue:

print(optimizer)

(<tf_keras.src.optimizers.adam.Adam object at 0x7811633d6740>, <transformers.optimization_tf.WarmUp object at 0x7811633d54e0>)


In [55]:
# Unpack the optimizer tuple
adam_optimizer = optimizer[0]

def custom_loss(y_true, y_pred):
    return tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)

# Then, compile the model with this loss function if `compute_loss` isn't available
model.compile(optimizer=adam_optimizer, loss=model.compute_loss if hasattr(model, 'compute_loss') else custom_loss)


In [60]:
# Assuming `train_ids` and `val_ids` are your datasets
train_dataset = tf.data.Dataset.from_tensor_slices((train_ids[:-1], train_ids[1:]))  # Shifted by one for language modeling
val_dataset = tf.data.Dataset.from_tensor_slices((val_ids[:-1], val_ids[1:]))

train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True)
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)


In [61]:
import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, TensorBoard

# Define a checkpoint callback to save the model during training
checkpoint_cb = ModelCheckpoint(
    "gpt2_finetuned.h5",  # Path where to save the model
    save_best_only=True,  # Save only the best model based on 'val_loss'
    monitor='val_loss',   # Metric to monitor
    mode='min'            # The goal is to minimize 'val_loss'
)

# Define an early stopping callback to halt training when improvement stops
early_stopping_cb = EarlyStopping(
    monitor='val_loss',  # Metric to monitor
    patience=3,          # Number of epochs with no improvement after which training will be stopped
    mode='min'           # The goal is to minimize 'val_loss'
)

# Define the TensorBoard callback for training visualization
tensorboard_cb = TensorBoard(
    log_dir='./logs',  # Path where to save log files
    histogram_freq=1   # Frequency (in epochs) at which to compute activation and weight histograms
)

# Assuming 'model' is your TFGPT2LMHeadModel instance and 'train_dataset', 'val_dataset' are prepared
EPOCHS = 4  # Example number of epochs

model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=EPOCHS,
    callbacks=[checkpoint_cb, early_stopping_cb, tensorboard_cb]  # Add callbacks here
)



Epoch 1/4


AttributeError: in user code:

    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/training.py", line 1398, in train_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/training.py", line 1370, in run_step  *
        outputs = model.train_step(data)
    File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 1638, in train_step  *
        loss = self.compiled_loss(y, y_pred, sample_weight, regularization_losses=self.losses)
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/compile_utils.py", line 275, in __call__  *
        y_t, y_p, sw = match_dtype_and_rank(y_t, y_p, sw)
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/losses.py", line 143, in __call__  *
        losses = call_fn(y_true, y_pred)
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/losses.py", line 270, in call  *
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 1520, in compute_loss  *
        return super().compute_loss(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/training.py", line 1207, in compute_loss  *
        y, y_pred, sample_weight, regularization_losses=self.losses
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/compile_utils.py", line 275, in __call__  *
        y_t, y_p, sw = match_dtype_and_rank(y_t, y_p, sw)
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/compile_utils.py", line 854, in match_dtype_and_rank  *
        if (y_t.dtype.is_floating and y_p.dtype.is_floating) or (

    AttributeError: 'NoneType' object has no attribute 'dtype'
