<a href="https://colab.research.google.com/github/unknownregular/relapse/blob/master/Making_the_most_of_your_colab_subscription.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Making the most of your colab subscription



## Faster GPUs

<p>Users who have purchased one of Colab's paid plans have access to faster GPUs and more memory. You can upgrade your notebook's GPU settings in <code>Runtime &gt; Change runtime type</code> in the menu to select from several accelerator options, subject to availability.</p>
<p>The free-of-charge version of Colab grants access to Nvidia's T4 GPUs subject to quota restrictions and availability.</p>

You can see what GPU you've been assigned at any time by executing the following cell. If the execution result of running the code cell below is 'Not connected to a GPU', you can change the runtime by going to <code>Runtime &gt; Change runtime type</code> in the menu to enable a GPU accelerator, and then re-execute the code cell.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In order to use a GPU with your notebook, select the <code>Runtime &gt; Change runtime type</code> menu and then set the hardware accelerator to the desired option.

## More memory

Users who have purchased one of Colab's paid plans have access to high-memory VMs when they are available. More powerful GPUs are always offered with high-memory VMs.
You can see how much memory you have available at any time by running the following code cell. If the execution result of running the code cell below is 'Not using a high-RAM runtime', then you can enable a high-RAM runtime via <code>Runtime &gt; Change runtime type</code> in the menu. Then select High-RAM in the Runtime shape toggle button. After, re-execute the code cell.

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

## Longer runtimes

All Colab runtimes are reset after some period of time &#40;which is faster if the runtime isn't executing code&#41;. Colab Pro and Pro+ users have access to longer runtimes than those who use Colab free of charge.

## Background execution

Colab Pro+ users have access to background execution, where notebooks will continue executing even after you've closed a browser tab. This is always enabled in Pro+ runtimes as long as you have compute units available.


## Relaxing resource limits in Colab Pro

Your resources are not unlimited in Colab. To make the most of Colab, avoid using resources when you don't need them. For example, only use a GPU when required and close Colab tabs when finished.

If you encounter limitations, you can relax those limitations by purchasing more compute units via pay as you go. Anyone can purchase compute units via <a href="https://colab.research.google.com/signup">pay as you go</a>; no subscription is required.

## Send us feedback!

<p>If you have any feedback for us, please let us know. The best way to send feedback is by using the Help &gt; 'Send feedback…' menu. If you encounter usage limits in Colab Pro consider subscribing to Pro+.</p>
<p>If you encounter errors or other issues with billing &#40;payments&#41; for Colab Pro, Pro+ or pay as you go, please email <a href="mailto:colab-billing@google.com">colab-billing@google.com</a>.</p>

## More resources

### Working with notebooks in Colab
- [Overview of Colab](/notebooks/basic_features_overview.ipynb)
- [Guide to markdown](/notebooks/markdown_guide.ipynb)
- [Importing libraries and installing dependencies](/notebooks/snippets/importing_libraries.ipynb)
- [Saving and loading notebooks in GitHub](https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/colab-github-demo.ipynb)
- [Interactive forms](/notebooks/forms.ipynb)
- [Interactive widgets](/notebooks/widgets.ipynb)

<a name="working-with-data"></a>
### Working with data
- [Loading data: Drive, Sheets and Google Cloud Storage](/notebooks/io.ipynb)
- [Charts: visualising data](/notebooks/charts.ipynb)
- [Getting started with BigQuery](/notebooks/bigquery.ipynb)

### Machine learning crash course
These are a few of the notebooks from Google's online machine learning course. See the <a href="https://developers.google.com/machine-learning/crash-course/">full course website</a> for more.
- [Intro to Pandas DataFrame](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb)
- [Linear regression with tf.keras using synthetic data](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/linear_regression_with_synthetic_data.ipynb)


<a name="using-accelerated-hardware"></a>
### Using accelerated hardware
- [TensorFlow with GPUs](/notebooks/gpu.ipynb)
- [TPUs in Colab](/notebooks/tpu.ipynb)

<a name="machine-learning-examples"></a>

## Machine learning examples

To see end-to-end examples of the interactive machine learning analyses that Colab makes possible, take a look at these tutorials using models from <a href="https://tfhub.dev">TensorFlow Hub</a>.

A few featured examples:

- <a href="https://tensorflow.org/hub/tutorials/tf2_image_retraining">Retraining an Image Classifier</a>: Build a Keras model on top of a pre-trained image classifier to distinguish flowers.
- <a href="https://tensorflow.org/hub/tutorials/tf2_text_classification">Text Classification</a>: Classify IMDB film reviews as either <em>positive</em> or <em>negative</em>.
- <a href="https://tensorflow.org/hub/tutorials/tf2_arbitrary_image_stylization">Style Transfer</a>: Use deep learning to transfer style between images.
- <a href="https://tensorflow.org/hub/tutorials/retrieval_with_tf_hub_universal_encoder_qa">Multilingual Universal Sentence Encoder Q&amp;A</a>: Use a machine-learning model to answer questions from the SQuAD dataset.
- <a href="https://tensorflow.org/hub/tutorials/tweening_conv3d">Video Interpolation</a>: Predict what happened in a video between the first and the last frame.


In [1]:
import os
import shutil

# Define the folder name
folder_name = 'my_project_folder'

# Create the directory if it doesn't exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# List of uploaded files (replace with the actual names of your uploaded files)
uploaded_files = ["/improved_model.py", "/improved_tester.py", "/improved_training.py", "/project_summary.md"]

# Move each uploaded file to the new directory
for file_name in uploaded_files:
    # Construct the destination path
    destination_path = os.path.join(folder_name, os.path.basename(file_name))
    try:
        # Use shutil.move for moving files
        shutil.move(file_name, destination_path)
        print(f"Moved '{file_name}' to '{destination_path}'")
    except FileNotFoundError:
        print(f"File '{file_name}' not found.")
    except Exception as e:
        print(f"Error moving file '{file_name}': {e}")

print(f"\nFiles should now be in the '{folder_name}' directory.")
# You can verify by listing the contents of the directory
# !ls my_project_folder

Moved '/improved_model.py' to 'my_project_folder/improved_model.py'
Moved '/improved_tester.py' to 'my_project_folder/improved_tester.py'
Moved '/improved_training.py' to 'my_project_folder/improved_training.py'
Moved '/project_summary.md' to 'my_project_folder/project_summary.md'

Files should now be in the 'my_project_folder' directory.


In [2]:
import os

folder_name = 'my_project_folder'
file_name = 'project_summary.md'
file_path = os.path.join(folder_name, file_name)

try:
    with open(file_path, 'r') as f:
        summary_content = f.read()
        print(summary_content)
except FileNotFoundError:
    print(f"File '{file_name}' not found in '{folder_name}'. Please make sure the file is in the correct directory.")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")

# Custom AI Model Project Summary & Checklist

## Current Project Status

### Project Overview
- **Goal**: Create a custom language model trained on 2025-themed data
- **Model Architecture**: Custom transformer implementation (12 layers, 768 hidden size)
- **Training Results**: Very successful training with loss decreasing to ~0.03
- **Current Issue**: Model has learned information but struggles with coherent text generation

### What We've Accomplished

- [x] Successfully implemented custom transformer architecture
- [x] Created training data with 2025-themed information
- [x] Trained model to very low loss (training: ~0.03, validation: ~0.007)
- [x] Model successfully learned factual information based on loss metrics
- [x] Converted model to use GPT-2 tokenizer (expanded vocabulary from 2.6k to 50k tokens)
- [x] Created generator scripts with various approaches

### Current Issues Identified

- [x] Text generation produces repetitive patterns (especially colons)
- [x] Model struggles

In [34]:
import torch
import torch.nn as nn

class SimpleLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # Replace feedforward layers with LSTM
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True) # batch_first=True expects input shape (batch_size, seq_len, embedding_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        # text shape: (batch_size, seq_len)
        embedded = self.embedding(text)
        # embedded shape: (batch_size, seq_len, embedding_dim)

        # Pass through LSTM
        lstm_output, (hidden_state, cell_state) = self.lstm(embedded)
        # lstm_output shape: (batch_size, seq_len, hidden_dim)
        # hidden_state shape: (1, batch_size, hidden_dim) # if num_layers is 1
        # cell_state shape: (1, batch_size, hidden_dim) # if num_layers is 1

        # We want the output at each time step to predict the next token
        # Apply linear layer to the LSTM output
        output = self.fc(lstm_output)
        # output shape: (batch_size, seq_len, output_dim)

        return output

# Example usage:
# vocab_size = 10000  # Size of your vocabulary
# embedding_dim = 100 # Dimension of word embeddings
# hidden_dim = 256    # Dimension of the hidden layer
# output_dim = vocab_size # Output dimension is typically vocabulary size for language modeling

# model = SimpleLanguageModel(vocab_size, embedding_dim, hidden_dim, output_dim)
# print(model)

In [35]:
import torch.optim as optim
import torch.nn as nn

# Assuming 'model' is an instance of your SimpleLanguageModel or similar PyTorch model
# Example:
# vocab_size = 10000  # Size of your vocabulary
embedding_dim = 100 # Dimension of word embeddings
hidden_dim = 256    # Dimension of the hidden layer
# output_dim = vocab_size # Output dimension is typically vocabulary size for language modeling

vocab_size = len(tokenizer) # Use the vocabulary size of your tokenizer
output_dim = vocab_size # Output dimension is typically vocabulary size for language modeling


model = SimpleLanguageModel(vocab_size=vocab_size, embedding_dim=embedding_dim, hidden_dim=hidden_dim, output_dim=output_dim)


# Define the optimizer
# We pass the model's parameters to the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001) # You can adjust the learning rate (lr)

# Define the loss function
# CrossEntropyLoss is suitable for multi-class classification (predicting the next token)
criterion = nn.CrossEntropyLoss()

print("Optimizer and loss function defined.")

Optimizer and loss function defined.


In [None]:
import torch

# Assuming you have your training data loaded and batched
# train_dataloader = ... # Your training data loader
# val_dataloader = ...   # Your validation data loader (optional but recommended)

# Assuming you have your model, optimizer, and criterion defined
# model = ...
# optimizer = ...
# criterion = ...

num_epochs = 10 # Define the number of training epochs

# Training loop
for epoch in range(num_epochs):
    model.train() # Set the model to training mode
    running_loss = 0.0

    # Iterate over the training data in batches
    for batch in dataloader:
        # Assuming inputs and targets are tensors on the correct device (e.g., GPU)
        inputs = batch['input_ids']
        labels = batch['labels']
        attention_mask = batch['attention_mask'] # You might need this for more complex models

        # TODO: Move inputs, labels, and model to the appropriate device (e.g., GPU)
        # inputs = inputs.to(device)
        # labels = labels.to(device)
        # model.to(device)


        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        # Ensure your model's forward method accepts inputs and attention_mask if needed
        # outputs = model(inputs, attention_mask=attention_mask)
        # For the SimpleLanguageModel, you might just need:
        outputs = model(inputs)


        # Calculate the loss
        # For CrossEntropyLoss, the outputs should be the raw logits before softmax
        # And the labels should be the target token IDs
        # You might need to flatten the outputs and labels for CrossEntropyLoss
        loss = criterion(outputs.view(-1, outputs.size(-1)), labels.view(-1))


        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * inputs.size(0) # Accumulate loss

    # Calculate average epoch loss
    epoch_loss = running_loss / len(dataset) # Use the total number of examples
    print(f"Epoch {epoch+1}/{num_epochs}, Training Loss: {epoch_loss:.4f}")

    # Optional: Evaluation on validation set
    # if val_dataloader:
    #     model.eval() # Set the model to evaluation mode
    #     val_running_loss = 0.0
    #     with torch.no_grad(): # Disable gradient calculation for evaluation
    #         for batch in val_dataloader:
    #             # Move inputs and labels to the appropriate device
    #             # inputs = inputs.to(device)
    #             # labels = labels.to(device)
    #             # attention_mask = batch['attention_mask'] # If used

    #             # Forward pass
    #             # outputs = model(inputs, attention_mask=attention_mask) # Adapt as needed
    #             # For SimpleLanguageModel:
    #             # outputs = model(inputs)

    #             # Calculate the loss
    #             # loss = criterion(outputs.view(-1, outputs.size(-1)), labels.view(-1))

    #             # val_running_loss += loss.item() * inputs.size(0) # Accumulate loss

    #     # val_epoch_loss = val_running_loss / len(val_dataloader.dataset)
    #     # print(f"Epoch {epoch+1}/{num_epochs}, Validation Loss: {val_epoch_loss:.4f}")

print("\nTraining finished.")

Epoch 1/10, Training Loss: 21.6529
Epoch 2/10, Training Loss: 21.4249
Epoch 3/10, Training Loss: 21.2041
Epoch 4/10, Training Loss: 20.9294
Epoch 5/10, Training Loss: 20.5476
Epoch 6/10, Training Loss: 19.8552


Here's a structure for a conversational dataset using special tokens. You'll need to replace the example data with your actual conversational text.

In [9]:
# Define special tokens
prompt_start_token = "<|prompt|>"
prompt_end_token = "<|endofprompt|>"
response_start_token = "<|response|>"
response_end_token = "<|endofresponse|>"

# Example raw conversational data (replace with your actual data)
raw_conversations = [
    {
        "prompt": "What is the capital of France?",
        "response": "The capital of France is Paris."
    },
    {
        "prompt": "Tell me about the weather today.",
        "response": "I'm sorry, I don't have real-time weather information."
    },
    {
        "prompt": "What are your capabilities?",
        "response": "I am a large language model, trained by Google."
    },
    {
        "prompt": "How do I train a neural network?",
        "response": "Training a neural network involves defining the architecture, compiling the model with an optimizer and loss function, and iterating over your data in a training loop."
    },
    {
        "prompt": "Can you write a poem?",
        "response": "Yes, I can try to write a poem for you. What would you like it to be about?"
    },
    {
        "prompt": "What is the meaning of life?",
        "response": "That is a deeply philosophical question that has been debated for centuries."
    }
]

# Format the data with special tokens
formatted_conversations = []
for convo in raw_conversations:
    formatted_prompt = f"{prompt_start_token}User: {convo['prompt']}{prompt_end_token}"
    formatted_response = f"{response_start_token}{convo['response']}{response_end_token}"
    # For training, you might concatenate prompt and response
    training_example = f"{formatted_prompt}{formatted_response}"
    formatted_conversations.append(training_example)

# Now 'formatted_conversations' contains your data ready for tokenization
# Example:
# print(formatted_conversations[0])

In [10]:
!pip install transformers

from transformers import GPT2Tokenizer

# Define your special tokens (make sure these match the ones in your data)
special_tokens = ["<|prompt|>", "<|endofprompt|>", "<|response|>", "<|endofresponse|>"]

# Load a pre-trained tokenizer (e.g., GPT-2)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Add the special tokens to the tokenizer's vocabulary
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})

# Tokenize the formatted conversations
# This will return a list of token IDs for each conversation
tokenized_data = [tokenizer.encode(text) for text in formatted_conversations]

print(f"Original formatted conversation example:\n{formatted_conversations[0]}")
print(f"\nTokenized example (first conversation):\n{tokenized_data[0]}")
print(f"\nVocabulary size after adding special tokens: {len(tokenizer)}")

# You might want to pad or truncate sequences to a fixed length later
# for creating batches for training.



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Original formatted conversation example:
<|prompt|>User: What is the capital of France?<|endofprompt|><|response|>The capital of France is Paris.<|endofresponse|>

Tokenized example (first conversation):
[50257, 12982, 25, 1867, 318, 262, 3139, 286, 4881, 30, 50258, 50259, 464, 3139, 286, 4881, 318, 6342, 13, 50260]

Vocabulary size after adding special tokens: 50261


In [11]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer # Assuming you are using the GPT2Tokenizer from the previous step

# Define your special tokens (make sure these match the ones used for tokenization)
prompt_start_token = "<|prompt|>"
prompt_end_token = "<|endofprompt|>"
response_start_token = "<|response|>"
response_end_token = "<|endofresponse|>"
special_tokens = [prompt_start_token, prompt_end_token, response_start_token, response_end_token]


# Assuming 'tokenized_data' is the list of token IDs from the previous step
# Assuming 'tokenizer' is the tokenizer object from the previous step

# Define a maximum sequence length (adjust as needed)
max_sequence_length = 512

class ConversationDataset(Dataset):
    def __init__(self, tokenized_data, max_length, tokenizer):
        self.tokenized_data = tokenized_data
        self.max_length = max_length
        self.tokenizer = tokenizer
        self.input_ids = []
        self.attention_masks = []
        self.labels = [] # For language modeling, labels are typically the next token

        for token_list in self.tokenized_data:
            # For language modeling, the target for an input sequence is the sequence shifted by one token
            # e.g., input: [token1, token2, token3], target: [token2, token3, pad_token]
            # We will predict the next token based on the current token(s)

            # Truncate if necessary
            if len(token_list) > self.max_length:
                token_list = token_list[:self.max_length]

            # Create input and labels
            # Input sequence is all tokens except the last one
            input_seq = token_list[:-1]
            # Label sequence is all tokens except the first one
            label_seq = token_list[1:]

            # Pad sequences
            padding_length = self.max_length - len(input_seq)
            if padding_length > 0:
                input_seq = input_seq + [self.tokenizer.pad_token_id] * padding_length
                label_seq = label_seq + [-100] * padding_length # Use -100 for padding in CrossEntropyLoss

            # Create attention mask (1 for real tokens, 0 for padding)
            attention_mask = [1] * len(token_list) + [0] * (self.max_length - len(token_list))
            attention_mask = attention_mask[:self.max_length]


            self.input_ids.append(torch.tensor(input_seq, dtype=torch.long))
            self.attention_masks.append(torch.tensor(attention_mask, dtype=torch.long))
            self.labels.append(torch.tensor(label_seq, dtype=torch.long))


    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_masks[idx],
            'labels': self.labels[idx]
        }

# Create the dataset
# Make sure the tokenizer has a pad_token_id
if tokenizer.pad_token_id is None:
    tokenizer.add_special_tokens({'pad_token': '<|pad|>'}) # Add a pad token if not present
    # You might need to resize your model's embedding layer after adding new tokens
    # model.resize_token_embeddings(len(tokenizer))


dataset = ConversationDataset(tokenized_data, max_sequence_length, tokenizer)

# Create a DataLoader
batch_size = 4 # Define your batch size
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

print(f"Dataset created with {len(dataset)} examples.")
print(f"DataLoader created with batch size {batch_size}.")

# Example of iterating through the DataLoader
# for batch in dataloader:
#     print("Input IDs batch shape:", batch['input_ids'].shape)
#     print("Attention Mask batch shape:", batch['attention_mask'].shape)
#     print("Labels batch shape:", batch['labels'].shape)
#     break # Just show one batch example

Dataset created with 6 examples.
DataLoader created with batch size 4.


In [19]:
!pip install datasets transformers

from datasets import load_dataset

# Load a sample conversational dataset (e.g., "dialogsum")
# You can explore other datasets on the Hugging Face Hub
dataset = load_dataset("blended_skill_talk")

# Print information about the dataset
print(dataset)

# Access a split (e.g., 'train', 'validation', 'test')
train_dataset = dataset['train']

# Print an example from the dataset
print("\nExample from the training dataset:")
print(train_dataset[0])



README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/5.88M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/2.62M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/2.40M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1009 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/980 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['personas', 'additional_context', 'previous_utterance', 'context', 'free_messages', 'guided_messages', 'suggestions', 'guided_chosen_suggestions', 'label_candidates'],
        num_rows: 4819
    })
    validation: Dataset({
        features: ['personas', 'additional_context', 'previous_utterance', 'context', 'free_messages', 'guided_messages', 'suggestions', 'guided_chosen_suggestions', 'label_candidates'],
        num_rows: 1009
    })
    test: Dataset({
        features: ['personas', 'additional_context', 'previous_utterance', 'context', 'free_messages', 'guided_messages', 'suggestions', 'guided_chosen_suggestions', 'label_candidates'],
        num_rows: 980
    })
})

Example from the training dataset:
{'personas': ["i've 2 kids.", 'i love flowers.'], 'additional_context': '', 'previous_utterance': ["I love live music, that's why I try to go to concerts", 'I do too. Wat do you like?'], 'context': 'empathetic_dialogues', 'free_me

In [20]:
# Assuming 'dataset' is the loaded blended_skill_talk dataset
# Assuming special tokens are defined: prompt_start_token, prompt_end_token, response_start_token, response_end_token

formatted_blended_conversations = []

# Iterate through the training split of the dataset
for example in dataset['train']:
    # The structure of blended_skill_talk is a bit complex, with multiple message lists.
    # We'll create simple prompt-response pairs from previous_utterance and free_messages/guided_messages
    # You might need to adapt this based on your specific needs

    # Combine previous utterances and free/guided messages to form a dialogue history
    dialogue_history = example['previous_utterance'] + example['free_messages'] + example['guided_messages']

    # Create prompt-response pairs from the dialogue history
    # For simplicity, let's create pairs of consecutive turns as prompt and response
    for i in range(len(dialogue_history) - 1):
        prompt = dialogue_history[i]
        response = dialogue_history[i+1]

        formatted_prompt = f"{prompt_start_token}User: {prompt}{prompt_end_token}"
        formatted_response = f"{response_start_token}{response}{response_end_token}"

        # For training, concatenate prompt and response
        training_example = f"{formatted_prompt}{formatted_response}"
        formatted_blended_conversations.append(training_example)

# 'formatted_blended_conversations' now contains the data formatted with special tokens
print(f"Created {len(formatted_blended_conversations)} formatted conversational examples from the training split.")
# Example:
# print(formatted_blended_conversations[0])

Created 58855 formatted conversational examples from the training split.


In [21]:
# Assuming 'tokenizer' is your tokenizer object with added special tokens
# Assuming 'formatted_blended_conversations' is the list of formatted conversations

# Tokenize the formatted blended conversations
tokenized_blended_data = [tokenizer.encode(text) for text in formatted_blended_conversations]

print(f"Tokenized example from blended_skill_talk (first conversation):\n{tokenized_blended_data[0]}")
print(f"Number of tokenized examples from blended_skill_talk: {len(tokenized_blended_data)}")

Tokenized example from blended_skill_talk (first conversation):
[50257, 12982, 25, 314, 1842, 2107, 2647, 11, 326, 338, 1521, 314, 1949, 284, 467, 284, 28565, 50258, 50259, 40, 466, 1165, 13, 12242, 466, 345, 588, 30, 50260]
Number of tokenized examples from blended_skill_talk: 58855


In [22]:
import torch
from torch.utils.data import Dataset, DataLoader
# Assuming ConversationDataset class is defined in a previous cell
# Assuming tokenizer is your tokenizer object with added special tokens

# Assuming 'tokenized_blended_data' is the list of token IDs from the previous step

# Define a maximum sequence length (adjust as needed, consider the distribution of your data lengths)
max_sequence_length = 512 # You might want to analyze your data to determine a suitable max length

# Create the dataset using the tokenized blended data
# Make sure the tokenizer has a pad_token_id if you are using padding
if tokenizer.pad_token_id is None:
     tokenizer.add_special_tokens({'pad_token': '<|pad|>'}) # Add a pad token if not present
     # If you added a new token, you might need to resize your model's embedding layer
     # model.resize_token_embeddings(len(tokenizer))


blended_dataset = ConversationDataset(tokenized_blended_data, max_sequence_length, tokenizer)

# Create a DataLoader for the blended dataset
batch_size = 8 # Define your batch size (you might increase this for a larger dataset and GPU)
blended_dataloader = DataLoader(blended_dataset, batch_size=batch_size, shuffle=True)

print(f"Blended Dataset created with {len(blended_dataset)} examples.")
print(f"Blended DataLoader created with batch size {batch_size}.")

# Example of iterating through the DataLoader
# for batch in blended_dataloader:
#     print("Input IDs batch shape:", batch['input_ids'].shape)
#     print("Attention Mask batch shape:", batch['attention_mask'].shape)
#     print("Labels batch shape:", batch['labels'].shape)
#     break # Just show one batch example

Blended Dataset created with 58855 examples.
Blended DataLoader created with batch size 8.


In [29]:
import torch

# Assuming you have a validation or test data loader
# val_dataloader = ... # Your validation or test data loader

# Assuming you have your trained model and criterion (loss function) defined
# model = ... # Your trained model instance
# criterion = ... # Your loss function

model.eval() # Set the model to evaluation mode
running_loss = 0.0
# You can also initialize metrics here, e.g., accuracy, perplexity

with torch.no_grad(): # Disable gradient calculation for evaluation
    # Iterate over the validation/test data in batches
    for batch in val_dataloader:
        # Assuming inputs and labels are tensors on the correct device
        # inputs = batch['input_ids'].to(device)
        # labels = batch['labels'].to(device)
        # attention_mask = batch['attention_mask'].to(device) # If used
        # model.to(device) # Ensure model is on the correct device


        # Forward pass
        # outputs = model(inputs, attention_mask=attention_mask) # Adapt as needed
        # For SimpleLanguageModel:
        inputs = batch['input_ids']
        labels = batch['labels']
        outputs = model(inputs)


        # Calculate the loss
        # loss = criterion(outputs.view(-1, outputs.size(-1)), labels.view(-1))
        loss = criterion(outputs.view(-1, outputs.size(-1)), labels.view(-1))


        running_loss += loss.item() * inputs.size(0) # Accumulate loss
        # You can also calculate and accumulate other metrics here


# Calculate average loss and/or other metrics over the evaluation set
avg_loss = running_loss / len(val_dataloader.dataset) # Adjust based on your data size
print(f"Evaluation Loss: {avg_loss:.4f}")

# Print other calculated metrics as needed

print("\nEvaluation finished.")

KeyboardInterrupt: 

In [26]:
import torch
from torch.utils.data import Dataset, DataLoader

# Assuming 'dataset' is the loaded blended_skill_talk dataset
# Assuming ConversationDataset class is defined in a previous cell
# Assuming tokenizer is your tokenizer object with added special tokens
# Assuming max_sequence_length is defined

# Access the validation split of the dataset
val_dataset_raw = dataset['validation']

# Extract and format conversational turns from the validation split
formatted_blended_conversations_val = []
for example in val_dataset_raw:
    dialogue_history = example['previous_utterance'] + example['free_messages'] + example['guided_messages']
    for i in range(len(dialogue_history) - 1):
        prompt = dialogue_history[i]
        response = dialogue_history[i+1]
        formatted_prompt = f"{prompt_start_token}User: {prompt}{prompt_end_token}"
        formatted_response = f"{response_start_token}{response}{response_end_token}"
        training_example = f"{formatted_prompt}{formatted_response}"
        formatted_blended_conversations_val.append(training_example)

# Tokenize the formatted validation conversations
tokenized_blended_data_val = [tokenizer.encode(text) for text in formatted_blended_conversations_val]

# Create the validation dataset using the ConversationDataset class
# Make sure the tokenizer has a pad_token_id if you are using padding
if tokenizer.pad_token_id is None:
     tokenizer.add_special_tokens({'pad_token': '<|pad|>'}) # Add a pad token if not present


val_dataset = ConversationDataset(tokenized_blended_data_val, max_sequence_length, tokenizer)

# Create a DataLoader for the validation dataset
val_batch_size = 8 # Define your validation batch size (can be larger than training batch size)
val_dataloader = DataLoader(val_dataset, batch_size=val_batch_size) # Shuffle is usually False for validation

print(f"Blended Validation Dataset created with {len(val_dataset)} examples.")
print(f"Blended Validation DataLoader created with batch size {val_batch_size}.")

Blended Validation Dataset created with 12311 examples.
Blended Validation DataLoader created with batch size 8.


In [32]:
import torch

# Assuming your trained model is in the 'model' variable
# Assuming your tokenizer is in the 'tokenizer' variable
# Assuming your special tokens are defined (prompt_start_token, prompt_end_token, response_start_token, response_end_token)
# Assuming max_sequence_length is defined

def generate_response(model, tokenizer, prompt, max_length=100, temperature=1.0, top_k=None, top_p=None, device='cpu'):
    """Generates a response from the model given a prompt."""
    model.eval() # Set model to evaluation mode
    model.to(device) # Move model to device

    # Format the prompt with special tokens
    formatted_prompt = f"{prompt_start_token}User: {prompt}{prompt_end_token}{response_start_token}"

    # Tokenize the formatted prompt
    input_ids = tokenizer.encode(formatted_prompt, return_tensors='pt').to(device)

    output_sequence = input_ids.tolist()[0] # Start with the input sequence

    with torch.no_grad():
        for _ in range(max_length):
            # Get the model's prediction for the next token
            # For a simple model, we might only feed the last token or the whole sequence
            # Let's feed the whole sequence for now
            input_tensor = torch.tensor([output_sequence], dtype=torch.long).to(device)

            # Ensure input_tensor does not exceed model's max_position_embeddings if applicable
            # Removed check for model.config as SimpleLanguageModel does not have it
            # if hasattr(model.config, 'max_position_embeddings') and input_tensor.size(1) > model.config.max_position_embeddings:
            #      input_tensor = input_tensor[:, -model.config.max_position_embeddings:]


            outputs = model(input_tensor)

            # Get the logits for the last token
            logits = outputs[0, -1, :] # Shape: (vocab_size,)

            # Apply temperature
            logits = logits / temperature

            # Apply top-k filtering
            if top_k is not None:
                top_k = min(top_k, logits.size(-1))  # Limit top_k to the vocabulary size
                indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
                logits[indices_to_remove] = -float('Inf')

            # Apply top-p (nucleus) sampling
            if top_p is not None:
                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(torch.softmax(sorted_logits, dim=-1), dim=-1)

                # Remove tokens with cumulative probability above the threshold
                sorted_indices_to_remove = cumulative_probs > top_p
                # Shift the indices to the right to keep the first token above the threshold
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
                sorted_indices_to_remove[..., 0] = 0

                # Scatter sorted tensors to original indexing
                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[indices_to_remove] = -float('Inf')


            # Sample the next token
            probs = torch.softmax(logits, dim=-1)
            next_token_id = torch.multinomial(probs, num_samples=1).item()


            # Append the generated token to the sequence
            output_sequence.append(next_token_id)

            # Stop if the end-of-response token is generated
            if next_token_id == tokenizer.encode(response_end_token)[0]:
                break

    # Decode the generated sequence
    generated_text = tokenizer.decode(output_sequence, skip_special_tokens=False) # Keep special tokens for now

    return generated_text

# Example usage:
prompt = "Tell me about the weather today."
# Make sure 'model', 'tokenizer', 'prompt_start_token', etc. are defined
# If using GPU, set device='cuda'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

generated_response = generate_response(model, tokenizer, prompt, device=device)

print("Prompt:", prompt)
print("Generated Response:", generated_response)

Prompt: Tell me about the weather today.
Generated Response: <|prompt|> User: Tell me about the weather today. <|endofprompt|> <|response|> . <|endofresponse|>
