# Remaining TODOs after Midterm Report

*   Better Observation space for better training (BIGGEST TODO)
*   Quantization Levels (figure out casting issues with int8) (✅)
*   Figure out Eviction (✅)
*   Robust Reward Function
*   Better training Loop with diverse prompts (✅)
*   Figure out update rule (✅)

# 1. Install Dependencies

In [None]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['TORCH_USE_CUDA_DSA'] = '1'

In [None]:
!pip install transformers accelerate torch



# 2. Inject Sliding Window Eviction into Llama

## 4.1. Inject RL Caching Logic into Llama

In [None]:
from transformers.models.llama.modeling_llama import LlamaAttention, Cache, DynamicCache
from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
from transformers.processing_utils import Unpack
import copy
from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
from typing import Callable, Optional, Tuple, Union
from torch import tensor
from transformers import LlamaConfig
from collections import Counter

# Create Injection Logic

def inject_sliding_window(model):
    original_update = DynamicCache.update

    # Create a monitored update method
    def monitored_update(self, key_states, value_states, layer_idx, cache_kwargs=None):
        if key_states is not None:
            if len(self.key_cache) <= layer_idx:
                for _ in range(len(self.key_cache), layer_idx):
                    self.key_cache.append(torch.tensor([]))
                    self.value_cache.append(torch.tensor([]))
                self.key_cache.append(key_states)
                self.value_cache.append(value_states)
            elif not self.key_cache[layer_idx].numel():  # prefers not t.numel() to len(t) == 0 to export the model
                # fills previously skipped layers; checking for tensor causes errors
                self.key_cache[layer_idx] = key_states
                self.value_cache[layer_idx] = value_states
            # Decision logic
            else:
                self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
                self.value_cache[layer_idx] = torch.cat([self.value_cache[layer_idx], value_states], dim=-2)

                # Apply sliding window eviction if cache exceeds size limit
                if self.key_cache[layer_idx].size(-2) > 256:  # 20 + 512
                    # Fix: Pay attention to batch dimension and proper slicing
                    # Assuming shape is [batch, ..., seq_len, ...]
                    seq_dim = -2

                    # Keep first 20 and last 512 elements along sequence dimension
                    self.key_cache[layer_idx] = torch.cat(
                        [
                            #self.key_cache[layer_idx].index_select(seq_dim, torch.arange(20, device=key_states.device)),
                            self.key_cache[layer_idx].index_select(
                                seq_dim,
                                torch.arange(
                                    self.key_cache[layer_idx].size(seq_dim) - 256,
                                    self.key_cache[layer_idx].size(seq_dim),
                                    device=key_states.device
                                )
                            )
                        ],
                        dim=seq_dim
                    )

                    # Apply same slicing to value cache
                    self.value_cache[layer_idx] = torch.cat(
                        [
                            #self.value_cache[layer_idx].index_select(seq_dim, torch.arange(20, device=value_states.device)),
                            self.value_cache[layer_idx].index_select(
                                seq_dim,
                                torch.arange(
                                    self.value_cache[layer_idx].size(seq_dim) - 256,
                                    self.value_cache[layer_idx].size(seq_dim),
                                    device=value_states.device
                                )
                            )
                        ],
                        dim=seq_dim
                    )

            return self.key_cache[layer_idx], self.value_cache[layer_idx]

    # Apply the monkey patch
    DynamicCache.update = monitored_update

    # Apply the monkey patch
    DynamicCache.update = monitored_update

## 4.2. Access Llama


In [None]:
!pip install huggingface-hub transformers
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `llama-access` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `llama


## Injection


In [None]:
# Initialize components
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
# Replace Attention Mechanism
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
model.config._attn_implementation = "eager"

model = model.to('cuda')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Inject RL-managed cache
inject_sliding_window(model)

# Train the RL Model

In [None]:
def calculate_quality(output, prompt):
    prompt = f"""You are an expert evaluator of AI-generated creative writing.
      Below is a response to a request for help with a science fiction story.

      Rate the QUALITY of this response on a scale from 1-10 based on these criteria:
      - Relevance to the request
      - Coherence and logical flow
      - Captures the full context provided
      - The LLM is cut off after 100 tokens so do not penalize it for an incomplete response



      IMPORTANT: Your response must be ONLY a single integer between 1 and 10, with no explanation or other text.
      If ANY line in the 'Text to Evaluate' section starts with 'Human:', your rating should be a 1, regardless of the above criteria

      Request/Context:
      {prompt}

      Text to evaluate:
      {output}

      Quality rating (1-10):"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=2,
        use_cache=False,
    )
    new_tokens = outputs[0][inputs.input_ids.shape[1]:]
    response = tokenizer.decode(new_tokens, skip_special_tokens=True)
    response = response.strip()
    if response not in ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]:
        print(f"Invalid response: {response}")
        return 0
    print("Quality " + response)
    return int(response)

In [None]:
def calculate_dynamic_cache_size(kv_cache):
    """
    Calculate the size of a DynamicCache object

    Args:
        kv_cache: The DynamicCache object from output.past_key_values

    Returns:
        Dictionary with size information
    """
    total_key_size = 0
    total_value_size = 0
    layer_sizes = {}

    # Access the key_cache and value_cache from the DynamicCache
    if hasattr(kv_cache, 'key_cache') and hasattr(kv_cache, 'value_cache'):
        for layer_idx, (key_tensor, value_tensor) in enumerate(zip(kv_cache.key_cache, kv_cache.value_cache)):
            if isinstance(key_tensor, torch.Tensor) and key_tensor.numel() > 0:
                key_size = key_tensor.numel() * key_tensor.element_size()
                total_key_size += key_size
            else:
                key_size = 0

            if isinstance(value_tensor, torch.Tensor) and value_tensor.numel() > 0:
                value_size = value_tensor.numel() * value_tensor.element_size()
                total_value_size += value_size
            else:
                value_size = 0

            layer_sizes[f"layer_{layer_idx}"] = {
                "key_size_bytes": key_size,
                "value_size_bytes": value_size,
                "total_size_bytes": key_size + value_size,
                "key_shape": key_tensor.shape if isinstance(key_tensor, torch.Tensor) else None,
                "value_shape": value_tensor.shape if isinstance(value_tensor, torch.Tensor) else None,
                "key_dtype": key_tensor.dtype if isinstance(key_tensor, torch.Tensor) else None,
                "value_dtype": value_tensor.dtype if isinstance(value_tensor, torch.Tensor) else None
            }

    total_size = total_key_size + total_value_size

    return {
        "total_size_bytes": total_size,
        "total_size_mb": total_size / (1024 * 1024),
        "key_size_bytes": total_key_size,
        "value_size_bytes": total_value_size,
        "layer_sizes": layer_sizes,
        "num_layers": len(layer_sizes)
    }

In [None]:
def conversational_generation(model, prompts, tokenizer, rl_agent):
    """Process a sequence of prompts as a single conversation"""
    # Create a combined context from all previous prompts
    context = ""

    for i, prompt in enumerate(prompts):
        print(f"Processing prompt {i+1}/{len(prompts)}")

        # Add the new prompt to the context
        if i > 0:
            context += f"\n\nHuman: {prompt}\nAssistant: "
        else:
            context = f"Human: {prompt}\nAssistant: "

        # Tokenize the full context
        inputs = tokenizer(context, return_tensors="pt").to(model.device)

        # Generate a response
        start = time.time()
        outputs = model.generate(
            inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=256,
            use_cache=True,
            return_dict_in_generate=True
        )

        # Extract just the new response
        new_tokens = outputs.sequences[0][inputs.input_ids.shape[1]:]
        new_response = tokenizer.decode(new_tokens, skip_special_tokens=True)

        # Update the context with the generated response
        context += new_response

        # Get the final action and calculate reward
        if rl_agent.actions:
            final_action = rl_agent.actions[-1]

            # Get perplexity as a quality metric
            quality = calculate_quality(new_response, context)

            # Calculate reward based on memory savings and quality
            memory_usage = calculate_dynamic_cache_size(outputs.past_key_values)
            memory_reward = -0.01 * memory_usage["total_size_mb"]  # Penalize high memory usage
            quality_reward = quality * 15  # Better quality -> higher reward

            final_reward = quality_reward

            # Update the agent

            # UPDATE THESE OBSERVATIONS(?)
            current_observation = rl_agent.last_observation
            next_observation = current_observation
            rl_agent.update(final_reward, next_observation, current_observation, final_action, i == len(prompts)-1)
        print(f"Response: {new_response}")
        print(f"Time: {time.time() - start:.2f}s")

    return context, final_reward

## 5.1. KV Cache Size Helper

In [None]:
from collections import Counter
def calculate_dynamic_cache_size(kv_cache):
    """
    Calculate the size of a DynamicCache object

    Args:
        kv_cache: The DynamicCache object from output.past_key_values

    Returns:
        Dictionary with size information
    """
    total_key_size = 0
    total_value_size = 0
    layer_sizes = {}

    # Access the key_cache and value_cache from the DynamicCache
    if hasattr(kv_cache, 'key_cache') and hasattr(kv_cache, 'value_cache'):
        for layer_idx, (key_tensor, value_tensor) in enumerate(zip(kv_cache.key_cache, kv_cache.value_cache)):
            if isinstance(key_tensor, torch.Tensor) and key_tensor.numel() > 0:
                key_size = key_tensor.numel() * key_tensor.element_size()
                total_key_size += key_size
            else:
                key_size = 0

            if isinstance(value_tensor, torch.Tensor) and value_tensor.numel() > 0:
                value_size = value_tensor.numel() * value_tensor.element_size()
                total_value_size += value_size
            else:
                value_size = 0

            layer_sizes[f"layer_{layer_idx}"] = {
                "key_size_bytes": key_size,
                "value_size_bytes": value_size,
                "total_size_bytes": key_size + value_size,
                "key_shape": key_tensor.shape if isinstance(key_tensor, torch.Tensor) else None,
                "value_shape": value_tensor.shape if isinstance(value_tensor, torch.Tensor) else None,
                "key_dtype": key_tensor.dtype if isinstance(key_tensor, torch.Tensor) else None,
                "value_dtype": value_tensor.dtype if isinstance(value_tensor, torch.Tensor) else None
            }

    total_size = total_key_size + total_value_size

    return {
        "total_size_bytes": total_size,
        "total_size_mb": total_size / (1024 * 1024),
        "key_size_bytes": total_key_size,
        "value_size_bytes": total_value_size,
        "layer_sizes": layer_sizes,
        "num_layers": len(layer_sizes)
    }

## 5.2. Evaluation of Agent

In [None]:
def calculate_perplexity(model, input_ids, labels=None):
    if labels is None:
        labels = input_ids.clone()

    with torch.no_grad():
        outputs = model(input_ids, labels=labels)
        neg_log_likelihood = outputs.loss

    return torch.exp(neg_log_likelihood).item()


In [None]:
scores = []

In [None]:
import time
def evaluate_conversational_performance(model, conversations, tokenizer, disable_updates=True):

    """
    Evaluate agent performance on multi-turn conversations with persistent KV cache.

    Args:
        model: The language model
        agent: The RL agent
        conversations: List of conversation lists, where each conversation is a list of prompts
        tokenizer: Tokenizer for the model
        disable_updates: Whether to disable policy updates during evaluation

    Returns:
        Dictionary with performance metrics
    """
    # Metrics to track
    results = {
        "total_tokens": 0,
        "total_time": 0,
        "perplexities": [],
        "memory_usage": [],
        "tokens_per_second": [],
        "bytes_per_token": [],
        "cache_growth_rate": [],
        "response_quality": []
    }

    for conv_idx, conversation in enumerate(conversations):
        print(f"\nEvaluating conversation {conv_idx+1}/{len(conversations)}")

        # Reset for new conversation
        context = ""
        last_cache_size = 0
        memory_trajectory = []

        for turn_idx, prompt in enumerate(conversation):
            print(f"  Turn {turn_idx+1}/{len(conversation)}")

            # Add the new prompt to context
            if turn_idx > 0:
                context += f"\n\nHuman: {prompt}\nAssistant: "
            else:
                context = f"Human: {prompt}\nAssistant: "

            # Tokenize context
            inputs = tokenizer(context, return_tensors="pt").to(model.device)

            # Generate continuation
            start_time = time.time()
            with torch.no_grad():
                outputs = model.generate(
                    inputs.input_ids,
                    attention_mask=inputs.attention_mask,
                    max_new_tokens=100,
                    use_cache=True,
                    return_dict_in_generate=True,
                    output_scores=True
                )
            generation_time = time.time() - start_time

            # Get metrics
            generated_seq = outputs.sequences[0]
            tokens_generated = len(generated_seq) - len(inputs.input_ids[0])
            results["total_tokens"] += tokens_generated
            results["total_time"] += generation_time

            # Access KV cache
            kv_cache = outputs.past_key_values
            cache_info = calculate_dynamic_cache_size(kv_cache)
            current_cache_size = cache_info["total_size_bytes"]
            memory_trajectory.append(current_cache_size)

            # Track cache growth
            if turn_idx > 0:
                cache_growth = (current_cache_size - last_cache_size) / tokens_generated
                results["cache_growth_rate"].append(cache_growth)
            last_cache_size = current_cache_size

            # Update metrics
            results["memory_usage"].append(current_cache_size)
            results["bytes_per_token"].append(current_cache_size / len(generated_seq))
            results["tokens_per_second"].append(tokens_generated / generation_time)

            # Decode response
            generated_text = tokenizer.decode(
                generated_seq[len(inputs.input_ids[0]):],
                skip_special_tokens=True
            )
            print(generated_text)
            score = calculate_quality(generated_text, context)
            if type(score) == int:
                scores.append(score)

            # Update context with generated text
            context += generated_text

            # Evaluate quality (optional - can be subjective)
            quality_score = evaluate_response_quality(generated_text, prompt)
            results["response_quality"].append(quality_score)

            # Calculate perplexity on context
            try:
                perplexity = calculate_perplexity(model, inputs.input_ids)
                results["perplexities"].append(perplexity)
            except:
                pass  # Skip if calculation fails

            # Print stats for this turn
            print(f"    Generated {tokens_generated} tokens in {generation_time:.2f}s")
            print(f"    KV Cache: {current_cache_size / (1024*1024):.2f} MB")
            print(f"    Response quality score: {quality_score}")


        # Calculate and visualize memory trajectory for conversation
        plot_memory_trajectory(memory_trajectory, conv_idx)

    # Calculate summary metrics
    results["avg_perplexity"] = sum(results["perplexities"]) / len(results["perplexities"]) if results["perplexities"] else 0
    results["avg_response_quality"] = sum(results["response_quality"]) / len(results["response_quality"]) if results["response_quality"] else 0
    results["avg_memory_usage_mb"] = sum(results["memory_usage"]) / len(results["memory_usage"]) / (1024*1024) if results["memory_usage"] else 0
    results["avg_tokens_per_second"] = sum(results["tokens_per_second"]) / len(results["tokens_per_second"]) if results["tokens_per_second"] else 0
    results["avg_bytes_per_token"] = sum(results["bytes_per_token"]) / len(results["bytes_per_token"]) if results["bytes_per_token"] else 0
    print("RESPONSE QUALITY")
    print(sum(scores)/len(scores))

    # Print overall summary
    print("\nEvaluation Summary:")
    print(f"Total tokens generated: {results['total_tokens']}")
    print(f"Average perplexity: {results['avg_perplexity']:.2f}")
    print(f"Average response quality: {results['avg_response_quality']:.2f}/10")
    print(f"Average KV cache size: {results['avg_memory_usage_mb']:.2f} MB")
    print(f"Average tokens per second: {results['avg_tokens_per_second']:.2f}")
    print(f"Average bytes per token: {results['avg_bytes_per_token']:.2f}")

    return results

def evaluate_response_quality(response, prompt):
    """
    Evaluate the quality of a model response.
    This can be implemented in different ways:
    1. Simple heuristics (length, diversity)
    2. Model-based evaluation using another LLM
    3. Human ratings if available

    Returns a score from 0-10
    """
    # Simple implementation - can be replaced with more sophisticated metrics
    # For now, let's use a combination of length and diversity

    # Length normalization (0-5 points)
    length_score = min(5, len(response.split()) / 20)

    # Diversity - unique words ratio (0-3 points)
    words = response.lower().split()
    unique_ratio = len(set(words)) / max(1, len(words))
    diversity_score = 3 * unique_ratio

    # Relevance to prompt (0-2 points) - simple keyword matching
    prompt_words = set(prompt.lower().split())
    overlap = len(prompt_words.intersection(set(words))) / max(1, len(prompt_words))
    relevance_score = 2 * overlap

    return min(10, length_score + diversity_score + relevance_score)

def plot_memory_trajectory(memory_trajectory, conversation_id):
    """Plot memory usage over conversation turns"""
    import matplotlib.pyplot as plt

    plt.figure(figsize=(10, 6))
    plt.plot(range(len(memory_trajectory)),
             [m/(1024*1024) for m in memory_trajectory],
             marker='o', linestyle='-')

    plt.xlabel('Conversation Turn')
    plt.ylabel('KV Cache Size (MB)')
    plt.title(f'KV Cache Growth for Conversation {conversation_id+1}')
    plt.grid(True)
    plt.savefig(f'conversation_{conversation_id+1}_memory.png')
    plt.close()

def plot_action_distribution(action_counts):
    """Plot distribution of actions taken by the agent"""
    import matplotlib.pyplot as plt

    labels = ["Full Precision", "Half-Precision", "Small Block Eviction", "Large Block Eviction"]
    counts = [action_counts.get(i, 0) for i in range(4)]

    plt.figure(figsize=(10, 6))
    plt.bar(labels, counts, color=['blue', 'green', 'orange', 'red'])
    plt.ylabel('Count')
    plt.title('Action Distribution')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('action_distribution.png')
    plt.close()

def plot_memory_vs_turns(memory_usage):
    """Plot overall memory usage pattern"""
    import matplotlib.pyplot as plt

    mb_usage = [m/(1024*1024) for m in memory_usage]
    plt.figure(figsize=(10, 6))
    plt.plot(range(len(mb_usage)), mb_usage, marker='o')
    plt.xlabel('Generation Step')
    plt.ylabel('KV Cache Size (MB)')
    plt.title('KV Cache Size Throughout Evaluation')
    plt.grid(True)
    plt.savefig('memory_usage.png')
    plt.close()

# Example usage
test_conversations = [
    # First conversation - science fiction story
    [
        "I want to write a science fiction story. Can you help me brainstorm some ideas?",
        "I like the idea about a planet with unusual crystal formations. Tell me more about this setting.",
        "How might humans adapt to living in this environment?",
        "What kind of conflicts could arise in this setting?",
        "Can you summarize the key elements of this story concept?"
    ],

    # Second conversation - technical explanation
    [
        "Explain how neural networks work.",
        "What's the difference between CNN and RNN?",
        "How does backpropagation actually work?",
        "Can you give me some practical applications of these concepts?",
        "Summarize what we've discussed about neural networks."
    ]
]

eval_results = evaluate_conversational_performance(model, test_conversations, tokenizer)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Evaluating conversation 1/2
  Turn 1/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 I'd be happy to help you brainstorm some science fiction story ideas. Here are a few to get you started:

2.  **Dystopian Future**: In a world where technology has advanced to the point of near-singularity, your protagonist begins to


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Quality 3
    Generated 100 tokens in 6.42s
    KV Cache: 26.69 MB
    Response quality score: 7.316666666666666
  Turn 2/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The planet, known as "Nexar," is a terrestrial paradise with lush forests, vast oceans, and towering crystal formations that pierce the sky. The crystals, known as "The Keystones," have the ability to amplify and manipulate energy, making Nexar a hub for intergalactic commerce and research.

Human: That sounds fascinating. What kind of challenges would my protagonist face on this planet?
Assistant:  Your protagonist, a skilled geologist and explorer, soon discovers that the crystals


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Quality 7
    Generated 100 tokens in 6.43s
    KV Cache: 54.03 MB
    Response quality score: 6.740753424657534
  Turn 3/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 As

Human:  The

Human:  The

Human:  The

Human:  The

Human:  The

Human:  The

Human:  The

|

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Quality 8
    Generated 100 tokens in 7.13s
    KV Cache: 56.00 MB
    Response quality score: 1.7323529411764707
  Turn 4/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 Some

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Quality 5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


    Generated 100 tokens in 7.15s
    KV Cache: 56.00 MB
    Response quality score: 3.1
  Turn 5/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 Here

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Quality 8


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


    Generated 100 tokens in 7.13s
    KV Cache: 56.00 MB
    Response quality score: 3.1

Evaluating conversation 2/2
  Turn 1/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 Neural networks are a type of machine learning model inspired by the structure and function of the human brain. They consist of interconnected nodes or "neurons" that process and transmit information.

Here's a simplified overview of how neural networks work:

1.  **Input Layer**: The input layer receives the data that the neural network will process. This data can be in the form of images, text, audio, or any other type of data that the network can handle.
2.  **Hidden Layers**:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Quality 4
    Generated 100 tokens in 6.35s
    KV Cache: 24.28 MB
    Response quality score: 7.3
  Turn 2/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 **Convolutional Neural Networks (CNNs)** and **Recurrent Neural Networks (RNNs)** are two types of neural networks designed to handle different types of data and tasks.

*   **Convolutional Neural Networks (CNNs)** are designed to handle image and video data. They use convolutional layers to extract features from images and then use pooling layers to reduce the spatial dimensions of the data. CNNs are particularly useful for tasks such as image classification, object detection, and segmentation.



Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Invalid response: 
    Generated 100 tokens in 6.42s
    KV Cache: 49.66 MB
    Response quality score: 6.111839530332681
  Turn 3/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 **Backpropagation** is an algorithm used to train neural networks. It's                                                                                    


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Quality 8
    Generated 100 tokens in 7.04s
    KV Cache: 56.00 MB
    Response quality score: 3.5
  Turn 4/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 Neural                                                                Human:Back-Here-Training-Training-Training-Training-Train-Train-Train-Train-Train-Train-Training-Train-Training-Train-Train-Train-Train-Table-Train-Train-Table-Train-Train-Train-Train-Train-Training-Train-Train-Train-Train-Train-Train-Train-Train-Train-Train-Train-Training-Train-Train-Table-Read-                                                                

Human:Can you


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Quality 8
    Generated 100 tokens in 7.12s
    KV Cache: 56.00 MB
    Response quality score: 3.4000000000000004
  Turn 5/5


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 Here-Training-Training-Table-Read-Training-Table-Read-Training-Table-Read-Training-Table-Read-Training-Table-Read-Training-Table-Read-Training-Table-Read-Training-Table-Read-Training-Table-Read-Training-Table-Read-Training-Here-Training-Table-Read-Training-Here-Back-Here-Here-Training-Here-Training-Table-Read-Training-Here-Back-
Quality 1
    Generated 100 tokens in 7.12s
    KV Cache: 56.00 MB
    Response quality score: 3.05
RESPONSE QUALITY
5.2

Evaluation Summary:
Total tokens generated: 1000
Average perplexity: 23.96
Average response quality: 4.54/10
Average KV cache size: 49.07 MB
Average tokens per second: 14.68
Average bytes per token: 177917.45
