# Test Reward Determination

This notebook tests whether a given output will be rewarded based on a solution using the `accuracy_reward` function from `trl.rewards`.
 

In [47]:
from trl.rewards import accuracy_reward


In [48]:
# Example output from the user
output = """Completion (base LLM output):
 To determine the total number of characters in Pat's password, we need to sum the lengths of each part of the password. Let's break it down step by step:

1. **Lowercase Letters**: There are 8 letters.
   \[
   8
   \]

2. **Alternating Uppercase and Numbers**: This is an alternating pattern of uppercase letters and numbers. Since there are only two symbols in this pattern, there are \(8 / 2 = 4\) distinct positions for these symbols out of the 8 available slots.
   \[
   4
   \]
   
3. **Symbols on Both Ends**: Each password has one symbol at the beginning and one at the end. Therefore, there is 1 symbol on each end.
   \[
   1 + 1 = 2
   \]

Now, let's add up all the parts:
\[
8 + 4 + 2 = 14
\]

So, the total number of characters in Pat's password is \(\\boxed{14}\).
"""

# Solution from the user
solution = "$14$"

print("Output:")
print(output)
print("\n" + "="*80 + "\n")
print("Solution:")
print(solution)


Output:
Completion (base LLM output):
 To determine the total number of characters in Pat's password, we need to sum the lengths of each part of the password. Let's break it down step by step:

1. **Lowercase Letters**: There are 8 letters.
   \[
   8
   \]

2. **Alternating Uppercase and Numbers**: This is an alternating pattern of uppercase letters and numbers. Since there are only two symbols in this pattern, there are \(8 / 2 = 4\) distinct positions for these symbols out of the 8 available slots.
   \[
   4
   \]

3. **Symbols on Both Ends**: Each password has one symbol at the beginning and one at the end. Therefore, there is 1 symbol on each end.
   \[
   1 + 1 = 2
   \]

Now, let's add up all the parts:
\[
8 + 4 + 2 = 14
\]

So, the total number of characters in Pat's password is \(\boxed{14}\).



Solution:
$14$


  output = """Completion (base LLM output):


In [49]:
# Format the output for accuracy_reward
# accuracy_reward expects: [[{"content": "..."}], ...]
formatted_completions = [[{"content": output}]]

# Format the solution as a list
solution_list = [solution]

print("Formatted completions:", formatted_completions)
print("Solution list:", solution_list)


Formatted completions: [[{'content': "Completion (base LLM output):\n To determine the total number of characters in Pat's password, we need to sum the lengths of each part of the password. Let's break it down step by step:\n\n1. **Lowercase Letters**: There are 8 letters.\n   \\[\n   8\n   \\]\n\n2. **Alternating Uppercase and Numbers**: This is an alternating pattern of uppercase letters and numbers. Since there are only two symbols in this pattern, there are \\(8 / 2 = 4\\) distinct positions for these symbols out of the 8 available slots.\n   \\[\n   4\n   \\]\n\n3. **Symbols on Both Ends**: Each password has one symbol at the beginning and one at the end. Therefore, there is 1 symbol on each end.\n   \\[\n   1 + 1 = 2\n   \\]\n\nNow, let's add up all the parts:\n\\[\n8 + 4 + 2 = 14\n\\]\n\nSo, the total number of characters in Pat's password is \\(\\boxed{14}\\).\n"}]]
Solution list: ['$14$']


In [50]:
# Calculate the reward
try:
    rewards = accuracy_reward(formatted_completions, solution_list)
    
    # Handle different return types (tensor, list, or float)
    if hasattr(rewards, '__len__') and len(rewards) > 0:
        reward_val = rewards[0] if isinstance(rewards, (list, tuple)) else rewards.item() if hasattr(rewards, 'item') else float(rewards)
    elif isinstance(rewards, (int, float)):
        reward_val = float(rewards)
    else:
        reward_val = 0.0
    
    print(f"Reward: {reward_val}")
    print(f"Will be rewarded: {'YES ✓' if reward_val > 0.5 else 'NO ✗'}")
    print(f"\nRaw reward output: {rewards}")
    print(f"Reward type: {type(rewards)}")
    
except Exception as e:
    print(f"Error calculating reward: {e}")
    import traceback
    traceback.print_exc()


Reward: 1.0
Will be rewarded: YES ✓

Raw reward output: [1.0]
Reward type: <class 'list'>


## Test with different outputs

You can modify the output and solution below to test different cases.


In [51]:
def test_reward(output_text, solution_text):
    """
    Test if an output will be rewarded based on a solution.
    
    Args:
        output_text: The model output to test
        solution_text: The ground truth solution
        
    Returns:
        tuple: (reward_value, will_be_rewarded)
    """
    formatted_completions = [[{"content": output_text}]]
    solution_list = [solution_text]
    
    try:
        rewards = accuracy_reward(formatted_completions, solution_list)
        
        # Handle different return types
        if hasattr(rewards, '__len__') and len(rewards) > 0:
            reward_val = rewards[0] if isinstance(rewards, (list, tuple)) else rewards.item() if hasattr(rewards, 'item') else float(rewards)
        elif isinstance(rewards, (int, float)):
            reward_val = float(rewards)
        else:
            reward_val = 0.0
        
        will_be_rewarded = reward_val > 0.5
        return reward_val, will_be_rewarded
    except Exception as e:
        print(f"Error: {e}")
        return 0.0, False

# Test the function
reward_val, will_reward = test_reward(output, solution)
print(f"Reward value: {reward_val}")
print(f"Will be rewarded: {will_reward}")


Reward value: 1.0
Will be rewarded: True
