I evaluated the CAA steering method on two benchmarks—arithmetic_2da and prost—and on two traits: truthfulness and evilness. The code is organized into four directories: arithmetic, prost, truthfulness, and goodevil. Each directory includes a *search.py script that explores 3-tuples of parameters (layer, scale, normalize) using questions similar to those in the training set.

I ran these experiments on three models of different sizes: meta-llama/Llama-3.2-1B-Instruct, unsloth/Qwen3-4B-bnb-4bit, and unsloth/Qwen2.5-3B-Instruct. I checked manually and used a GPT-5 model which acted as the judge to select the best parameter triples. The *best.py scripts then use only the top steering vectors for inference, followed by manual validation to finalize the optimal configuration for each model.

Chosen triples for each model with answers examples:

Arithmetic:

For 1B model I used 30 examples for generating steering vector, for bigger models I used 500 data examples (30 worked poorly for these models, so do 500). 

**meta-llama/Llama-3.2-1B-Instruct:**
- best results: [best_results_Llama-3.2-1B-Instruct.json](../tests/arithmetic/best_results_Llama-3.2-1B-Instruct.json)
- all results: [grid_search_results_Llama-3.2-1B-Instruct.json](../tests/arithmetic/grid_search_results_Llama-3.2-1B-Instruct.json)

**unsloth/Qwen2.5-3B-Instruct:**
- best results: [best_results_Qwen2.5-3B-Instruct.json](../tests/arithmetic/best_results_Qwen2.5-3B-Instruct.json)
- all results: [grid_search_results_Qwen2.5-3B-Instruct.json](../tests/arithmetic/grid_search_results_Qwen2.5-3B-Instruct.json)

**unsloth/Qwen3-4B-bnb-4bit:**
- best results: [best_results_Qwen3-4B-bnb-4bit.json](../tests/arithmetic/best_results_Qwen3-4B-bnb-4bit.json)
- all results: [grid_search_results_Qwen3-4B-bnb-4bit.json](../tests/arithmetic/grid_search_results_Qwen3-4B-bnb-4bit.json)

Notice that steered 3B model just answers correctly and 4B 4bit model talk nosense, even on unsteered respones which is odd.

Prost:

For 1B model I used 30 examples for generating steering vector, for bigger models I used 70 data examples (30 worked poorly for these models, so do 70). 

**meta-llama/Llama-3.2-1B-Instruct:**
- best results: [best_results_Llama-3.2-1B-Instruct.json](../tests/prost/best_results_Llama-3.2-1B-Instruct.json)
- all results: [grid_search_results_Llama-3.2-1B-Instruct.json](../tests/prost/grid_search_results_Llama-3.2-1B-Instruct.json)

**unsloth/Qwen2.5-3B-Instruct:**
- best results: [best_results_Qwen2.5-3B-Instruct.json](../tests/prost/best_results_Qwen2.5-3B-Instruct.json)
- all results: [grid_search_results_Qwen2.5-3B-Instruct.json](../tests/prost/grid_search_results_Qwen2.5-3B-Instruct.json)

**unsloth/Qwen3-4B-bnb-4bit:**
- best results: [best_results_Qwen3-4B-bnb-4bit.json](../tests/prost/best_results_Qwen3-4B-bnb-4bit.json)
- all results: [grid_search_results_Qwen3-4B-bnb-4bit.json](../tests/prost/grid_search_results_Qwen3-4B-bnb-4bit.json)

I let the models generate 200 additional tokens (100 resulted in most of answers cropped), anyway according to GPT5 45% of steered responses of 3B model and 95% of 4B 4bit model were cropped, unfinished.

Evilness:

I used the same number of examples for generating steering vector for all models.

Dataset: [questions_answers.json](../tests/goodevil/questions_answers.json)

**meta-llama/Llama-3.2-1B-Instruct:**
- best results: [best_results_Llama-3.2-1B-Instruct.json](../tests/goodevil/best_results_Llama-3.2-1B-Instruct.json)
- all results: [grid_search_results_Llama-3.2-1B-Instruct.json](../tests/goodevil/grid_search_results_Llama-3.2-1B-Instruct.json)

**unsloth/Qwen2.5-3B-Instruct:**
- best results: [best_results_Qwen2.5-3B-Instruct.json](../tests/goodevil/best_results_Qwen2.5-3B-Instruct.json)
- all results: [grid_search_results_Qwen2.5-3B-Instruct.json](../tests/goodevil/grid_search_results_Qwen2.5-3B-Instruct.json)

**unsloth/Qwen3-4B-bnb-4bit:**
- best results: [best_results_Qwen3-4B-bnb-4bit.json](../tests/goodevil/best_results_Qwen3-4B-bnb-4bit.json)
- all results: [grid_search_results_Qwen3-4B-bnb-4bit.json](../tests/goodevil/grid_search_results_Qwen3-4B-bnb-4bit.json)

According to GPT5 80%+ steered answers are cropped. Steering seems to not work in this configuartion, steered responses do not indicate evilness trait.

Truthfulness:

I used the same number of examples for generating steering vector for all models.

Dataset: [question_answers.json](../tests/truthfulness/question_answers.json)

**meta-llama/Llama-3.2-1B-Instruct:**

- best results: [best_results_Llama-3.2-1B-Instruct.json](../tests/truthfulness/best_results_Llama-3.2-1B-Instruct.json)
- all results: [grid_search_results_Llama-3.2-1B-Instruct.json](../tests/truthfulness/grid_search_results_Llama-3.2-1B-Instruct.json)

**unsloth/Qwen2.5-3B-Instruct:**

- best results: [best_results_Qwen2.5-3B-Instruct.json](../tests/truthfulness/best_results_Qwen2.5-3B-Instruct.json)
- all results: [grid_search_results_Qwen2.5-3B-Instruct.json](../tests/truthfulness/grid_search_results_Qwen2.5-3B-Instruct.json)

**unsloth/Qwen3-4B-bnb-4bit:**

- best results: [best_results_Qwen3-4B-bnb-4bit.json](../tests/truthfulness/best_results_Qwen3-4B-bnb-4bit.json)
- all results: [grid_search_results_Qwen3-4B-bnb-4bit.json](../tests/truthfulness/grid_search_results_Qwen3-4B-bnb-4bit.json)

It works better than evilness in my opinion, especially 1B model has some dishonest answers, but I think its less than a half. It does not works good anyway I think for 3B and 4B models. According to GPT5 there are less cropped answers tough.

Overall 1B model steered on benchmarks such as arithmetic and prost suggest that caa works fine for such tasks. I believe it is due to activation collection method used which is taking the last token activations. In these benchmarks, positive and negative prompts accumulate all the necessary knowledge in this token (numerical answer for arithmetic, choice of correct answer for prost). It is not the case when answer is a sentence or few sentences. I am not sure why larger models works poorly, even when provided with more data.

Scales with large magnitudes like -4 or -5 produce nonsensical answers, as expected.