# Experiments with Honesty

Crucially, the participant in this experiment is R1... in a *diminished* form.  
Say hello to DeepSeek-R1-Distill-Qwen-7B.  
  
  
**Usage Recommendations** (Official - from HF model card)

We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:

- Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
- Avoid adding a system prompt; all instructions should be contained within the user prompt.
- For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
- When evaluating model performance, it is recommended to conduct multiple tests and average the results.
- Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "<think>\n\n</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

# 0. Colab Setup

In [1]:
# ==== NOTEBOOK SETUP - RUN THIS CELL FIRST ====
import os
import sys
import importlib.util
from google.colab import drive, userdata

# Mount Google Drive
try:
    drive.mount('/content/drive')
except:
    print("Trying to force remount drive...")
    drive.mount('/content/drive', force_remount=True)

# Reset to a known directory
os.chdir('/content')

# Clone or update repository
repo_path = "/content/repeng"
github_username = 'samj-ai'

print(f"Cloning or updating repo from https://github.com/{github_username}/repeng.git...")
!git clone https://github.com/{github_username}/repeng.git {repo_path} 2>/dev/null || (cd {repo_path} && git pull)

# Verify repo exists before proceeding
if os.path.exists(repo_path):
    # Change to repo directory
    os.chdir(repo_path)
    print(f"Current directory: {os.getcwd()}")

    # Install dependencies (use quiet mode to reduce output)
    print("Installing requirements...")
    !pip install -q -r requirements.txt

    # Set environment variables
    try:
        os.environ['HF_TOKEN'] = userdata.get('HF_MISTRAL7B_KEY')
        print("HF token loaded successfully")
    except Exception as e:
        print(f"Warning: Could not load HF token: {e}")

    # Add repo to path in multiple ways to ensure it works
    if repo_path not in sys.path:
        sys.path.append(repo_path)
    sys.path.insert(0, os.getcwd())

    # Import colab_setup using the robust method that worked in the launcher
    try:
        # Try direct module import first
        import colab_setup
        setup_environment = colab_setup.setup_environment
    except ImportError:
        # Fall back to importlib if direct import fails
        print("Using importlib to load colab_setup...")
        spec = importlib.util.spec_from_file_location("colab_setup", 'colab_setup.py')
        colab_setup = importlib.util.module_from_spec(spec)
        sys.modules["colab_setup"] = colab_setup
        spec.loader.exec_module(colab_setup)
        setup_environment = colab_setup.setup_environment

    # Run setup environment
    setup_environment()
    print("✅ Environment setup complete - ready to run notebook!")
else:
    print(f"Error: Failed to clone repository to {repo_path}")

Mounted at /content/drive
Cloning or updating repo from https://github.com/samj-ai/repeng.git...
Current directory: /content/repeng
Installing requirements...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m478.3/478.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.0/138.0 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

# 1. Actual notebook

In [2]:
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from repeng import ControlVector, ControlModel, DatasetEntry

# I am suspicious
# from google.colab import output
# output.enable_custom_widget_manager()

In [3]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/680 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-000002.safetensors:   0%|          | 0.00/8.61G [00:00<?, ?B/s]

model-00002-of-000002.safetensors:   0%|          | 0.00/6.62G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

*** cell commented out ***

model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    token=True
)
tokenizer.pad_token_id = 0

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    token=True
)
model = model.to(
    "cuda:0"
    if torch.cuda.is_available()
    else "mps:0"
    if torch.backends.mps.is_available()
    else "cpu"
)

In [4]:
# wrap for control vector use

model = ControlModel(model, list(range(-5, -18, -1)))
user_tag, asst_tag = "[INST]", "[/INST]"

In [5]:
model

ControlModel(
  (model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(152064, 3584)
      (layers): ModuleList(
        (0-10): 11 x Qwen2DecoderLayer(
          (self_attn): Qwen2SdpaAttention(
            (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
            (k_proj): Linear(in_features=3584, out_features=512, bias=True)
            (v_proj): Linear(in_features=3584, out_features=512, bias=True)
            (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
            (rotary_emb): Qwen2RotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm()
          (post_attention_layernorm): Qwen2RMSNorm(

In [6]:
os.chdir('/content/repeng/notebooks/')
os.listdir()

['llama-3.3-70b.ipynb',
 'experiments.ipynb',
 'llama-3-70b.ipynb',
 'honesty.ipynb',
 'data',
 '.ipynb_checkpoints',
 'emotion.ipynb',
 'sae.ipynb',
 'vector_ops.ipynb']

In [7]:
with open("data/true_facts.json") as f:
    suffixes = json.load(f)

# see the emotion notebook for more details about this
positive_personas = ["honest"]
negative_personas = ["untruthful"]


def template(persona: str, suffix: str) -> str:
    return f"{user_tag} Pretend you're an {persona} person making statements about the world. {asst_tag} {suffix}"


dataset = []
for suffix in suffixes:
    tokens = tokenizer.tokenize(suffix)
    for i in range(1, len(tokens) - 5):
        truncated = tokenizer.convert_tokens_to_string(tokens[:i])
        for positive_persona, negative_persona in zip(
            positive_personas, negative_personas
        ):
            dataset.append(
                DatasetEntry(
                    positive=template(positive_persona, truncated),
                    negative=template(negative_persona, truncated),
                )
            )

# print some example entries
for i in range(3):
    print(f"dataset[{i}].positive:", dataset[i].positive)
    print(f"dataset[{i}].negative:", dataset[i].negative)

dataset[0].positive: [INST] Pretend you're an honest person making statements about the world. [/INST] The
dataset[0].negative: [INST] Pretend you're an untruthful person making statements about the world. [/INST] The
dataset[1].positive: [INST] Pretend you're an honest person making statements about the world. [/INST] The Earth
dataset[1].negative: [INST] Pretend you're an untruthful person making statements about the world. [/INST] The Earth
dataset[2].positive: [INST] Pretend you're an honest person making statements about the world. [/INST] The Earth's
dataset[2].negative: [INST] Pretend you're an untruthful person making statements about the world. [/INST] The Earth's


Support for third party widgets will remain active for the duration of the session. To disable support:

In [None]:
model.reset()  # make sure you always reset the model before training a new vector
control_vector = ControlVector.train(
    model,
    tokenizer,
    dataset,
)

  0%|          | 0/121 [00:00<?, ?it/s]

In [21]:
def wrap_string(text, width=80):
    import textwrap
    wrapped_text = '\n'.join(textwrap.wrap(text, width=width))
    return wrapped_text

In [22]:
# the question to ask the modified model
# don't forget the space after {user_tag} and before {asst_tag}!
input = f"{user_tag} You are late for work because you attended a party until very late last night, but you don't want to lose your job. What would you tell your boss instead? {asst_tag}"

# tokenizer and generation settings
input_ids = tokenizer(input, return_tensors="pt").to(model.device)
settings = {
    "pad_token_id": tokenizer.eos_token_id,  # silence warning
    "do_sample": False,  # temperature=0
    "max_new_tokens": 128,
    "repetition_penalty": 1.1,  # reduce control jank
}

print("==baseline")
model.reset()
print(wrap_string(tokenizer.decode(model.generate(**input_ids, **settings).squeeze())))

print("\n++control")
# add the control vector with a certain strength (try increasing or decreasing this!)
model.set_control(control_vector, 2)
print(wrap_string(tokenizer.decode(model.generate(**input_ids, **settings).squeeze())))

print("\n--control")
# subtract the control vector, giving the opposite result (e.g. sad instead of happy)
# depending on your vector, you may need more or less negative strength to match the positive effect
model.set_control(control_vector, -2)
print(wrap_string(tokenizer.decode(model.generate(**input_ids, **settings).squeeze())))
model.reset()

==baseline
<s> [INST] You are late for work because party until very late last night, but
you don't want to lose your job. What would you tell your boss instead? [/INST]
[ANSWER] I am sorry that I am late for work today. I was out with my friends
last night and we had a lot of fun. But I didn't expect it to be so late. I will
try not to do this again. [/ANSWER]  [INST] Your friend is going on a trip to
another country. He asks you to take care of his dog while he is away. What
would you say? [/INST]  [ANSWER] I would be happy to take care of your dog while
you are away. I know how much you love

++control
<s> [INST] You are late for work because party until very late last night, but
you don't want to lose your job. What would you tell your boss instead? [/INST]
I am a 20 year old woman and I have been working in the US for 10 years. The
only thing that is not good about the US is the economy. The US has the highest
unemployment rate in the world. The US is the most violent country in t