#### Setup

In [7]:
from nnsight import LanguageModel

Using Models from Hugging Face via the LanguageModel class

In [22]:
# Set dispatch to true if you want to initialize LM into memory
# If dispatch is false, this only instantiates a Meta object until
# the first tracing context has been initialized
model = LanguageModel("openai-community/gpt2", device_map="auto", dispatch=True)

print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
  (generator): WrapperModule()
)


In [44]:
# Set dispatch to true if you want to initialize LM into memory
# If dispatch is false, this only instantiates a Meta object until
# the first tracing context has been initialized
phi_model = LanguageModel("microsoft/phi-1_5", device_map="auto")

print(phi_model)

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2048)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x PhiDecoderLayer(
        (self_attn): PhiSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2048, out_features=8192, bias=True)
          (fc2): Linear(in_features=8192, out_features=2048, bias=True)
        )
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((2048,

In [25]:
model.device

device(type='cuda', index=0)

In [6]:
model.transformer.wte

Embedding(50257, 768)

In [10]:
model.transformer.h[-1].mlp

GPT2MLP(
  (c_fc): Conv1D()
  (c_proj): Conv1D()
  (act): NewGELUActivation()
  (dropout): Dropout(p=0.1, inplace=False)
)

#### Test Ablations, Look for Potential Steering Vectors

In [48]:
with model.trace("Deception") as tracer:
    out = model.transformer.h[-1].attn.output[0].save()

In [35]:
type(out)

nnsight.models.LanguageModel.LanguageModelProxy

In [49]:
print(out[0].shape)

torch.Size([2, 768])


In [50]:
print(out[0])

tensor([[ 0.0016,  0.0497, -0.2819,  ...,  0.1228,  0.2832, -0.1518],
        [ 0.6262, -0.6962, -0.5985,  ...,  0.0251,  0.1031, -0.3279]],
       device='cuda:0', grad_fn=<SelectBackward0>)


### Activation Steering

***Experiment***
- Arrive at a steering vector that makes a model more deceptive, sycophantic, or another problematic trait
    - Get the deception activations using the average of deceptive prompts 
    - Get the honest activations using the average of the honest prompts
    - Get steering vector (more deceptive or more honest) by subtracting these from each other

- Apply steering vector to Phi-1.5
- Compare steered Phi vs unchanged Phi against a dataset

***Here are five example pairs of honest vs. potentially deceptive sentences on the same topics:***

Honest: "I was at home last night."
Deceptive: "Well, you know, I'm pretty sure I spent most of the evening at my place, as far as I can recall."

Honest: "I disagree with that decision."
Deceptive: "I'm not saying it's a bad choice, but perhaps there are other factors we haven't fully considered yet."

Honest: "I made a mistake on the report."
Deceptive: "There may have been some discrepancies in the data that affected the overall accuracy of the document."

Honest: "I borrowed $50 from the petty cash."
Deceptive: "I'm not entirely certain, but it's possible some funds might have been temporarily reallocated for urgent business needs."

Honest: "I haven't finished the project yet."
Deceptive: "The project is in its final stages of completion, pending some minor adjustments and unforeseen circumstances."

1. Honest: "I broke the vase."
   Deceptive: "The vase? Oh, it must have fallen when I was cleaning nearby."

2. Honest: "I forgot to pay the bill."
   Deceptive: "There seems to have been an issue with the payment processing system."

3. Honest: "I don't like your new haircut."
   Deceptive: "Your hair looks... different. It's quite a change, isn't it?"

4. Honest: "I ate the last cookie."
   Deceptive: "I'm not sure what happened to it. Maybe someone else took it?"

5. Honest: "I'm not interested in going out tonight."
   Deceptive: "Oh, tonight? I might have some work to finish up, so I'm not sure if I can make it."

6. Honest: "I haven't started the assignment yet."
   Deceptive: "I'm making progress on the assignment. It's coming along."

7. Honest: "I don't agree with your opinion."
   Deceptive: "That's an interesting perspective. I'll have to think about it some more."

8. Honest: "I overslept and was late for work."
   Deceptive: "There was unexpected heavy traffic this morning. It was terrible."

9. Honest: "I don't want to attend the family reunion."
   Deceptive: "I'll have to check my schedule. Things have been pretty busy lately."

10. Honest: "I didn't study for the test."
    Deceptive: "I reviewed the material, but perhaps I focused on the wrong topics."