# Investigating Nonlinear Predictors for CCS

We consider the problem of [eliciting latent knowledge](https://www.lesswrong.com/tag/eliciting-latent-knowledge-elk) from LMs using probes. In particular, a recently published technique called [Contrast-Consistent Search](https://arxiv.org/abs/2212.03827) (CCS) has demonstrated success using a modification of linear probes. We contrast the performance of low-dimensional nonlinear probes in their 

In [2]:
# default imports
from tqdm import tqdm
import copy
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

from datasets import load_dataset
import transformers
import sklearn as skl
from sklearn import *

# my cool cool imports
import matplotlib.pyplot as plt
import random
import pickle


# Language Modelling

We consider the [BOOLQ dataset](https://huggingface.co/datasets/boolq), testing the ability of LMs to answer yes/no questions given a relevant informational passage.

For our LM, we use [DeBERTa-v3](https://huggingface.co/docs/transformers/model_doc/deberta), which for our purposes is a performant version of BERT.

In [6]:
# BOOLQ dataset
train = load_dataset("super_glue", "boolq")["train"]
val = load_dataset("super_glue", "boolq")["validation"]

Found cached dataset super_glue (/Users/jstav/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset super_glue (/Users/jstav/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed)


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
# base deberta
model_type = "encoder"
MODEL_NAME = "microsoft/deberta-v3-base"

In [8]:
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
model = transformers.AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
# model.cuda()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForMaskedLM: ['mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'deberta.embeddings.position_embeddings.weight', 'mask_predictions.LayerNorm.bias']
- This IS expected if you are initializing DebertaV2ForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassific

In [11]:
# data storage
FILENAME = "boolq-uqa-base.pkl"
FILEPATH = f"data/{FILENAME}"

In [12]:
# load data from pkl
with open(FILEPATH, "rb") as f:
    X_train, y_train, X_val, y_val = pickle.load(f)

# evaluation

tests various models on loaded data

Observations:
- nonlinear models seem to work well
- simple models seem to work well

In [29]:
# defines models
linear_model = skl.linear_model.LogisticRegression()
knn_model = skl.neighbors.KNeighborsClassifier(n_neighbors=100)
rfc_model = skl.ensemble.RandomForestClassifier(n_estimators=1000, max_depth=5)
# mlp_model = skl.neural_network.MLPClassifier(hidden_layer_sizes=(100, 10, 1), max_iter=500)
models = [linear_model, knn_model, rfc_model]

In [30]:
# trains and tests models
for model in models:
    print(model)
    model.fit(X_train, y_train)
    print("train score:", round(model.score(X_train, y_train), 2))
    print("val score:", round(model.score(X_val, y_val), 2))
    print()

LogisticRegression()
train score: 0.62
val score: 0.62

KNeighborsClassifier(n_neighbors=100)
train score: 0.64
val score: 0.64

RandomForestClassifier(max_depth=5, n_estimators=1000)
