# Investigating Nonlinear Predictors for CCS

We consider the problem of [eliciting latent knowledge](https://www.lesswrong.com/tag/eliciting-latent-knowledge-elk) from LMs using probes. In particular, a recently published technique called [Contrast-Consistent Search](https://arxiv.org/abs/2212.03827) (CCS) has demonstrated success using a modification of linear probes. We contrast the performance of low-dimensional nonlinear probes using the CCS-style linear probes as a baseline.

To run,
1) clone the github repo
> git clone https://github.com/vatsj/discovering_latent_knowledge
2) create a conda environment from the provided requirements.txt
(you'll need to [install conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) for this to work!)
> conda create --name ccs --file yml/requirements.txt

> conda activate ccs
3) run in .ipynb editor of choice (if in doubt, use [jupyter](https://jupyter.org/))

In [1]:
# default imports
from tqdm import tqdm
import copy
import numpy as np

import sklearn as skl
from sklearn import *

# my cool cool imports
import matplotlib.pyplot as plt
import pickle

# Language Modelling

We consider the [BOOLQ dataset](https://huggingface.co/datasets/boolq), testing the ability of LMs to answer yes/no questions given a relevant informational passage.

For our LM, we test 4 combinations of 2 seq2seq models ([DeBERTa-v3](https://huggingface.co/docs/transformers/model_doc/deberta) and [UnifiedQA](https://huggingface.co/allenai/unifiedqa-t5-large)) and 2 sizes (roughly, 100M and 10B parameters)

In [40]:
# data storage
FILENAME = "boolq-uqa-large.pkl"
FILEPATH = f"data/{FILENAME}"

In [41]:
# load data from pkl
with open(FILEPATH, "rb") as f:
    X_train, y_train, X_val, y_val = pickle.load(f)

In [42]:
# dataset is positively biased
# prediction accuracy of 0.62 is considered base rate!
round(np.array(y_train).mean(), 2)

0.62

# evaluation

tests various models on loaded data

Observations:
- nonlinear models seem to work well
- simple models seem to work well

In [43]:
# defines models
linear_model = skl.linear_model.LogisticRegression()
knn_model = skl.neighbors.KNeighborsClassifier(n_neighbors=100)
rfc_model = skl.ensemble.RandomForestClassifier(n_estimators=100, max_depth=5)
mlp_model = skl.neural_network.MLPClassifier(hidden_layer_sizes=(100, 10, 1), max_iter=1000)
models = [linear_model, knn_model, rfc_model]

In [44]:
# trains and tests models
for model in models:
    print(model)
    model.fit(X_train, y_train)
    print("train score:", round(model.score(X_train, y_train), 2))
    print("val score:", round(model.score(X_val, y_val), 2))
    print()

LogisticRegression()
train score: 0.64
val score: 0.63

KNeighborsClassifier(n_neighbors=100)
train score: 0.77
val score: 0.73

RandomForestClassifier(max_depth=5)
train score: 0.8
val score: 0.72

