<a href="https://colab.research.google.com/github/tabasy/fewgen/blob/main/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prepare environment

In [1]:
import os

if not os.path.exists('fewgen'):
  !git clone https://github.com/tabasy/fewgen.git
  !pip install -q -r fewgen/requirements.txt

In [2]:
import sys
sys.path.append('fewgen')

import logging
from IPython.display import Markdown, HTML, clear_output

logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)

# Define, train model

First consider a single-input text classification task (our default task is `SST2`). To create a `FewgenClassifier`, we need:
* a **language model** along with its tokenizer
* some **descriptions** for each class of your task

For the language model part, only variants of `GPT2` are tested, but any model instantiated by `transformers.AutoModelForCausalLM` is expected to work.

In [3]:
from fewgen.util import load_model

tokenizer, language_model = load_model('gpt2-medium', device='cuda')
clear_output()

The descriptions are defined as below. Each description has two parts which are separated with a ` / `:
> `" All in all, the movie was a / terrible failure"`

 First part acts as a **prompt** and then comes the **answer** part. 

For training and inference, the descriptions are scored by calculating perplexity change for the answer part (before and after prepending our given input text).
> *All in all, the movie was a* ***terrible failure***

> I would not recommend it to anyone. *All in all, the movie was a* ***terrible failure***


Now you are free to define your own descriptions:

In [4]:
descriptions = {
  # the keys are class labels
  'neg': [
    " All in all, the movie was a / terrible failure",
    " I would give it / zero of 10",
    " All in all, the screenplay is / poorly written",
    " I think the actors / played awful"
  ],
  'pos': [
    " All in all, the movie was a / masterpiece",
    " I would give it / 10 stars of 10",
    " All in all, the screenplay is / very well written",
    " I think the actors / should win the award"
  ]
}

Now we are ready to instantiate a `FewgenClassifier`:

In [5]:
from fewgen.classifier import FewgenClassifier

classifier = FewgenClassifier(descriptions=descriptions, 
                              language_model=language_model,
                              tokenizer=tokenizer)

To train our classifier, we need a dataset. It is supposed to be a (huggingface) `datasets.Dataset` instance, including two fields:
* the `text` field, our input text
* the `label` field, with values matching one of our defined `descriptions.keys()`

Our `prepare_dataset` function, makes few-shot experiments easier, but you can create your own dataset in any way possible.

In [6]:
from fewgen.dataset import prepare_dataset

dataset_params = {
    'dataset_name': 'glue/sst2',
    'shuffle': True, 
    'shuffle_seed': 120,
    'train_ex_per_class': 16,
    'test_ex_per_class': 128,
    'test_split_name': 'validation',
  }

trainset, testset = prepare_dataset(**dataset_params)

# converting integer labels to human readable strings
labels = classifier.labels

def i2label(example):
  example['label'] = labels[example['label']]
  return example

trainset = trainset.map(i2label)
testset = testset.map(i2label)
clear_output()

It is time to train our model on the small fewshot trainset.

Without finetuning language model, the training means fitting a linear model on description scores (as high-level features).

In [7]:
classifier.train(trainset, finetune_lm=False)
classifier.test(testset)

  0%|          | 0/32 [00:00<?, ?it/s]

INFO:root:train accuracy: 96.9%
INFO:root:train f-score: 96.9%


  0%|          | 0/256 [00:00<?, ?it/s]

INFO:root:test accuracy: 85.5%
INFO:root:test f-score: 85.5%


{'test_acc': 0.85546875, 'test_f1': 0.855360605921786}

And we have the option to finetune language model and get better results.
> **We recommend you to skip the following code cell and play with the frozen-LM classifier first.
Then you can come back, finetune and see the differences...**

In [8]:
# classifier.train(trainset, finetune_lm=True, finetune_args=dict(epochs=3))
# classifier.test(testset)

# Check model predictions

We have a traind classifier and we are going to watch it in action:

In [9]:
#@title classify input text
text = "I would not recommend it to anyone." #@param {type:"string"}

probs = classifier.predict_proba(text)[0]
pred = classifier.classify(text)

clear_output()
print(f'prediction: {pred[0]}\n')
for label, prob in zip(classifier.labels, probs):
  print(f'{label}:\t{prob:.1%}')

prediction: neg

neg:	89.1%
pos:	10.9%


One of the benefits of prompt-based methods (most of them), is their **interpretability**. As the prompt and the answers are humanly understandable, we expect to understand how the model decides (in some levels)!

Here we can check how prepending the input text to a description, changes the probability of answer words (and th conditional perplexity). 
> <font color="LimeGreen"> **Green** </font> means the probability has increased after adding the input text.

In [10]:
#@title probability change of answer words
text = "I would not recommend it to anyone." #@param {type:"string"}
from fewgen.explain import show_prob_changes

for desc in classifier.descriptions:
  viz = show_prob_changes(classifier.lm, classifier.tokenizer, desc, text)
  display(HTML(viz))

What do you see? Does the language model think(!) as you expected?!