# Prompt-based learning

Prompt-based learning is a new paradigm in the NLP field. In prompt-based learning, we do not have to hold any supervised learning process since we directly rely on the objective function (such as MLM) of any pre-trained language model. In order to use the models to achieve prediction tasks, the only thing to be done is to modify the original input<X> using a task-specific template into a textual string prompt such as <X, that is [MASK]> so that the model can achieve the task even without learning.
Such a mechanism allows us to exploit the LM that is pre-trained on huge amounts of textual data. This prompting function can be defined to make any LM be able to achieve few-shot, one-shot, or even zero-shot learning tasks where we easily adapt the model to new scenarios even with few or no labeled data.

In [2]:
!pip install -q transformers datasets

In [3]:
import os
import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
os.chdir("/content/drive/MyDrive/akademi/PromptingDeneme")

Mounted at /content/drive


In [5]:
from transformers import AutoModelForMaskedLM , AutoTokenizer
import torch
model_path="dbmdz/bert-base-turkish-cased"
tokenizer = AutoTokenizer.from_pretrained(model_path)

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/245k [00:00<?, ?B/s]

## Prompting Class
Here is the class definition for Prompting

In [29]:
from transformers import AutoModelForMaskedLM , AutoTokenizer
class Prompting(object):
  """ doc string 
   This class helps us to implement
   Prompt-based Learning Model
  """
  def __init__(self, **kwargs):
    """ constructor 

    parameter:
    ----------
       model: AutoModelForMaskedLM
            path to a Pre-trained language model form HuggingFace Hub
       tokenizer: AutoTokenizer
            path to tokenizer if different tokenizer is used, 
            otherwise leave it empty
    """
    model_path=kwargs['model']
    tokenizer_path= kwargs['model']
    if "tokenizer" in kwargs.keys():
      tokenizer_path= kwargs['tokenizer']
    self.model = AutoModelForMaskedLM.from_pretrained(model_path)
    self.tokenizer = AutoTokenizer.from_pretrained(model_path)

  def prompt_pred(self,text):
    """
      Predict MASK token by listing the probability of candidate tokens 
      where the first token is the most likely

      Parameters:
      ----------
      text: str 
          The text including [MASK] token.
          It supports single MASK token. If more [MASK]ed tokens 
          are given, it takes the first one.

      Returns:
      --------
      list of (token, prob)
         The return is a list of all token in LM Vocab along with 
         their prob score, sort by score in descending order 
    """
    tokenized_text = self.tokenizer.tokenize(text)
    indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokenized_text)
    tokens_tensor = torch.tensor([indexed_tokens])
    # take the first masked token
    mask_pos=tokenized_text.index("[MASK]")
    self.model.eval()
    with torch.no_grad():
      outputs = self.model(tokens_tensor)
      predictions = outputs[0]
    values, indices=torch.sort(predictions[0, mask_pos],  descending=True)
    values=torch.nn.functional.softmax(values, dim=0)
    result=list(zip(tokenizer.convert_ids_to_tokens(indices), values))
    self.scores_dict={a:b for a,b in result}
    return result

  def compute_tokens_prob(self, text, token_list1, token_list2):
    """
    Compute the activations for given two token list, 

    Parameters:
    ---------
    token_list1: List(str)
     it is a list for positive polarity tokens such as good, great. 
    token_list2: List(str)
     it is a list for negative polarity tokens such as bad, terrible.      

    Returns:
    --------
    Tuple (
       the probability for first token list,
       the probability of the second token list,
       the ratio score1/ (score1+score2)
    """
    _=self.prompt_pred(text)
    score1=[self.scores_dict[token1] if token1 in self.scores_dict.keys() else 0\
            for token1 in token_list1]
    score1= max(score1)
    score2=[self.scores_dict[token2] if token2 in self.scores_dict.keys() else 0\
            for token2 in token_list2]
    score2= max(score2)
    return score1, score2, score1/ (score1+score2)

I take Turkish LM here, you can choose any other model. 

In [30]:
prompting= Prompting(model="dbmdz/bert-base-turkish-cased")

Some weights of the model checkpoint at dbmdz/bert-base-turkish-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Let the model predict some tokens

A POSITIVE example Input

In [31]:
text="Çok keyif aldım filmden" # Which means: I liked the film,
propmt=". çünkü [MASK] idi." #  since it was [MASK]
prompted= text + propmt
prompting.prompt_pred(prompted)[:10]

[('güzel', tensor(0.0294)),
 ('eski', tensor(0.0228)),
 ('harika', tensor(0.0220)),
 ('mükemmel', tensor(0.0214)),
 ('eğlenceli', tensor(0.0209)),
 ('yeni', tensor(0.0204)),
 ('muhteşem', tensor(0.0184)),
 ('kötü', tensor(0.0179)),
 ('komik', tensor(0.0171)),
 ('iyi', tensor(0.0169))]

A NEGATIVE example Input

In [32]:
text="Çok keyif almadım filmden" # Which means: I didn't enjoy the film
propmt=". çünkü [MASK] idi." # , since it was [MASK]
prompted= text + propmt
prompting.prompt_pred(prompted)[:10]

[('kötü', tensor(0.0362)),
 ('eski', tensor(0.0335)),
 ('berbat', tensor(0.0226)),
 ('yeni', tensor(0.0205)),
 ('gereksiz', tensor(0.0202)),
 ('iğrenç', tensor(0.0184)),
 ('sıkıcı', tensor(0.0166)),
 ('korkunç', tensor(0.0162)),
 ('saçma', tensor(0.0155)),
 ('güzel', tensor(0.0143))]

## Producing the results for  a pair of neg/pos words
Now we pass a list of neg/pos words rather thansinlge neg/pos words (tokens)

In [33]:
text="Çok keyif almadım filmden"
propmt=", çünkü [MASK] idi."
prompted= text + propmt
prompting.compute_tokens_prob(prompted, ["gereksiz"], ["harika"])

(tensor(0.0391), tensor(0.0106), tensor(0.7865))

In [34]:
text="Çok keyif almadım filmden"
propmt=", çünkü [MASK] idi."
prompted= text + propmt
prompting.compute_tokens_prob(prompted, ["gereksiz","kötü", "berbat","sıkıcı"], ["harika","güzel","mükemmel","muhteşem"])

(tensor(0.0391), tensor(0.0161), tensor(0.7078))

In [35]:
text="Çok keyif aldım filmden"
prompted= text + propmt
prompting.compute_tokens_prob(prompted, ["gereksiz"], ["harika"])

(tensor(0.0074), tensor(0.0600), tensor(0.1098))

In [36]:
prompting.compute_tokens_prob(prompted, ["gereksiz","kötü", "berbat","sıkıcı"], ["harika","güzel","mükemmel","muhteşem"])

(tensor(0.0074), tensor(0.0600), tensor(0.1100))

# Learning

We can check LM to find the which words are the most suitable (pos or neg) in according to LM. 

In [38]:
df= pd.read_csv("film.tsv", sep="\t", header=None)
df.columns=["text","label"]
df.head(3)

Unnamed: 0,text,label
0,aglamaktan perisan oldugum bir filmdi..,1
1,altan erkekli benim için artik son derece dege...,0
2,"bastan asagi mantik hatalariyla dolu, yeni nes...",0


Separate neg/pos corpus

In [39]:
pos=df[df.label==1]
neg=df[df.label==0]
pos.shape, neg.shape

((1644, 2), (1620, 2))

In [41]:
# take first 200 examples for pos/neg
pos200=pos["text"].values[:200]
neg200=neg["text"].values[:200]

### Lets find positive tokens for a given template

In [None]:
pos_tokens=[]
prompt=", yani [MASK] bir film."

for i,t in enumerate(pos200):
  if i%25==0:
    print(i)
  prompted= " ".join(t.split()[:10]) + prompt
  res=prompting.prompt_pred(prompted)[:10]
  res2=[e[0] for e in res]
  pos_tokens+= res2

0
25
50
75
100
125
150
175


In [None]:
import collections
cp=collections.Counter(pos_tokens)
cp.most_common(10)

[('güzel', 151),
 ('iyi', 147),
 ('harika', 133),
 ('yeni', 118),
 ('mükemmel', 93),
 ('sadece', 77),
 ('süper', 75),
 ('böyle', 60),
 ('klasik', 56),
 ('muhteşem', 49)]

### Lets find negative tokens for a given template

In [None]:
neg_tokens=[]
for i,t in enumerate(neg200):
  if i%25==0:
    print(i)
  prompted= t + prompt
  res=prompting.prompt_pred(prompted)[:10]
  res2=[e[0] for e in res]
  neg_tokens+= res2

In [None]:
import collections
cn=collections.Counter(neg_tokens)
cn.most_common(8)

# Evaluation

In [None]:
pos_test200=pos["text"][200:400]
neg_test200=neg["text"][200:400]

In [None]:
prompt=", son derece [MASK] bir filmdi."

## For POSIVITE Set Eval.

In [None]:
pos_best_tokens=["harika","mükemmel","muhteşem","süper"]
neg_best_tokens=["kötü","berbat","sıkıcı","sadece"]

crr=0
for i,t in enumerate(pos_test200):
  if i%25==0:
    print("%s. step, and the number of correct case is %s"%(i,crr))
  prompted= t + prompt
  res=prompting.compute_tokens_prob(prompted, pos_best_tokens, neg_best_tokens)
  if res[2]>0.5:
    crr+=1
print(crr)

0. step, and the number of correct case is 0
25. step, and the number of correct case is 19
50. step, and the number of correct case is 36
75. step, and the number of correct case is 54
100. step, and the number of correct case is 74
125. step, and the number of correct case is 95
150. step, and the number of correct case is 117
175. step, and the number of correct case is 140
163


## For Negative set Eval

In [None]:
crr=0
for i,t in enumerate(neg_test200):
  if i%25==0:
    print("%s. step, and the number of correct case is %s"%(i,crr))
  prompted= t + prompt
  res=prompting.compute_tokens_prob(prompted, pos_best_tokens, neg_best_tokens)
  if res[2]<0.5:
    crr+=1
print(crr)

0. step, and the number of correct case is 0
25. step, and the number of correct case is 5
50. step, and the number of correct case is 20
75. step, and the number of correct case is 32
100. step, and the number of correct case is 49
125. step, and the number of correct case is 66
150. step, and the number of correct case is 80
175. step, and the number of correct case is 93
108
