<a href="https://colab.research.google.com/github/savasy/CorrespondenceAnalysisForNLP/blob/main/Prompting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt-based learning

Prompt-based learning is a new paradigm in the NLP field. In prompt-based learning, we do not have to hold any supervised learning process since we directly rely on the objective function (such as MLM) of any pre-trained language model. In order to use the models to achieve prediction tasks, the only thing to be done is to modify the original input<X> using a task-specific template into a textual string prompt such as <X, that is [MASK]> so that the model can achieve the task even without learning.
Such a mechanism allows us to exploit the LM that is pre-trained on huge amounts of textual data. This prompting function can be defined to make any LM be able to achieve few-shot, one-shot, or even zero-shot learning tasks where we easily adapt the model to new scenarios even with few or no labeled data.

In [2]:
!pip install -q transformers datasets

[K     |████████████████████████████████| 3.5 MB 4.0 MB/s 
[K     |████████████████████████████████| 311 kB 59.6 MB/s 
[K     |████████████████████████████████| 596 kB 48.0 MB/s 
[K     |████████████████████████████████| 895 kB 51.0 MB/s 
[K     |████████████████████████████████| 67 kB 4.3 MB/s 
[K     |████████████████████████████████| 6.8 MB 32.7 MB/s 
[K     |████████████████████████████████| 243 kB 43.6 MB/s 
[K     |████████████████████████████████| 1.1 MB 46.4 MB/s 
[K     |████████████████████████████████| 133 kB 56.9 MB/s 
[K     |████████████████████████████████| 144 kB 49.3 MB/s 
[K     |████████████████████████████████| 271 kB 57.9 MB/s 
[K     |████████████████████████████████| 94 kB 2.9 MB/s 
[?25h

In [None]:
import os
import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
os.chdir("/content/drive/MyDrive/akademi/PromptingDeneme")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from transformers import AutoModelForMaskedLM , AutoTokenizer
import torch
model_path="dbmdz/bert-base-turkish-cased"
tokenizer = AutoTokenizer.from_pretrained(model_path)

## Prompting Class
Here is the class definition for Prompting

In [3]:
from transformers import AutoModelForMaskedLM , AutoTokenizer
class Prompting(object):
  """ doc string 
   This class helps us to implement
   Prompt-based Learning Model
  """
  def __init__(self, **kwargs):
    """ constructor 

    parameter:
    ----------
       model: AutoModelForMaskedLM
            path to a Pre-trained language model form HuggingFace Hub
       tokenizer: AutoTokenizer
            path to tokenizer if different tokenizer is used, 
            otherwise leave it empty
    """
    model_path=kwargs['model']
    tokenizer_path= kwargs['model']
    if "tokenizer" in kwargs.keys():
      tokenizer_path= kwargs['tokenizer']
    self.model = AutoModelForMaskedLM.from_pretrained(model_path)
    self.tokenizer = AutoTokenizer.from_pretrained(model_path)

  def prompt_pred(self,text):
    """
      Predict MASK token by listing the probability of candidate tokens 
      where the first token is the most likely

      Parameters:
      ----------
      text: str 
          The text including [MASK] token.
          It supports single MASK token. If more [MASK]ed tokens 
          are given, it takes the first one.

      Returns:
      --------
      list of (token, prob)
         The return is a list of all token in LM Vocab along with 
         their prob score, sort by score in descending order 
    """
    indexed_tokens=tokenizer(text, return_tensors="pt").input_ids
    tokenized_text= tokenizer.convert_ids_to_tokens (indexed_tokens[0])
    # take the first masked token
    mask_pos=tokenized_text.index(tokenizer.mask_token)
    self.model.eval()
    with torch.no_grad():
      outputs = model(indexed_tokens)
      predictions = outputs[0]
    values, indices=torch.sort(predictions[0, mask_pos],  descending=True)
    #values=torch.nn.functional.softmax(values, dim=0)
    result=list(zip(self.tokenizer.convert_ids_to_tokens(indices), values))
    self.scores_dict={a:b for a,b in result}
    return result

  def compute_tokens_prob(self, text, token_list1, token_list2):
    """
    Compute the activations for given two token list, 

    Parameters:
    ---------
    token_list1: List(str)
     it is a list for positive polarity tokens such as good, great. 
    token_list2: List(str)
     it is a list for negative polarity tokens such as bad, terrible.      

    Returns:
    --------
    Tuple (
       the probability for first token list,
       the probability of the second token list,
       the ratio score1/ (score1+score2)
       The softmax returns
    """
    _=self.prompt_pred(text)
    score1=[self.scores_dict[token1] if token1 in self.scores_dict.keys() else 0\
            for token1 in token_list1]
    score1= sum(score1)
    score2=[self.scores_dict[token2] if token2 in self.scores_dict.keys() else 0\
            for token2 in token_list2]
    score2= sum(score2)
    softmax_rt=torch.nn.functional.softmax(torch.Tensor([score1,score2]), dim=0)
    return score1, score2, score1/ (score1+score2),softmax_rt

  def fine_tune(self, sentences, labels, prompt=" Çünkü [MASK] idi.",goodToken="iyi",badToken="kötü"):
    """  
      Fine tune the model
    """
    good=tokenizer.convert_tokens_to_ids(goodToken)
    bad=tokenizer.convert_tokens_to_ids(badToken)

    from transformers import AdamW
    optimizer = AdamW(self.model.parameters(),lr=1e-3)

    for sen, label in zip(sentences, labels):
      tokenized_text = self.tokenizer.tokenize(sen+prompt)
      indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokenized_text)
      tokens_tensor = torch.tensor([indexed_tokens])
      # take the first masked token
      mask_pos=tokenized_text.index("[MASK]")
      outputs = self.model(tokens_tensor)
      predictions = outputs[0]
      pred=predictions[0, mask_pos][[good,bad]]
      prob=torch.nn.functional.softmax(pred, dim=0)
      lossFunc = torch.nn.CrossEntropyLoss()
      loss=lossFunc(prob.unsqueeze(0), torch.tensor([label]))
      loss.backward()
      optimizer.step()
    print("done!")


I take Turkish LM here, you can choose any other model. 

In [None]:
prompting= Prompting(model="dbmdz/bert-base-turkish-cased")

Some weights of the model checkpoint at dbmdz/bert-base-turkish-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Let the model predict some tokens

A POSITIVE example Input

In [None]:
text="Çok keyif aldım filmden" # Which means: I liked the film,
prompt=". çünkü [MASK] idi." #  since it was [MASK]
prompted= text + prompt
prompting.prompt_pred(prompted)[:10]

[('güzel', tensor(7.8681)),
 ('eski', tensor(7.6136)),
 ('harika', tensor(7.5754)),
 ('mükemmel', tensor(7.5497)),
 ('eğlenceli', tensor(7.5261)),
 ('yeni', tensor(7.5034)),
 ('muhteşem', tensor(7.3991)),
 ('kötü', tensor(7.3724)),
 ('komik', tensor(7.3256)),
 ('iyi', tensor(7.3105))]

A NEGATIVE example Input

In [None]:
text="Çok keyif almadım filmden" # Which means: I didn't enjoy the film
propmt=". çünkü [MASK] idi." # , since it was [MASK]
prompted= text + propmt
prompting.prompt_pred(prompted)[:10]

[('kötü', tensor(8.0674)),
 ('eski', tensor(7.9916)),
 ('berbat', tensor(7.5965)),
 ('yeni', tensor(7.5006)),
 ('gereksiz', tensor(7.4864)),
 ('iğrenç', tensor(7.3922)),
 ('sıkıcı', tensor(7.2912)),
 ('korkunç', tensor(7.2623)),
 ('saçma', tensor(7.2231)),
 ('güzel', tensor(7.1390))]

## Producing the results for  a pair of neg/pos words
Now we pass a list of neg/pos words rather thansinlge neg/pos words (tokens)

In [None]:
text="Çok keyif almadım filmden"
propmt=", çünkü [MASK] idi."
prompted= text + propmt
prompting.compute_tokens_prob(prompted, ["gereksiz"], ["harika"])

(tensor(8.3082), tensor(7.0045), tensor(0.5426), tensor([0.7865, 0.2135]))

In [None]:
text="Çok keyif almadım filmden"
propmt=", çünkü [MASK] idi."
prompted= text + propmt
prompting.compute_tokens_prob(prompted, ["gereksiz","kötü", "berbat","sıkıcı"], ["harika","güzel","mükemmel","muhteşem"])

(tensor(32.1539), tensor(28.4894), tensor(0.5302), tensor([0.9750, 0.0250]))

In [None]:
text="Çok keyif aldım filmden"
prompted= text + propmt
prompting.compute_tokens_prob(prompted, ["gereksiz"], ["harika"])

(tensor(6.8251), tensor(8.9176), tensor(0.4335), tensor([0.1098, 0.8902]))

In [None]:
prompting.compute_tokens_prob(prompted, ["gereksiz","kötü", "berbat","sıkıcı"], ["harika","güzel","mükemmel","muhteşem"])

(tensor(26.2229),
 tensor(34.5780),
 tensor(0.4313),
 tensor([2.3515e-04, 9.9976e-01]))

# Learning

We can check LM to find the which words are the most suitable (pos or neg) in according to LM. 

In [None]:
df= pd.read_csv("film.tsv", sep="\t", header=None)
df.columns=["text","label"]
df.head(3)

Unnamed: 0,text,label
0,aglamaktan perisan oldugum bir filmdi..,1
1,altan erkekli benim için artik son derece dege...,0
2,"bastan asagi mantik hatalariyla dolu, yeni nes...",0


Separate neg/pos corpus

In [None]:
pos=df[df.label==1]
neg=df[df.label==0]
pos.shape, neg.shape

((1644, 2), (1620, 2))

In [None]:
# take first 200 examples for pos/neg
pos200=pos["text"].values[:200]
neg200=neg["text"].values[:200]

### Lets find positive tokens for a given template

In [None]:
pos_tokens=[]
prompt=", yani [MASK] bir film."

for i,t in enumerate(pos200[:5]):
  if i%25==0:
    print(i)
  prompted= " ".join(t.split()[:10]) + prompt
  res=prompting.prompt_pred(prompted)[:10]
  res2=[e[0] for e in res]
  pos_tokens+= res2

0


In [None]:
import collections
cp=collections.Counter(pos_tokens)
cp.most_common(10)

[('iyi', 4),
 ('güzel', 4),
 ('tek', 3),
 ('harika', 3),
 ('yeni', 2),
 ('kötü', 2),
 ('klasik', 2),
 ('ikinci', 2),
 ('özel', 2),
 ('baska', 2)]

### Lets find negative tokens for a given template

In [None]:
neg_tokens=[]
for i,t in enumerate(neg200[:5]):
  if i%25==0:
    print(i)
  prompted= t + prompt
  res=prompting.prompt_pred(prompted)[:10]
  res2=[e[0] for e in res]
  neg_tokens+= res2

0


In [None]:
import collections
cn=collections.Counter(neg_tokens)
cn.most_common(8)

[('iyi', 4),
 ('yeni', 4),
 ('kötü', 3),
 ('böyle', 3),
 ('öyle', 3),
 ('sadece', 3),
 ('##t', 3),
 ('baska', 3)]

# Evaluation

In [None]:
pos_test200=pos["text"][200:400]
neg_test200=neg["text"][200:400]

In [None]:
prompt=", son derece [MASK] bir filmdi."

## For POSIVITE Set Eval.

In [None]:
pos_best_tokens=["harika","mükemmel","muhteşem","süper"]
neg_best_tokens=["kötü","berbat","sıkıcı","sadece"]

crr=0
for i,t in enumerate(pos_test200):
  if i%25==0:
    print("%s. step, and the number of correct case is %s"%(i,crr))
  prompted= t + prompt
  res=prompting.compute_tokens_prob(prompted, pos_best_tokens, neg_best_tokens)
  if res[2]>0.5:
    crr+=1
print(crr)

## For Negative set Eval

In [None]:
crr=0
for i,t in enumerate(neg_test200):
  if i%25==0:
    print("%s. step, and the number of correct case is %s"%(i,crr))
  prompted= t + prompt
  res=prompting.compute_tokens_prob(prompted, pos_best_tokens, neg_best_tokens)
  if res[2]<0.5:
    crr+=1
print(crr)

# Fine-tuning

In [None]:
prompting.fine_tune(pos200+neg200, [1]*len(pos200)+ [0]* len(neg200))

KeyboardInterrupt: ignored

In [None]:
[1,3,4]+[3,4,1,1,1,1,1,1]