# Zero-shot experiments

In [None]:
!pip install transformers datasets --quiet

In [1]:
from datasets import load_dataset
import pandas as pd
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from time import time
from sklearn.metrics import classification_report

Для проверки знаний модели о лингвистической приемлемости можно использовать zero-shot классификацию. Сначала надо подготовить датасет.

In [2]:
ds = load_dataset('linxinyuan/cola')

Found cached dataset cola (C:/Users/Julia/.cache/huggingface/datasets/linxinyuan___cola/default/0.0.0/0871d55203d4de46ef1815400998ed8f219236694f0d03786bde849741f04cd4)


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
ds_train = pd.DataFrame(ds["train"])
ds_test = pd.DataFrame(ds["test"])

In [None]:
ds_test.head()

Unnamed: 0,text,label
0,The sailors rode the breeze clear of the rocks.,1
1,The weights made the rope stretch over the pul...,1
2,The mechanical doll wriggled itself loose.,1
3,"If you had eaten more, you would want less.",1
4,"As you eat the most, you want the least.",0


In [None]:
ds_train.shape, ds_test.shape

((8551, 2), (527, 2))

Датасет состоит из части для обучения и теста. Мы будем использовать только тестовую часть. Датасет представляет собой предложения и соответствующие им классы {0,1}, где 0 -- предложение содержит ошибку, а 1 -- предложение корректно.

# Zero shot NLI

Первая идея проверить энкодерные модели, которые были обучены на задаче NLI или на нескольких задачах NLU как multi-tasking. Также надо поготовить шаблон с гипотезой и классами.

In [None]:
def check(text, classifier, hypothesis_template="The sentence is {}.", candidate_labels=["correct", "incorrect"]):
    answer =  classifier(text, candidate_labels, hypothesis_template=hypothesis_template)["labels"][0]
    if answer == candidate_labels[0]:
        return 1
    else:
        return 0

In [None]:
models = [
    "cross-encoder/nli-roberta-base",
    "sileod/deberta-v3-base-tasksource-nli",
    "gsarti/biobert-nli",
    "facebook/bart-large-mnli"
]

In [None]:
for model in models:
    classifier = pipeline("zero-shot-classification", model=model, device=0)
    start = time()
    ds_test[model] = ds_test["text"].apply(lambda x: check(x, classifier))
    print(f"Inference of {model} took {time() - start} s")
    del classifier




Inference of cross-encoder/nli-roberta-base took 12.654100179672241 s


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Inference of sileod/deberta-v3-base-tasksource-nli took 20.53249502182007 s


Downloading (…)lve/main/config.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at gsarti/biobert-nli and are newly initialized: ['classifier.weight', 'classifier.bias']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)okenizer_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.




Inference of gsarti/biobert-nli took 10.088934659957886 s


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]




Inference of facebook/bart-large-mnli took 26.537539958953857 s


In [None]:
for model in models:
    print(f"Results for {model}:")
    print(classification_report(ds_test["label"], ds_test[model]))
    print("-----------------------------\n")

Results for cross-encoder/nli-roberta-base:

              precision    recall  f1-score   support



           0       0.29      0.17      0.21       162

           1       0.69      0.82      0.75       365



    accuracy                           0.62       527

   macro avg       0.49      0.49      0.48       527

weighted avg       0.57      0.62      0.58       527



-----------------------------



Results for sileod/deberta-v3-base-tasksource-nli:

              precision    recall  f1-score   support



           0       0.00      0.00      0.00       162

           1       0.69      0.99      0.81       365



    accuracy                           0.69       527

   macro avg       0.35      0.49      0.41       527

weighted avg       0.48      0.69      0.56       527



-----------------------------



Results for gsarti/biobert-nli:

              precision    recall  f1-score   support



           0       0.27      0.50      0.35       162

           1       0

Метрика Accuracy здесь мало что говорит, потому что классы не сбалансированы, а нам интересно понять, улавливает ли модель ошибки. Так, например, у модели sileod/deberta-v3-base-tasksource-nli самое высокое Accuracy, но при этом она просто все предложения классифицирует как корректные. Наибольшая полнота и точность ошибочных предложений достигается у модели facebook/bart-large-mnli, но зато хуже справляется с корректными предложениями. Можно попробовать для этой же модели другие шаблоны.

In [None]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)

In [None]:
sequences = ds_test[["text", "label"]]

candidate_labels = ["is a", "is no"]
hypothesis_template = "There {} mistake in the previous sentence."

sequences["predictions"] = sequences["text"].apply(lambda x: check(x, classifier, hypothesis_template=hypothesis_template, candidate_labels=candidate_labels))




A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead



See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  sequences["predictions"] = sequences["text"].apply(lambda x: check(x, classifier, hypothesis_template=hypothesis_template, candidate_labels=candidate_labels))


In [None]:
print(classification_report(sequences["label"], sequences["predictions"]))

              precision    recall  f1-score   support



           0       0.30      0.02      0.03       162

           1       0.69      0.98      0.81       365



    accuracy                           0.69       527

   macro avg       0.50      0.50      0.42       527

weighted avg       0.57      0.69      0.57       527




In [None]:
candidate_labels = ["acceptable", "unacceptable"]
hypothesis_template = "The sentence is {} in terms of linguistics."

sequences["predictions"] = sequences["text"].apply(lambda x: check(x, classifier, hypothesis_template=hypothesis_template, candidate_labels=candidate_labels))
print(classification_report(sequences["label"], sequences["predictions"]))




              precision    recall  f1-score   support



           0       0.29      0.06      0.10       162

           1       0.69      0.93      0.79       365



    accuracy                           0.67       527

   macro avg       0.49      0.50      0.45       527

weighted avg       0.57      0.67      0.58       527





A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead



See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  sequences["predictions"] = sequences["text"].apply(lambda x: check(x, classifier, hypothesis_template=hypothesis_template, candidate_labels=candidate_labels))


В общем стало только хуже, лучше всего было с первоначальным шаблоном.

# Zero-shot generative

Еще захотели проверить генеративные модели. Взяли gpt2 как обычную генеративную модель и allenai/tk-instruct-11b-def-pos как генеративную модель, обученную на инструкциях. (Еще хотели nlpcloud/instruct-gpt-j-fp16, но она не влезает:()

### GPT-2

In [45]:
tokenizer_gpt = AutoTokenizer.from_pretrained('gpt2')
tokenizer_gpt.pad_token_id = tokenizer_gpt.eos_token_id

In [49]:
generative_classifier = pipeline('text-generation', model='gpt2', tokenizer=tokenizer_gpt)

In [7]:
generative_classifier("Everyone hopes everyone to sleep. Q: Is this sentence correct in terms of linguistics?", 
    return_full_text=False, max_new_tokens=10, num_beams=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': '\n\nA: Yes.\n\nQ:'}]

In [8]:
generative_classifier("Everyone hopes everyone to sleep. Q: Is this sentence correct in terms of linguistics?", 
    return_full_text=False, max_new_tokens=10, num_beams=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': ' A: No, it is not. It is'}]

In [13]:
generative_classifier("""Everyone hopes everyone to sleep. Q: Is this sentence correct in terms of linguistics?""", 
    return_full_text=False, max_new_tokens=10, num_beams=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': ' A: Yes, it is correct in terms of'}]

Сначала я просто попробовала в промпте передовать предложение и вопрос к нему, но модель каждый раз дает разный ответ. Выглядит как рандом. Поэтому следущая идея few-shot learning: подать пару примеров 

In [16]:
generative_classifier(
    """
    Are the following sentences correct?
    Q: Everyone hopes everyone to sleep.
    A: no

    Q: Everyone hopes that he will sleep. 
    A: yes

    Q: If he were a rich man, he'd buy a diamond ring.
    A:""", 
    return_full_text=False, max_new_tokens=5, num_beams=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': ' yes\n\n   '}]

In [34]:
def generate_response_gpt(sentence):
    answer = generative_classifier(
        f"""
        Are the following sentences correct?
        Q: Everyone hopes everyone to sleep.
        A: no

        Q: Everyone hopes that he will sleep. 
        A: yes

        Q: {sentence}
        A:""", 
        return_full_text=False, max_new_tokens=5, num_beams=3)
    return answer[0]["generated_text"]

In [38]:
for i in range(10):
    print(f"Sentence: {ds_test.iloc[i]['text']}\nTrue label: {ds_test.iloc[i]['label']}")
    print(generate_response_gpt(ds_test.iloc[i]["text"]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sentence: The sailors rode the breeze clear of the rocks.
True label: 1


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 yes

   
Sentence: The weights made the rope stretch over the pulley.
True label: 1


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 no


Q:
Sentence: The mechanical doll wriggled itself loose.
True label: 1


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 yes

   
Sentence: If you had eaten more, you would want less.
True label: 1


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 no


Q:
Sentence: As you eat the most, you want the least.
True label: 0


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 no


Q:
Sentence: The more you would want, the less you would eat.
True label: 0


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 yes

   
Sentence: I demand that the more John eat, the more he pays.
True label: 0


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 yes

   
Sentence: Mary listens to the Grateful Dead, she gets depressed.
True label: 1


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 no


Q:
Sentence: The angrier Mary got, the more she looked at pictures.
True label: 1


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 yes

   
Sentence: The higher the stakes, the lower his expectations are.
True label: 1
 yes

   


In [39]:
def gpt_predict(sentence):
    answer = generative_classifier(
        f"""
        Are the following sentences correct?
        Q: Everyone hopes everyone to sleep.
        A: no

        Q: Everyone hopes that he will sleep. 
        A: yes

        Q: {sentence}
        A:""", 
        return_full_text=False, max_new_tokens=5, num_beams=3)
    if "no" in answer[0]["generated_text"]:
        return 0
    return 1

In [57]:
%%time
ds_test["prediction"] = ds_test["text"].apply(gpt_predict)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

CPU times: user 20min 34s, sys: 1.82 s, total: 20min 36s
Wall time: 10min 19s


In [58]:
print(classification_report(ds_test["label"], ds_test["prediction"]))

              precision    recall  f1-score   support

           0       0.25      0.24      0.25       162
           1       0.67      0.68      0.68       365

    accuracy                           0.55       527
   macro avg       0.46      0.46      0.46       527
weighted avg       0.54      0.55      0.55       527



Для генеративной модели выглядит неплохо, но в сравнении с моделью, обученной на задаче NLI, ошибки детектирует хуже.

### TK-Instruct

In [17]:
tokenizer = AutoTokenizer.from_pretrained("allenai/tk-instruct-3b-def")
model = AutoModelForSeq2SeqLM.from_pretrained("allenai/tk-instruct-3b-def")

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.13k [00:00<?, ?B/s]

Downloading (…)"spiece.model";:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [21]:
input_ids = tokenizer.encode(
    """
    Definition: Is this sentence correct in terms of linguistics? 
    Input: Everyone hopes everyone to sleep. 
    Output: no
    
    Input: Everyone hopes that he will sleep. 
    Output: yes
    
    Input: If he were a rich man, he'd buy a diamond ring.
    Output:
    """, 
    return_tensors="pt")
output = model.generate(input_ids, max_length=10)
tokenizer.decode(output[0], skip_special_tokens=True)

'no'

In [28]:
def generate_response(sentence):
    input_ids = tokenizer.encode(
        f"""
        Definition: Is this sentence correct in terms of linguistics? 
        Input: Everyone hopes that he will sleep. 
        Output: yes

        Input: Everyone hopes everyone to sleep. 
        Output: no
        
        Input: If he were a rich man, he'd buy a diamond ring.
        Output: yes

        Input: {sentence}
        Output:
        """, 
        return_tensors="pt")
    output = model.generate(input_ids, max_length=10)
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [32]:
ds_test.head()

Unnamed: 0,text,label
0,The sailors rode the breeze clear of the rocks.,1
1,The weights made the rope stretch over the pul...,1
2,The mechanical doll wriggled itself loose.,1
3,"If you had eaten more, you would want less.",1
4,"As you eat the most, you want the least.",0


In [33]:
for i in range(10):
    print(generate_response(ds_test.iloc[i]["text"]))

no
no
no
no
no
no
no
no
no
no


Модель просто всегда выдает "no", поэтому нет смысла проверять на всем датасете. Но можно попробовать разные варианты параметров для генерации.

In [59]:
input_ids = tokenizer.encode(
    """
    Definition: Is this sentence correct in terms of linguistics? 
    Input: Everyone hopes everyone to sleep. 
    Output: no
    
    Input: Everyone hopes that he will sleep. 
    Output: yes
    
    Input: If he were a rich man, he'd buy a diamond ring.
    Output:
    """, 
    return_tensors="pt")

In [60]:
def get_answer_instruct(input_ids, params):
    output = model.generate(input_ids, **params)
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [64]:
params = {
    "max_new_tokens": 10,
    "temperature": 10,
}
get_answer_instruct(input_ids, params)

'no'

In [66]:
params = {
    "max_new_tokens": 10,
    "do_sample": True,
}
get_answer_instruct(input_ids, params)

'yes'

Модель меняет ответ только с семплированием, но это нечестно, нам не нужен рандом.

## Выводы

Как выяснилось, модели слабо понимают языковую приемлемость, если их "спрашивать" напрямую. Но мы использовали не такие уж и большие модели, чтобы они могли настолько глубоко понимать язык. Что если попробовать реальную LLM? Мы взяли первые 20 предложений и спросили у ChatGPT.

In [None]:
print("""
Prompt:

Linguistic acceptability test. Write 1 if the sentence is acceptable in terms of linguistics, and 0 if not. Write answer for each sentence in a separate line. And if the sentence is incorrect tell why.

The sailors rode the breeze clear of the rocks.
The weights made the rope stretch over the pulley.
The mechanical doll wriggled itself loose.
If you had eaten more, you would want less.
As you eat the most, you want the least.
The more you would want, the less you would eat.
I demand that the more John eat, the more he pays.
Mary listens to the Grateful Dead, she gets depressed.
The angrier Mary got, the more she looked at pictures.
The higher the stakes, the lower his expectations are.
The more Fred is obnoxious, the less attention you should pay to him.
John was lots more obnoxious than Fred.
The more people you give beer to, the more people get sick.
The more does Bill smoke, the more Susan hates him.
The more pictures of him that appear in the news, the more embarrassed John becomes.
Every senator seems to become more corrupt, as he talks to more lobbyists.
Who does John visit Sally because he likes?
Marianne did not leave.
He could not have been working.
He can not have been working.

Answer:

1
1
0 (use 'loose itself' instead of 'wriggled itself loose')
1
0 (use 'the person who eats the most wants the least' instead of 'as you eat the most, you want the least')
0 (use 'the more you want, the less you eat' instead of 'the more you would want, the less you would eat')
0 (use 'the more John eats, the more he pays' instead of 'I demand that the more John eat, the more he pays')
0 (use a conjunction such as 'when' to join the two clauses, e.g. "Mary listens to the Grateful Dead when she gets depressed")
1
1
1
1
1
1
1
1
0 (use 'whom' instead of 'who' as 'whom' is the object of the verb 'visit')
1
1
1
""")

In [16]:
ds_test_subset = ds_test[:20]

In [13]:
ds_test_subset["predictions"] = [
    1,
    1,
    0,
    1,
    0,
    0,
    0,
    0,
    1,
    1,
    1,
    1,
    1,
    1,
    1,
    1,
    0,
    1,
    1,
    1,
]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ds_test_subset["predictions"] = [


In [14]:
ds_test_subset

Unnamed: 0,text,label,predictions
0,The sailors rode the breeze clear of the rocks.,1,1
1,The weights made the rope stretch over the pul...,1,1
2,The mechanical doll wriggled itself loose.,1,0
3,"If you had eaten more, you would want less.",1,1
4,"As you eat the most, you want the least.",0,0
5,"The more you would want, the less you would eat.",0,0
6,"I demand that the more John eat, the more he p...",0,0
7,"Mary listens to the Grateful Dead, she gets de...",1,0
8,"The angrier Mary got, the more she looked at p...",1,1
9,"The higher the stakes, the lower his expectati...",1,1


In [15]:
print(classification_report(ds_test["label"][:19], answers))

              precision    recall  f1-score   support

           0       0.50      0.80      0.62         5
           1       0.91      0.71      0.80        14

    accuracy                           0.74        19
   macro avg       0.70      0.76      0.71        19
weighted avg       0.80      0.74      0.75        19



Chat-GPT допустила 3 ошибки: одно предложение пропустила и отметила как верное, и два ошибочно поправила, сумев объяснить ошибки.