# Exp 013: Create constraints and prompts for grammar-controlled text generation tasks
This notebooks experiments on how to setup the different tasks and prompts GPT3.5 as a baseline to test the evaluation suite

In [1]:
import random
import os
from dotenv import load_dotenv
load_dotenv()
import pandas as pd
import sys
sys.path.append('../source')
import models
import data
import api
import evaluation
import importlib
#importlib.reload(data)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[nltk_data] Downloading package punkt to
[nltk_data]     /cluster/scratch/dglandorf/cache...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data]

In [2]:
# helper functions
levels = ["A1", "A2", "B1", "B2", "C1", "C2"] 
level_idx = {level: i for i, level in enumerate(levels)}

def skills(rules, statement=True):
     if statement: return "- " + rules['SubCategory'] + " - " + rules['guideword'] + ": " + rules['Can-do statement']
     return rules['SubCategory'] + " " + rules['guideword']
    
def get_prompt(rules, snippet, negative_rules=None, dialog=True, all=False, level=None, statement=True):
    sep = os.linesep if statement else "; "
    if level:
        constraints = f"grammar skills on CEFR level {level}."
    else: 
        exclude = f"""Do NOT apply the following grammar skills:
    {sep.join(skills(negative_rules, statement=statement))}
""" if negative_rules else ""
        constraints = f"""{"all" if all else "at least one"} of these grammar skills:
{sep.join(skills(rules, statement=statement))}
{exclude}"""
    return f"""Continue the {"dialog with one turn" if dialog else "text with two sentences"}, proving knowledge of {constraints}
Snippet:
{snippet}"""
    
def classify_sents(classifiers, rules, sentences, verbose=True):
    results = []
    for idx, rule in rules.iterrows():
        scores = models.probe_model(classifiers[rule['#']], sentences)
        hits = [sentences[i] for i, score in enumerate(scores[0]) if score > 0.5]
        if verbose & len(hits): print(f'\n{rule["Level"]}-{rule["Can-do statement"]} ({rule["#"]})\n{os.linesep.join(["- "+h for h in hits])}')
        results.append(len(hits)/len(sentences))
    return results

def sample_dialog_snippet(dialogs, n=4):
    dialog = []
    while len(dialog) < n+1:
        dialog = random.sample(dialogs, 1)[0]
    index = random.randint(0, len(dialog) - n)
    utterances = dialog[index:index+n]
    return os.linesep.join([("A" if (i%2==0) else "B") + ": " + utt for i, utt in enumerate(utterances[:-1] + [""])]), utterances[-1]

def prompt_score(rules, snippet, classifiers, dialog, all, n_responses=1, level=None, statement=True, verbose=False):
    prompt = get_prompt(rules, snippet, dialog=dialog, all=all, level=level, statement=statement)
    if verbose: print(f"PROMPT: {prompt}")
    responses = [api.get_openai_chat_completion([{ "role": "user", "content": prompt}])[0] for _ in range(n_responses)] #, model="gpt-4-0125-preview"
    if verbose: print("RESPONSES from GPT3.5")
    if verbose: print(responses)
    if dialog: print(evaluation.evaluate_responses([snippet], [responses], [list(rules['#'])], [[]]))
    return [classify_sents(classifiers, rules, data.sent_tokenize(response), verbose) for response in responses]

Load EGP rules, topical prompts and dialogs

In [3]:
egp = data.get_egp()
cefr = data.CEFRTexts()
stories = list(cefr.get_beginnings(100))
ds = data.DialogSum()
dd = data.DailyDialog()
wow = data.WoW()
dialogs = ds.get_dialogues() + dd.get_dialogues() + wow.get_dialogues()
classifiers = {nr: models.load_classifier(nr, 'corpus_training') for nr in [int(name.replace(".pth","")) for name in os.listdir(f"../models/corpus_training")]}

General parameters for task datasets
1. text vs dialog -> dialog
2. context length -> 4 turns (trade off between context length and informativeness)

Dev datasets: Wizard of Wikipedia, Daily Dialog, Dialog Sum

Test datasets: CMU Document grounded conversations

# Questions

- Should the dialog contexts be as diverse as possible, i.e. different for each datapoint, or should the same set of dialog contexts be included in every combination of parameters?
- Is it reasonable to work with a fixed number of turns both in training and test data? I'd say there is no reason why it should not generalize.
- Is it okay to include no or only an unconditioned ground truth answer in train and test set since all metrics are reference-free? Using only sets of constraints that are fulfilled in the given datasets will likely make the evaluation much smaller and less challenging.
- In the evaluation, should I normalize the score by the number of sentences or should it give a full score if a constraint is met at all?

## Task 1: Single constraints

In [5]:
def sample_single_constraints(level=None, n_per_subcat=1, subcategories = ["would", "negation", "superlatives"]):
    if level is None: level = random.sample(levels, 1)[0]
    return egp[(egp['SubCategory'].isin(subcategories)) & (egp['Level']==level) & egp['#'].isin(classifiers.keys())].groupby("SubCategory").sample(n_per_subcat)

For Text

In [6]:
snippet = random.sample(stories, 1)[0]
rules = sample_single_constraints()
scores = prompt_score(rules, snippet, classifiers, dialog=False, all=True, verbose=True)
print(f"Constraint scores: {scores}")

PROMPT: Continue the text with two sentences, proving knowledge of all of these grammar skills:
- negation - FORM/USE: 'NOT ONLY ... (BUT) ALSO' WITH INVERSION: Can use auxiliary 'do' + inverted subject after 'not only', to give focus.
- superlatives - FORM/USE: WITH NOUN AND POSTMODFIER: Can use a postmodifier to make the superlative stronger in the structure superlative + postmodifier + noun. 
- would - FORM: WITH ADVERBS: Can use an increasing range of adverbs with 'would', including 'strongly', 'easily', 'especially', 'actually', 'absolutely', 'gladly'  ► adverbs

Snippet:
Excuse me.
RESPONSES from GPT3.5
['Not only did she break the record for the fastest runner, but she also set a new world record, making her the fastest woman alive. She would absolutely dominate the competition, especially with her strong determination and training regimen.']

C1-Can use auxiliary 'do' + inverted subject after 'not only', to give focus. (1198)
- Not only did she break the record for the fastest 

For Dialog

In [7]:
for i in range(5):
    rules = sample_single_constraints()
    snippet, _ = sample_dialog_snippet(dialogs)
    prompt_score(rules, snippet, classifiers, dialog=True, all=True, verbose=True, n_responses=3)
    print("_" * 100)

PROMPT: Continue the dialog with one turn, proving knowledge of all of these grammar skills:
- negation - FORM: 'NOT ALL', 'NOT EVERY': Can use 'not with indefinite pronouns 'everyone' and 'everything' and determiners 'every', 'all'.
- superlatives - FORM/USE: WITH NOUN AND POSTMODFIER: Can use a postmodifier to make the superlative stronger in the structure superlative + postmodifier + noun. 
- would - FORM: WITH ADVERBS: Can use an increasing range of adverbs with 'would', including 'strongly', 'easily', 'especially', 'actually', 'absolutely', 'gladly'  ► adverbs

Snippet:
A: Either way is good for me.
B: There are a number of open houses this weekend in your area. Would that okay with you?
A: I can take a little time off of work, or I can go on a weekend, also.
B: 
RESPONSES from GPT3.5
['Not all of the open houses this weekend will be suitable for us, so we should carefully choose the ones that interest us the most.', "Not every open house will have what I'm looking for, so I'll ha

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.8904109589041096], 'positive_constraints': [[[1.0, 0.0, 0.0], [1.0, 0.0, 0.0], [1.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[3.0, 4.0, 4.0]], 'Relevance': [[2.0, 4.0, 4.0]], 'Content Richness': [[3.0, 4.0, 3.0]], 'Grammatical Correctness': [[4.0, 5.0, 4.0]]}

C1-Can use 'not with indefinite pronouns 'everyone' and 'everything' and determiners 'every', 'all'. (1197)
- Not all of the open houses this weekend will be suitable for us, so we should carefully choose the ones that interest us the most.

C1-Can use 'not with indefinite pronouns 'everyone' and 'everything' and determiners 'every', 'all'. (1197)
- Not every open house will have what I'm looking for, so I'll have to choose carefully.

C1-Can use 'not with indefinite pronouns 'everyone' and 'everything' and determiners 'every', 'all'. (1197)
- Not all of the open houses are within walking distance from my house, so I would probably need to drive, especially if the weather is

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.6097560975609756], 'positive_constraints': [[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 1.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 5.0, 2.0]], 'Relevance': [[4.0, 4.0, 4.0]], 'Content Richness': [[2.0, 2.0, 2.0]], 'Grammatical Correctness': [[5.0, 5.0, 5.0]]}

A2-Can use 'it would be' to make suggestions. (623)
- I'm sorry, but I don't think that would be a good idea.
____________________________________________________________________________________________________
PROMPT: Continue the dialog with one turn, proving knowledge of all of these grammar skills:
- negation - FORM: AUXILIARY VERBS 'BE', 'HAVE', PAST: Can form negative statements of main verbs in the past continuous and past perfect with auxiliary verbs 'be' and 'have' + 'not/n't'. ► past continuous ► past perfect
- superlatives - FORM/USE: 'THE BEST' WITH NOUN AND PRESENT PERFECT: Can use 'the best' before a noun + present perfect to talk about a unique experience.
- w

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.8], 'positive_constraints': [[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[3.0, 4.0, 4.0]], 'Relevance': [[3.0, 4.0, 4.0]], 'Content Richness': [[4.0, 4.0, 4.0]], 'Grammatical Correctness': [[5.0, 5.0, 5.0]]}
____________________________________________________________________________________________________
PROMPT: Continue the dialog with one turn, proving knowledge of all of these grammar skills:
- negation - FORM/USE: 'NOT', EMPHASIS: Can use uncontracted 'not' for emphasis or in formal contexts.
- superlatives - FORM: ELLIPSIS, WITH 'THE': Can use '(one of) the' with an increasing range of superlative adjectives without a following noun, when the noun is understood.
- would - USE: HABITUAL PAST: Can use 'would' to talk about habitual actions and events in the past.

Snippet:
A: Do you have a favorite side dish? I love mashed potatoes, it goes with so much!
B: I love them too. I also love ye

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.7049180327868853], 'positive_constraints': [[[0.0, 0.0, 0.5], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 4.0, 4.0]], 'Relevance': [[5.0, 4.0, 4.0]], 'Content Richness': [[4.0, 4.0, 4.0]], 'Grammatical Correctness': [[5.0, 4.0, 5.0]]}

B2-Can use 'would' to talk about habitual actions and events in the past. (636)
- Spanish rice is one of my favorite side dishes, I would always order it at my favorite Mexican restaurant.
____________________________________________________________________________________________________
PROMPT: Continue the dialog with one turn, proving knowledge of all of these grammar skills:
- negation - FORM: MENTAL PROCESS VERBS + CLAUSE: Can use the negative forms of mental process verbs ('I don't think', 'I don't believe') followed by a complement clause, where the negative form is in the mental process verb rather than the complement clause.
- superlatives - FORM/USE: 'THE BEST' WI

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.6041666666666666], 'positive_constraints': [[[1.0, 0.0, 0.0], [1.0, 0.0, 0.0], [0.5, 0.5, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 4.0, 4.0]], 'Relevance': [[4.0, 4.0, 5.0]], 'Content Richness': [[4.0, 4.0, 4.0]], 'Grammatical Correctness': [[5.0, 5.0, 5.0]]}

B1-Can use the negative forms of mental process verbs ('I don't think', 'I don't believe') followed by a complement clause, where the negative form is in the mental process verb rather than the complement clause. (1186)
- I don't think I've ever seen a better deal in this city.

B1-Can use the negative forms of mental process verbs ('I don't think', 'I don't believe') followed by a complement clause, where the negative form is in the mental process verb rather than the complement clause. (1186)
- I don't think I've ever seen a better deal than this!

B1-Can use the negative forms of mental process verbs ('I don't think', 'I don't believe') followed by a complement clause, wh

Single constraints parameters
1. number of single EGP skill constraints -> 1-6 (unreasonable to have more than six constraints in one answer)
2. number of subcategories to choose constraints from -> 1-3 (cost of developing more classifiers)
3. CEFR levels -> A1-C2

dataset format

CTX_SOURCE_DS | CTX_SOURCE_ID | CONTEXT | RESPONSE | EGP_NRS
Daily Dialog | 023 | A: Hey B: Hey A: How are you? | B: I'm good | 616,636,1199

In [8]:
subcats = ["would", "negation", "superlatives"]
num_constraints = list(range(1,1+6))
num_subcats = list(range(1,1+3))
num_dialogs = 2
random.seed(os.getenv("RANDOM_SEED"))
#dd = data.DialogSum()
#dd = data.DailyDialog()
#dd = data.WoW()
dd = data.CMUDoG()
dialogs = dd.get_dialogues()
dialog_contexts = [sample_dialog_snippet(dialogs, n=4) for _ in range(num_dialogs)]

for context, response in dialog_contexts:
    print(f"Context: {context}")
    print(f"True response: {response}")
    for num_constraint in num_constraints:
        for num_subcat in num_subcats:
            for level in levels:
                pass
                # add to dataframe

Context: A: I do too! Bruce Almighty was hilarious
B: Yes it was, he doesn't disappointed, I love his movies, including Dick and Jane and Truman Show!
A: also has quite a few other big name actors and actresses like Freeman and Aniston!
B: 
True response: Fun with Dick and Jane is one of his best in my opinion.
Context: A: It is produced by Walt Disney Animation Studios, it is their 53rd featured film. Did you know it is inspired from a Hans Christian Anderson fairy tail?
B: Oh, right... Which fairy tale is that?
A: The fairy tale is called "The Snow Queen" - a story about a brave princess who sets out with a group of friends to finder her estranged sister.
B: 
True response: That does sound like a good one. Do you know which actors provided voices for the characters?


## Task 2: Combine subcategories

Choose subcategory and level skills

In [15]:
def sample_subcategory_rules(subcategories = ["would", "negation", "superlatives"], levels = ["B1", "B2", "B1"]):
    filter_clf = (egp['#'].isin(classifiers.keys()))
    return pd.concat([egp[(egp['SubCategory'] == subcat) & (egp['Level']==level) & filter_clf] for subcat, level in zip(subcategories, levels)])

Assemble prompt for a random story beginning, test it with GPT3.5 and use classifiers to score the response

In [16]:
rules = sample_subcategory_rules()
snippet = random.sample(stories, 1)[0]
prompt_score(rules, snippet, classifiers, dialog=False, all=False, verbose=True, statement=False)

PROMPT: Continue the text with two sentences, proving knowledge of at least one of these grammar skills:
would FORM/USE: AFTER 'IF' CLAUSES; would FORM: PAST AFFIRMATIVE; would FORM: PAST NEGATIVE; would FORM: QUESTIONS; would FORM: WITH ADVERBS; would USE: FUTURE IN THE PAST; would USE: IMAGINED SITUATIONS IN THE PAST; would USE: INDIRECTNESS; would USE: POLITE REQUESTS; would USE: REPORTED SPEECH; would USE: WILLINGNESS IN THE PAST; negation FORM/USE: 'NOT', EMPHASIS; negation FORM/USE: 'NEVER', INVERTED FRONT POSITION, FOCUS; negation FORM/USE: 'NEITHER ... NOR'; superlatives FORM/USE: 'THE BEST' WITH NOUN AND PRESENT PERFECT

Snippet:
Hume proposes that feeling, not thought, informs us that an object is beautiful or ugly, or that an action exhibits virtue or vice: “The very feeling constitutes our praise or admiration” (T, 471).
RESPONSES from GPT3.5
['If Hume were alive today, he would likely argue that beauty and morality remain subjective concepts influenced by individual percep

[[0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]

Do this with dialogs

In [17]:
for i in range(5):
    rules = sample_subcategory_rules()
    snippet, _ = sample_dialog_snippet(dialogs)
    prompt_score(rules, snippet, classifiers, dialog=True, all=False, verbose=True, statement=False, n_responses=3)
    print("_" * 100)

PROMPT: Continue the dialog with one turn, proving knowledge of at least one of these grammar skills:
would FORM/USE: AFTER 'IF' CLAUSES; would FORM: PAST AFFIRMATIVE; would FORM: PAST NEGATIVE; would FORM: QUESTIONS; would FORM: WITH ADVERBS; would USE: FUTURE IN THE PAST; would USE: IMAGINED SITUATIONS IN THE PAST; would USE: INDIRECTNESS; would USE: POLITE REQUESTS; would USE: REPORTED SPEECH; would USE: WILLINGNESS IN THE PAST; negation FORM/USE: 'NOT', EMPHASIS; negation FORM/USE: 'NEVER', INVERTED FRONT POSITION, FOCUS; negation FORM/USE: 'NEITHER ... NOR'; superlatives FORM/USE: 'THE BEST' WITH NOUN AND PRESENT PERFECT

Snippet:
A: It's fun because they incorporate a lot of classic toys including a pull-string cowboy, Mr Potato Head, Green Army Men, and
 toy dinosaurs
B: It sounds good.
A: In one scene the green army men, led by the character Sarge spies on a party, and reports the
findings to the other toys via baby monitors. It's light-hearted in its approach
B: 
RESPONSES fro

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.9365079365079365], 'positive_constraints': [[[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 4.0, 4.0]], 'Relevance': [[4.0, 4.0, 4.0]], 'Content Richness': [[4.0, 4.0, 4.0]], 'Grammatical Correctness': [[5.0, 4.0, 5.0]]}

B1-Can use 'would have' + '-ed'. (626)
- That would have been a fun scene to watch, imagining toys spying on a party through baby monitors.

B1-Can use 'would' to talk about the future in the past. (630)
- It would be interesting to see how the toys would react in a situation like that.

B1-Can use 'would have' + '-ed'. (626)
- That would have been a clever way to include all the iconic toys in the movie, creating a nostalgic and entertaining experience for both children and adults.
_______________

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.74], 'positive_constraints': [[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 4.0, 4.0]], 'Relevance': [[4.0, 5.0, 5.0]], 'Content Richness': [[4.0, 4.0, 4.0]], 'Grammatical Correctness': [[5.0, 5.0, 5.0]]}
____________________________________________________________________________________________________
PROMPT: Continue the dialog with one turn, proving knowledge of at least one of these grammar skills:
would FORM/USE: AFTER 'IF' CLAUSES; would FORM: PAST AFFIRMATIVE; would FORM: PAST NEGATIVE; would FORM: QUESTIONS; would FORM: WITH ADVERBS; would USE: FUTURE IN THE PAST; would USE: IMAGINED SITUATIONS IN THE PAST; would USE: INDIRECTNESS; would USE: POLITE REQUESTS; would USE: REPORTED SPEECH; would USE: WILLIN

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.9736842105263158], 'positive_constraints': [[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 4.0, 4.0]], 'Relevance': [[4.0, 4.0, 4.0]], 'Content Richness': [[4.0, 2.0, 4.0]], 'Grammatical Correctness': [[4.0, 5.0, 4.0]]}

B1-Can use an limited range of adverbs with 'would', including 'really', 'probably', 'certainly', 'definitely'.► adverbs (629)
- Yes, she would definitely be a great choice for that role.

B1-Can use an limited range of adverbs with 'would', including 'really', 'probably', 'certainly', 'definitely'.► adverbs (629)
- She would definitely bring a level of charisma to any role she's in.
____________________________________________________________________________________________________
PROMPT: Continu

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.3939393939393939], 'positive_constraints': [[[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[2.0, 4.0, 2.0]], 'Relevance': [[2.0, 2.0, 2.0]], 'Content Richness': [[2.0, 3.0, 2.0]], 'Grammatical Correctness': [[4.0, 4.0, 4.0]]}

B1-Can use 'would' in the main clause of a conditional sentence to talk about an imagined situation, often in the context of advice or opinion-giving. (625)
- If you could choose any superpower, what would it be?

B1-Can use question forms. (628)
- If you could choose any superpower, what would it be?

B1-Can use 'would' in the main clause of a conditional sentence to talk about an imagined situation, often in the context of advice or opinion-giving. (625)
- If you could have any superpower, what w

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.9454545454545454], 'positive_constraints': [[[1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 2.0, 4.0]], 'Relevance': [[4.0, 2.0, 4.0]], 'Content Richness': [[4.0, 4.0, 4.0]], 'Grammatical Correctness': [[5.0, 5.0, 5.0]]}

B1-Can use 'would' in the main clause of a conditional sentence to talk about an imagined situation, often in the context of advice or opinion-giving. (625)
- Yeah, if Batman hadn't witnessed his parents' murder, he wouldn't have become the Dark Knight.

B1-Can use 'would not have' + '-ed' or 'wouldn’t have' + '-ed' (627)
- Yeah, if Batman hadn't witnessed his parents' murder, he wouldn't have become the Dark Knight.

B1-Can use 'would' to talk about imagined situations in the past. ► conditionals 

Subcategory constraints parameters
1. number of subcategories to choose constraints from -> 1-3 (cost of developing more classifiers)
2. CEFR levels -> A1-C2
3. maximum difference between CEFR levels per category

dataset format

CTX_SOURCE_DS | CTX_SOURCE_ID | CONTEXT | RESPONSE | SUBCATS x LEVELS

Daily Dialog | 023 | A: Hey B: Hey A: How are you? | B: I'm good | would_B2, negation_C1, superlatives_B1

## Task 3: All skills from one difficulty level

In [18]:
def sample_level_rules(level = None):
    if level is None: level = random.sample(levels, 1)[0]
    return level, egp[(egp['Level']==level) & (egp['#'].isin(classifiers.keys()))]

In [19]:
level, rules = sample_level_rules()
snippet = random.sample(stories, 1)[0]
prompt_score(rules, snippet, classifiers, dialog=False, all=False, verbose=True, level=level)

PROMPT: Continue the text with two sentences, proving knowledge of grammar skills on CEFR level A1.
Snippet:
President Obama will place tariffs on imports of some Chinese tires for three years in an effort to curb a surge in exports that has rocked the U.S. tire industry.
RESPONSES from GPT3.5
['The decision has come under criticism from China, sparking concerns of a potential trade war between the two countries. Nevertheless, the President believes this action is necessary to protect American jobs and industries.']


[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]

In [20]:
for i in range(5):
    level, rules = sample_level_rules()
    snippet, _ = sample_dialog_snippet(dialogs)
    prompt_score(rules, snippet, classifiers, dialog=True, all=False, verbose=True, level=level, n_responses=3)
    print("_" * 100)

PROMPT: Continue the dialog with one turn, proving knowledge of grammar skills on CEFR level C1.
Snippet:
A: Steve Carrell is the main character with Jason Segel and Russell Brand. 
B: Gosh, I bet it is funny then.  Is it a Pixar movie?
A: No, it actually is produced by Universal and Illumination.
B: 
RESPONSES from GPT3.5
['Illumination, the studio responsible for hits such as "Despicable Me" and "The Secret Life of Pets."', "Oh, I see. So it's not animated then, but more of a live-action comedy film?", "Ah, I see. I didn't know that. Thank you for clarifying."]


Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.9454545454545454], 'positive_constraints': [[[0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 4.0, 4.0]], 'Relevance': [[4.0, 4.0, 4.0]], 'Content Richness': [[4.0, 4.0, 3.0]], 'Grammatical Correctness': [[5.0, 5.0, 5.0]]}
____________________________________________________________________________________________________
PROMPT: Continue the dialog with one turn, proving knowledge of grammar skills on CEFR level A2.
Snippet:
A: Do you like animated
 movies?
B: anyone there?
A: hai
B: 
RESPONSES from GPT3.5
['Yes, I like animated movies.', 'Yes, I enjoy watching animated movies.', 'Do you like animated movies too?']


Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.6842105263157895], 'positive_constraints': [[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[2.0, 2.0, 2.0]], 'Relevance': [[2.0, 4.0, 2.0]], 'Content Richness': [[2.0, 2.0, 2.0]], 'Grammatical Correctness': [[4.0, 4.0, 4.0]]}
____________________________________________________________________________________________________
PROMPT: Continue the dialog with one turn, proving knowledge of grammar skills on CEFR level B2.
Snippet:
A: yes  i was band play
B: i like the music
A: When the band  starts their tour I found it interesting that Mia accuses Sebastian of abandoning hie dreams!
B: 
RESPONSES from GPT3.5
['Is she just jealous of his success?', "It's true that Mia accuses Sebastian of abandoning his dreams, but ultimately they both r

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [1.0], 'positive_constraints': [[[0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 4.0, 4.0]], 'Relevance': [[4.0, 4.0, 5.0]], 'Content Richness': [[3.0, 4.0, 4.0]], 'Grammatical Correctness': [[4.0, 4.0, 5.0]]}
____________________________________________________________________________________________________
PROMPT: Continue the dialog with one turn, proving knowledge of grammar skills on CEFR level B1.
Snippet:
A: Wow, that's awesome.  Is there a love story between Turing and the character played by Knightley?
B: Yes, Turing proposes to her, but it is revealed that he is actually a homosexual.
A: Quite a twist
B: 
RESPONSES from GPT3.5
['Indeed, it adds a layer of complexity to his character and the overall plot.', 'Yes, it adds a layer of complexity to the story and sheds light on the challenges Turing faced during that time period.', 'Yes, it certainly adds an in

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.7755102040816326], 'positive_constraints': [[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 4.0, 4.0]], 'Relevance': [[4.0, 4.0, 4.0]], 'Content Richness': [[4.0, 4.0, 3.0]], 'Grammatical Correctness': [[5.0, 5.0, 5.0]]}
____________________________________________________________________________________________________
PROMPT: Continue the dialog with one turn, proving knowledge of grammar skills on CEFR level B1.
Snippet:
A: I really like the movie Zootopia. 
B: The Rotten Tomatoes rating is 98%. That is a really good rating for a movie.
A: zootopia yeah that was good
B: 
RESPONSES from GPT3.5
['Yes, I thought the animation was fantastic and the storyline was very enter

Contexts:   0%|          | 0/1 [00:00<?, ?it/s]

Responses:   0%|          | 0/3 [00:00<?, ?it/s]

{'Distinctiveness': [0.8780487804878049], 'positive_constraints': [[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]], 'negative_constraints': [[[], [], []]], 'Appropriateness': [[4.0, 4.0, 4.0]], 'Relevance': [[4.0, 4.0, 5.0]], 'Content Richness': [[4.0, 4.0, 4.0]], 'Grammatical Correctness': [[5.0, 5.0, 5.0]]}
____________________________________________________________________________________________________


Difficulty level constraints parameters
1. CEFR levels -> A1-C2

dataset format

CTX_SOURCE_DS | CTX_SOURCE_ID | CONTEXT | RESPONSE | LEVEL

Daily Dialog | 023 | A: Hey B: Hey A: How are you? | B: I'm good | B1