# This Notebook
uses a modified version of the hugging face transformers package to generate syntactically similar sentences. The approach redefines the phrasal contraint class in (generation_beam_constraint.py) and the parts of the constrained generation(generation_beam_search.py) in order to force this behavior.

The constrained beam search does not work for the current version of the transformers package. The requirements for the following include transformers==4.20.1 which requires python<=3.8.

Todo:
1. Understand and change behavior of the text generation so that the constraint objects are not duplicated across beams. This is necessary in the standard use case but unecessary since all beams will have the same constraint object state at each depth of the beam. Removing this would make the whole generation much more efficient allowing for larger banks and more beams with the same time efficiency.
2. Optimize hyperparamters including bank size for generation that is consistently high quality.
3. Perform post generation analysis for verifying that the semantics of the generated sentences are random

Understanding Constrainded Beam Search:
https://huggingface.co/blog/constrained-beam-search

In [30]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, PhrasalConstraint


In [2]:
from transformers import AutoTokenizer, AutoModelWithLMHead, AutoModelForCausalLM

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load the model
model = AutoModelForCausalLM.from_pretrained("gpt2")


In [3]:

encoder_input_str = "The jungle"

'''
APPROACH:

Notes: cannot add anything large to the PhrasalContraint class as they get copied recursively

1. Add a word_bank to the Phrasal contraint that is a list of list, each list corresponds to one of the target words and offers a list of alternatives 
that can replace it in the sentence.
2. Alter the generate method to assign infinite negative score to sentence candidates that dont make progress


For each candidate, dont any predicted next words, just expand every candidate with all possible next words

'''
constraints = [
    PhrasalConstraint(
        token_ids =[
        # tokenizer(" dog", add_special_tokens=False).input_ids[0],
        tokenizer("cat", add_special_tokens=False).input_ids[0],
        tokenizer("house", add_special_tokens=False).input_ids[0],
        ],
        alt_tokens=[
        [
        # tokenizer(" puma", add_special_tokens=False).input_ids[0],
        tokenizer(" cougar", add_special_tokens=False).input_ids[0],
        tokenizer(" leopard", add_special_tokens=False).input_ids[0],
        tokenizer(" cat", add_special_tokens=False).input_ids[0],
        tokenizer(" cheetah", add_special_tokens=False).input_ids[0],
        tokenizer(" tiger", add_special_tokens=False).input_ids[0],

       ],
        [
        tokenizer(" ship", add_special_tokens=False).input_ids[0],
        tokenizer(" pain", add_special_tokens=False).input_ids[0],
        tokenizer(" car", add_special_tokens=False).input_ids[0],
        tokenizer(" tree", add_special_tokens=False).input_ids[0],
        tokenizer(" shoe", add_special_tokens=False).input_ids[0],
        tokenizer(" bar", add_special_tokens=False).input_ids[0],
        tokenizer(" dirt", add_special_tokens=False).input_ids[0],
        tokenizer(" rain", add_special_tokens=False).input_ids[0],
        tokenizer(" cub", add_special_tokens=False).input_ids[0],

       ]
       ]
    
    )

]


input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids

outputs = model.generate(
    input_ids,
    constraints=constraints,
    num_beams= 5,
    num_return_sequences=4,
    no_repeat_ngram_size=1, # dont repeat any words
    remove_invalid_values=True,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=2

)

print("Output:\n" + 100 * '-')
for i in range(len(outputs)):
    print(tokenizer.decode(outputs[i], skip_special_tokens=True))




Output:
----------------------------------------------------------------------------------------------------
The jungle tiger cub
The jungle cat cub
The jungle tiger tree
The jungle tiger bar


In [4]:
import numpy as np
import spacy

from collections import Counter
from datasets import load_dataset


import nltk  
from nltk.tokenize import wordpunct_tokenize
from string import punctuation

from nltk.corpus import stopwords
from tqdm import tqdm



In [5]:
# Load the Spacy model (Small model for CPU usage)
nlp = spacy.load("en_core_web_sm")  # Smaller model suitable for CPU

# Load the corpus dataset
dataset = load_dataset("generics_kb", "generics_kb_best")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [6]:
np.random.seed(42)
sentences_to_use = np.random.choice(range(len(dataset['train'])), 100000, replace=False)
sentence_pipe = nlp.pipe(dataset['train'][sentences_to_use[0:15000]]['generic_sentence']) #only used 3000 to avoid OOM error

# Initialize the dictionary to store words by their attributes
word_bank = {}

# populate the dictionary
for doc in sentence_pipe:
    for w in doc:
        # create a key tuple of word attributes
        key = (w.pos_, w.tag_, w.dep_, w.is_stop)
        if key not in word_bank:
            word_bank[key] = []

        # check the word is single token:
        # rn a space is appended to each word
        tokenized_text = tokenizer.tokenize(" "+w.text.lower())
        if len(tokenized_text) > 1:
            continue

        # if the word is already in the word bank, skip it
        if w.text.lower() in word_bank[key]:
            continue
        
        # if the word is non-alphabetic, skip it
        if not w.text.isalpha():
            continue


        # Append the word text to the corresponding list in the dictionary
        word_bank[key].append(w.text.lower())


In [7]:
example_key = ('NOUN', 'NN', 'nsubj', False)  # Example tuple key
matching_words = word_bank.get(example_key, [])
print("Words matching the attributes {}: {}".format(example_key, matching_words))

Words matching the attributes ('NOUN', 'NN', 'nsubj', False): ['cancer', 'citizenship', 'computer', 'silk', 'cache', 'graphic', 'protection', 'communication', 'nutrition', 'pressure', 'love', 'fish', 'cream', 'deer', 'wind', 'sun', 'light', 'pregnancy', 'wildlife', 'blade', 'arthritis', 'ambition', 'infection', 'food', 'health', 'function', 'religion', 'material', 'agriculture', 'dehydration', 'oil', 'development', 'football', 'narration', 'grain', 'fog', 'photography', 'disturbance', 'stress', 'cooking', 'succession', 'weather', 'plant', 'piracy', 'selection', 'fraud', 'purity', 'pork', 'perception', 'drawing', 'sugar', 'illness', 'fluid', 'information', 'water', 'neon', 'sleep', 'sexism', 'war', 'confidence', 'furniture', 'poverty', 'stock', 'cocaine', 'planting', 'dust', 'air', 'faith', 'tolerance', 'use', 'pollution', 'turbulence', 'sky', 'milk', 'flight', 'energy', 'influenza', 'span', 'acid', 'aspiration', 'time', 'current', 'ice', 'throttle', 'chart', 'growth', 'smoking', 'glass

In [8]:
sent = "cats love dogs"
doc = nlp(sent)
for w in doc:
    print(w.text, w.pos_, w.tag_, w.dep_, w.is_stop)


cats NOUN NNS nsubj False
love VERB VBP ROOT False
dogs NOUN NNS dobj False


In [53]:
import re
def generate_tokens(sent,max_bank_size,word_bank,tokenizer,nlp):

    input_words = input_tokens = re.findall(r'\w+|[^\w\s]', sent)
    
    input_tokens = [
        tokenizer(" "+word, add_special_tokens=False).input_ids[0] for word in input_words
    ]



    alt_words = []
    doc = nlp(sent)


    for w in doc:
        key = (w.pos_, w.tag_, w.dep_, w.is_stop)
        if w.is_stop:
            #NOTE: this may not be single token
            matching_words = [w.text.lower()]

        # if the word is a special character, just return a list with the word itself
        elif not w.text.isalpha():
            matching_words = [w.text.lower()]

        else:
            # if there are no matching words in the bank raise an error
            if key not in word_bank:
                raise ValueError("No matching words in the word bank for the key: {}".format(key))

            matching_words = word_bank.get(key, [])

            # chose random subjection of matching words to limit the size of the banks and introduce randomness
            matching_words = np.random.choice(matching_words, min(max_bank_size, len(matching_words)), replace=False)


        alt_words.append(matching_words)

    alt_tokens = []

    for i in range(len(alt_words)):
        alt_token_per_word = []
        for j in range(len(alt_words[i])):
            alt_token_per_word.append(tokenizer(" "+alt_words[i][j], add_special_tokens=False).input_ids[0])
    

        alt_tokens.append(alt_token_per_word)

    return input_tokens, alt_tokens


In [10]:
tokenized_text = tokenizer.tokenize('selection')
print(len(tokenized_text))

1


In [None]:
outputs.shape

for i in range(len(outputs)):
    for k in range(1,len(outputs[i])):
        print(tokenizer.decode(outputs[i][k], skip_special_tokens=True))


 people
 want
 things
 more
 than
 anything


In [68]:
def create_syntactic_similars(sentences, num_alternatives, num_beams, max_bank_size, word_bank, tokenizer, nlp):
    encoder_input_str = " "
    
    all_alt_sentences = []

    for sent_idx, sent in enumerate(sentences):
        print(f"Generating alternatives for sentence {sent_idx+1}/{len(sentences)}")
        sent = sent.lower()

        # generate banks of certain size randomly selected
        input_tokens, alternative_tokens = generate_tokens(sent, max_bank_size, word_bank, tokenizer, nlp)

        alt_sentences = []
        for i in range(num_alternatives):

            # Remove all previously used words from the alternative tokens
            for alt_sentence in alt_sentences:
                for token_idx in range(1, len(alt_sentence)):
                    token = alt_sentence[token_idx]
                    try:
                        if len(alternative_tokens[token_idx-1]) > 10:
                            alternative_tokens[token_idx-1].remove(token)
                    except:
                        print(f"Error: token [{tokenizer.decode(token)}] not found in list")

            constraints = [
                PhrasalConstraint(
                    token_ids=input_tokens,
                    alt_tokens=alternative_tokens
                )
            ]

            input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids

            outputs = model.generate(
                input_ids,
                constraints=constraints,
                num_beams=num_beams,
                num_return_sequences=1,
                remove_invalid_values=True,
                pad_token_id=tokenizer.eos_token_id,
                max_new_tokens=len(input_tokens)
            )

            alt_sentence = outputs[0]
            alt_sentences.append(alt_sentence)

        all_alt_sentences.append(alt_sentences)

    return [[tokenizer.decode(alt[1:], skip_special_tokens=True) for alt in alt_sentence] for alt_sentence in all_alt_sentences]

In [60]:
sentences = ["cats love dogs more when the time is right", "the jungle is a place of wonder","the cat is a predator when hungry" ]

alt_sentences = create_syntactic_similars(sentences,num_alternatives=2, num_beams=60, max_bank_size=400,word_bank=word_bank,tokenizer=tokenizer,nlp=nlp)

for idx,sent in enumerate(sentences):
    print("input sentence: ")
    print(sent)
    print("alternative sentences: ")

    for alt in alt_sentences[idx]:
        print(alt)

input sentence: 
cats love dogs more when the time is right
alternative sentences: 
 companies want people more when the price is right
 patients receive drugs more when the risk is low
input sentence: 
the jungle is a place of wonder
alternative sentences: 
 the act is a crime of violence
 the colour is a mixture of red
input sentence: 
the cat is a predator when hungry
alternative sentences: 
 the fruit is a kind when ripe
 the season is a success when young


In [None]:
# NOTE: ERROR arena is multitoken
sentences = ["trees know when you live"]

alt_sentences = create_syntactic_similars(sentences, 2, word_bank,tokenizer,nlp)

400
400
1
1
51
  customers need when you want
  tools like when you use


In [None]:
print(alt_sentences)

[[' the sea, and', ' the author, and']]


In [20]:
import re

sent = "when you live with someone who has a temper, a very bad temper, a very, very bad temper, you learn to play around."

# Use regular expressions to split the sentence into words and punctuation
input_tokens = re.findall(r'\w+|[^\w\s]', sent)

print(input_tokens)

['when', 'you', 'live', 'with', 'someone', 'who', 'has', 'a', 'temper', ',', 'a', 'very', 'bad', 'temper', ',', 'a', 'very', ',', 'very', 'bad', 'temper', ',', 'you', 'learn', 'to', 'play', 'around', '.']


In [31]:
sent = "when you live with someone who has a temper, a very bad temper, a very, very bad temper, you learn to play around." 
input_tokens = [
    word for word in sent.split()
]
print(input_tokens)

['when', 'you', 'live', 'with', 'someone', 'who', 'has', 'a', 'temper,', 'a', 'very', 'bad', 'temper,', 'a', 'very,', 'very', 'bad', 'temper,', 'you', 'learn', 'to', 'play', 'around.']


In [35]:
import nltk  
from nltk.tokenize import wordpunct_tokenize
from string import punctuation
from collections import defaultdict
import warnings
from nltk.corpus import stopwords
from nltk.corpus import stopwords
from tqdm import tqdm
from deepmultilingualpunctuation import PunctuationModel




In [36]:
# Loading the textgrids

# Rstories are the names of the training (or Regression) stories, which we will use to fit our models
Rstories = ['alternateithicatom', 'avatar', 'howtodraw', 'legacy', 
            'life', 'myfirstdaywiththeyankees', 'naked', 
            'odetostepfather', 'souls', 'undertheinfluence']

# Pstories are the test (or Prediction) stories (well, story), which we will use to test our models
Pstories = ['wheretheressmoke']

allstories = Rstories + Pstories

# Load TextGrids
from stimulus_utils import load_grids_for_stories
grids = load_grids_for_stories(allstories)

# Load TRfiles
from stimulus_utils import load_generic_trfiles
trfiles = load_generic_trfiles(allstories)

# Make word and phoneme datasequences
from dsutils import make_word_ds, make_phoneme_ds
wordseqs = make_word_ds(grids, trfiles) # dictionary of {storyname : word DataSequence}
phonseqs = make_phoneme_ds(grids, trfiles) # dictionary of {storyname : phoneme DataSequence}

In [37]:
wheretheressmoke = wordseqs["wheretheressmoke"]
print ("There are %d words in the story called 'wheretheressmoke'" % len(list(wheretheressmoke.data)))

There are 1839 words in the story called 'wheretheressmoke'


In [38]:
punctuation_model= PunctuationModel()




In [39]:
input_text = ' '.join(wheretheressmoke.data)
punctuation_model = PunctuationModel()
# Restore Punctuation
punctuated_output = punctuation_model.restore_punctuation(input_text)
# Split sentences by Punctuation
story_sentences = nltk.sent_tokenize(punctuated_output)  

In [61]:
for idx,sentence in enumerate(story_sentences):
    print(sentence)


i reached over and secretly undid my seatbelt and when his foot hit the brake at the red light, i flung open the door and i ran.
i had no shoes on, i was crying, i had no wallet, but i was ok because i had my cigarettes and i didn't want any part of freedom if i didn't have my cigarettes.
when you live with someone who has a temper, a very bad temper, a very, very bad temper, you learn to play around.
that you learn this time i'll play possum and next time i'll just be real nice, or i'll say yes to everything, or you make yourself scarce or you run.
and this was one of the times when you just run and as i was running, i thought this was a great place to jump out, because there were big lawns and there were cul de sacs and sometimes he would come after me and drive and yell stuff at me to get back in, get back in.
and i was like no, i'm out of here, this is great.
and i went and hid behind a cabana and he left and i had my cigarettes and, uh, i started to walk in this beautiful neighbor

# Note
Should we split on phrases or sentences?
- The sentences are not exact as the content is spoken are sentence boundaries are imperfectly generated
- Phrases are shorter which would allow for more sensible generation to to less-strict/shorter constraints and would be more time efficient 


In [66]:
story_sentences[16:22]

['what, uh, what are they doing out on this suburban street?',
 "and the person comes closer and i could see it's a woman.",
 'and then i can see she has her hands in her face.',
 "oh, she's crying.",
 'and then she sees me and she composes herself and she gets closer and i see she has no shoes on.',
 "she has no shoes on and she's crying and she's out on the street."]

In [67]:
sentences = story_sentences[16:22]
alt_sentences = create_syntactic_similars(sentences,num_alternatives=2, num_beams=40, max_bank_size=400,word_bank=word_bank,tokenizer=tokenizer,nlp=nlp)



Error: token [.] not found in list
Error: token [ '] not found in list
Error: token [.] not found in list
Error: token [.] not found in list


In [69]:

for idx,sent in enumerate(sentences):
    print("input sentence: ")
    print(sent)
    print("alternative sentences: ")

    for alt in alt_sentences[idx]:
        print(alt)

input sentence: 
what, uh, what are they doing out on this suburban street?
alternative sentences: 
 what, yes, what are they doing out on this good day?
 what, o, what are they doing out on this dark path?
input sentence: 
and the person comes closer and i could see it's a woman.
alternative sentences: 
 and the problem goes deeper and i could see it'a thing..
 and the law requires better and i could see it'a problem..
input sentence: 
and then i can see she has her hands in her face.
alternative sentences: 
 and then i can see she has her guns in her hand.
 and then i can see she has her wounds in her belly.
input sentence: 
oh, she's crying.
alternative sentences: 
 o, she'writing. '
 okay, she'dreaming. '
input sentence: 
and then she sees me and she composes herself and she gets closer and i see she has no shoes on.
alternative sentences: 
 and then she turns me and she turns herself and she turns older and i see she has no children on.
 and then she follows me and she says hersel

In [50]:
encoder_input_str = " "

input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids

outputs = model.generate(
    input_ids,
    num_beams= 200,
    num_return_sequences=1,
    no_repeat_ngram_size=1, # dont repeat any words
    remove_invalid_values=True,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=30


)

print("Output:\n" + 100 * '-')
for i in range(len(outputs)):
    print(tokenizer.decode(outputs[i], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
