## Setup

Mount Google Drive and clone the repository containing the methods.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [5]:
import getpass

github_username = input("Enter your GitHub username: ")
github_token = getpass.getpass("Enter your GitHub personal access token: ")

Enter your GitHub username: smcaleese
Enter your GitHub personal access token: ··········


In [6]:
repo_name = "smcaleese/masters-thesis-code"
!git clone https://{github_username}:{github_token}@github.com/{repo_name}.git

Cloning into 'masters-thesis-code'...
remote: Enumerating objects: 110, done.[K
remote: Counting objects: 100% (110/110), done.[K
remote: Compressing objects: 100% (101/101), done.[K
remote: Total 110 (delta 2), reused 110 (delta 2), pack-reused 0[K
Receiving objects: 100% (110/110), 2.53 MiB | 7.05 MiB/s, done.
Resolving deltas: 100% (2/2), done.


In [1]:
%cd masters-thesis-code
%pwd

[Errno 2] No such file or directory: 'masters-thesis-code'
/Users/smcaleese/Documents/masters-thesis-code


  bkms = self.shell.db.get('bookmarks', {})


'/Users/smcaleese/Documents/masters-thesis-code'

Install necessary dependencies.

In [None]:
%pip install transformers datasets

## Get input

Download the IMDB dataset, clean the sentences, and create a list of input sentences.

In [2]:
from datasets import load_dataset

imdb = load_dataset("imdb")

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 7.81k/7.81k [00:00<00:00, 20.4MB/s]
Downloading data: 100%|██████████| 21.0M/21.0M [00:05<00:00, 4.12MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:02<00:00, 9.23MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:03<00:00, 11.6MB/s]
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 436270.44 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 529068.13 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 310628.90 examples/s]


In [31]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [38]:
# first 12,500 are negative and the second 12,500 are positive
# you could shuffle all the rows and then sample 100

import random
random.seed(0)

num_samples = 10

test_sentences = imdb["test"]["text"]
random_sentences_subset = random.sample(test_sentences, num_samples)

test_labels = imdb["test"]["label"]
random_labels_subset = random.sample(test_labels, num_samples)

random_sentences_subset

["This movie was sadly under-promoted but proved to be truly exceptional. Entering the theatre I knew nothing about the film except that a friend wanted to see it.<br /><br />I was caught off guard with the high quality of the film. I couldn't image Ashton Kutcher in a serious role, but his performance truly exemplified his character. This movie is exceptional and deserves our monetary support, unlike so many other movies. It does not come lightly for me to recommend any movie, but in this case I highly recommend that everyone see it.<br /><br />This films is Truly Exceptional!",
 'On a dark, gloomy New Year\'s Eve night, an ill nurse, her life slowly ebbing away, demands that David Holm be presented to her at once. We don\'t yet know who David Holm is, or why this nurse wishes to see him, but her only dying wish is to speak with him just one more time. On the other side of the town, nestled comfortably amongst the gravestones of the local cemetery, Holm (Victor Sjöström, who also dire

In [48]:
def get_input_sentences_list(sentences, max_length = 512):
    sentences_list = []
    for i in range(len(sentences)):
        text = sentences[i]
        if len(text.split()) > max_length:
            continue
        text = text.replace("<br /><br />", " ")
        sentences_list.append(text)
    return sentences_list

input_sentences_list = get_input_sentences_list(random_sentences_subset)
print(f"Number of input sentences: {len(input_sentences_list)}")

Number of input sentences: 10


Write the sentences to a file named `imdb-input.csv`.

In [56]:
import pandas as pd

df = pd.DataFrame(input_sentences_list, columns=["original_text"])
df.to_csv("./input/input_sentences.csv", index=False)

## Run CLOSS and HotFlip

First run the method without optimization (`CLOSS-EO`) and without retraining the language modeling head.

- `CLOSS-EO:` skip optimizing the embedding. This increases failures but lowers perplexity.
- `CLOSS-RTL:` skip retraining the language modeling head. This has no effect on perplexity but increases the failure rate.

In [None]:
%cd "/content/masters-thesis-code/CLOSS"
%pwd

1. Run HotFlip:

In [None]:
!python3 run_closs.py --dataset imdb --beamwidth 15 --w 5 --K 30 --evaluation hotflip_only --model bert --retrain_epochs 0 --lm_head default --saliency_method norm_grad --log_note ''

2. Run CLOSS without optimization and without retraining the language modeling head:

In [None]:
!python3 run_closs.py --dataset imdb --beamwidth 15 --w 5 --K 30 --evaluation closs-eo --model bert --retrain_epochs 0 --lm_head default --saliency_method norm_grad --log_note ''


## Run Polyjuice


### Setup

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
%cd polyjuice
%pwd

/Users/smcaleese/Documents/masters-thesis-code/polyjuice


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


'/Users/smcaleese/Documents/masters-thesis-code/polyjuice'

In [4]:
%pip install -e .

Obtaining file:///Users/smcaleese/Documents/masters-thesis-code/polyjuice
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: polyjuice_nlp
  Building editable for polyjuice_nlp (pyproject.toml) ... [?25ldone
[?25h  Created wheel for polyjuice_nlp: filename=polyjuice_nlp-0.1.5-0.editable-py3-none-any.whl size=5974 sha256=ad4334654e88412e0a862c78377a92118182dc99bab4f4820ddf8d6e6e798e6e
  Stored in directory: /private/var/folders/t6/_lt6g5116z9f5127kxxf3qgc0000gn/T/pip-ephem-wheel-cache-9ndjf_nc/wheels/25/ab/5a/2c39cb2ced826c744df003583a7e2691ec72e79dc71b9ba517
Successfully built polyjuice_nlp
Installing collected packages: polyjuice_nlp
  Attempting uninstall: polyjuice_nlp
    Found existing installation: polyjuice_nlp 0.1.5
    Uninstalling polyjuic

An example showing how to generate a perturbation:

In [15]:
from polyjuice import Polyjuice
# initiate a wrapper.
# model path is defaulted to our portable model:
# https://huggingface.co/uw-hai/polyjuice
# No need to change this unless you are using customized model
pj = Polyjuice(model_path="uw-hai/polyjuice", is_cuda=True)

# the base sentence
text = "It is great for kids."

# perturb the sentence with one line:
# When running it for the first time, the wrapper will automatically
# load related models, e.g. the generator and the perplexity filter.
perturbations = pj.perturb(
    orig_sent=text,
    ctrl_code="negation",
    num_perturbations=10
)
perturbations

INFO:polyjuice.polyjuice_wrapper:Setup Polyjuice.
INFO:polyjuice.polyjuice_wrapper:Setup SpaCy processor.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
INFO:polyjuice.polyjuice_wrapper:Setup perplexity scorer.


['It is not great for kids.']

In [None]:
# perturbations = pj.perturb(
#     orig_sent=text,
#     # can specify where to put the blank. Otherwise, it's automatically selected.
#     # Can be a list or a single sentence.
#     blanked_sent="It is [BLANK] for kids.",
#     # can also specify the ctrl code (a list or a single code.)
#     # The code should be from 'resemantic', 'restructure', 'negation', 'insert', 'lexical', 'shuffle', 'quantifier', 'delete'.
#     ctrl_code="negation",
#     # Customzie perplexity score. 
#     perplex_thred=5,
#     # number of perturbations to return
#     num_perturbations=1,
#     # the function also takes in additional arguments for huggingface generators.
#     num_beams=3
# )

In [27]:
text = "This is surely British humour at its best. It tends to grow on you. The first time I watched it I couldn't quite figure out what it was all about but now I can watch the episodes over and over again and enjoy them every time."
# text = "This is surely British humour at its best."
perturbations = pj.perturb(
    orig_sent=text,
    ctrl_code="negation",
    num_perturbations=1,
    perplex_thred=None
)
perturbations

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["This is surely British humour at its best. It tends to grow on you. The first time I watched it I couldn't quite figure out what it was all about but now I can watch the episodes over and never want to again and enjoy them every time."]

In [28]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("uw-hai/polyjuice")
model = AutoModelForCausalLM.from_pretrained("uw-hai/polyjuice")

In [31]:
sentence = "The quick brown fox jumps over the lazy dog."
ids = tokenizer.encode(sentence, return_tensors="pt")
ids

tensor([[  464,  2068,  7586, 21831, 18045,   625,   262, 16931,  3290,    13]])

In [46]:
output_ids = model.generate(ids, max_length=500)
output_text = tokenizer.decode(output_ids[0])
output_text

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The quick brown fox jumps over the lazy dog. <|perturb|> [resemantic] The quick brown fox [BLANK] over the lazy dog. [SEP] withers [ANSWER] <|endoftext|>'