#Potentially useful code snippets for your final projects
these cells were all part of draft versions of HW1, but we decided to exclude them from the final version. most of them require the dependencies loaded in the HW1 notebook

## Text generation
No assignments in this section, but we will see how to use a pretrained generative language model to perform the task of *open-ended text generation* (e.g., conditional story generation and contextual text continuation). Given an input text passage as context, the task is to generate text that forms a coherent continuation from the given context. There has been an increasing interest in open-ended text generation due to significant advances in pretrained generative language models, e.g., `GPT-2` [(Radford et al., 2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), `XLNet` [(Yang et al., 2020)](https://arxiv.org/pdf/1906.08237.pdf), and `CTRL` [(Keskar et al., 2020)](https://einstein.ai/presentations/ctrl.pdf). The ability of those models to generate coherent text is very impressive, e.g., [GPT2 on unicorns](https://openai.com/blog/better-language-models/#samples), [XLNet](https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e), [Fill-in-the-blank text generation with T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html). Below you will see how you can implement open-ended text generation with very little effort.

First, run the cell below to download a pretrained generative model, i.e., `XLNet` from S3.

In [None]:
from transformers import AutoTokenizer, AutoModelWithLMHead

model_name_or_path = "xlnet-base-cased"
cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir)
model = AutoModelWithLMHead.from_pretrained(model_name_or_path, cache_dir=cache_dir)

Let's use the model to generate text for a given context passage. The following cell generates a different text each time you run, so try it several times to see how it goes. 

In [None]:
# Padding text helps XLNet with short contexts - proposed by Aman Rusia in
# https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """Natural Language Processing (NLP) is the engineering art and 
science of how to teach computers to understand human language. NLP is a type 
of artificial intelligence technology, and it's now ubiquitous -- NLP lets us 
talk to our phones, use the web to answer questions, map out discussions in 
books and social media, and even translate between human languages. Since 
language is rich, ambiguous, and very difficult for computers to understand, 
these systems can sometimes seem like magic -- but these are engineering 
problems we can tackle with data, math, and insights from linguistics.<eod> 
</s> <eos>""" # can be a random text

context = "I am taking a course on natural language processing. So far it has been"
inputs = tokenizer.encode(PADDING_TEXT + context, add_special_tokens=False, 
                          return_tensors="pt")

context_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, 
                                      clean_up_tokenization_spaces=True))
outputs = model.generate(inputs, max_length=250, do_sample=True, 
                         top_p=0.95, top_k=60)
generated_text = context + tokenizer.decode(outputs[0])[context_length:]
print(generated_text)

## Machine translation (optional, 0 points)

Again, no assignments in this section. We will try `T5` [(Raffel et al., 2020)](https://arxiv.org/pdf/1910.10683.pdf), an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. The model works well on a variety of downstream NLP tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: `translate English to German: ...`, for summarization: `summarize: ...`.

Run the following cell to download the pretrained model from S3.

In [None]:
from transformers import AutoTokenizer, AutoModelWithLMHead

model_name_or_path = "t5-base"
cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir)
model = AutoModelWithLMHead.from_pretrained(model_name_or_path, cache_dir=cache_dir)

Let's try a translation task: `translate English to German`.

In [None]:
source = "English"
target = "German" #  German, French, or Romanian
sentence = "This course will broadly focus on deep learning methods for \
natural language processing."
inputs = tokenizer.encode(f"translate {source} to {target}: {sentence}", 
                          return_tensors="pt")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)
print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(outputs.squeeze())))

# Fine-tuning BERT
If you would like to see how fine-tuning is performed, you can use the cell below to fine-tune `BERT` on a small subset (1K examples) of the dataset which we call `smallSST`, located at `data/smallSST` (please uncomment the code before you run, it would take around 1-2 minutes).

In [None]:
start_time = timeit.default_timer()
task_name = "SST"
data_dir = f"./data/small{task_name}"
model_name_or_path = "bert-base-cased"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/finetuning/small{task_name}"
output_dir = f"./output/finetuning/bert-finetuned-small{task_name}"

do_target_task_finetuning(
    model_name_or_path=model_name_or_path,
    task_name=f"{task_name}-2",
    task_type="text_classification",
    do_train=True,
    # do_eval=True, # you can do evaluation while training 
    do_lower_case=True,
    data_dir=data_dir,
    max_seq_length=128,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    num_train_epochs=3.0,
    model_cache_dir=model_cache_dir,
    data_cache_dir=data_cache_dir,
    output_dir=output_dir,
    overwrite_output_dir=True,
)
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

## Fine-tuning BERT for question answering
In this section, we will use `BERT` for a question answering task, i.e., `SQuAD` [(Rajpurkar et al., 2016)](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) whose dataset was built from Wikipedia. If you would like to see how fine-tuning is perfomed, uncomment the code in the cell below and run it to fine-tune `BERT` on a small subset (1K examples) of the dataset which we call `smallSQuaAD`, located at `data/smallSQuAD` (it would take around 4 minutes).

In [None]:
start_time = timeit.default_timer()
task_name = "SQuAD"
data_dir = f"./data/small{task_name}"
model_name_or_path = "bert-base-cased"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/finetuning/small{task_name}"
output_dir = f"./output/finetuning/small{task_name}"

do_target_task_finetuning(
    model_type="bert",
    model_name_or_path="bert-base-cased",
    task_type="question_answering",
    do_train=True,
    do_eval=False,
    do_lower_case=True,
    data_dir=data_dir,
    train_file="train-v1.1.json",
    dev_file="dev-v1.1.json",
    per_gpu_train_batch_size=12,
    per_gpu_eval_batch_size=8,
    learning_rate=3e-5,
    num_train_epochs=3.0,
    max_seq_length=384,
    doc_stride=128,
    model_cache_dir=model_cache_dir,
    data_cache_dir=data_cache_dir,
    output_dir=output_dir,
    overwrite_output_dir=True,
)
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

You can uncomment the cell below to evaluate the trained model on the `SQuAD`'s dev set. It would take around 6-8 minutes. You should see an F1 score of 87.8% if you run it.

In [None]:
start_time = timeit.default_timer()
task_name = "SQuAD"
data_dir = f"./data/small{task_name}"
model_name_or_path = "bert-base-cased-finetuned-squad"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/finetuning/small{task_name}"
output_dir = f"./output/finetuning/small{task_name}"

do_target_task_finetuning(
    model_type="bert",
    model_name_or_path=model_cache_dir,
    task_type="question_answering",
    do_train=False,
    do_eval=True,
    do_lower_case=True,
    data_dir=data_dir,
    train_file="train-v1.1.json",
    dev_file="dev-v1.1.json",
    per_gpu_eval_batch_size=8,
    max_seq_length=384,
    doc_stride=128,
    model_cache_dir=model_cache_dir,
    data_cache_dir=data_cache_dir,
    output_dir=output_dir
)
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

## Fine-tuning BERT for sequence labeling
Next, we will use `BERT` for a named entity recognition (`NER`) task, i.e., `CoNLL-2003` [(Tjong Kim Sang and De Meulder, 2003)](https://www.aclweb.org/anthology/W03-0419.pdf), which was built from news data. For each word in a given input text, the task is to identify if the word is inside a named entity or not by assigning it to one of the following tags:

*   `O`: Outside of a named entity

*   `B-MISC`: Beginning of a miscellaneous entity right after another miscellaneous entity

*   `I-MISC`: Miscellaneous entity

*   `B-PER`: Beginning of a person's name right after another person's name

*   `I-PER`: Person's name

*   `B-ORG`: Beginning of an organisation right after another organisation

*   `I-ORG`: Organisation

*   `B-LOC`: Beginning of a location right after another location

*   `I-LOC`: Location.

You can uncomment and run the cell below to see how fine-tunign is performed on a small subset (1K examples) of the dataset which we call `smallNER`,located at `data/smallNER`(it would take around 2-3 minutes).

In [None]:
start_time = timeit.default_timer()
task_name = "NER"
data_dir = f"./data/small{task_name}"
label_file = f"{data_dir}/labels.txt"
model_name_or_path = "bert-base-cased"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/finetuning/small{task_name}"
output_dir = f"./output/finetuning/small{task_name}"

do_target_task_finetuning(
    model_type="bert",
    model_name_or_path="bert-base-cased",
    task_type="sequence_labeling",
    do_train=True,
    do_eval=False,
    do_lower_case=False,
    data_dir=data_dir,
    labels=label_file,
    max_seq_length=128,
    per_device_train_batch_size=32,
    num_train_epochs=6.0,
    model_cache_dir=model_cache_dir,
    data_cache_dir=data_cache_dir,
    output_dir=output_dir,
    overwrite_output_dir=True,
)
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

Again, we will provide you with a model trained on the full `CoNLL-2003` dataset of 14K examples. Run the following cell to download the trained model.

In [None]:
data_file = drive.CreateFile({'id': '1JJIGk6PS9U7C0zoTk121ZfYXhfMxZWTh'})
data_file.GetContentFile('bert-base-cased-finetuned-ner.zip')

# Extract the data from the zipfile and put it into pretrained_models_dir
with zipfile.ZipFile('bert-base-cased-finetuned-ner.zip', 'r') as zip_file:
    zip_file.extractall(pretrained_models_dir)
os.remove('bert-base-cased-finetuned-ner.zip')
print("bert-base-cased-finetuned-ner downloaded!")


Now, let's use the trained model to make a few predictions. Your task is to complete the code below to show the predicted named entity tag for each token of a given input sentence.

In [None]:
sequence = "The University of Massachusetts Amherst is a public research and land-grant university in Amherst, " \
            "Massachusetts. It is the flagship campus of the University of Massachusetts system."

# YOUR CODE HERE!

##### SOLUTION #####
task_name = "NER"
data_dir = f"./data/small{task_name}"
label_file = f"{data_dir}/labels.txt"
model_name_or_path = "bert-base-cased-finetuned-ner"
pretrained_weights = os.path.join(pretrained_models_dir, model_name_or_path)
task_type = "sequence_labeling"
model = AUTO_MODEL[task_type].from_pretrained(pretrained_weights)
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)

with open(label_file, "r") as f:
  label_list = f.read().splitlines()

# Get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

for token, prediction in zip(tokens, predictions[0].tolist()):
  print(f"{token}: {label_list[prediction]}")
##### END SOLUTION #####

## Further pretraining on the target training data
Another approach to improve the perfomance of `BERT` on a target task is to further pretrain it with the masked language modeling objective on the target task's unlabeled data before fine-tuning with the target task's supervised data. The intuition is that the target task's data likely comes from a different distribution than the general-domain data used for pretraining `BERT`, no matter how diverse the pretraining data is. Previous work demonstrated that this can lead to significant performance gains [(Howard and Ruder, 2018](https://arxiv.org/pdf/1801.06146.pdf), [Gururangan et al., 2020)](https://arxiv.org/pdf/2004.10964.pdf).

In the following cell, we provide you with a function for fine-tuning `BERT` with the language modeling objective on the target task's unlabeled data. No need to edit any code, just run the cell.

In [None]:
def do_target_task_LM_finetuning(model_name_or_path, output_dir, **kwargs):
    r""" Fine-tuning BERT on a downstream target task.
    Params:
        **model_name_or_path**: either:
            - a string with the `shortcut name` of a pre-trained model configuration to load from cache
                or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
            - a path to a `directory` containing a configuration file saved
                using the `save_pretrained(save_directory)` method.
            - a path or url to a saved configuration `file`.
        **output_dir**: string:
            The output directory where the model predictions and checkpoints will be written.
        **kwargs**: (`optional`) dict:
            Dictionary of key/value pairs with which to update the configuration object after loading.
            - The values in kwargs of any keys which are configuration attributes will be used
            to override the loaded values.
    """
    model_args = ModelArguments(model_name_or_path=model_name_or_path)
    data_args = LMDataTrainingArguments()
    training_args = TrainingArguments(output_dir=output_dir)

    # override the loaded configs
    configs = (model_args, data_args, training_args)
    for config in configs:
        for key, value in kwargs.items():
            if hasattr(config, key):
                setattr(config, key, value)

    if data_args.eval_data_file is None and training_args.do_eval:
        raise ValueError(
            "Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
            "or remove the --do_eval argument."
        )

    if (
        os.path.exists(training_args.output_dir)
        and os.listdir(training_args.output_dir)
        and training_args.do_train
        and not training_args.overwrite_output_dir
    ):
        raise ValueError(
            f"Output directory ({training_args.output_dir}) already exists and is not empty. "
            f"Use --overwrite_output_dir to overcome."
        )

    for p in [model_args.model_cache_dir, training_args.output_dir]:
        if not os.path.exists(p):
            os.makedirs(p)

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
    )

    logger.info("Process device: %s, n_gpu: %s", training_args.device, training_args.n_gpu)
    logger.info("Training/evaluation parameters %s", training_args)

    # Set seed
    set_seed(training_args.seed)

    # Load pretrained model and tokenizer

    config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.model_cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.model_cache_dir)
    model = AutoModelWithLMHead.from_pretrained(model_args.model_name_or_path, from_tf=False, config=config,
                                                cache_dir=model_args.model_cache_dir)

    model.resize_token_embeddings(len(tokenizer))

    assert data_args.mlm, "BERT must be run using the --mlm flag (masked language modeling)."

    if data_args.block_size <= 0:
        data_args.block_size = tokenizer.max_len
        # Our input block size will be the max possible for the model
    else:
        data_args.block_size = min(data_args.block_size, tokenizer.max_len)

    # Get datasets

    train_dataset = (get_dataset(data_args, tokenizer=tokenizer, cache_dir=model_args.data_cache_dir)
                     if training_args.do_train else None)
    eval_dataset = (get_dataset(data_args, tokenizer=tokenizer, evaluate=True, cache_dir=model_args.data_cache_dir)
                    if training_args.do_eval else None)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
    )

    # Initialize our Trainer
    trainer_params = {
        "model": model,
        "args": training_args,
        "train_dataset": train_dataset,
        "eval_dataset": eval_dataset,
        "data_collator": data_collator,
        "prediction_loss_only": True,
    }
    trainer = Trainer(**trainer_params)

    # Training
    if training_args.do_train:
        trainer.train(
            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
        )
        trainer.save_model()
        # For convenience, we also re-save the tokenizer to the same directory
        tokenizer.save_pretrained(training_args.output_dir)

    # Evaluation
    eval_results = {}
    if training_args.do_eval:
        logger.info("*** Evaluate ***")
        eval_output = trainer.evaluate()

        perplexity = math.exp(eval_output["eval_loss"])
        eval_result = {"perplexity": perplexity}

        output_eval_file = os.path.join(training_args.output_dir, f"eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results *****")
            for key, value in eval_result.items():
                logger.info("  %s = %s", key, value)
                writer.write("%s = %s\n" % (key, value))

        eval_results.update(eval_result)
    return eval_results

You can use the cell below to perform target task language model fine-tuning `BERT` on a subset (1000 sentences) of the unlabeled data part of the original `SST` dataset (please uncomment the code before you run, it would take around 1 minute). The data file should be a plain text file, with one sentence per line, see `sents_train.txt` in `data/tinySST`.

In [None]:
start_time = timeit.default_timer()
task_name = "SST"
data_dir = f"./data/tiny{task_name}"
model_name_or_path = "bert-base-cased"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/lm-finetuning/tiny{task_name}/"
output_dir = f"./output/lm-finetuning/tiny{task_name}"

# Target task LM fine-tuning
start_time = timeit.default_timer()
do_target_task_LM_finetuning(
    model_name_or_path=model_name_or_path,
    model_type="bert",
    do_train=True,
    train_data_file=f"{data_dir}/sents_train.txt",
    line_by_line=True,
    do_eval=True,
    eval_data_file=f"{data_dir}/sents_dev.txt",
    model_cache_dir=model_cache_dir,
    output_dir=output_dir,
    overwrite_output_dir=True,
    mlm=True,
)

# Evaluate
model_dir = f"./output/lm-finetuning/tiny{task_name}"
do_target_task_LM_finetuning(
    model_name_or_path=model_dir,
    model_type="bert",
    do_train=False,
    line_by_line=True,
    do_eval=True,
    eval_data_file=f"{data_dir}/sents_dev.txt",
    model_cache_dir=model_cache_dir,
    output_dir=output_dir,
    mlm=True,
)
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

Since fine-tuning `BERT` with the language modeling objective on the whole unlabeled data of the original `SST` dataset (67K examples) would take a while, we will provide you with a trained model to save your time. Run the following cell to download the model.

In [None]:
data_file = drive.CreateFile({'id': '1PwiB5CsziaqfSq6WjxVnqGt7AWU5r_WK'})
data_file.GetContentFile('bert-base-cased-lm-finetuned-sst.zip')

# Extract the data from the zipfile and put it into the data directory
with zipfile.ZipFile('bert-base-cased-lm-finetuned-sst.zip', 'r') as zip_file:
    zip_file.extractall(pretrained_models_dir)
os.remove('bert-base-cased-lm-finetuned-sst.zip')
print("bert-base-cased-lm-finetuned-sst downloaded!")

The following cell fine-tunes the `BERT` model `bert-base-cased-lm-finetuned-sst` with the target task `tinySST`'s supervised data and then evaluate the resulting model on `tinySST`'s dev set.

In [None]:
start_time = timeit.default_timer()
task_name = "SST"
data_dir = f"./data/tiny{task_name}"
model_name_or_path = "bert-base-cased-lm-finetuned-sst"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/lm-finetuning/tiny{task_name}/"
output_dir = model_cache_dir

mean = None
std = None

# Fine-tune BERT for 20 epochs using 4 random seeds
for seed in [1234, 2341, 3412, 4123]:
  output_dir = f"./output/tiny{task_name}-{seed}"
  # YOUR CODE HERE!

  ##### SOLUTION #####
  do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_cache_dir,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=True,
      do_eval=False, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      per_device_train_batch_size=32,
      learning_rate=2e-5,
      num_train_epochs=20.0,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=output_dir,
      overwrite_output_dir=True,
  )
##### END SOLUTION #####

# Evaluate BERT on the dev set
results = []
for seed in [1234, 2341, 3412, 4123]:
  # YOUR CODE HERE!
  
  ##### SOLUTION #####
  model_dir = f"./output/tiny{task_name}-{seed}"
  result = do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_dir,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=False,
      do_eval=True, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=model_dir,
  )
  results.append(result["eval_acc"])

results = np.array(results)
mean = np.mean(results)
std = np.std(results)
##### END SOLUTION #####

print("===== Target task language model fine-tuning =====")
print(f"Performance when fine-tuning BERT for 20 epochs: {mean} +/- {std}")
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")