<a href="https://colab.research.google.com/github/r-chambers/TextAdventureGenerator/blob/main/CreateGraphModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [125]:
!pip install datasets
!pip install git-python==1.0.3
!pip install rouge_score
!pip install sacrebleu
!pip install -U accelerate
!pip install -U transformers
!pip install spacy



This notebook follows the tutorial at https://huggingface.co/blog/warm-starting-encoder-decoder.

In [126]:
import json

import tensorflow as tf
from tensorflow import keras
import numpy as np
from transformers import BertTokenizer, TrainingArguments, EncoderDecoderModel, Seq2SeqTrainer, Seq2SeqTrainingArguments
import pandas as pd
import datasets
from google.colab import drive
from datasets import Dataset
import spacy
import ast

Loading the tokenizer and pre-trained checkpoints.

In [127]:
tokenizer = BertTokenizer.from_pretrained("prajjwal1/bert-medium")
model = EncoderDecoderModel.from_encoder_decoder_pretrained("prajjwal1/bert-medium", "prajjwal1/bert-medium", tie_encoder_decoder=True)

Some weights of BertLMHeadModel were not initialized from the model checkpoint at prajjwal1/bert-medium and are newly initialized: ['bert.encoder.layer.3.crossattention.self.value.bias', 'bert.encoder.layer.4.crossattention.self.key.bias', 'bert.encoder.layer.5.crossattention.self.query.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.4.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.6.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.7.crossattention.self.query.weight', 'bert.encoder.layer.7.crossattention.self.value.bias', 'bert.encoder.layer.3.crossattention.self.query.weight', 'bert.encoder.layer.5.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.encoder.layer.1.crossattention.self.value.bias', 'bert.encoder.layer.2.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.enc

In [128]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [129]:
model.num_parameters()

50080570

In [130]:
# Putting model on the GPU
model = model.to("cuda")

In [131]:
# Setting model config
# Because Bert-medium is based on Bert-Base, we can assume that it also doesn't have a decoder start token or EOS token and should take it from the tokenizer
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.eos_token_id = tokenizer.sep_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.vocab_size = model.config.encoder.vocab_size

Let's get our training and test data.

In [132]:
f_train = open("/content/drive/MyDrive/TextAdventureModel/jerichoworld_train_locations.json", "r")
f_test = open("/content/drive/MyDrive/TextAdventureModel/jerichoworld_test_locations.json", "r")
train_data = json.load(f_train)
test_data = json.load(f_test)
f_train.close()
f_test.close()

In [133]:
def get_full_input(x):
  full_string = x['next_state']['walkthrough_act'] + ' '
  full_string += str(x['state']['graph']) + ' '
  #full_string += x['state']['obs']
  #full_string += x['next_state']['walkthrough_act']
  return full_string

Let's convert the training and evaluation data into transformers Datasets, the format that the Seq2SeqTrainer takes when fine-tuning the model

In [134]:
# Convert data into a pandas dataframe
def convert_to_dataset(data):
  data_list = []

  for game in data:
    for states in game:
      inputs = tokenizer(get_full_input_test(states), padding="max_length", truncation=True, max_length=512)
      outputs = tokenizer(str(states['next_state']['graph']), padding="max_length", truncation=True, max_length=512)

      row = {}
      row['input_ids'] = inputs.input_ids
      row['attention_mask'] = inputs.attention_mask
      row["labels"] = outputs.input_ids.copy()

      # ignoring PAD token as padding is given via the option padding="max_length"
      row["labels"] = [-100 if token == tokenizer.pad_token_id else token for token in row["labels"]]

      data_list.append(row)

  df = pd.DataFrame.from_records(data_list)
  return Dataset.from_pandas(df)

In [135]:
# Creating train dataset
train_dataset = convert_to_dataset(train_data)

In [136]:
test_dataset = convert_to_dataset(test_data[0:2])

In [137]:
train_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "labels"],
)

In [138]:
train_dataset.__len__

<bound method Dataset.__len__ of Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 12099
})>

In [139]:
test_dataset.__len__

<bound method Dataset.__len__ of Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 660
})>

Let's set the parameters of the model.

In [140]:
# This is the max length that the model will generate, some of the graphs got this long so we want to make the output be as long as this if possible
# This is also max length of the BERT tokenizer
model.config.max_length = 512
# We want a room name and some items but don't need much else.
model.config.min_length = 50
#model.config.temperature = 0.5
# This NEEDS to be zero, as we want tons of repeating ngrams with "you", "have" and such
model.config.no_repeat_ngram_size = 0
model.config.early_stopping = True
model.config.length_penalty = 2.0
model.config.num_beams = 4

Let's set the parameters of the Seq2Seq Trainer.

In [141]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    #evaluation_strategy="steps",
    per_device_train_batch_size=8,
    #per_device_eval_batch_size=8,
    fp16=True,
    output_dir="./",
    logging_steps=2,
    save_steps=500,
    # eval_steps=4,
    # logging_steps=1000,
    # save_steps=500,
    # eval_steps=7500,
    # warmup_steps=2000,
    # save_total_limit=3,
)

In [142]:
rouge = datasets.load_metric("rouge")

In [143]:
# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset
)

In [144]:
trainer.train()

  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)


Step,Training Loss
2,8.9062
4,8.6746
6,4.6922
8,4.346
10,3.8532
12,3.484
14,3.3888
16,3.0292
18,3.0711
20,2.5881


Checkpoint destination directory ./checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
Checkpoint destination directory ./checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
Checkpoint destination directory ./checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
Checkpoint destination directory ./checkpoint-2000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
Checkpoint destination directory ./checkp

TrainOutput(global_step=4539, training_loss=0.24429738187374955, metrics={'train_runtime': 508.5884, 'train_samples_per_second': 71.368, 'train_steps_per_second': 8.925, 'total_flos': 3812350276564992.0, 'train_loss': 0.24429738187374955, 'epoch': 3.0})

Now let's evaluate the model.

In [145]:
def generate_graph(input):
  # tokenize input
  inputs = tokenizer(input, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
  input_ids = inputs.input_ids.to("cuda")
  attention_mask = inputs.attention_mask.to("cuda")

  outputs = model.generate(input_ids, attention_mask=attention_mask)

  output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

  return output_str

In [146]:
output = generate_graph(get_full_input(test_data[0][2]))



In [147]:
print("INPUT:\n", get_full_input(test_data[0][2]))
print("\nOUTPUT:\n", output)

INPUT:
 abstract small to paper [['you', 'have', 'piece of white paper'], ['you', 'in', 'Closet'], ['small black pistol', 'in', 'Closet'], ["Chief's office", 'is', 'east']] 

OUTPUT:
 ["[ ['you ','have ','flashlight'], ['you ','in ','corridor near pit'], ['you ','have ','torch'], ['you ','have ','shovel'], ['center of camp ','is ','west'], ['storage tent ','is ','east'] ]"]


In [148]:
# Let's save the model
model.save_pretrained("/content/drive/My Drive/TextAdventureModel/model_medium")

In [149]:
def generate_predictions(test_data):
  predictions = []
  references = []

  for game in test_data:
    for states in game:
      inputs = tokenizer(get_full_input(states), padding="max_length", truncation=True, max_length=512, return_tensors="pt")
      input_ids = inputs.input_ids.to("cuda")
      attention_mask = inputs.attention_mask.to("cuda")

      outputs = model.generate(input_ids, attention_mask=attention_mask)

      output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

      predictions.append(output_str)
      references.append(str(states['next_state']['graph']))

  return predictions, references

Let's generate our predictions. We are only doing some of the test data as generating these predictions takes a long time.

In [150]:
pred, ref = generate_predictions(test_data[0:1])

In [151]:
rouge.compute(predictions=pred, references=ref, rouge_types=["rouge2"])["rouge2"].mid

Score(precision=0.1187155123512389, recall=0.1810656488218821, fmeasure=0.13168542431101993)

Now let's save our model to our Google drive.

In [152]:
model.save_pretrained("/content/drive/My Drive/TextAdventureModel/model_medium")

How we can load our model if we want for later.

In [153]:
loaded_model = EncoderDecoderModel.from_pretrained("/content/drive/My Drive/TextAdventureModel/model_medium")

The following encoder weights were not tied to the decoder ['bert/pooler']
The following encoder weights were not tied to the decoder ['bert/pooler']
The following encoder weights were not tied to the decoder ['bert/pooler']
The following encoder weights were not tied to the decoder ['bert/pooler']
