### How to use the trained KGQG model (T5 trained on WQ dataset)

First we load the model and tokenizer:

In [1]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast
model = T5ForConditionalGeneration.from_pretrained('stanlochten/t5-KGQgen')
tokenizer = T5TokenizerFast.from_pretrained('t5-base',  extra_ids=0, 
            additional_special_tokens = ['<A>', '<H>', '<R>', '<T>'])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Then you load in your linearized graphs:

In [2]:
# load data
# The model is trained on graphs that have the following form:
# <A> answer node(s) <H> head <R> relation <T> tail <H> head <R> relation <T> tail ... etc ...

graphs = ['<A> matt lanter <H> Star Wars: The Clone Wars <R> other crew <T> Matt Lanter <H> Star Wars: The Clone Wars <R> starring <T> Darth Vader',
          '<A> parque warner madrid <H> Madrid <R> tourist attractions <T> Parque Warner Madrid <H> Parque Warner Madrid <R> rides <T> Batman: La Fuga',
          '<A> ukhta moscow <H> Roman Abramovich <R> ships owned <T> Ecstasea <H> Roman Abramovich <R> places lived <T> Moscow <H> Roman Abramovich <R> places lived <T> Ukhta'
]

To generate a question for each graph, we first convert the text to token ids, then feed these ids to the model, and we decode the results to convert the token ids back to words:

In [3]:
print('Tokenizing...')
inputs = tokenizer(graphs, return_tensors="pt", padding=True, truncation=True)
print('Predicting...')
y_hats = model.generate(inputs.input_ids)
print('Decoding...')
preds = tokenizer.batch_decode(y_hats, skip_special_tokens=True, clean_up_tokenization_spaces=True)

Tokenizing...
Predicting...
Decoding...


Resulting in the following generated questions:

In [4]:
[print(f'{i+1}. {e}') for i,e in enumerate(preds)];

1. who played darth vader in the film that matt lanter was a crew
2. what is the attraction in madrid that has the batman : la fuga ride
3. where does the owner of ecstasea live?
