# Question Generation

In [121]:
from datasets import load_dataset
from transformers import AutoTokenizer, TFT5ForConditionalGeneration
import tensorflow as tf

squad_v2_data = load_dataset("squad_v2")
train_set = squad_v2_data["train"][:20000]
val_set = squad_v2_data["validation"][:400]

In [93]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = TFT5ForConditionalGeneration.from_pretrained("t5-small")

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [94]:
test_context = train_set["context"][0]
test_context

'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'

In [4]:
test_question = train_set["question"][0]
test_question

'When did Beyonce start becoming popular?'

In [5]:
len(test_question)

40

In [6]:
inputs = tokenizer(test_context, return_tensors="tf").input_ids
labels = tokenizer(test_question, return_tensors="tf").input_ids

train_data = {}
train_data["input_ids"] = [inputs]
train_data["labels"] = [labels]
dataset = tf.data.Dataset.from_tensor_slices(train_data)

model.compile()

history = model.fit(dataset, epochs=20)

Epoch 1/20


2023-11-29 13:02:55.293875: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fee70479270 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-29 13:02:55.293893: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3070 Ti Laptop GPU, Compute Capability 8.6
2023-11-29 13:02:55.296875: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-29 13:02:55.309209: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8700
2023-11-29 13:02:55.356812: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [7]:
history.history

{'loss': [8.033397674560547,
  3.1897623538970947,
  2.2098562717437744,
  0.6068986654281616,
  0.4252108037471771,
  0.18703721463680267,
  0.17042605578899384,
  0.029090795665979385,
  0.03936365246772766,
  0.012281271629035473,
  0.02421463467180729,
  0.020143898203969002,
  0.024941857904195786,
  0.012064033187925816,
  0.03561845421791077,
  0.018977342173457146,
  0.015171533450484276,
  0.004022296518087387,
  0.007062794174998999,
  0.006204564124345779]}

In [8]:
inputs = tokenizer(test_context, return_tensors="tf").input_ids
outputs = model.generate(inputs)



In [9]:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

When did Beyonce start becoming popular?


## Train on more data

In [122]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = TFT5ForConditionalGeneration.from_pretrained("t5-small")

contexts = train_set["context"]
questions = train_set["question"]
val_contexts = val_set["context"]
val_questions = val_set["question"]

contexts_max_len = max([len(context.split()) for context in contexts])
questions_max_len = max([len(question.split()) for question in questions])

print(contexts_max_len, questions_max_len)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


499 30


In [123]:
import pandas as pd
len(pd.Series(contexts).unique())

1593

In [124]:
[s[:200] for s in pd.Series(contexts).unique()]

['Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in v',
 'Following the disbandment of Destiny\'s Child in June 2005, she released her second solo album, B\'Day (2006), which contained hits "Déjà Vu", "Irreplaceable", and "Beautiful Liar". Beyoncé also venture',
 'A self-described "modern-day feminist", Beyoncé creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dy',
 'Beyoncé Giselle Knowles was born in Houston, Texas, to Celestine Ann "Tina" Knowles (née Beyincé), a hairdresser and salon owner, and Mathew Knowles, a Xerox sales manager. Beyoncé\'s name is a tribute',
 "Beyoncé attended St. Mary's Elementary School in Fredericksburg, Texas, where she enrolled in dance classes. Her singing talent was discovered when dance instructor Darlette J

In [125]:
%%time
# Takes some time!

inputs = tokenizer(contexts, return_tensors="tf", max_length=contexts_max_len, padding="max_length", truncation=True).input_ids
labels = tokenizer(questions, return_tensors="tf", max_length=questions_max_len, padding="max_length", truncation=True).input_ids

val_inputs = tokenizer(val_contexts, return_tensors="tf", max_length=contexts_max_len, padding="max_length", truncation=True).input_ids
val_labels = tokenizer(val_questions, return_tensors="tf", max_length=questions_max_len, padding="max_length", truncation=True).input_ids

train_data = {}
train_data["input_ids"] = inputs
train_data["labels"] = labels

val_data = {}
val_data["input_ids"] = val_inputs
val_data["labels"] = val_labels



CPU times: user 8.69 s, sys: 585 ms, total: 9.28 s
Wall time: 2.5 s


In [126]:
from tensorflow.keras.optimizers import AdamW
from tensorflow.keras.optimizers.schedules import ExponentialDecay

initial_learning_rate = 0.001 # start with default Adam value

lr_schedule = ExponentialDecay(
    # Every 5000 iterations, multiply the learning rate by 0.7
    initial_learning_rate, decay_steps = 50000, decay_rate = 0.7,
)

dataset = tf.data.Dataset.from_tensor_slices(train_data).batch(8)
val_dataset = tf.data.Dataset.from_tensor_slices(val_data).batch(8)

adamw = AdamW(weight_decay=0.04, learning_rate=lr_schedule)

model.compile(optimizer=adamw)

We don't necessarily need to use validation, as it is not important that the generated questions match **exactly** with those in the validation set.

In [None]:
history = model.fit(dataset, epochs=4)

In [131]:
n_sequences = 10
test_inputs = tokenizer(test_context, return_tensors="tf").input_ids
test_outputs = model.generate(test_inputs, num_beams=20, num_return_sequences=n_sequences, do_sample=False)
for i in range(n_sequences):
    print(tokenizer.decode(test_outputs[i], skip_special_tokens=True))

What was the name of Beyoncé's first album?
What was the name of Beyoncé's group's first album?
What was the name of Beyoncé's group?
What was the name of Beyoncé's first solo album?
Beyonce's first album, Dangerously in Love was called what?
What was the name of Beyonce's group's first album?
What was Beyoncé's first album, Dangerously in Love?
What was Beyoncé's first album?
Who managed Beyoncé's group in the late 1990s?
Beyonce's first album, Dangerously in Love?


In [132]:
n_sequences = 20
#test_context_2 = """Johann Sebastian Bach[n 2] (31 March [O.S. 21 March] 1685 – 28 July 1750) was a German composer and musician of the late Baroque period. He is known for his orchestral music such as the Brandenburg Concertos; instrumental compositions such as the Cello Suites; keyboard works such as the Goldberg Variations and The Well-Tempered Clavier; organ works such as the Schubler Chorales and the Toccata and Fugue in D minor; and vocal music such as the St Matthew Passion and the Mass in B minor. Since the 19th-century Bach revival, he has been generally regarded as one of the greatest composers in the history of Western music."""
test_context_2 = """
Ariana Grande-Butera (/ˌɑːriˈɑːnə ˈɡrɑːndeɪ bjʊˈtɛərə/ AR-ee-AH-nə GRAHN-day byuu-TAIR-ə;[note 1] born June 26, 1993) is an American singer, songwriter, and actress. An influential figure in contemporary popular music, and often regarded as a pop culture icon, she is noted for her four-octave vocal range and whistle register that has garnered critical acclaim. Grande has received numerous accolades throughout her career, including two Grammy Awards, one Brit Award, one Bambi Award, two Billboard Music Awards, three American Music Awards, nine MTV Video Music Awards, and 30 Guinness World Records.

Grande began her music career at age 15 in the 2008 Broadway musical 13. She rose to fame for playing Cat Valentine in the Nickelodeon television series Victorious (2010–2013) and Sam & Cat (2013–2014). Grande signed with Republic Records in 2011 after label executives viewed YouTube videos of her covering songs. Her 1950s doo-wop-influenced pop and R&B debut album,[2] Yours Truly (2013), topped the US Billboard 200, while its lead single, "The Way", reached the top ten of the US Billboard Hot 100. Grande's voice and vocal performances on the album drew immediate comparisons to Mariah Carey."""
test_inputs_2 = tokenizer(test_context_2, return_tensors="tf").input_ids
test_outputs_2 = model.generate(test_inputs_2, num_beams=20, num_return_sequences=n_sequences, do_sample=False)

for i in range(n_sequences):
    print(tokenizer.decode(test_outputs_2[i], skip_special_tokens=True))

What was the name of Grande's first album?
Her four-octave vocal range and whistle register was what?
Her four-octave vocal range and whistle register garnered what accolade?
Her four-octave vocal range and whistle register were what?
What was the name of Grande's first single?
What was the name of the name of Grande's first album?
Her four-octave vocal range and whistle register garnered what award?
Her four-octave vocal range and whistle register earned what accolade?
Her four-octave vocal range was what?
What was the name of Grande's first solo album?
How many Grammy awards did Grande receive in her music career?
How many Grammy awards did Grande win in 2008?
What was the name of Grande's first song?
How many Grammy awards did Grande win in her career?
How many Grammy awards did Grande receive in her career?
How many Grammy awards did Grande win?
What is the name of Grande's first album?
When did Grande sign with Republic Records?
How many Grammy Awards did Grande win?
When did Gran