## Some trials, samples with readymade QG Algorithms

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def generate_questions(context, answer, num_questions=5):
    model_name = "valhalla/t5-small-qg-prepend"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    input_text = f"answer: {answer} context: {context}"
    input_ids = tokenizer.encode(input_text, return_tensors="pt")

    # Generate a set of questions
    questions = set()
    while len(questions) < num_questions:
        outputs = model.generate(input_ids, num_return_sequences=5, max_length=64, num_beams=5)
        for output in outputs:
            question = tokenizer.decode(output, skip_special_tokens=True)
            questions.add(question)

    return list(questions)

# Example usage
context = "The Apollo program was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which succeeded in landing the first humans on the Moon from 1969 to 1972."
answer = "National Aeronautics and Space Administration"
questions = generate_questions(context, answer)
for i, question in enumerate(questions):
    print(f"Question {i+1}: {question}")


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Question 1: Who was the third human spaceflight program in the United States?
Question 2: Who was the third human spaceflight program of the United States?
Question 3: Who carried out the Apollo program?
Question 4: Who carried out the third human spaceflight program?
Question 5: Who carried out the third human spaceflight program in the United States?


### Reset the Kernel

In [8]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)


{'status': 'ok', 'restart': True}

: 

## Generating the Dataset from SQuAD

### Load the dataset by downloading

In [1]:
from tqdm import tqdm
from datasets import load_dataset
from sklearn.utils import shuffle
import pandas as pd
import os

def load_squad_dataset(dataset):
    df_dataset = pd.DataFrame(columns=['context', 'question', 'answer'])
    num_of_answer = 0
    for index, value in tqdm(enumerate(dataset)):
        context = value['context']
        question = value['question']
        answer = value['answers']['text'][0]
        number_of_words = len(answer.split())
        df_dataset.loc[num_of_answer] = [context] + [question] + [answer]
        num_of_answer = num_of_answer + 1
    return df_dataset


print('Downloading SQuAD dataset...')
train_dataset = load_dataset("squad", split='train')
valid_dataset = load_dataset("squad", split='validation')
print('train: ', len(train_dataset))
print('validation: ', len(valid_dataset))

pd.set_option('display.max_colwidth', None)
print('Loading df_train...')
df_train = load_squad_dataset(train_dataset)
print('Loading df_validation...')
df_validation = load_squad_dataset(valid_dataset)

print('Shuffling DataFrame...')
df_train = shuffle(df_train)
df_validation = shuffle(df_validation)

Downloading SQuAD dataset...


Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

train:  87599
validation:  10570
Loading df_train...


87599it [04:38, 314.84it/s] 


Loading df_validation...


10570it [00:08, 1204.09it/s]


Shuffling DataFrame...


### See some examples from the dataset

In [2]:
print('df_train.shape')
print(df_train.shape)
print('df_validation.shape')
print(df_validation.shape)
print('df_train.head():')
print(df_train.head())
print('df_validation.head():')
print(df_validation.head())

df_train.shape
(87599, 3)
df_validation.shape
(10570, 3)
df_train.head():
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     context  \
4875                                                          

### Save the dataset as CSV

In [3]:
print('Saving dataset as csv...')
dataset_save_path = 'datasets/'
if not os.path.exists(dataset_save_path):
    os.makedirs(dataset_save_path)
df_train.to_csv(dataset_save_path + 'squad_train.csv', index=False)
df_validation.to_csv(dataset_save_path + 'squad_validation.csv', index=False)
print('All done.')

Saving dataset as csv...
All done.


## Training a QG-Algorithm, Model: T5

### Imports & Initial Settings

In [6]:
import os
import time
import copy
import argparse
import torch
import pytorch_lightning as pl
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from transformers import T5Tokenizer, T5ForConditionalGeneration
from tqdm import tqdm
from torch.optim import AdamW


train_file_path = 'datasets/squad_train.csv'
validation_file_path = 'datasets/squad_validation.csv'
save_model_path = 'model/'
save_tokenizer_path = 'tokenizer/'
pretrained_model = 't5-large'

# Set training arguments
args = {
    'num_workers': 2,
    'batch_size': 8,
    'learning_rate': 3e-5,
    'eps': 1e-8,
    'weight_decay': 0.0
}


### Define the dataset class


In [7]:
class QGDataset(Dataset):
    def __init__(self, tokenizer, file_path, max_len_input=512, max_len_output=128):
        self.tokenizer = tokenizer
        self.data = pd.read_csv(file_path)
        self.max_len_input = max_len_input
        self.max_len_output = max_len_output
        self.context_column = 'context'
        self.answer_column = 'answer'
        self.question_column = 'question'
        self.inputs = []
        self.targets = []
        self._load_data()

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        source_ids = self.inputs[index]['input_ids'].squeeze()
        target_ids = self.targets[index]['input_ids'].squeeze()
        source_mask = self.inputs[index]['attention_mask'].squeeze()
        target_mask = self.targets[index]['attention_mask'].squeeze()
        labels = copy.deepcopy(target_ids)
        labels[labels == 0] = -100
        return {'source_ids': source_ids, 'source_mask': source_mask, 'target_ids': target_ids, 'target_mask': target_mask, 'labels': labels}

    def _load_data(self):
        for idx in tqdm(range(len(self.data))):
            context, answer, target = self.data.loc[idx, self.context_column], self.data.loc[idx, self.answer_column], self.data.loc[idx, self.question_column]
            input_text = '<answer> %s <context> %s ' % (answer, context)
            target = str(target)

            tokenized_inputs = self.tokenizer.batch_encode_plus(
                [input_text],
                max_length=self.max_len_input,
                padding='max_length',
                truncation=True,
                return_tensors='pt'
            )

            tokenized_targets = self.tokenizer.batch_encode_plus(
                [target],
                max_length=self.max_len_output,
                padding='max_length',
                truncation=True,
                return_tensors='pt'
            )

            self.inputs.append(tokenized_inputs)
            self.targets.append(tokenized_targets)

### Define the model class


In [8]:
class T5FineTuner(pl.LightningModule):
    def __init__(self, model, tokenizer, args):
        super().__init__()
        self.model = model
        self.tokenizer = tokenizer
        self.args = args

    def forward(self, input_ids, attention_mask, labels=None):
        return self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels,
        )

    def training_step(self, batch, batch_idx):
        outputs = self.forward(
            input_ids=batch['source_ids'],
            attention_mask=batch['source_mask'],
            labels=batch['labels']
        )
        loss = outputs.loss
        self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        outputs = self.forward(
            input_ids=batch['source_ids'],
            attention_mask=batch['source_mask'],
            labels=batch['labels']
        )
        loss = outputs.loss
        self.log('val_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
        return loss

    def train_dataloader(self):
        return DataLoader(train_dataset, batch_size=self.args['batch_size'], num_workers=self.args['num_workers'])

    def val_dataloader(self):
        return DataLoader(validation_dataset, batch_size=self.args['batch_size'], num_workers=self.args['num_workers'])

    def configure_optimizers(self):
        return AdamW(self.parameters(), lr=self.args['learning_rate'], eps=self.args['eps'])
    
    def save_model(self, save_model_path):
        # Ensure the directory exists
        if not os.path.exists(save_model_path):
            os.makedirs(save_model_path)
        # Save the model
        self.model.save_pretrained(save_model_path)
        
    def save_tokenizer(self, save_tokenizer_path):
        # Ensure the directory exists
        if not os.path.exists(save_tokenizer_path):
            os.makedirs(save_tokenizer_path)
        # Save the tokenizer
        self.tokenizer.save_pretrained(save_tokenizer_path)


### Train

In [9]:
start_time = time.time()
pl.seed_everything(99)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

print('Loading pre-trained model...')
model = T5ForConditionalGeneration.from_pretrained(pretrained_model).to(device)
tokenizer = T5Tokenizer.from_pretrained(pretrained_model, model_max_length=512)
tokenizer.add_special_tokens(
    {'additional_special_tokens': ['<answer>', '<context>']}
)

print('Preparing dataset...')
train_dataset = QGDataset(tokenizer, train_file_path)
validation_dataset = QGDataset(tokenizer, validation_file_path)

print('train_dataset: ', len(train_dataset))
print('validation_dataset: ', len(validation_dataset))

[rank: 0] Seed set to 99


Using device: cuda
Loading pre-trained model...


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Preparing dataset...


100%|██████████| 87599/87599 [01:29<00:00, 980.14it/s] 
100%|██████████| 10570/10570 [00:11<00:00, 912.82it/s]

train_dataset:  87599
validation_dataset:  10570





In [10]:
# torch.set_float32_matmul_precision('high')  # or 'high'

print('Initializing model...')
model = T5FineTuner(model, tokenizer, args)
trainer = pl.Trainer(
    max_epochs=10,
    accelerator='gpu',
    devices=1,
    callbacks=[EarlyStopping(monitor="val_loss")]
)

Initializing model...


/cta/users/serhan.yilmaz/.conda/envs/unsloth_env/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /cta/users/serhan.yilmaz/.local/lib/python3.10/site- ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [11]:
print('Fine tuning...')
trainer.fit(model)

Fine tuning...


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 737 M 
-----------------------------------------------------
737 M     Trainable params
0         Non-trainable params
737 M     Total params
2,950.672 Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

/cta/users/serhan.yilmaz/.conda/envs/unsloth_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...


In [12]:
print('Saving model...')
if not os.path.exists(save_model_path):
    os.makedirs(save_model_path)
if not os.path.exists(save_tokenizer_path):
    os.makedirs(save_tokenizer_path)
model.model.save_pretrained(save_model_path)
tokenizer.save_pretrained(save_tokenizer_path)

end_time = time.time() - start_time
print('Total time: %s hours' % (end_time / 60 / 60))
print('All done.')

Saving model...
Total time: 10.985352538426717 hours
All done.


## Generating Questions

### Sentence Transformers

In [16]:
from sentence_transformers import SentenceTransformer, util

trained_model = 'all-distilroberta-v1'
# trained_model = 'all-roberta-large-v1'
class SentenceEmbeddings:

    def __init__(self):
        self.embedder = SentenceTransformer(trained_model)

    def encode(self, text):
        return self.embedder.encode(text, convert_to_tensor=True)

    def get_most_similar(self, context:str, qa_list:list):
        context_embeddings = self.encode(context)
        top1 = {'idx': 0, 'score': float('-inf')}
        for i in range(len(qa_list)):
            qa_str = qa_list[i]['question'] + ' ' + qa_list[i]['answer']
            qa_embeddings = self.encode(qa_str)
            cos_score = util.pytorch_cos_sim(context_embeddings, qa_embeddings)
            # print(cos_score[0][0], qa_list[i])
            if cos_score[0][0] > top1['score']:
                top1['score'] = cos_score[0][0]
                top1['idx'] = i
        return qa_list[top1['idx']]

### QG-Class

In [17]:

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

trained_model_path = save_model_path
trained_tokenizer_path = save_tokenizer_path

class QuestionGeneration:

    def __init__(self, model_dir=None):
        self.model = T5ForConditionalGeneration.from_pretrained(trained_model_path)
        self.tokenizer = T5Tokenizer.from_pretrained(trained_tokenizer_path)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model = self.model.to(self.device)
        self.model.eval()

    def generate(self, answer: str, context: str):
        input_text = '<answer> %s <context> %s ' % (answer, context)
        encoding = self.tokenizer.encode_plus(
            input_text,
            return_tensors='pt'
        ).to(self.device)
        input_ids = encoding['input_ids']
        attention_mask = encoding['attention_mask']
        outputs = self.model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            num_beams = 3,
            num_return_sequences = 1
        )
        question_list = []
        for output in outputs:
            question = self.tokenizer.decode(
                output,
                skip_special_tokens=True,
                clean_up_tokenization_spaces=True
            )
            question_list.append({'question': question, 'answer': answer, 'context': context})
        return question_list

In [18]:
context = '''
Serhan has fine-tuned T5 on SQuAD dataset for question generation.
'''
answer_list = ['Serhan', 'SQuAD', 'question generation']

QG = QuestionGeneration()
SE = SentenceEmbeddings()

for answer in answer_list:
    qa_pair_list = QG.generate(answer, context)
    most_similar = SE.get_most_similar(context, qa_pair_list)
    print(most_similar)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'question': 'Who has fine-tuned T5 on SQuAD dataset for question generation?', 'answer': 'Serhan', 'context': '\nSerhan has fine-tuned T5 on SQuAD dataset for question generation.\n'}
{'question': 'On what dataset has Serhan fine-tuned T5?', 'answer': 'SQuAD', 'context': '\nSerhan has fine-tuned T5 on SQuAD dataset for question generation.\n'}
{'question': 'What did Serhan fine-tune the T5 on SQuAD dataset for?', 'answer': 'question generation', 'context': '\nSerhan has fine-tuned T5 on SQuAD dataset for question generation.\n'}


In [11]:
context = '''
Serhan has fine-tuned T5 on SQuAD dataset for question generation.
'''
answer_list = ['Serhan', 'SQuAD', 'question generation', 'Serhan']

QG = QuestionGeneration()
SE = SentenceEmbeddings()

for answer in answer_list:
    qa_pair_list = QG.generate(answer, context)
    most_similar = SE.get_most_similar(context, qa_pair_list)
    print(most_similar)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'question': 'Who has fine-tuned T5 on SQuAD dataset for question generation?', 'answer': 'Serhan', 'context': '\nSerhan has fine-tuned T5 on SQuAD dataset for question generation.\n'}
{'question': 'On what dataset has Serhan fine-tuned T5?', 'answer': 'SQuAD', 'context': '\nSerhan has fine-tuned T5 on SQuAD dataset for question generation.\n'}
{'question': 'Serhan has fine-tuned T5 on SQuAD dataset for what?', 'answer': 'question generation', 'context': '\nSerhan has fine-tuned T5 on SQuAD dataset for question generation.\n'}
{'question': 'Who has fine-tuned T5 on SQuAD dataset for question generation?', 'answer': 'Serhan', 'context': '\nSerhan has fine-tuned T5 on SQuAD dataset for question generation.\n'}
