# Old Model and Results

In [1]:
from question_generator import QuestionGenerator
from sentence_embeddings import SentenceEmbeddings
from paraphraser_class import Paraphraser

model_path = 'model/'
tokenizer_path = 'tokenizer/'

def qg(context, answer, question_generator, sentence_embeddings, paraphraser):
    generated_questions = question_generator.generate(answer, context)

    qa_list = [{'question': q, 'answer': answer} for q in generated_questions]
    top_questions = sentence_embeddings.get_most_similar(context, qa_list)

    for qa in top_questions:
        # print("Original Question:", qa['question'])
        best_paraphrase = paraphraser.unique_paraphrase(context, qa['question'])
        print("Generated Question:", best_paraphrase)

In [2]:
question_generator = QuestionGenerator(trained_model_path=model_path, trained_tokenizer_path=tokenizer_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
sentence_embeddings = SentenceEmbeddings()

In [4]:
paraphraser = Paraphraser()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Sample Run 1

### New Run (Answer Check Implemented: The paraphrases that have the same answer get bonus points)

In [5]:
context = """
Google LLC is an American multinational corporation and technology company focusing on 
search engine technology, cloud computing, computer software, quantum computing and artificial intelligence (AI).
It has been referred to as "the most powerful company in the world" and is 
one of the world's most valuable brands due to its market dominance, data collection, and technological advantages in the field of AI.
Google's parent company, Alphabet Inc. is one of the five Big Tech companies, alongside Amazon, Apple, Meta, and Microsoft.
"""
answer = "American"

In [6]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What country of origin is Google LLC?


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What is Google's national identity?
Generated Question: What is the nationality or domicile of Google LLC?


### Old Run (Don't Run Again - Answer Check was not implemented yet)

In [6]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What country of origin is Google LLC?


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What country of origin is Google LLC from?
Generated Question: What is the national identity of Google LLC?


## Sample Run 2

### What if the answer was something less suitable for Question Generation? 
### Something less like an answer?

In [7]:
context = """
Wikis are enabled by wiki software, otherwise known as wiki engines. 
A wiki engine, being a form of a content management system, 
differs from other web-based systems such as blog software or static site generators, 
in that the content is created without any defined owner or leader, 
and wikis have little inherent structure, 
allowing structure to emerge according to the needs of the users."""
answer = "wiki engine"

In [8]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Unlike other web-based systems, the content in this system is generated without a predetermined structure.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Wiki software is also known by what alias?
Generated Question: Unlike other web-based systems, the content is created without a predetermined owner.


### What if the answer was unrelated to the context? (Hallucination)

In [9]:
context = """
Wikis are enabled by wiki software, otherwise known as wiki engines. 
A wiki engine, being a form of a content management system, 
differs from other web-based systems such as blog software or static site generators, 
in that the content is created without any defined owner or leader, 
and wikis have little inherent structure, 
allowing structure to emerge according to the needs of the users."""
answer = "American"

In [10]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What is the nationality of the people who developed the wiki software?


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What is the nationality of the creators of the wiki software?
Generated Question: Which nationality does the average Wiki software user belong to?


### What if the answer was something even less like an answer?

In [12]:
context = """
Wikis are enabled by wiki software, otherwise known as wiki engines. 
A wiki engine, being a form of a content management system, 
differs from other web-based systems such as blog software or static site generators, 
in that the content is created without any defined owner or leader, 
and wikis have little inherent structure, 
allowing structure to emerge according to the needs of the users."""
answer = "emerge"

In [13]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Wikis adapt their structure to accommodate the needs of their users.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Wikis tailor their layout to accommodate the demands of their users.
Generated Question: What capabilities does a wiki engine grant to structured content?


* 3rd Generated Question is not at all bad, does well.
* The first 2 are not questions.
* Same problem with the first example in Sample Run 2

I should inspect the Generated Questions, and the returned paraphrases for this.

It could be Llama3 hallucinating, or it could be T5 not generating good enough questions.

A newer T5 is being trained on an A100 now, should be done in about 23 hours.

If it is T5, enhancing the dataset might be needed.

# New Model and Results

In [1]:
from question_generator import QuestionGenerator
from sentence_embeddings import SentenceEmbeddings
from paraphraser_class import Paraphraser

model_path = 'new_model/model/'
tokenizer_path = 'new_model/tokenizer/'

def qg(context, answer, question_generator, sentence_embeddings, paraphraser):
    generated_questions = question_generator.generate(answer, context)

    qa_list = [{'question': q, 'answer': answer} for q in generated_questions]
    top_questions = sentence_embeddings.get_most_similar(context, qa_list)

    for qa in top_questions:
        # print("Original Question:", qa['question'])
        best_paraphrase = paraphraser.unique_paraphrase(context, qa['question'])
        print("Generated Question:", best_paraphrase)

In [2]:
question_generator = QuestionGenerator(trained_model_path=model_path, trained_tokenizer_path=tokenizer_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
sentence_embeddings = SentenceEmbeddings()

In [4]:
paraphraser = Paraphraser()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Sample Run 1

In [5]:
context = """
Google LLC is an American multinational corporation and technology company focusing on 
search engine technology, cloud computing, computer software, quantum computing and artificial intelligence (AI).
It has been referred to as "the most powerful company in the world" and is 
one of the world's most valuable brands due to its market dominance, data collection, and technological advantages in the field of AI.
Google's parent company, Alphabet Inc. is one of the five Big Tech companies, alongside Amazon, Apple, Meta, and Microsoft.
"""
answer = "American"

In [6]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What is Google LLC's national origin?


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What is the national identity of Google LLC?
Generated Question: What is the national origin of Google's parent company?


## Sample Run 2

### What if the answer was something less suitable for Question Generation? 
### Something less like an answer?

In [7]:
context = """
Wikis are enabled by wiki software, otherwise known as wiki engines. 
A wiki engine, being a form of a content management system, 
differs from other web-based systems such as blog software or static site generators, 
in that the content is created without any defined owner or leader, 
and wikis have little inherent structure, 
allowing structure to emerge according to the needs of the users."""
answer = "wiki engine"

In [8]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Wiki software is commonly known as what?


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What is wiki software also known as?
Generated Question: What are wiki software applications also known as?


### What if the answer was unrelated to the context? (Hallucination)

In [9]:
context = """
Wikis are enabled by wiki software, otherwise known as wiki engines. 
A wiki engine, being a form of a content management system, 
differs from other web-based systems such as blog software or static site generators, 
in that the content is created without any defined owner or leader, 
and wikis have little inherent structure, 
allowing structure to emerge according to the needs of the users."""
answer = "American"

In [10]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Who is behind the creation of a wiki?


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: What ethnicity is associated with the concept of a wiki?
Generated Question: What is the typical nationality of a person who creates and edits content on Wikipedia?


Hahahhahahaha... what?

### What if the answer was something even less like an answer?

In [11]:
context = """
Wikis are enabled by wiki software, otherwise known as wiki engines. 
A wiki engine, being a form of a content management system, 
differs from other web-based systems such as blog software or static site generators, 
in that the content is created without any defined owner or leader, 
and wikis have little inherent structure, 
allowing structure to emerge according to the needs of the users."""
answer = "emerge"

In [12]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Wikis have no built-in structure, allowing them to evolve according to the needs of their contributors.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Wikis' lack of inherent structure allows them to be constantly updated and revised.
Generated Question: With no inherent structure, wikis can freely change and develop.


In [13]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Wikis have minimal built-in structure, allowing them to evolve according to the needs of their community.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Question: Wikis' lack of inherent structure enables them to evolve and transform over time.
Generated Question: Wikis are characterized by their lack of predetermined structure, which allows it to and what?


* The question generation has gotten relatively better, arguably because of better model performance.
* The algorithm still won't ensure 100% that there will be questions though, especially if the answer is unlikely to be an answer. Which is somewhat understandable.
* I will train T5 one more time, a bit more harshly without early stopping, and let's observe those results in the next section.
* It still bugs me that I take the inputs from the llama3 paraphrasing engine with regex though, unreliable.
* In case the non-question generative behaviour continues, an additional scoring might be added to the scoring function, enabling it to give more points to sentences ending with a question mark, whether they give the same answer to the QA or not.

# Newest Model and Results

In [None]:
from question_generator import QuestionGenerator
from sentence_embeddings import SentenceEmbeddings
from paraphraser_class import Paraphraser

model_path = 'newest_model/model/'
tokenizer_path = 'newest_model/tokenizer/'

def qg(context, answer, question_generator, sentence_embeddings, paraphraser):
    generated_questions = question_generator.generate(answer, context)

    qa_list = [{'question': q, 'answer': answer} for q in generated_questions]
    top_questions = sentence_embeddings.get_most_similar(context, qa_list)

    for qa in top_questions:
        # print("Original Question:", qa['question'])
        best_paraphrase = paraphraser.unique_paraphrase(context, qa['question'])
        print("Generated Question:", best_paraphrase)

In [None]:
question_generator = QuestionGenerator(trained_model_path=model_path, trained_tokenizer_path=tokenizer_path)

In [None]:
sentence_embeddings = SentenceEmbeddings()

In [None]:
paraphraser = Paraphraser()

## Sample Run 1

In [None]:
context = """
Google LLC is an American multinational corporation and technology company focusing on 
search engine technology, cloud computing, computer software, quantum computing and artificial intelligence (AI).
It has been referred to as "the most powerful company in the world" and is 
one of the world's most valuable brands due to its market dominance, data collection, and technological advantages in the field of AI.
Google's parent company, Alphabet Inc. is one of the five Big Tech companies, alongside Amazon, Apple, Meta, and Microsoft.
"""
answer = "American"

In [None]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

## Sample Run 2

### What if the answer was something less suitable for Question Generation? 
### Something less like an answer?

In [None]:
context = """
Wikis are enabled by wiki software, otherwise known as wiki engines. 
A wiki engine, being a form of a content management system, 
differs from other web-based systems such as blog software or static site generators, 
in that the content is created without any defined owner or leader, 
and wikis have little inherent structure, 
allowing structure to emerge according to the needs of the users."""
answer = "wiki engine"

In [None]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

### What if the answer was unrelated to the context? (Hallucination)

In [None]:
context = """
Wikis are enabled by wiki software, otherwise known as wiki engines. 
A wiki engine, being a form of a content management system, 
differs from other web-based systems such as blog software or static site generators, 
in that the content is created without any defined owner or leader, 
and wikis have little inherent structure, 
allowing structure to emerge according to the needs of the users."""
answer = "American"

In [None]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

Hahahhahahaha... what?

### What if the answer was something even less like an answer?

In [None]:
context = """
Wikis are enabled by wiki software, otherwise known as wiki engines. 
A wiki engine, being a form of a content management system, 
differs from other web-based systems such as blog software or static site generators, 
in that the content is created without any defined owner or leader, 
and wikis have little inherent structure, 
allowing structure to emerge according to the needs of the users."""
answer = "emerge"

In [None]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

In [None]:
qg(context, answer, question_generator, sentence_embeddings, paraphraser)

* The question generation has gotten relatively better, arguably because of better model performance.
* The algorithm now seems to create questions only, but I worry about halucinations.
* The questions are generated alright, but in the unlikely case above they don't give the answer.
* One might also ask, is there a question that generates the given answer? If not, this can qualify as hallucination.
* It still bugs me that I take the inputs from the llama3 paraphrasing engine with regex though, unreliable.