# Question Generation

This is a bare bones tutorial showing what is possible with the QuestionGenerator Nodes and Pipelines which automatically
generate questions which the question generation model thinks can be answered by a given document.

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.  
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">

In [None]:
# Install the latest release of Haystack in your own environment
! pip install farm-haystack

# Install the latest main of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack
  Downloading farm_haystack-1.11.1-py3-none-any.whl (588 kB)
[K     |████████████████████████████████| 588 kB 23.4 MB/s 
Collecting posthog
  Downloading posthog-2.2.0-py2.py3-none-any.whl (33 kB)
Collecting huggingface-hub>=0.5.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 49.2 MB/s 
[?25hCollecting elasticsearch<8,>=7.7
  Downloading elasticsearch-7.17.8-py2.py3-none-any.whl (385 kB)
[K     |████████████████████████████████| 385 kB 74.8 MB/s 
[?25hCollecting torch<1.13,>1.9
  Downloading torch-1.12.1-cp38-cp38-manylinux1_x86_64.whl (776.3 MB)
[K     |████████████████████████████████| 776.3 MB 1.2 kB/s 
Collecting transformers==4.21.2
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 59.9 MB/s 
Collecting mmh3
  Downloading mmh3-3.

## Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt
Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

In [None]:
# Imports needed to run this notebook

from pprint import pprint
from tqdm.auto import tqdm
from haystack.nodes import QuestionGenerator, BM25Retriever, FARMReader
from haystack.schema import Document, Answer, SpeechAnswer
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.pipelines import (
    QuestionGenerationPipeline,
    RetrieverQuestionGenerationPipeline,
    QuestionAnswerGenerationPipeline,
)
from haystack.utils import launch_es, print_questions
import nltk.data
import pandas as pd


Let's start an Elasticsearch instance with one of the options below:

In [None]:
# Option 2: In Colab / No Docker environments: Start Elasticsearch from source
# ! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
# ! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
# ! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

In [None]:

def save_q(results: dict, doc_num = 0):
    """
    Utility to print the output of a question generating pipeline in a readable format.
    """
    # if "generated_questions" in results.keys():
    #     print("\nGenerated questions:")
    #     for result in results["generated_questions"]:
    #         for question in result["questions"]:
    #             print(f" - {question}")
    qa_list = []
    if "queries" in results.keys() and "answers" in results.keys():
      # print("\nGenerated pairs:")
      for i, (query, answers) in enumerate(zip(results["queries"], results["answers"])):
          # print(f" - Q: {query}")
           
          for answer in answers:

              # Verify that the pairs contains Answers under the `answer` key
              if not isinstance(answer, Answer):
                  raise ValueError(
                      "This results object does not contain `Answer` objects under the `answers` "
                      "key of the generated question/answer pairs. "
                      "Please make sure the last node of your pipeline makes proper use of the "
                      "new Haystack primitive objects, and if you're using Haystack nodes/pipelines only, "
                      "please report this as a bug."
                  )
              # print(f"      A: {answer.answer}")
              qa_list.append({"Draft_Intent": f"aci_basic_info_{i}",
                              "Questions": query,
                              "Answers": str(answer.answer)})

    else:
        raise ValueError(
            "This object does not seem to be the output "
            "of a question generating pipeline: does not contain neither "
            f"'generated_questions' nor 'results', but only: {results.keys()}. "
            " Try `print_answers` or `print_documents`."
        )

    return qa_list

In [None]:
# nltk.download("punkt")
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
docs =[{"content": sentence} for sentence in tokenizer.tokenize(data)]

# Initialize document store and write in the documents
document_store = ElasticsearchDocumentStore()
document_store.delete_documents()
document_store.write_documents(docs)

# Initialize Question Generator
question_generator = QuestionGenerator()

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
Using sep_token, but it is not set yet.


Let's initialize some core components

In [None]:
reader = FARMReader("deepset/roberta-base-squad2")
question_generation_pipeline = QuestionGenerationPipeline(question_generator, )
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
qa_df = pd.DataFrame()
for idx, document in enumerate(tqdm(document_store)):

    print(f"\n * Generating questions for document {idx}: {document.content[:100]}...\n")
    questions = question_generation_pipeline.run(documents=[document])
    qa = qag_pipeline.run(documents=[document])
    # qa_list.append(save_q(qa))
    qa_df = qa_df.append(save_q(qa), ignore_index=True)

qa_df["Draft_Intent"] = ["aci_basic_info_"+str(i) for i in qa_df.index]   
qa_df.to_csv("test.csv")




INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


0it [00:00, ?it/s]


 * Generating questions for document 0: With almost three decades of partnering life and engendering hope, ACI is one of the top pharmaceuti...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 1: As a progressive and forward-thinking company, ACI Pharma is dedicated to improve the health of peop...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 2: ACI introduced the concept of quality management system by being the first company in Bangladesh to ...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 3: Aligned with the concept that a pharmaceutical must ensure effective management of environment, ACI ...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 4: ACI maintains a congenial and supportive relationship with the healthcare community of Bangladesh, w...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 5: Our endeavor is to ensure the availability of world class, quality medicines across Bangladesh and a...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 6: Being a successor of world’s renowned pharmaceutical company ICI, we take pride of its rich heritage...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 7: Since its introduction in 1992 ACI continues to remain committed to developing first-in-class and be...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 8: Our strength is our ability to excel in developing generics and technologically complex products thr...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 9: At present, ACI formulates & markets a comprehensive range of more than 550 SKUs covering all major ...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 10: ACI has introduced sophisticated manufacturing technologies like Biosimilar (biotech) products, insu...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 11: First company in Bangladesh to achieve ISO 9001 certification in 1995....



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 12: World class manufacturing facilities and strict compliance to cGMP & ethics have earned the company ...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 13: ACI also markets & distributes vaccine product (rabies vaccine) of world’s renowned pharmaceutical c...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 14: ACI is enriched with GMP certification from Kenya, Ivory Coast, and Philippines....



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 15: Our pharmaceuticals are exported to 30 countries of 4 continents....



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 16: ACI also has Product Marketing Approval from 15 countries....



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 17: ACI is recognized by STC (Save the Children) audit with concluding remarks ‘it can be concluded that...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 18: It is patients & physicians who inspireusto move forward....



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 19: We have created & captured value for them through cutting-edge chemistry at work, more innovation, a...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 20: Our endeavor is to ensure the access of all the ailing human being across the globe to high quality ...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


 * Generating questions for document 21: We are highly caring to the constantly evolving unmet medical needs of patients & their families....



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

In [None]:
qa_df["Draft_Intent"] = ["aci_basic_info_"+str(i) for i in qa_df.index]
# qa_df.to_csv("test.csv")
# qa_df.concat(qa_list[7], axis=1)
# qa_list[7]
# len(questions["queries"])

In [None]:
qa_df.to_csv("test.csv", index=False)

## Question Generation Pipeline

The most basic version of a question generator pipeline takes a document as input and outputs generated questions
which the the document can answer.

In [None]:
question_generation_pipeline = QuestionGenerationPipeline(question_generator)
for idx, document in enumerate(document_store):

    print(f"\n * Generating questions for document {idx}: {document.content[:100]}...\n")
    result = question_generation_pipeline.run(documents=[document])
    # print_questions(result)


 * Generating questions for document 0: ACI introduced the concept of quality management system by being the first company in Bangladesh to ...


Generated questions:
 - In what year did ACI become the first company in Bangladesh to achieve ISO 9001 certification?
 - ACI follows a policy of continuous improvement in all its operations?

 * Generating questions for document 1: With almost three decades of partnering life and engendering hope, ACI is one of the top pharmaceuti...


Generated questions:
 - How long has ACI been in business?
 - How many people does ACI employ in Bangladesh?

 * Generating questions for document 2: Aligned with the concept that a pharmaceutical must ensure effective management of environment, ACI ...


Generated questions:
 - ACI complies with what standard?
 - ACI was awarded EMS 14001 in what year?

 * Generating questions for document 3: ACI maintains a congenial and supportive relationship with the healthcare community of Bangladesh, w...


Generated q

## Retriever Question Generation Pipeline

This pipeline takes a query as input. It retrieves relevant documents and then generates questions based on these.

In [None]:
retriever = BM25Retriever(document_store=document_store)
rqg_pipeline = RetrieverQuestionGenerationPipeline(retriever, question_generator)

print(f"\n * Generating questions for documents matching the query 'Arya Stark'\n")
result = rqg_pipeline.run(query="Arya Stark")
print_questions(result)

## Question Answer Generation Pipeline

This pipeline takes a document as input, generates questions on it, and attempts to answer these questions using
a Reader model

In [None]:
reader = FARMReader("deepset/roberta-base-squad2")
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
for idx, document in enumerate(tqdm(document_store)):

    print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
    result = qag_pipeline.run(documents=[document])
    print_questions(result)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


0it [00:00, ?it/s]


 * Generating questions and answers for document 0: ACI introduced the concept of quality management system by being the first company in Bangladesh to ...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


Generated pairs:
 - Q: In what year did ACI become the first company in Bangladesh to achieve ISO 9001 certification?
      A: 1995
 - Q: ACI follows a policy of continuous improvement in all its operations?
      A: quality management system by being the first company in Bangladesh

 * Generating questions and answers for document 1: With almost three decades of partnering life and engendering hope, ACI is one of the top pharmaceuti...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


Generated pairs:
 - Q: How long has ACI been in business?
      A: almost three decades
 - Q: How many people does ACI employ in Bangladesh?
      A: more than 5,000

 * Generating questions and answers for document 2: Aligned with the concept that a pharmaceutical must ensure effective management of environment, ACI ...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


Generated pairs:
 - Q: ACI complies with what standard?
      A: environment management policy
 - Q: ACI was awarded EMS 14001 in what year?
      A: 2000

 * Generating questions and answers for document 3: ACI maintains a congenial and supportive relationship with the healthcare community of Bangladesh, w...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


Generated pairs:
 - Q: ACI maintains a congenial and supportive relationship with the healthcare community of what country?
      A: Bangladesh
 - Q: ACI believes that business excellence can only be achieved through pursuit of what?
      A: quality


## Translated Question Answer Generation Pipeline
Trained models for Question Answer Generation are not available in many languages other than English. Haystack
provides a workaround for that issue by machine-translating a pipeline's inputs and outputs with the
TranslationWrapperPipeline. The following example generates German questions and answers on a German text
document - by using an English model for Question Answer Generation.

In [None]:
# Fill the document store with a German document.
text1 = "Python ist eine interpretierte Hochsprachenprogrammiersprache für allgemeine Zwecke. Sie wurde von Guido van Rossum entwickelt und 1991 erstmals veröffentlicht. Die Design-Philosophie von Python legt den Schwerpunkt auf die Lesbarkeit des Codes und die Verwendung von viel Leerraum (Whitespace)."
docs = [{"content": text1}]
document_store.delete_documents()
document_store.write_documents(docs)

# Load machine translation models
from haystack.nodes import TransformersTranslator

in_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-de-en")
out_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-en-de")

# Wrap the previously defined QuestionAnswerGenerationPipeline
from haystack.pipelines import TranslationWrapperPipeline

pipeline_with_translation = TranslationWrapperPipeline(
    input_translator=in_translator, output_translator=out_translator, pipeline=qag_pipeline
)

for idx, document in enumerate(tqdm(document_store)):
    print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
    result = pipeline_with_translation.run(documents=[document])
    print_questions(result)

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Discord](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)