# Question & Answers pairs generator
Generating synthetic question-and-answer (Q&A) pairs from documents is a powerful way to enhance training datasets for large language models (LLMs). This is especially useful for tasks like:
- Building or fine-tuning RAG and retrieval-based systems
- Training models for document comprehension and summarization
- Creating multilingual or domain-specific Q&A datasets without exposing real user data

In this notebook, we’ll show how to use the `DocumentQAGeneration` class from the YData SDK to generate grounded Q&A pairs from documents, folders, or in-memory tables.

#### Prerequisites
Make sure you have installed the YData SDK package:
```
 pip install ydata-sdk
```
and that you're registered at https://ydata.ai/register

## Q&A pairs generation from existing documents

In [None]:
### Import required classes
from ydata.synthesizers.text.model.qa import DocumentQAGeneration

## Step 2:Initialize the Q&A Generator
##Replace `Add-token` with your ydata-sdk token

import os
os['']="ADD-TOKEN"

# By default the model will leverage latest OpenAI model. You can also configure it to use Anthropic's claude instead, depending on the type of document you want to use to generate Q&A.
# Use the input parameters `provider` and `model_name`
qa_generator = DocumentQAGeneration()

### Generate Q&A Pairs from a single input document
You can extract question-and-answer pairs from a single supported document, such as `.docx` or `.txt`. The generator processes the file in chunks to ensure completeness and accuracy.

In [None]:
result = qa_generator.generate(
    input_source="/path/to/your/file.docx",
    docs_extension="docx",
    num_qa_pairs=10
)

result

### Generate Q&A Pairs from a Folder

You can also generate Q&A pairs from an entire folder of documents. Supported file types include `.docx` and `.txt`, and each file is processed individually while maintaining consistent variability and quality.

In [None]:
folder_result = qa_generator.generate(
    input_source="/path/to/your/folder/",
    docs_extension="docx",
    num_qa_pairs=5
)

folder_result

### Generate Q&A Pairs from an In-Memory Table

You can also generate Q&A pairs from an entire folder of documents. Supported file types include `.docx` and `.txt`, and each file is processed individually while maintaining consistent variability and quality.


In [None]:
import pyarrow as pa

documents_table = pa.table({
    "text": [
        "This is a sample document about machine learning. It discusses various algorithms and their applications.",
        "Another document about data science and its importance in modern business."
    ],
    "metadata": [
        {"source": "doc1", "author": "John Doe"},
        {"source": "doc2", "author": "Jane Smith"}
    ]
})

table_result = qa_generator.generate(
    input_source=documents_table,
    num_qa_pairs=3
)

table_result

## Conclusion

Synthetic Q&A generation is a powerful tool for enhancing model training and evaluation. By programmatically generating high-quality, grounded Q&A pairs, organizations can:

- Improve LLM performance on domain-specific questions
- Enable more effective retrieval-augmented generation (RAG) systems
- Enrich datasets while avoiding privacy or compliance issues

This makes it especially useful for applications in:
- Customer support automation
- Legal document analysis
- Educational content generation
- Healthcare documentation and clinical QA systems

For more details, visit the [YData SDK documentation](https://docs.sdk.ydata.ai/latest/synthetic_data/).