# Synthetic Document generation with YData SDK
Synthetic documents are AI-generated files that mimic real-world reports, resumes, contracts, and other types of documents. They are valuable for:

- **Training LLMs** without using sensitive or proprietary data
- **Testing pipelines** or document-processing systems with realistic content
- **Augmenting datasets** when original documents are scarce or unavailable

In this notebook we demonstrate how to generate synthetic documents using the `DocumentGenerator` class from the YData SDK.
You will learn how to:
1. Generate a single or a set of synthetic documents with your own requirements
2. Adjust document formats and styles

#### Prerequisites
Make sure you have installed the YData SDK package:
```
 pip install ydata-sdk
```
and that you're registered at https://ydata.ai/register

## Document Generation

In [None]:
### Import required classes
from ydata.synthesizers.text.model.document import DocumentGenerator, DocumentFormat

In [None]:
## Step 2: Initialize the Document Generator
##Replace `Add-token` with your ydata-sdk token

import os
os['']="ADD-TOKEN"

# By default the model will leverage latest OpenAI model. You can also configure it to use Anthropic's claude instead, depending on the type of document you want to generate. 
# Us the input parameters `provider` and `model_name`
generator = DocumentGenerator(
    document_format=DocumentFormat.PDF  # Options: PDF, DOCX, HTML
)

### Generate a single new synthetic document
Customize the document with your desired settings:

In [None]:
generator.generate(
    n_docs=1,
    document_type="Curriculum",                     # e.g., Resume, Report, Invoice, etc. This is a free text input field
    audience="HR",                                  # Who will read this, it is also a free text input field
    tone="formal",                                  # Style of writing limited to the following values: formal, casual, persuasive, empathetic, inspirational, enthusiastic, humorous, neutral. 
    purpose="Application for a Senior Machine Learning Engineer",
    region="North America",                         # Tailored regional language. This is a free text input field
    language="German",                              # Language for the document. This is a free text input field
    length="Long",                                  # Short, Medium, Long. This is a free text input field
    topics="Foundational models, LLMs, GenerativeAI, API, Python, software engineer", #This is a free text input field, that helps you ensuring that some topics are covered in the document content
    style_guide="Flawless design",                  # Layout and formatting hints. This is a free text input field
    output_dir="output/CV2"                          # Where to save the file. This is a free text input field
)

### Generate a set of new synthetic documents

It’s also possible to generate multiple documents in a single run. Each document is uniquely crafted, maintaining variability in tone and content, without altering your specified input parameters. This makes it ideal for creating diverse and robust training datasets.

In [None]:
generator.generate(
    n_docs=10,
    document_type="Curriculum",                     # e.g., Resume, Report, Invoice, etc. This is a free text input field
    audience="HR",                                  # Who will read this, it is also a free text input field
    tone="formal",                                  # Style of writing limited to the following values: formal, casual, persuasive, empathetic, inspirational, enthusiastic, humorous, neutral. 
    purpose="Application for a Senior Machine Learning Engineer",
    region="North America",                         # Tailored regional language. This is a free text input field
    language="German",                              # Language for the document. This is a free text input field
    length="Long",                                  # Short, Medium, Long. This is a free text input field
    topics="Foundational models, LLMs, GenerativeAI, API, Python, software engineer", #This is a free text input field, that helps you ensuring that some topics are covered in the document content
    style_guide="Flawless design",                  # Layout and formatting hints. This is a free text input field
    output_dir="output/CV"                          # Where to save the file. This is a free text input field
)

Once the generations are completed, all generated documents are saved in the defined output_dir, in the case of this example  `output/CV` directory.

## Conclusion
Synthetic document generation is a valuable strategy for companies looking to improve the performance and generalization of their AI models—especially those working with language models or document processing systems.

By creating high-quality, domain-specific synthetic documents, organizations can:
- **Augment fine-tuning datasets** with diverse, multilingual examples
- **Pre-train or adapt LLMs** to specific document structures or business jargon
- **Expand coverage** across rare use cases or underrepresented formats
- **Avoid legal risks** tied to using real documents containing PII or proprietary content

This makes it particularly useful for teams in:
- **Healthcare** (e.g., discharge summaries, lab reports)
- **Finance** (e.g., credit statements, policy summaries)
- **Legal** (e.g., contracts, NDAs)
- **Human Resources** (e.g., resumes, evaluations)

With just a few lines of code, you can generate realistic, custom-tailored PDFs, DOCX, or HTML files that accelerate model development without compromising on privacy or compliance.

To explore more, visit the [YData SDK documentation](https://docs.sdk.ydata.ai/latest/synthetic_data/).
