# Testing DataMorgana API

This notebook demonstrates how to use the DataMorgana API for synthetic conversation generation, as described in the paper "Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana".

DataMorgana is a tool that generates highly customizable and diverse synthetic Q&A benchmarks tailored to RAG applications. It enables detailed configurations of user and question categories and provides control over their distribution within the benchmark.

## Setup & Imports

In [1]:
from services.ds_data_morgana import DataMorgana
from pprint import pprint

# Initialize DataMorgana client
dm = DataMorgana()

## Generating a Synchronous QA Pair

Let's generate a single QA pair using the synchronous API.

In [2]:
# Define question categories for a single QA pair
question_category = {
    "formulation-categorization": {
        "name": "natural",
        "description": "phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure."
        # no probability, that is for bulk generation
    }
}

# Define user categories for a single QA pair
user_category = {
    "expertise-categorization": {
        "name": "common person",
        "description": "a common person who is not expert of the subject discussed in the document, therefore he asks basic questions."
        # no probability, that is for bulk generation
    }
}

# Define a document ID (replace with a valid document ID for your use case)
document_ids = ["<urn:uuid:26073e55-b9e8-4213-8df2-8baf5c1cb383>"]

# Generate a QA pair
response = dm.generate_qa_pair_sync(
    question_categories=question_category,
    user_categories=user_category,
    document_ids=document_ids
)

# Display the response
print("Generated QA Pair:")
pprint(response)

Generated QA Pair:
{'credits': 1,
 'request': {'document_ids': ['<urn:uuid:26073e55-b9e8-4213-8df2-8baf5c1cb383>'],
             'n_questions': 1,
             'question_categories': {'formulation-categorization': {'description': 'phrased '
                                                                                   'in '
                                                                                   'the '
                                                                                   'way '
                                                                                   'people '
                                                                                   'typically '
                                                                                   'speak, '
                                                                                   'reflecting '
                                                                                   'everyday '
                

## Defaults from the Paper

In [3]:
# General user categories (expertise-based)
general_user_categories = [
    {
        "categorization_name": "expertise-categorization",
        "categories": [
            {
                "name": "expert",
                "description": "a specialized user with deep understanding of the corpus",
                "probability": 0.5
            },
            {
                "name": "novice",
                "description": "a regular user with no understanding of specialized terms",
                "probability": 0.5
            }
        ]
    },
    {
        "categorization_name": "health-categorization",
        "categories": [
            {
                "name": "patient",
                "description": "a regular patient who uses the system to get basic health information, symptom checking, and guidance on preventive care",
                "probability": 0.25
            },
            {
                "name": "medical-doctor",
                "description": "a medical doctor who needs to access some advanced information",
                "probability": 0.25
            },
            {
                "name": "clinical-researcher",
                "description": "a clinical researcher who uses the system to access population health data, conduct initial patient surveys, track disease progression patterns, etc",
                "probability": 0.25
            },
            {
                "name": "public-health-authority",
                "description": "a public health authority who uses the system to manage community health information dissemination, be informed on health emergencies, etc",
                "probability": 0.25
            }
        ]
    }
]

In [4]:
# Default question categories from the DataMorgana paper
default_question_categories = [
    {
        "categorization_name": "factuality",
        "categories": [
            {
                "name": "factoid",
                "description": "question seeking a specific, concise piece of information or a short fact about a particular subject, such as a name, date, or number",
                "probability": 0.5
            },
            {
                "name": "open-ended",
                "description": "question inviting detailed or exploratory responses, encouraging discussion or elaboration",
                "probability": 0.5
            }
        ]
    },
    {
        "categorization_name": "premise",
        "categories":  [
            {
                "name": "direct",
                "description": "question that does not contain any premise or any information about the user",
                "probability": 0.5
            },
            {
                "name": "with-premise",
                "description": "question starting with a very short premise, where the user reveals their needs or some information about themselves",
                "probability": 0.5
            }
        ]
    },
    {
        "categorization_name": "phrasing",
        "categories": [
            {
                "name": "concise-and-natural",
                "description": "phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure. It is a concise direct question consisting of less than 10 words",
                "probability": 0.25
            },
            {
                "name": "verbose-and-natural",
                "description": "phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure. It is a relatively long question consisting of more than 9 words",
                "probability": 0.25
            },
            {
                "name": "short-search-query",
                "description": "phrased as a typed web query for search engines (only keywords, without punctuation and without a natural-sounding structure). It consists of less than 7 words",
                "probability": 0.25
            },
            {
                "name": "long-search-query",
                "description": "phrased as a typed web query for search engines (only keywords, without punctuation and without a natural-sounding structure). It consists of more than 6 words",
                "probability": 0.25
            }
        ]
    },
    {
        "categorization_name": "linguistic_variation",
        "categories": [
            {
                "name": "similar-to-document",
                "description": "phrased using the same terminology and phrases appearing in the document",
                "probability": 0.5
            },
            {
                "name": "distant-from-document",
                "description": "phrased using terms completely different from the ones appearing in the document",
                "probability": 0.5
            }
        ]
    }
]

## Bulk Generation with Multiple Categorizations

One of the key features of DataMorgana is the ability to generate diverse QA pairs using multiple categorizations for both questions and users.

In [5]:
# Define example document IDs (replace with valid document IDs for your use case)
# example_document_ids = ["<urn:uuid:sample-doc-123>", "<urn:uuid:sample-doc-456>"]

# Submit a bulk generation request
bulk_response = dm.generate_qa_pair_bulk(
    n_questions=2,  # Generate 2 QA pairs
    question_categorizations=default_question_categories,
    user_categorizations=general_user_categories,
    # document_ids=example_document_ids  # Optional
)

# Display the response
print("Bulk Generation Response:")
pprint(bulk_response)

# Store the generation ID for later use
generation_id = bulk_response.get("request_id")

Bulk Generation Response:
{'request_id': '12346c84-6648-430d-8db0-0fd9710a2cc5', 'type': 'async'}


## Fetching Bulk Generation Results

After submitting a bulk generation request, you can check its status and retrieve the results using the generation ID.

In [6]:
# Fetch generation results
results = dm.fetch_generation_results(generation_id)

# Display the results
print("Generation Results:")
print(f"Status: {results.get('status', 'Unknown')}")

# If the generation is complete, display a sample of the generated QA pairs
if results.get('status') == 'COMPLETED' and 'qa_pairs' in results:
    print(f"\nTotal QA pairs generated: {len(results['qa_pairs'])}")
    print("\nSample QA pairs:")
    for i, qa_pair in enumerate(results['qa_pairs'][:2]):  # Show first 2 pairs
        print(f"\nPair {i+1}:")
        print(f"Question: {qa_pair['question']}")
        print(f"Answer: {qa_pair['answer']}")
        if 'metadata' in qa_pair:
            print(f"Categories: {qa_pair['metadata']}")

Generation Results:
Status: in_progress


## Waiting for Generation Results

Instead of manually checking the status, you can use the `wait_generation_results` method which will automatically poll every 2 seconds until the process is complete.

In [7]:
# Wait for results (this will poll automatically every 2 seconds)
complete_results = dm.wait_generation_results(generation_id)

# Display the results once completed
if 'qa_pairs' in complete_results:
    print("\nSample QA pairs:")
    for i, qa_pair in enumerate(complete_results['qa_pairs'][:2]):
        print(f"\nPair {i+1}:")
        print(f"Question: {qa_pair['question']}")
        print(f"Answer: {qa_pair['answer']}")
        if 'metadata' in qa_pair:
            print(f"Categories: {qa_pair['metadata']}")

Status: in_progress, waiting 2 seconds before retrying...
Status: in_progress, waiting 2 seconds before retrying...
Status: in_progress, waiting 2 seconds before retrying...
Status: in_progress, waiting 2 seconds before retrying...
Status: in_progress, waiting 2 seconds before retrying...
Status: in_progress, waiting 2 seconds before retrying...
Status: in_progress, waiting 2 seconds before retrying...
Status: in_progress, waiting 2 seconds before retrying...
Generation status: completed
Retrieved 2 QA pairs

Sample QA pairs:

Pair 1:
Question: As someone concerned about environmental impact, I'd like to know why manufacturers are switching to biological alternatives for making everyday products like plastics and cosmetics?
Answer: Manufacturers are switching to bio-organic acids due to stringent environmental regulations imposed on conventional organic acid producers. Bio-organic acids are gaining significance because of their renewability and eco-friendly nature, making them an attra

## Retrying a Failed Generation

If a bulk generation request fails, you can retry it using the generation ID.

In [None]:
# Kun: Won't work.... I implemented this according to API doc, 
# but it raises error: "Unprocessable Entity", generally means the request body has the wrong format/fields.

# # Retry a failed generation
# retry_response = dm.retry_generation(generation_id)

# # Display the response
# print("Retry Response:")
# pprint(retry_response)

## Using Custom Categories

You can create custom categories and categorizations to better suit your specific needs.

In [None]:
# Create custom question categories for technical support domain
tech_support_categorization = [{
    "categorization_name": "tech-question-type",
    "categories": [
        {
            "name": "troubleshooting",
            "description": "question about diagnosing and fixing problems with software or hardware",
            "probability": 0.4
        },
        {
            "name": "how-to",
            "description": "question about how to perform specific tasks or use specific features",
            "probability": 0.3
        },
        {
            "name": "compatibility",
            "description": "question about whether products or features work together",
            "probability": 0.2
        },
        {
            "name": "installation",
            "description": "question about installing or setting up software or hardware",
            "probability": 0.1
        }
    ]
}, {
    "categorization_name": "complexity",
    "categories": [
        {
            "name": "simple",
            "description": "straightforward question with a clear answer",
            "probability": 0.7
        },
        {
            "name": "complex",
            "description": "nuanced question requiring detailed explanation",
            "probability": 0.3
        }
    ]
}]

In [None]:
# Create custom user categories for technical support domain
tech_user_categorization = {
    "categorization_name": "tech-user-expertise",
    "categories": [
        {
            "name": "beginner",
            "description": "a user with minimal technical knowledge who struggles with basic computing tasks",
            "probability": 0.3
        },
        {
            "name": "intermediate",
            "description": "a user with basic technical skills who can follow instructions but lacks deep understanding",
            "probability": 0.4
        },
        {
            "name": "advanced",
            "description": "a technically proficient user who understands system concepts and can troubleshoot independently",
            "probability": 0.2
        },
        {
            "name": "developer",
            "description": "a software developer or IT professional with deep technical knowledge",
            "probability": 0.1
        }
    ]
}