# Testing DataMorgana API

This notebook demonstrates how to use the DataMorgana API for synthetic conversation generation, as described in the paper "Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana".

DataMorgana is a tool that generates highly customizable and diverse synthetic Q&A benchmarks tailored to RAG applications. It enables detailed configurations of user and question categories and provides control over their distribution within the benchmark.

## Setup & Imports

In [2]:
from services.ds_data_morgana import DataMorgana
from pprint import pprint

# Initialize DataMorgana client
dm = DataMorgana()

## Generating a Synchronous QA Pair

Let's generate a single QA pair using the synchronous API.

In [3]:
# Define question categories for a single QA pair
question_category = {
    "formulation-categorization": {
        "name": "natural",
        "description": "phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure."
        # no probability, that is for bulk generation
    }
}

# Define user categories for a single QA pair
user_category = {
    "expertise-categorization": {
        "name": "common person",
        "description": "a common person who is not expert of the subject discussed in the document, therefore he asks basic questions."
        # no probability, that is for bulk generation
    }
}

# Define a document ID (replace with a valid document ID for your use case)
document_ids = ["<urn:uuid:26073e55-b9e8-4213-8df2-8baf5c1cb383>"]

# Generate a QA pair
qa_pair = dm.generate_qa_pair_sync(
    question_categories=question_category,
    user_categories=user_category,
    document_ids=document_ids
)

# Display the response
print("Generated QA Pair:")
print(f"Question: {qa_pair.question}")
print(f"Answer: {qa_pair.answer}")
print(f"Context: {qa_pair.context}")
print(f"Question Categories: {qa_pair.question_categories}")
print(f"User Categories: {qa_pair.user_categories}")
print(f"Document IDs: {qa_pair.document_ids}")

Generated QA Pair:
Question: What is the new evidence about when humans first lived in North America?
Answer: A recent study has dated stone artifacts to 16-20,000 years ago, which pushes back the timeline of the first human inhabitants of North America by at least 2,500 years before the Clovis period. These artifacts were found at the Gault Site in Central Texas and show a previously unknown, early projectile point technology that was unrelated to Clovis.
Context: ['Gault site research pushes back date of earliest North Americans\nJuly 20, 2018 (Reno, NV) – For decades, researchers believed the Western Hemisphere was settled by humans roughly 13,500 years ago, a theory based largely upon the widespread distribution of Clovis artifacts dated to that time. Clovis artifacts are distinctive prehistoric stone tools so named because they were initially found near Clovis, New Mexico, in the 1920s but have since been identified throughout North and South America.\nIn recent years, though, arc

## Defaults from the Paper

In [3]:
# General user categories (expertise-based)
general_user_categories = [
    {
        "categorization_name": "expertise-categorization",
        "categories": [
            {
                "name": "expert",
                "description": "a specialized user with deep understanding of the corpus",
                "probability": 0.5
            },
            {
                "name": "novice",
                "description": "a regular user with no understanding of specialized terms",
                "probability": 0.5
            }
        ]
    },
    {
        "categorization_name": "health-categorization",
        "categories": [
            {
                "name": "patient",
                "description": "a regular patient who uses the system to get basic health information, symptom checking, and guidance on preventive care",
                "probability": 0.25
            },
            {
                "name": "medical-doctor",
                "description": "a medical doctor who needs to access some advanced information",
                "probability": 0.25
            },
            {
                "name": "clinical-researcher",
                "description": "a clinical researcher who uses the system to access population health data, conduct initial patient surveys, track disease progression patterns, etc",
                "probability": 0.25
            },
            {
                "name": "public-health-authority",
                "description": "a public health authority who uses the system to manage community health information dissemination, be informed on health emergencies, etc",
                "probability": 0.25
            }
        ]
    }
]

In [4]:
# Default question categories from the DataMorgana paper
default_question_categories = [
    {
        "categorization_name": "factuality",
        "categories": [
            {
                "name": "factoid",
                "description": "question seeking a specific, concise piece of information or a short fact about a particular subject, such as a name, date, or number",
                "probability": 0.5
            },
            {
                "name": "open-ended",
                "description": "question inviting detailed or exploratory responses, encouraging discussion or elaboration",
                "probability": 0.5
            }
        ]
    },
    {
        "categorization_name": "premise",
        "categories":  [
            {
                "name": "direct",
                "description": "question that does not contain any premise or any information about the user",
                "probability": 0.5
            },
            {
                "name": "with-premise",
                "description": "question starting with a very short premise, where the user reveals their needs or some information about themselves",
                "probability": 0.5
            }
        ]
    },
    {
        "categorization_name": "phrasing",
        "categories": [
            {
                "name": "concise-and-natural",
                "description": "phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure. It is a concise direct question consisting of less than 10 words",
                "probability": 0.25
            },
            {
                "name": "verbose-and-natural",
                "description": "phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure. It is a relatively long question consisting of more than 9 words",
                "probability": 0.25
            },
            {
                "name": "short-search-query",
                "description": "phrased as a typed web query for search engines (only keywords, without punctuation and without a natural-sounding structure). It consists of less than 7 words",
                "probability": 0.25
            },
            {
                "name": "long-search-query",
                "description": "phrased as a typed web query for search engines (only keywords, without punctuation and without a natural-sounding structure). It consists of more than 6 words",
                "probability": 0.25
            }
        ]
    },
    {
        "categorization_name": "linguistic_variation",
        "categories": [
            {
                "name": "similar-to-document",
                "description": "phrased using the same terminology and phrases appearing in the document",
                "probability": 0.5
            },
            {
                "name": "distant-from-document",
                "description": "phrased using terms completely different from the ones appearing in the document",
                "probability": 0.5
            }
        ]
    }
]

## Bulk Generation with Multiple Categorizations

One of the key features of DataMorgana is the ability to generate diverse QA pairs using multiple categorizations for both questions and users.

In [5]:
# Define example document IDs (replace with valid document IDs for your use case)
# example_document_ids = ["<urn:uuid:sample-doc-123>", "<urn:uuid:sample-doc-456>"]

# Submit a bulk generation request
bulk_response = dm.generate_qa_pair_bulk(
    n_questions=2,  # Generate 2 QA pairs
    question_categorizations=default_question_categories,
    user_categorizations=general_user_categories,
    # document_ids=example_document_ids  # Optional
)

# Display the response
print("Bulk Generation Response:")
pprint(bulk_response)

# Store the generation ID for later use
generation_id = bulk_response.get("request_id")

Bulk Generation Response:
{'request_id': '9e4a29ef-6edc-44b7-8924-3b161fb84083', 'type': 'async'}


## Fetching Bulk Generation Results

After submitting a bulk generation request, you can check its status and retrieve the results using the generation ID.

In [6]:
# Fetch generation results
results = dm.fetch_generation_results(generation_id)

# Display the results
print("Generation Results:")
print(f"Status: {results.get('status', 'Unknown')}")

Generation Results:
Status: in_progress


## Waiting for Generation Results

Instead of manually checking the status, you can use the `wait_generation_results` method which will automatically poll every 2 seconds until the process is complete.

In [13]:
# Wait for results (this will poll automatically every 2 seconds)
# When parse_results=True (default), it returns a list of QAPair objects
qa_pairs = dm.wait_generation_results(generation_id)
display(dm.to_dataframe(qa_pairs))

# Display the results once completed
print(f"Retrieved {len(qa_pairs)} QA pairs")
print("\nSample QA pairs:")

for i, qa_pair in enumerate(qa_pairs[:2]):  # Show first 2 pairs
    print(f"\nPair {i+1}:")
    print(f"Question: {qa_pair.question}")
    print(f"Answer: {qa_pair.answer}")
    print(f"Question Categories: {qa_pair.question_categories}")
    print(f"User Categories: {qa_pair.user_categories}")


Generation status: completed

File URL: https://s3.amazonaws.com/data.aiir/data_morgana/web_api/results_id_5e634774-db09-4fcd-a862-44f8d217f4ad_user_id_0e5ea510-5709-475a-8460-4577fd3588ea.jsonl?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2UC3AHBFZP4P5ZYJ%2F20250403%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250403T145923Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEIf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIER9bviK52k76OOIVL71FVm1Vvqz%2BfqbhV6gPm1gvpUhAiBl4WFdAvdrvdkOwWUN3mkp61%2BsqJpjdrRt1Nm417sQqSrEBQjv%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAAaDDczMDMzNTI5NTU2MyIMX4RixBi81MJR7UhpKpgF%2BZmw6u3MdAMRyCKdQF8UmExsHHDwPzzqJmYUBwSOD71GvYrV%2B920OL4e6btmcAamG82sgh5KhZCAJuktoJYUSIzpazAaj8v%2FwqTjRMavFW%2BUMwGHy8eyk0XNKbS684CDeclRkfz1RmJKk0sLGJZyJ4KIssb4Y9vP1MxtCjhehwoV5OEmcgIsx1GZ8qD%2F4a3hprxtRKJ77d4evmT2DmD64o1vR0EvZw%2FKjNGXqisWM09gW4YFhM1%2F6e7Z1Dpj0f%2B6SxBuIWlXWWiBPcjxwDxlOVr36EKijfIq%2B9rxoRMDfZLO6%2FoC8g81gyzEGvjxE3CSmXT48h8Ga

Unnamed: 0,question,answer,context,question_categories,user_categories,document_ids
0,medical professional searching effectiveness a...,Anti-AMA1-C1 antibody responses were dose depe...,[A Phase 1 trial was conducted in malaria-na?v...,"[{'categorization_name': 'factuality', 'catego...",[{'categorization_name': 'expertise-categoriza...,[<urn:uuid:5ec6b8c8-46bf-43c9-b745-49514ed5ff7d>]
1,What materials can catalyst substrates be made...,Catalyst substrates can be made from materials...,[This disclosure relates to a method of manufa...,"[{'categorization_name': 'factuality', 'catego...",[{'categorization_name': 'expertise-categoriza...,[<urn:uuid:406ded9e-6252-4264-aae9-d6ec55b82375>]


Retrieved 2 QA pairs

Sample QA pairs:

Pair 1:
Question: medical professional searching effectiveness antibody response vaccine different doses
Answer: Anti-AMA1-C1 antibody responses were dose dependent and observed following each vaccination, with mean antibody levels 2-3 fold higher in the 20 μg group compared to the 5 μg group at most time points.
Question Categories: [{'categorization_name': 'factuality', 'category_name': 'factoid'}, {'categorization_name': 'premise', 'category_name': 'with-premise'}, {'categorization_name': 'phrasing', 'category_name': 'long-search-query'}, {'categorization_name': 'linguistic_variation', 'category_name': 'distant-from-document'}]
User Categories: [{'categorization_name': 'expertise-categorization', 'category_name': 'novice'}, {'categorization_name': 'health-categorization', 'category_name': 'public-health-authority'}]

Pair 2:
Question: What materials can catalyst substrates be made from?
Answer: Catalyst substrates can be made from materials in

## Retrying a Failed Generation

If a bulk generation request fails, you can retry it using the generation ID.

In [8]:
# Kun: Won't work.... I implemented this according to API doc, 
# but it raises error: "Unprocessable Entity", generally means the request body has the wrong format/fields.

# # Retry a failed generation
# retry_response = dm.retry_generation(generation_id)

# # Display the response
# print("Retry Response:")
# pprint(retry_response)

## Using Custom Categories

You can create custom categories and categorizations to better suit your specific needs.

In [9]:
# Create custom question categories for technical support domain
tech_support_categorization = [{
    "categorization_name": "tech-question-type",
    "categories": [
        {
            "name": "troubleshooting",
            "description": "question about diagnosing and fixing problems with software or hardware",
            "probability": 0.4
        },
        {
            "name": "how-to",
            "description": "question about how to perform specific tasks or use specific features",
            "probability": 0.3
        },
        {
            "name": "compatibility",
            "description": "question about whether products or features work together",
            "probability": 0.2
        },
        {
            "name": "installation",
            "description": "question about installing or setting up software or hardware",
            "probability": 0.1
        }
    ]
}, {
    "categorization_name": "complexity",
    "categories": [
        {
            "name": "simple",
            "description": "straightforward question with a clear answer",
            "probability": 0.7
        },
        {
            "name": "complex",
            "description": "nuanced question requiring detailed explanation",
            "probability": 0.3
        }
    ]
}]

In [10]:
# Create custom user categories for technical support domain
tech_user_categorization = {
    "categorization_name": "tech-user-expertise",
    "categories": [
        {
            "name": "beginner",
            "description": "a user with minimal technical knowledge who struggles with basic computing tasks",
            "probability": 0.3
        },
        {
            "name": "intermediate",
            "description": "a user with basic technical skills who can follow instructions but lacks deep understanding",
            "probability": 0.4
        },
        {
            "name": "advanced",
            "description": "a technically proficient user who understands system concepts and can troubleshoot independently",
            "probability": 0.2
        },
        {
            "name": "developer",
            "description": "a software developer or IT professional with deep technical knowledge",
            "probability": 0.1
        }
    ]
}

## Parse Results From Files

In [11]:
import pandas as pd
from services.ds_data_morgana import QAPair

# Load JSONL file from disk
df = pd.read_json('../data/generated_qa_pairs/results_id_e3c63ac8-3790-44c7-8380-d22781f75571_user_id_0e5ea510-5709-475a-8460-4577fd3588ea.jsonl', lines=True)

# Display the dataframe
display(df)

# Convert dataframe records to QAPair objects
qa_pairs = []
for record in df.to_dict('records'):
    qa_pairs.append(QAPair.from_dict(record))

# Display first QA pair
print("\nFirst QA pair from file as QAPair object:")
print(f"Question: {qa_pairs[0].question}")
print(f"Answer: {qa_pairs[0].answer}")
print(f"Question Categories: {qa_pairs[0].question_categories}")
print(f"User Categories: {qa_pairs[0].user_categories}")

Unnamed: 0,question,answer,context,question_categories,user_categories,document_ids
0,united states assistant secretary health role ...,The Assistant Secretary for Health oversees th...,[||This biographical article needs additional ...,"[{'categorization_name': 'factuality', 'catego...",[{'categorization_name': 'expertise-categoriza...,[<urn:uuid:f29be4bd-628f-4e6e-9955-b32714ccb8c5>]
1,who perfected found object art concept,Marcel Duchamp perfected the concept when he m...,[A large number of artists in the Landfillart ...,"[{'categorization_name': 'factuality', 'catego...",[{'categorization_name': 'expertise-categoriza...,[<urn:uuid:f5a541e8-18b4-4c95-9fab-a851a6135820>]



First QA pair from file as QAPair object:
Question: united states assistant secretary health role responsibilities public health department
Answer: The Assistant Secretary for Health oversees the HHS Office of Public Health and Science, the Commissioned Corps of the U.S. Public Health Service, and the Office of the Surgeon General. They lead interdisciplinary programs related to disease prevention, health promotion, health disparities reduction, women's and minority health, HIV/AIDS, vaccine programs, physical fitness, bioethics, population affairs, blood supply, and human research protections.
Question Categories: [{'categorization_name': 'factuality', 'category_name': 'factoid'}, {'categorization_name': 'premise', 'category_name': 'direct'}, {'categorization_name': 'phrasing', 'category_name': 'long-search-query'}, {'categorization_name': 'linguistic_variation', 'category_name': 'similar-to-document'}]
User Categories: [{'categorization_name': 'expertise-categorization', 'category_n