# Testing DataMorgana API

This notebook demonstrates how to use the DataMorgana API for synthetic conversation generation, as described in the paper "Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana".

DataMorgana is a tool that generates highly customizable and diverse synthetic Q&A benchmarks tailored to RAG applications. It enables detailed configurations of user and question categories and provides control over their distribution within the benchmark.

## Setup & Imports

In [1]:
from services.ds_data_morgana import DataMorgana


dm = DataMorgana()

## Using Pre-defined Categories

DataMorgana comes with pre-defined categories for both questions and users based on the research paper.

In [3]:
# Get the default question categories
question_categories = dm.get_default_question_categories()

# Print the factuality categories as an example
print("Factuality Categories:")
for category in question_categories["factuality"]:
    print(f"- {category.name}: {category.description} (probability: {category.probability})")

Factuality Categories:
- factoid: question seeking a specific, concise piece of information or a short fact about a particular subject, such as a name, date, or number (probability: 0.5)
- open-ended: question inviting detailed or exploratory responses, encouraging discussion or elaboration (probability: 0.5)


In [4]:
# Get the general user categories
general_user_categories = dm.get_general_user_categories()

print("General User Categories:")
for category in general_user_categories:
    print(f"- {category.name}: {category.description} (probability: {category.probability})")

# Get the healthcare-specific user categories
healthcare_user_categories = dm.get_healthcare_user_categories()

print("\nHealthcare User Categories:")
for category in healthcare_user_categories:
    print(f"- {category.name}: {category.description} (probability: {category.probability})")

General User Categories:
- expert: a specialized user with deep understanding of the corpus (probability: 0.5)
- novice: a regular user with no understanding of specialized terms (probability: 0.5)

Healthcare User Categories:
- patient: a regular patient who uses the system to get basic health information, symptom checking, and guidance on preventive care (probability: 0.25)
- medical-doctor: a medical doctor who needs to access some advanced information (probability: 0.25)
- clinical-researcher: a clinical researcher who uses the system to access population health data, conduct initial patient surveys, track disease progression patterns, etc (probability: 0.25)
- public-health-authority: a public health authority who uses the system to manage community health information dissemination, be informed on health emergencies, etc (probability: 0.25)


## Creating Custom Categories

You can create custom categories and categorizations to better suit your specific needs.

In [None]:
# Create custom question categories for technical support domain
import json
from services.ds_data_morgana import Category


tech_support_categories = [
    Category("troubleshooting", "question about diagnosing and fixing problems with software or hardware", 0.4),
    Category("how-to", "question about how to perform specific tasks or use specific features", 0.3),
    Category("compatibility", "question about whether products or features work together", 0.2),
    Category("installation", "question about installing or setting up software or hardware", 0.1)
]

# Create a tech support categorization
tech_support_categorization = dm.create_categorization("tech-question-type", tech_support_categories)

# Print the categorization as a dictionary
print(json.dumps(tech_support_categorization.to_dict(), indent=2))

{
  "categorization_name": "tech-question-type",
  "categories": [
    {
      "name": "troubleshooting",
      "description": "question about diagnosing and fixing problems with software or hardware",
      "probability": 0.4
    },
    {
      "name": "how-to",
      "description": "question about how to perform specific tasks or use specific features",
      "probability": 0.3
    },
    {
      "name": "compatibility",
      "description": "question about whether products or features work together",
      "probability": 0.2
    },
    {
      "name": "installation",
      "description": "question about installing or setting up software or hardware",
      "probability": 0.1
    }
  ]
}


In [6]:
# Create custom user categories for technical support domain
tech_user_categories = [
    Category("beginner", "a user with minimal technical knowledge who struggles with basic computing tasks", 0.3),
    Category("intermediate", "a user with basic technical skills who can follow instructions but lacks deep understanding", 0.4),
    Category("advanced", "a technically proficient user who understands system concepts and can troubleshoot independently", 0.2),
    Category("developer", "a software developer or IT professional with deep technical knowledge", 0.1)
]

tech_user_categorization = dm.create_categorization("tech-user-expertise", tech_user_categories)

print(json.dumps(tech_user_categorization.to_dict(), indent=2))

{
  "categorization_name": "tech-user-expertise",
  "categories": [
    {
      "name": "beginner",
      "description": "a user with minimal technical knowledge who struggles with basic computing tasks",
      "probability": 0.3
    },
    {
      "name": "intermediate",
      "description": "a user with basic technical skills who can follow instructions but lacks deep understanding",
      "probability": 0.4
    },
    {
      "name": "advanced",
      "description": "a technically proficient user who understands system concepts and can troubleshoot independently",
      "probability": 0.2
    },
    {
      "name": "developer",
      "description": "a software developer or IT professional with deep technical knowledge",
      "probability": 0.1
    }
  ]
}


## Generating a Synchronous QA Pair

Let's generate a single QA pair using the synchronous API.

In [None]:
# Define question categories for a single QA pair
from pprint import pprint


question_category = {
    "formulation-categorization": {
        "name": "natural",
        "description": "phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure."
    }
}

# Define user categories for a single QA pair
user_category = {
    "expertise-categorization": {
        "name": "common person",
        "description": "a common person who is not expert of the subject discussed in the document, therefore he asks basic questions."
    }
}

# Define a document ID (this is just an example, replace with a valid document ID)
document_ids = ["<urn:uuid:26073e55-b9e8-4213-8df2-8baf5c1cb383>"]

# Generate a QA pair
response = dm.generate_sync_qa_pair(
    question_categories=question_category,
    user_categories=user_category,
    document_ids=document_ids
)

# Display the response
print("Generated QA Pair:")
pprint(response)

Generated QA Pair:
{'credits': 1,
 'request': {'document_ids': ['<urn:uuid:26073e55-b9e8-4213-8df2-8baf5c1cb383>'],
             'n_questions': 1,
             'question_categories': {'formulation-categorization': {'description': 'phrased '
                                                                                   'in '
                                                                                   'the '
                                                                                   'way '
                                                                                   'people '
                                                                                   'typically '
                                                                                   'speak, '
                                                                                   'reflecting '
                                                                                   'everyday '
                

## Bulk Generation with Multiple Categorizations

One of the key features of DataMorgana is the ability to generate diverse QA pairs using multiple categorizations for both questions and users.

In [None]:
# Kun: not working as I can only create job, not retrieve it (says Not Found)
# # Let's use our pre-defined categorizations to create diverse questions
# # Convert the question categories dict into a list of Categorization objects
# question_categorizations = []
# for cat_name, categories in question_categories.items():
#     question_categorizations.append(dm.create_categorization(cat_name, categories))

# # Create a user categorization using the healthcare user categories
# user_categorization = dm.create_categorization("healthcare-user-type", healthcare_user_categories)

# # Optional: Define document IDs for generation (this is just an example)
# covid_document_ids = ["<urn:uuid:sample-covid-doc-123>", "<urn:uuid:sample-covid-doc-456>"]

# # Submit a bulk generation request
# bulk_response = dm.bulk_generation(
#     n_questions=2,  # Generate 2 QA pairs
#     question_categorizations=question_categorizations,
#     user_categorizations=[user_categorization],
#     document_ids=covid_document_ids  # Optional
# )

# # Display the response
# print("Bulk Generation Response:")
# pprint(bulk_response)

# # Store the generation ID for later use
# generation_id = bulk_response.get("request_id")


Bulk Generation Response:
{'request_id': 'fc06504d-7c1b-48d0-ba8b-d61d8eb201da', 'type': 'async'}


## Fetching Bulk Generation Results

After submitting a bulk generation request, you can check its status and retrieve the results using the generation ID.

In [None]:
# # Fetch generation results
# results = dm.fetch_generation_results(generation_id)

# # Display the results
# print("Generation Results:")
# print(f"Status: {results.get('status', 'Unknown')}")

# # If the generation is complete, display a sample of the generated QA pairs
# if results.get('status') == 'COMPLETED' and 'qa_pairs' in results:
#     print(f"\nTotal QA pairs generated: {len(results['qa_pairs'])}")
#     print("\nSample QA pairs:")
#     for i, qa_pair in enumerate(results['qa_pairs'][:3]):  # Show first 3 pairs
#         print(f"\nPair {i+1}:")
#         print(f"Question: {qa_pair['question']}")
#         print(f"Answer: {qa_pair['answer']}")
#         if 'metadata' in qa_pair:
#             print(f"Categories: {qa_pair['metadata']}")


HTTPError: 404 Client Error: Not Found for url: https://api.ai71.ai/v1/fetch_generation_results/request_id=fc06504d-7c1b-48d0-ba8b-d61d8eb201da

## Retrying a Failed Generation

If a bulk generation request fails, you can retry it using the generation ID.

In [None]:
# # This try-except block allows the notebook to run even if the API key is not valid
# try:
#     # Retry a failed generation
#     retry_response = dm.retry_generation(generation_id)
    
#     # Display the response
#     print("Retry Response:")
#     pprint(retry_response)
# except Exception as e:
#     print(f"Error: {e}\n\nNote: This is expected if you haven't provided a valid API key or if the generation ID is not valid.")

Error: 422 Client Error: Unprocessable Entity for url: https://api.ai71.ai/v1/retry

Note: This is expected if you haven't provided a valid API key or if the generation ID is not valid.
