# Uploading Results to Trismik Dashboard

This notebook demonstrates three ways to upload evaluation results to Trismik's dashboard for tracking and visualization.

## Why Upload Results?

- **Track Progress**: Monitor model performance over time
- **Compare Models**: Visualize performance across different models and experiments
- **Share Results**: Collaborate with your team on evaluation insights
- **Historical Analysis**: Maintain a record of all evaluations

## Prerequisites

- **Trismik API key**: Get yours at https://app.trismik.com/settings
- **Trismik Project ID**: Create a project at https://app.trismik.com

## Setup

Set your API credentials:

In [None]:
import os

# Set your API keys here or load from .env file
os.environ["TRISMIK_API_KEY"] = "your-trismik-api-key"
os.environ["TRISMIK_PROJECT_ID"] = "your-project-id"

# Or load from .env file
# from dotenv import load_dotenv
# load_dotenv()

In [None]:
from pprint import pprint
from scorebook import score, login
from scorebook.metrics import Accuracy

## Login to Trismik

In [None]:
api_key = os.environ.get("TRISMIK_API_KEY")
if not api_key:
    raise ValueError("TRISMIK_API_KEY not set. Get your API key from https://app.trismik.com/settings")

login(api_key)
print("✓ Logged in to Trismik")

project_id = os.environ.get("TRISMIK_PROJECT_ID")
if not project_id:
    raise ValueError("TRISMIK_PROJECT_ID not set. Find your project ID at https://app.trismik.com")

print(f"✓ Using project: {project_id}")

## Method 1: Upload score() Results

Score pre-computed outputs and upload to Trismik:

In [None]:
# Prepare items with pre-computed outputs
items = [
    {"input": "What is 2 + 2?", "output": "4", "label": "4"},
    {"input": "What is the capital of France?", "output": "Paris", "label": "Paris"},
    {"input": "Who wrote Romeo and Juliet?", "output": "William Shakespeare", "label": "William Shakespeare"},
    {"input": "What is 5 * 6?", "output": "30", "label": "30"},
    {"input": "What is the largest planet?", "output": "Jupiter", "label": "Jupiter"},
]

# Score and upload
results = score(
    items=items,
    metrics=Accuracy,
    dataset_name="basic_questions",
    model_name="example-model-v1",
    experiment_id="Score-Upload-Notebook",
    project_id=project_id,
    metadata={
        "description": "Example from Jupyter notebook",
        "note": "Pre-computed outputs uploaded via score()",
    },
    upload_results=True,  # Enable uploading
)

print(f"\n✓ Results uploaded successfully!")
print(f"Accuracy: {results['aggregates']['Accuracy']:.2%}")

## Method 2: Upload evaluate() Results

Run inference and automatically upload results:

In [None]:
from typing import Any, List
from scorebook import EvalDataset, evaluate

# Create a simple dataset
import json
from pathlib import Path

sample_data = [
    {"question": "What is 10 + 5?", "answer": "15"},
    {"question": "What is the capital of Spain?", "answer": "Madrid"},
]

temp_file = Path("temp_eval_dataset.json")
with open(temp_file, "w") as f:
    json.dump(sample_data, f)

dataset = EvalDataset.from_json(
    path=str(temp_file),
    metrics="accuracy",
    input="question",
    label="answer",
)

# Define a simple inference function (mock)
def mock_inference(inputs: List[Any], **hyperparameters: Any) -> List[Any]:
    """Mock inference that returns the expected answers."""
    # In practice, this would call your model
    return ["15", "Madrid"]  # Mock perfect answers

# Run evaluation with upload
eval_results = evaluate(
    mock_inference,
    dataset,
    hyperparameters={"temperature": 0.7},
    experiment_id="Evaluate-Upload-Notebook",
    project_id=project_id,
    metadata={
        "model": "mock-model",
        "description": "Evaluation results from notebook",
    },
    return_aggregates=True,
    return_items=True,
    return_output=True,
)

print(f"\n✓ Evaluation results uploaded!")
print(f"Accuracy: {eval_results['aggregates']['accuracy']:.2%}")

# Cleanup
temp_file.unlink()

## Method 3: Upload External Results

Import results from external evaluation frameworks or historical data:

In [None]:
# Example: Import results from another evaluation framework
external_results = [
    {"input": "Translate 'hello' to Spanish", "output": "hola", "label": "hola"},
    {"input": "Translate 'goodbye' to Spanish", "output": "adiós", "label": "adiós"},
    {"input": "Translate 'thank you' to Spanish", "output": "gracias", "label": "gracias"},
    {"input": "Translate 'please' to Spanish", "output": "por favor", "label": "por favor"},
]

# Upload external results
external_upload = score(
    items=external_results,
    metrics="accuracy",
    dataset_name="spanish_translation",
    model_name="external-translator-v2",
    experiment_id="External-Results-Upload",
    project_id=project_id,
    metadata={
        "description": "Historical results imported from external framework",
        "source": "Custom evaluation pipeline",
        "date": "2025-01-15",
    },
    upload_results=True,
)

print(f"\n✓ External results uploaded!")
print(f"Accuracy: {external_upload['aggregates']['accuracy']:.2%}")

## View Results on Dashboard

All uploaded results are now visible on your Trismik dashboard:

In [None]:
from IPython.display import display, Markdown

dashboard_url = f"https://app.trismik.com/projects/{project_id}"
display(Markdown(f"### 📊 [View All Results on Dashboard]({dashboard_url})"))
print(f"\nDirect link: {dashboard_url}")
print("\nYou should see three experiments:")
print("  1. Score-Upload-Notebook")
print("  2. Evaluate-Upload-Notebook")
print("  3. External-Results-Upload")

## Organizing Results with Metadata

Use metadata to add context and organization to your results:

In [None]:
# Example: Organizing a model comparison experiment
models_to_test = [
    {"name": "model-a", "version": "1.0"},
    {"name": "model-b", "version": "2.0"},
]

test_items = [
    {"output": "positive", "label": "positive"},
    {"output": "negative", "label": "negative"},
]

for model_info in models_to_test:
    result = score(
        items=test_items,
        metrics=Accuracy,
        dataset_name="sentiment_test",
        model_name=model_info["name"],
        experiment_id="Model-Comparison-Notebook",
        project_id=project_id,
        metadata={
            "model_version": model_info["version"],
            "comparison_group": "sentiment_analysis",
            "date": "2025-01-26",
            "notes": f"Testing {model_info['name']} v{model_info['version']}",
        },
        upload_results=True,
    )
    print(f"✓ Uploaded results for {model_info['name']} v{model_info['version']}")

## Best Practices

### Experiment Naming
- Use descriptive `experiment_id` values (e.g., "GPT4-MMLU-Baseline")
- Group related runs under the same experiment ID
- Use different experiment IDs for different types of tests

### Metadata
- Include model version, hyperparameters, and configuration
- Add timestamps and descriptions for historical tracking
- Use consistent keys across experiments for easy comparison

### Organization
- Create separate projects for different use cases
- Use tags or metadata fields to categorize experiments
- Document your evaluation methodology in metadata

## Next Steps

- Explore the Trismik dashboard to visualize trends and comparisons
- Set up automated evaluation pipelines with result uploading
- Try the **Adaptive Evaluations** notebook for efficient testing with automatic uploads