# Dataset operations 🗂️

This guide covers how to work with AnnotatedPairs datasets using Feedback Forensics' dataset operations tools.

## Overview

Feedback Forensics provides both CLI and Python API tools for manipulating AnnotatedPairs datasets. This is useful for:

- **Converting CSV data**: Transform preference data from CSV to AnnotatedPairs for use in Feedback Forensics
- **Merging datasets**: Combine multiple annotated datasets with overlapping comparisons
- **Data restoration**: Merge annotation-only datasets with full comparison data

## Command Line Interface

For quick dataset operations, use the `ff-data` CLI tool:

In [None]:
!ff-data --help

usage: ff-data [-h] {merge,csv_to_ap} ...

Swiss army knife for AnnotatedPairs datasets

positional arguments:
  {merge,csv_to_ap}  Available commands
    merge            Merge two AnnotatedPairs datasets
    csv_to_ap        Convert CSV to AnnotatedPairs format

options:
  -h, --help         show this help message and exit


### Converting CSV to AnnotatedPairs

Before you can use datasets with Feedback Forensics, they need to be in the AnnotatedPairs format. If you have a raw CSV file with preference data, you can convert it using the `csv_to_ap` command:

In [None]:
!ff-data csv_to_ap --help

usage: ff-data csv_to_ap [-h] --name NAME csv_file output

positional arguments:
  csv_file     Input CSV file with columns text_a, text_b, preferred_text
  output       Output AnnotatedPairs JSON file (use "-" for stdout)

options:
  -h, --help   show this help message and exit
  --name NAME  Dataset name for the AnnotatedPairs output


Your CSV file must contain the required columns:
- `text_a`, `text_b`: The two responses being compared
- `preferred_text`: Which response was preferred (`"text_a"` or `"text_b"`)

Optional columns:
- `input` or `prompt`: The prompt that generated the responses
- `model_a`, `model_b`: Names of the models that generated the responses

Here's an example converting the sample data:

In [None]:
!ff-data csv_to_ap ../../data/input/example.csv - --name "Example Dataset"

📜  | INFO | Converting CSV to AnnotatedPairs: ../../data/input/example.csv[0m
  df = df.applymap(str)  # ensure all columns are hashable
📜  | INFO | Available metadata columns: ['index'][0m
📜  | INFO | Outputting AnnotatedPairs dataset to stdout[0m
{
  "metadata": {
    "version": "2.0",
    "description": "Annotated pairs dataset with annotations from ICAI",
    "created_at": "2025-05-30T16:49:16Z",
    "dataset_name": "Example Dataset",
    "default_annotator": "ba751e7b",
    "available_metadata_keys_per_comparison": [
      "index"
    ]
  },
  "annotators": {
    "ba751e7b": {
      "name": "preferred_text",
      "description": "Default annotator from original dataset (from column `preferred_text`)",
      "type": "unknown"
    }
  },
  "comparisons": [
    {
      "id": "a4e9e77e",
      "prompt": null,
      "response_a": {
        "text": "In the heart of a bustling city, a sleek black cat named Shadow prowled the moonlit rooftops, her eyes gleaming with curiosity and misch

This creates a basic AnnotatedPairs dataset with just the original preferences from your CSV file. **Note: This converted dataset will not include principle-based annotations** - it only contains the original preference data.

To get rich principle-based annotations that enable personality trait analysis, you should use `ff-annotate` (requiring API keys) instead:

```bash
ff-annotate --datapath="../../data/input/example.csv"
```

The `csv_to_ap` command is useful for quick conversion when you want to merge a CSV dataset with annotations in annotated pairs format or when you want to work with just the original preference data without additional AI annotations.

### Merging datasets via CLI

`ff-data` can merge multiple datasets with possibly overlapping comparisons.

In [None]:
!ff-data merge --help

usage: ff-data merge [-h] [--name NAME] [--desc DESC] first second output

positional arguments:
  first        First dataset file (takes precedence in conflicts)
  second       Second dataset file
  output       Output file (use "-" for stdout)

options:
  -h, --help   show this help message and exit
  --name NAME  Override dataset name for merged result
  --desc DESC  Override description for merged result


Example usage, merging identical datasets and printing the merged dataset to stdout for demonstration:

In [None]:
!ff-data merge ../../data/output/annotated_pairs.json ../../data/output/annotated_pairs.json -

📜  | INFO | Merging AnnotatedPairs: ../../data/output/annotated_pairs.json + ../../data/output/annotated_pairs.json[0m
📜  | INFO | Merging AnnotatedPairs datasets[0m
📜  | INFO | First dataset: 1 comparisons, 41 annotators[0m
📜  | INFO | Second dataset: 1 comparisons, 41 annotators[0m
📜  | INFO | Found 1 matching comparisons, 0 unique to first, 0 unique to second[0m
📜  | INFO | Merged result: 1 comparisons, 41 annotators[0m
📜  | INFO | Outputting merged dataset to stdout[0m
{
  "metadata": {
    "version": "2.0",
    "created_at": "2025-05-30T15:36:36Z",
    "dataset_name": "ICAI Training Dataset - 2025-05-07_18-35-25",
    "description": "AnnotatedPairs dataset with annotations from ICAI",
    "default_annotator": "ba751e7b",
    "available_metadata_keys_per_comparison": [
      "index"
    ]
  },
  "annotators": {
    "ba751e7b": {
      "name": "preferred_text",
      "description": "Default annotator from original dataset (from column `preferred_text`)",
      "type": "unknow

## Python API

The dataset operations are also available through the `feedback_forensics.data.operations` module:

### Converting CSV to AnnotatedPairs

In [None]:
from feedback_forensics.data.operations import csv_to_ap, save_ap

# Convert CSV to AnnotatedPairs format (without principle annotations)
ap_data = csv_to_ap("../../data/input/example.csv", "Example Dataset")

print(f"Converted dataset contains {len(ap_data['comparisons'])} comparisons")
print(f"Dataset contains {len(ap_data['annotators'])} annotators")
print("Note: This dataset only contains original preferences, not principle-based annotations")

📜  | INFO | Available metadata columns: ['index'][0m
Converted dataset contains 1 comparisons
Dataset contains 1 annotators
Note: This dataset only contains original preferences, not principle-based annotations


  df = df.applymap(str)  # ensure all columns are hashable


### Loading, Merging, and Saving Datasets

Merge two datasets with conflict resolution:

In [None]:
from feedback_forensics.data.operations import load_ap, merge_ap

# Load two datasets. Here we use the same sample data for both, but in practice you would load two different datasets.
dataset1 = load_ap("../../data/output/annotated_pairs.json")
dataset2 = load_ap("../../data/output/annotated_pairs.json")

print(f"Dataset contains {len(dataset1['comparisons'])} comparisons")
print(f"Dataset contains {len(dataset1['annotators'])} annotators")

# Merge them (first dataset takes precedence in conflicts). In this case, the datasets are identical.
merged_dataset = merge_ap(dataset1, dataset2)

Dataset contains 1 comparisons
Dataset contains 41 annotators
📜  | INFO | Merging AnnotatedPairs datasets[0m
📜  | INFO | First dataset: 1 comparisons, 41 annotators[0m
📜  | INFO | Second dataset: 1 comparisons, 41 annotators[0m
📜  | INFO | Found 1 matching comparisons, 0 unique to first, 0 unique to second[0m
📜  | INFO | Merged result: 1 comparisons, 41 annotators[0m


You can save the resulting dataset to a file:

```python
save_ap(merged_dataset, "merged_dataset.json")
```

### How Merging Works

The merge operation works as follows:

1. **Comparison Matching**: Uses content-based hash IDs to identify identical comparisons
2. **Annotation Combining**: Merges all annotations from both datasets for matching comparisons
3. **Conflict Resolution**: When conflicts occur, the first dataset takes precedence (with warnings logged)

This is particularly useful for restoring datasets where you have:
- One dataset with full comparison data but limited annotations
- Another dataset with rich annotations but possibly missing comparison details

## Next Steps

- [Analyze your merged datasets](../guide/feedback.ipynb)
- [API Reference](../api.ipynb)
- [Learn about the underlying method](../method/index.md)