# Dataset Operations 🗂️

This guide covers how to work with annotated pairs datasets using Feedback Forensics' dataset operations tools.

## Overview

Feedback Forensics provides both CLI and Python API tools for manipulating annotated pairs datasets. This is useful for:

- **Merging datasets**: Combine multiple annotated datasets with overlapping comparisons
- **Data restoration**: Merge annotation-only datasets with full comparison data

## Command Line Interface

For quick dataset operations, use the `ff-data` CLI tool:

In [ ]:
!ff-data --help

usage: ff-data [-h] {merge} ...

Swiss army knife for annotated pairs datasets

positional arguments:
  {merge}     Available commands
    merge     Merge two annotated pairs datasets

options:
  -h, --help  show this help message and exit


### Merging Datasets via CLI

`ff-data` can merge multiple datasets with possibly overlapping comparisons.

In [ ]:
!ff-data merge --help

usage: ff-data merge [-h] [--name NAME] [--desc DESC] first second output

positional arguments:
  first        First dataset file (takes precedence in conflicts)
  second       Second dataset file
  output       Output file (use "-" for stdout)

options:
  -h, --help   show this help message and exit
  --name NAME  Override dataset name for merged result
  --desc DESC  Override description for merged result


Example usage, merging identical datasets and printing the merged dataset to stdout for demonstration:

In [ ]:
!ff-data merge ../../data/output/annotated_pairs.json ../../data/output/annotated_pairs.json -

📜  | INFO | Merging annotated pairs: ../../data/output/annotated_pairs.json + ../../data/output/annotated_pairs.json[0m
📜  | INFO | Merging annotated pairs datasets[0m
📜  | INFO | First dataset: 1 comparisons, 41 annotators[0m
📜  | INFO | Second dataset: 1 comparisons, 41 annotators[0m
📜  | INFO | Found 1 matching comparisons, 0 unique to first, 0 unique to second[0m
📜  | INFO | Merged result: 1 comparisons, 41 annotators[0m
📜  | INFO | Outputting merged dataset to stdout[0m
{
  "metadata": {
    "version": "2.0",
    "created_at": "2025-05-30T14:22:44Z",
    "dataset_name": "ICAI Training Dataset - 2025-05-07_18-35-25",
    "description": "Annotated pairs dataset with annotations from ICAI",
    "default_annotator": "ba751e7b",
    "available_metadata_keys_per_comparison": [
      "index"
    ]
  },
  "annotators": {
    "ba751e7b": {
      "name": "preferred_text",
      "description": "Default annotator from original dataset (from column `preferred_text`)",
      "type": "unk

## Python API

The dataset operations are also available through the `feedback_forensics.data.operations` module:

### Loading, Merging, and Saving Datasets

Merge two datasets with conflict resolution:

In [ ]:
from feedback_forensics.data.operations import load_ap, merge_ap

# Load two datasets. Here we use the same sample data for both, but in practice you would load two different datasets.
dataset1 = load_ap("../../data/output/annotated_pairs.json")
dataset2 = load_ap("../../data/output/annotated_pairs.json")

print(f"Dataset contains {len(dataset1['comparisons'])} comparisons")
print(f"Dataset contains {len(dataset1['annotators'])} annotators")

# Merge them (first dataset takes precedence in conflicts). In this case, the datasets are identical.
merged_dataset = merge_ap(dataset1, dataset2)

Dataset contains 1 comparisons
Dataset contains 41 annotators
📜  | INFO | Merging annotated pairs datasets[0m
📜  | INFO | First dataset: 1 comparisons, 41 annotators[0m
📜  | INFO | Second dataset: 1 comparisons, 41 annotators[0m
📜  | INFO | Found 1 matching comparisons, 0 unique to first, 0 unique to second[0m
📜  | INFO | Merged result: 1 comparisons, 41 annotators[0m


You can save the resulting dataset to a file:

```python
save_ap(merged_dataset, "merged_dataset.json")
```

### How Merging Works

The merge operation works as follows:

1. **Comparison Matching**: Uses content-based hash IDs to identify identical comparisons
2. **Annotation Combining**: Merges all annotations from both datasets for matching comparisons
3. **Conflict Resolution**: When conflicts occur, the first dataset takes precedence (with warnings logged)

This is particularly useful for restoring datasets where you have:
- One dataset with full comparison data but limited annotations
- Another dataset with rich annotations but possibly missing comparison details

## Next Steps

- [Analyze your merged datasets](feedback.ipynb)
- [API Reference](../api.ipynb)
- [Learn about the underlying method](../method/index.md)