# Basic Usage: DiffPrivateSimpleDatasetPack

In this notebook, we demonstrate the basic usage of `DiffPrivateSimpleDatasetPack`. The purpose of this pack is to create privacy, safe synthetic observations (or copies) of an original, likely sensitive dataset.

### How it works?
`DiffPrivateSimpleDatasetPack` operates on the `LabelledSimpleDataset` type, which is a llama-dataset that contains `LabelledSimpleDataExample`'s for which there are two fields, namely: `text` and `reference_label`. Calling the `.run()` method of the `DiffPrivateSimpleDatasetPack` will ultimately produce a new `LabelledSimpleDataset` containing privacy-safe, synthetic `LabelledSimpleDataExample`'s.

### This notebook:
In this notebook, we create a privacy-safe, synthetic version of the AGNEWs dataset. This raw AGNEWs data was used to create a `LabelledSimpleDataset` version from it (see `_create_agnews_simple_dataset.ipynb`.

In [None]:
%pip install treelib -q

Note: you may need to restart the kernel to use updated packages.


In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
from llama_index.core.instrumentation.span_handlers import SimpleSpanHandler
import llama_index.core.instrumentation as instrument

span_handler = SimpleSpanHandler()
dispatcher = instrument.get_dispatcher()
dispatcher.add_span_handler(span_handler)

In [None]:
from llama_index.core.llama_dataset.simple import LabelledSimpleDataset
from llama_index.packs.diff_private_simple_dataset.base import PromptBundle
from llama_index.packs.diff_private_simple_dataset import DiffPrivateSimpleDatasetPack
from llama_index.llms.openai import OpenAI
import tiktoken

### Load LabelledSimpleDataset

In [None]:
simple_dataset = LabelledSimpleDataset.from_json("./agnews.json")

In [None]:
simple_dataset.to_pandas()[:5]

Unnamed: 0,reference_label,text,text_by
0,Business,Wall St. Bears Claw Back Into the Black (Reute...,human
1,Business,Carlyle Looks Toward Commercial Aerospace (Reu...,human
2,Business,Oil and Economy Cloud Stocks' Outlook (Reuters...,human
3,Business,Iraq Halts Oil Exports from Main Southern Pipe...,human
4,Business,"Oil prices soar to all-time record, posing new...",human


In [None]:
simple_dataset.to_pandas().value_counts("reference_label")

reference_label
Business    30000
Sci/Tech    30000
Sports      30000
World       30000
Name: count, dtype: int64

### InstantiatePack

To construct a `DiffPrivateSimpleDatasetPack` object, we need to supply:

1. an `LLM` (must return `CompletionResponse`),
2. its associated `tokenizer`,
3. a `PromptBundle` object that contains the parameters required for prompting the LLM to produce the synthetic observations
4. a `LabelledSimpleDataset`
5. [Optional] `sephamore_counter_size` used to help reduce chances of experiencing a `RateLimitError` when calling the LLM's completions API.
6. [Optional] `sleep_time_in_seconds` used to help reduce chances of experiencing a `RateLimitError` when calling the LLM"s completions API.

In [None]:
llm = OpenAI(
    model="gpt-3.5-turbo-instruct",
    max_tokens=1,
    logprobs=True,
    top_logprobs=5,  # OpenAI only allows for top 5 next token as opposed to entire vocabulary
)
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo-instruct")

prompt_bundle = PromptBundle(
    instruction=(
        "Given a label of news type, generate the chosen type of news accordingly.\n"
        "Start your answer directly after 'Text: '. Begin your answer with [RESULT].\n"
    ),
    label_heading="News Type",
    text_heading="Text",
)

dp_simple_dataset_pack = DiffPrivateSimpleDatasetPack(
    llm=llm,
    tokenizer=tokenizer,
    prompt_bundle=prompt_bundle,
    simple_dataset=simple_dataset,
)

To generate a single synthetic example, we can call the `generate_dp_synthetic_example()` method. Synthetic examples are created for a specific `label`. Both sync and async are supported. A few params are required:

- `label`: The class from which you want to generate a synthetic example.
- `t_max`: The max number of tokens you would like to generate (the algorithm adds some logic per token in order to satisfy differential privacy).
- `sigma`: Controls the variance of the noise distribution associated with differential privacy noise mechanism. A value of `sigma` amounts to a level of `epsilon` satisfied in differential privacy.
- `num_splits`: The differential privacy algorithm implemented here relies on disjoint splits of the original dataset.
- `num_samples_per_split`: The number of private, in-context examples to include in the generation of the synthetic example.

In [None]:
dp_simple_dataset_pack.generate_dp_synthetic_example(
    label="Sports", t_max=35, sigma=0.1, num_splits=2, num_samples_per_split=8
)

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:33<00:00,  1.05it/s]


LabelledSimpleDataExample(reference_label='Sports', text='The 2019 World Series has been won by the Washington Nationals, who defeated the Houston Astros in Game 7 with a final score of 6-2', text_by=CreatedBy(model_name='gpt-3.5-turbo-instruct', type=<CreatedByType.AI: 'ai'>))

In [None]:
await dp_simple_dataset_pack.agenerate_dp_synthetic_example(
    label="Sports", t_max=35, sigma=0.1, num_splits=2, num_samples_per_split=8
)

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:33<00:00,  1.04it/s]


LabelledSimpleDataExample(reference_label='Sports', text='The 2018 Winter Olympics in Pyeongchang, South Korea, have come to a close, and the United States has once again dominated the medal count', text_by=CreatedBy(model_name='gpt-3.5-turbo-instruct', type=<CreatedByType.AI: 'ai'>))

To create a privacy-safe, synthetic dataset, we call the `run()` (or async `arun()`) method. The required params for this method have been priorly introduced, with the exception of `sizes`.

- `sizes`: Can be `int` or `Dict[str, int]` which specify the number of synthetic observations per label to be generated.

In [None]:
synthetic_dataset = await dp_simple_dataset_pack.arun(
    sizes={"World": 1, "Sports": 1, "Sci/Tech": 0, "Business": 0},
    t_max=10,  #
    sigma=0.5,
    num_splits=2,
    num_samples_per_split=8,
)

100%|█████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:17<00:00,  1.77s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:18<00:00,  1.83s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.14s/it]


In [None]:
synthetic_dataset.to_pandas()

Unnamed: 0,reference_label,text,text_by
0,World,As tensions continue to rise on,ai (gpt-3.5-turbo-instruct)
1,Sports,The latest sports news: The New,ai (gpt-3.5-turbo-instruct)


In [None]:
span_handler.print_trace_trees()

DiffPrivateSimpleDatasetPack.generate_dp_synthetic_example-92029346-818a-4466-9bf8-add83b0e43f4 (33.304743)
└── DiffPrivateSimpleDatasetPack.agenerate_dp_synthetic_example-29f82dd5-d7ef-4cdb-9990-2096136ab139 (33.302189)
    ├── DiffPrivateSimpleDatasetPack._filter_dataset_by_label-25cc7261-9d65-4abd-b37b-70d877ee2194 (0.089156)
    ├── DiffPrivateSimpleDatasetPack._split_dataset-d9a79a57-b4e4-4ed1-9884-beb006849927 (0.007764)
    ├── DiffPrivateSimpleDatasetPack._filter_dataset_by_label-5a2950f7-3cff-4103-972f-4458390b57ea (0.085112)
    ├── DiffPrivateSimpleDatasetPack._split_dataset-93075908-1732-4594-bd41-c69fe7e320ee (0.007132)
    ├── DiffPrivateSimpleDatasetPack._filter_dataset_by_label-1fadb3e0-4339-424a-9806-8cb313968d74 (0.090871)
    ├── DiffPrivateSimpleDatasetPack._split_dataset-830b2f0b-f254-4c56-aba0-467405a6c7a7 (0.00719)
    ├── DiffPrivateSimpleDatasetPack._filter_dataset_by_label-bb31ab4c-b377-405a-a1d3-03cfb2f95faa (0.075273)
    ├── DiffPrivateSimpleDatasetPack._sp