# Creating the AGNEWs `LabelledSimpleDataset`

In this notebook, we take the AGNEWs dataset ([original source](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset)) and turn it into a `LabelledSimpleDataset` that we ultimately run the `DiffPrivateSimpleDatasetPack` on.

In [None]:
from llama_index.core.llama_dataset.simple import (
    LabelledSimpleDataExample,
    LabelledSimpleDataset,
)
from llama_index.core.llama_dataset.base import CreatedBy, CreatedByType
import pandas as pd

### Load data

The dataset is also available from our public Dropbox.

In [None]:
!mkdir -p "data/agnews/"
!wget "https://www.dropbox.com/scl/fi/wzcuxuv2yo8gjp5srrslm/train.csv?rlkey=6kmofwjvsamlf9dj15m34mjw9&dl=1" -O "data/agnews/train.csv"

--2024-03-18 12:21:01--  https://www.dropbox.com/scl/fi/wzcuxuv2yo8gjp5srrslm/train.csv?rlkey=6kmofwjvsamlf9dj15m34mjw9&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.11.18
Connecting to www.dropbox.com (www.dropbox.com)|162.125.11.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc82fce8c47868a649f55a4d5de7.dl.dropboxusercontent.com/cd/0/inline/CPVWu3TnpuUIfZZJZ4fGfwlbBAh0Yq2rn-Z-B86N2nUIVfeX0IIlkkPsiHLXfV0ZAqIL2jTDl0tW4s72KlwbWAgtPq9RP6AWKmsm4hymekDGtzH7_fq6i09hKhZfI67nrgv9_R_7TsI4mAc9XwhuIdQx/file?dl=1# [following]
--2024-03-18 12:21:02--  https://uc82fce8c47868a649f55a4d5de7.dl.dropboxusercontent.com/cd/0/inline/CPVWu3TnpuUIfZZJZ4fGfwlbBAh0Yq2rn-Z-B86N2nUIVfeX0IIlkkPsiHLXfV0ZAqIL2jTDl0tW4s72KlwbWAgtPq9RP6AWKmsm4hymekDGtzH7_fq6i09hKhZfI67nrgv9_R_7TsI4mAc9XwhuIdQx/file?dl=1
Resolving uc82fce8c47868a649f55a4d5de7.dl.dropboxusercontent.com (uc82fce8c47868a649f55a4d5de7.dl.dropboxusercontent.com)... 162.125.11.15
Connecting to uc82fc

In [None]:
df = pd.read_csv("./data/agnews/train.csv")

In [None]:
class_to_label = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}

In [None]:
df["Label"] = df["Class Index"].map(class_to_label)

In [None]:
df.head()

Unnamed: 0,Class Index,Title,Description,Label
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Business
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Business
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Business
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Business
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",Business


### Create LabelledSimpleDataExample

In [None]:
examples = []
for index, row in df.iterrows():
    example = LabelledSimpleDataExample(
        reference_label=row["Label"],
        text=f"{row['Title']} {row['Description']}",
        text_by=CreatedBy(type=CreatedByType.HUMAN),
    )
    examples.append(example)

simple_dataset = LabelledSimpleDataset(examples=examples)

In [None]:
simple_dataset.to_pandas()[:5]

Unnamed: 0,reference_label,text,text_by
0,Business,Wall St. Bears Claw Back Into the Black (Reute...,human
1,Business,Carlyle Looks Toward Commercial Aerospace (Reu...,human
2,Business,Oil and Economy Cloud Stocks' Outlook (Reuters...,human
3,Business,Iraq Halts Oil Exports from Main Southern Pipe...,human
4,Business,"Oil prices soar to all-time record, posing new...",human


In [None]:
simple_dataset.save_json("agnews.json")